My apologies if this is a duplicate. I
think the attachment size was too big.
Reference: Bruce Momjian
writes: -> http://archives.postgresql.org/pgsql-committers/2007-09/msg00402.php
Other references: Boyer
Moore?? -> http://www.cs.utexas.edu/users/moore/best-ideas/string-searching/fstrpos-example.html
As a first time hacker of
postgresql I’ve decided to attempt something that does not go too deep in
to the heart of the software.
Before I make an attempt at
writing a patch, I’ve decided to start benchmarking outside of Postgres.
On reading the link underneath the Boyer Moore item in the in the new TODO list
wiki, I learned that someone else had a shot at speeding up strpos around this
time last year using KMP algotithm. On looking a little deeper into Boyer
Moore’s algorithm it’s quite obvious that there is some overhead
involved with preprocessing the search key (needle). My target has been to not
have any trade offs with this possibly to be patch. I didn’t want
smaller searches to suffer.
I’d like to receive
some feedback on the various functions in the attached before I go ahead and
work on this any more. I’ve written 8 different versions of the function.
Version 8 seems to be the fastest out of them, however I’ve not tested
with multi byte chars.
Please note that these are
just some initial ideas. For example startpos is not yet implemented, but there
is very little overhead to that anyway.
For comparision sake I
mocked up the current postgresql 8.3’s strpos() from varlena.c.
I’ve no idea how accuratly this will be to the real version. Probably
find out once I write the patch. I replaced strncmp with my own version as I
was having problems with my compilers optimizer, it optimized too much and I
was unable to benchmark. I felt running without and optimization was unfair as
strncmp is. Perhaps if someone else wants to benchmark with –O2 on gcc
they should first replace pg_strncmp with strncmp.
Currently I have a single
file which can be compiled itself that contains benchmarks for the functions
I’ve tested so far. All of these implement a stripped out version of the
Boyer Moore algotithm. I’ve done away with one of the character maps in a
bid to reduce the overhead involved creating them.
The position calculation
will likely need changed for multi byte characters. I’m suspecting
I’m not meant to be doing length calculations on pointers. Can someone
confirm?
To keep this email short
I’ve put most of the information required in as comments into the source
of the program.
I’ve uploaded some
benchmark results using the attached program. The benchmarking was done on
windows…
Please see http://www.unixbeast.com/~fat/test1.html
there are 10 tests in total, in each I attempt to exploit weaknesses, some with
my functions and some with the existing function.
My method for version 8 of
the function probably looks quite unusual as it does some simple cost based
optimisation
I look forward to receiving
feedback on this.
David.