[email protected]
[Top] [All Lists]

Re: Test code for regex queries

Subject: Re: Test code for regex queries
From: Erik Hatcher
Date: Thu, 24 Nov 2005 04:25:15 -0500
On 24 Nov 2005, at 03:17, Paul Elschot wrote:
I must admit that I haven't used the surround parser.  For my custom
parser (a legacy syntax that no one here would want), I take any term
that has an *, ?, or [...] syntax as a regex term.
I had another look at the javadocs of java regex package.
The normal brackets in a regex are not needed for queries, so they
can be left as they are.
I don't understand what you mean about brackets not being needed.
Brackets certainly factor in greatly into regular expressions:
        <http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html>

What am I missing?

There are still some TODO's with the (Span)RegexQuery - such as being
wise about the prefix length.  Right now it is not wise enough.  I've
spent some time looking for a regex parser that could parse a regex
expression into an AST so that it could be used for determining the
last static character to start term enumeration.  This would also
come in very handy in being able to rotate a regular expression
string to maximize the static prefix when indexing with an analyzer
that rotates terms.  If anyone has suggestions/pointers to how this
could be accomplished, it'd be most appreciated!
I think I'll simply treat each term as a potential regex and
use alphanumeric characters for the prefix. I'll try and leave
parsing of the regex to the java regex package as much as possible.
Simply using alphanumeric to determine the prefix doesn't work in all
cases though. Here's an example I just added, commented out, to
TestRegexQuery:
 assertEquals(1, regexQueryNrHits("r?over"));

The 'r' is optional, yet it is chosen as a prefix and thus no documents match even though "over" does match that regex. The quick fix for this is, like FuzzyQuery, to enumerate all terms blindly. Another option is to allow the user of (Span)RegexQuery to provide the prefix, at least pushing the burden back a bit.
I still think a regex parser would be mighty useful for this issue,
as well as the next...
Rotating from the suffix should also be straightforward for
alphanumerical chars.
It's not as trivial as this, if you want the rotated prefixes to be
maximized. Take, for example, the regex expression
        ThisContainsTwo[abc]RegexPieces.*Total

Using rotation to maximize the prefix and speeding up term enumeration a calculation of of the maximum non-regex piece is needed, including a calculation on whether the head and tail combined make a larger prefix. For example, using '$' to denote the end of the string, the rotated version of this should be:
        Total$ThisContainsTwo[abc]RegexPieces.*

With a regex parse tree, it should be possible to be wise about what is a static prefix and to compute the size of all the static pieces allowing for clever rotation to make regex queries as efficient as possible. Now where is that regex parser grammar?! :/
I'd like the surround parser to be a power tool that provides
everything that Lucene has under the hood. Regexes fit well,
because they are already used for truncation, only the
truncation syntax will have a dot added as far as I can see now.
I'm still missing why brackets don't factor into the equation.

        Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

<Prev in Thread] Current Thread [Next in Thread>