Robert Muir commented on LUCENE-2089:
bq. but BKTree can help only if threshold is small, otherwise it is similar to
right, this is one aspect that makes the DFA approach attractive, even if there
is a high threshold, so high that its not even worth seeking around and
brute-force iteration is faster, we can still compute the levenstein distance
in O\(n\) time with a DFA, instead of the expensive dynamic programming
i found this out for wildcard as well, as even when just doing brute-force
enumeration, the compiled RunAutomaton eats the old hand-coded wildcardEquals()
for breakfast, because its a faster matcher.
this is in line with some tests done by jflex:
http://jflex.de/manual.html#PerformanceHandwritten . I have to agree that in
this day and age, writing hand-coded stuff to parse or match is simply a waste
of time, as chances are a DFA will be considerably more efficient.
> explore using automaton for fuzzyquery
> Key: LUCENE-2089
> URL: https://issues.apache.org/jira/browse/LUCENE-2089
> Project: Lucene - Java
> Issue Type: Wish
> Components: Search
> Reporter: Robert Muir
> Assignee: Mark Miller
> Priority: Minor
> Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java
> Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is
> itching to write that nasty algorithm)
> we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
> * up front, calculate the maximum required K edits needed to match the users
> supplied float threshold.
> * for at least small common E up to some max K (1,2,3, etc) we should create
> a DFA for each E.
> if the required E is above our supported max, we use "dumb mode" at first (no
> seeking, no DFA, just brute force like now).
> As the pq fills, we swap progressively lower DFAs into the enum, based upon
> the lowest score in the pq.
> This should work well on avg, at high E, you will typically fill the pq very
> quickly since you will match many terms.
> This not only provides a mechanism to switch to more efficient DFAs during
> enumeration, but also to switch from "dumb mode" to "smart mode".
> i modified my wildcard benchmark to generate random fuzzy queries.
> * Pattern: 7N stands for NNNNNNN, etc.
> * AvgMS_DFA: this is the time spent creating the automaton (constructor)
> ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
> as you can see, this prototype is no good yet, because it creates the DFA in
> a slow way. right now it creates an NFA, and all this wasted time is in
> NFA->DFA conversion.
> So, for a very long string, it just gets worse and worse. This has nothing to
> do with lucene, and here you can see, the TermEnum is fast (AvgMS -
> AvgMS_DFA), there is no problem there.
> instead we should just build a DFA to begin with, maybe with this paper:
> we can precompute the tables with that algorithm up to some reasonable K, and
> then I think we are ok.
> the paper references using http://portal.acm.org/citation.cfm?id=135907 for
> linear minimization, if someone wants to implement this they should not worry
> about minimization.
> in fact, we need to at some point determine if AutomatonQuery should even
> minimize FSM's at all, or if it is simply enough for them to be deterministic
> with no transitions to dead states. (The only code that actually assumes
> minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a
> summation easily). we need to benchmark really complex DFAs (i.e. write a
> regex benchmark) to figure out if minimization is even helping right now.
This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.
To unsubscribe, e-mail: java-dev-unsubscribe@xxxxxxxxxxxxxxxxx
For additional commands, e-mail: java-dev-help@xxxxxxxxxxxxxxxxx