|
|
Hello folks,
Maybe one of you can help me with this (sorry, long read).
I have implemented a FuzzyPhraseQuery that works similar to Lucene's
native PhraseQuery.
I.e. it can retrieve phrases for a query, with respect to insertions
and term order.
But in addition it can also find matches with terms missing (deletions).
Scoring is implemented as described here:
http://www.gossamer-threads.com/lists/lucene/java-user/33558#33558
So the scorer uses the total error rather than the maximum error for
insertions and out-of-order. That part works all fine (eventhough the
total errors I'm observing quickly lead to very low frequencies
returned by sloppyFreq() )
Now my problem is with scoring the deletion cases.
My initial idea was to penalize a missing term position with its maximum error.
Consider this:
Query: a b c d
Document A: b c d
Term a is missing, score it as if it was at the worst position possible
result: b c d a
pos. diffs: -1 -1 -1 +3
It can be observed that the max error for the nth missing term is 2n - 2
If you have a query given with 100 terms and say 10 of them are not
found, I would have a penalty of 190 + 192 + 194 etc.
for extreme cases, this is rather simple to calculate. in the middle
of a phrase, things get tricky though. Also the penalty becomes higher
as the number of terms increases.
So I think this is no viable solution for my problem.
Does anyone know a better solution for scoring deletion cases?
Thanks for your input,
Philipp
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@xxxxxxxxxxxxxxxxx
For additional commands, e-mail: java-user-help@xxxxxxxxxxxxxxxxx
|
|