One question that comes to mind, is what are you looking to do?
What I'm trying to do is prevent Lucene from providing better ranking
for documents that use a term multiple times than those that have more
I've got some huge queries with quite a number of unique terms. I
want the documents that hit more unique terms to float to the top,
while documents that hit some or few of the terms to sink to the
bottom (even if they have more occurrences of those terms).
Lucene, as I understand things, does this for the most part, though it
is possible that term frequency can play a significant roll and drown
out the part of the desired behavior that I'd like to keep.
My users want to grab search results using the currently Lucene
method, then select a checkbox and search without term frequency
contributing to the score. I, on the other hand, have a vested
interest in not maintaining two indexes.
I don't want the title "Caesar Came, Caesar Saw, Caesar Conquered" to have a
score just because the word Caesar is repeated.
This is very much the kind of problem I'm trying to address.
I'm currently looking at whether overriding queryNorm() to always
return 1 is a good thing or not.
I vaguely recall reading something about boosts as well; I'm not sure
you want to mess with this one. For me, idf() is the bigger question.
For a single term, [freq is] determined at index time and stored in the index.
I guess what I'm asking is, is freq, the value passed to tf(), the
count of the term, or a ratio of the term to total terms in the index.
Scores across queries aren't really comparable though.
... Lucene doesn't currently do this "mixing", ...
This has always been my understanding of search engines, that the raw
scores are effectively meaningless if used to mix other query results
However, according to the documentation at
it would seem that #4 queryNorm() says:
"This factor does not affect document ranking (since all ranked
documents are multiplied by the same factor), but rather just attempts
to make scores from different queries (or even different indexes)
This to me is black magic. It alludes that one can do two different
queries and a merge-sort, and further that the content can come from
Either I'm reading this completely wrong, or the documentation may
need an update.
Hope this provides better clarification.
As for why am I doing asking all this? What I'd like to do is get a
firm understanding of how Lucene does scoring in such a way that the
behavior can be modified, and if ambitious enough fill in some of the
holes in the documentation.
To unsubscribe, e-mail: java-user-unsubscribe@xxxxxxxxxxxxxxxxx
For additional commands, e-mail: java-user-help@xxxxxxxxxxxxxxxxx