|
|
Hi,
I am trying to move from a system where I counted the frequency of terms by
hand in a highlighter to determine if a result was useful to me. In an
earlier post on this list someone suggested I could boost the terms that are
useful to me and only accept hits above a certain threshold. However, in my
tests, I can't seem to find a deterministic way of calculating a threshold.
Here is an example of what I mean:
My query: "John Smith" "John Smith Manufacturing" "San Francisco"
"California"
Results are only useful to me if they contain the first term "John Smith"
and/or the second term "John Smith Manufacturing" or any combination with
the other San Fran and California terms. However, results with just "San
Francisco" or "California" can be ignored.
I tried something like "John Smith"^200 "John Smith Manufacturing"^100 "San
Francisco"^2 "California"^1
But I can't seem to find a good method of calculating a cut-off score and
filtering out the results that are only San Fran or California using the
term boosting and resulting score. I also don't care about frequency,
meaning that I want the result even if John Smith occurs once, and I don't
want a document with "San Francisco" a million times to score higher than
the single result for John Smith.
Sorry if that's confusing.
Any ideas?
Thanks,
Max
|
|