[Top] [All Lists]

Re: Term numbering and range filtering

Subject: Re: Term numbering and range filtering
From: Michael McCandless
Date: Sun, 9 Nov 2008 15:23:56 -0500

Conceivably, TermInfosReader could track the sequence number of each term.

A seek/skipTo would know which sequence number it just jumped too, because the index is regular (every 128 terms by default), and then each next() call could increment that. Then retrieving this number would be as costly as calling eg IndexReader.docFreq(Term) is now.

But I'm not sure how a multi-segment index would work, ie how would MultiSegmentReader compute this for its terms? Or maybe you'd just do this per-segment?


Tim Sturge wrote:


I’m wondering if there is any easy technique to number the terms in an index (By number I mean map a sequence of terms to a contiguous range of integers
and map terms to these numbers efficiently)

Looking at the Term class and the .tis/.tii index format it appears that the terms are stored in an ordered and prefix-compressed format, but while there
are pointers from a term to the .frq and .prx files, neither is really
suitable as a sequence number.

The reason I have this question is that I am writing a multi-filter for single term fields. My index contains many fields for which each document contains a single term (e.g. date, zipcode, country) and I need to perform
range queries or set matches over these fields, many of which are very
inclusive (they match >10% of the total documents)

A cached RangeFilter works well when there are a small number of potential options (e.g. for countries) but when there are many options (consider a date range or a set of zipcodes) there are too many potential choices to cache each possibility and it is too inefficient to build a filter on the
fly for each query (as you have to visit 10% of documents to build the
filter despite the query itself matching 0.1%)

Therefore I was considering building a int[reader.maxDocs()] array for each field and putting into it the term number for each document. This relies on the fact that each document contains only a single term for this field, but with it I should be able to quickly construct a “multi-filter” (that is, something that iterates the array and checks that the term is in the range
or set).

Right now it looks like I can do some very ugly surgery and perhaps use the offset to the prx file even though it is not contiguous. But I’m hoping
there is a better technique that I’m just not seeing right now.



To unsubscribe, e-mail: java-user-unsubscribe@xxxxxxxxxxxxxxxxx
For additional commands, e-mail: java-user-help@xxxxxxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>