[email protected]
[Top] [All Lists]

Re: Term numbering and range filtering

Subject: Re: Term numbering and range filtering
From: Tim Sturge
Date: Mon, 10 Nov 2008 13:21:20 -0800
Hmmm -- I hadn't thought about that so I took a quick look at the term
vector support. 

What I'm really looking for is a compact but performant representation of
a set of filters on the same (one term field). Using term vectors would
mean an algorithm similar to:

String myfield;
String myterm;
TermVector tv;
for (int i = 0 ;  i < maxDoc ; i++) {
    tv = reader.getTermFreqVector(i,country)
    if (tv.indexOf(myterm) != -1) {
          // include this doc...

The key thing I am looking to achieve here is performance comparable to
filters. I suspect getTermFremVector() is not efficient enough but I'll give
it a try.


On 11/10/08 10:54 AM, "Paul Elschot" <[email protected]> wrote:

> Tim,
> I didn't follow all the details, so this may be somewhat off,
> but did you consider using TermVectors?
> Regards,
> Paul Elschot
> Op Monday 10 November 2008 19:18:38 schreef Tim Sturge:
>> Yes, that is a significant issue. What I'm coming to realize is that
>> either I will end up with something like
>> class MultiFilter {
>>    String field;
>>    private int[] termInDoc;
>>    Map<Term,int> termToInt;
>>    ...
>> }
>> which can be entirely built on the current lucene APIs but has
>> significantly more overhead (the termToInt mapping in particular and
>> the need to construct the mapping and array on startup)
>> Or I can go deep into the guts and add a data file per-segment with a
>> format something like
>> int version
>> int numFields
>> (int fieldNum, long offset) ^ numFields
>> (int termForDoc) ^ (maxDocs * numFields)
>> and add something to FieldInfo  like  boolean storeMultiFilter;
>> and FieldInfos something like STORE_MULTIFILTER = 0x40; I'd need to
>> add an int termNum to the .tis file as well.
>> This is clearly a lot more work than the first solution, but it is a
>> lot nicer to deal with as well. Is this interesting to anyone other
>> than me?
>> Tim
>> On 11/9/08 12:23 PM, "Michael McCandless" <[email protected]>
> wrote:
>>> Conceivably, TermInfosReader could track the sequence number of
>>> each term.
>>> A seek/skipTo would know which sequence number it just jumped too,
>>> because the index is regular (every 128 terms by default), and then
>>> each next() call could increment that.  Then retrieving this number
>>> would be as costly as calling eg IndexReader.docFreq(Term) is now.
>>> But I'm not sure how a multi-segment index  would work, ie how
>>> would MultiSegmentReader compute this for its terms?  Or maybe
>>> you'd just do this per-segment?
>>> Mike
>>> Tim Sturge wrote:
>>>> Hi,
>>>> I¹m wondering if there is any easy technique to number the terms
>>>> in an index
>>>> (By number I mean map a sequence of terms to a contiguous range of
>>>> integers
>>>> and map terms to these numbers efficiently)
>>>> Looking at the Term class and the .tis/.tii index format it
>>>> appears that the
>>>> terms are stored in an ordered and prefix-compressed format, but
>>>> while there
>>>> are pointers from a term to the .frq and .prx files, neither is
>>>> really suitable as a sequence number.
>>>> The reason I have this question is that I am writing a
>>>> multi-filter for
>>>> single term fields. My index contains many fields for which each
>>>> document
>>>> contains a single term (e.g. date, zipcode, country) and I need to
>>>> perform
>>>> range queries or set matches over these fields, many of which are
>>>> very inclusive (they match >10% of the total documents)
>>>> A cached RangeFilter works well when there are a small number of
>>>> potential
>>>> options (e.g. for countries) but when there are many options
>>>> (consider a
>>>> date range or a set of zipcodes) there are too many potential
>>>> choices to
>>>> cache each possibility and it is too inefficient to build a filter
>>>> on the
>>>> fly for each query (as you have to visit 10% of documents to build
>>>> the filter despite the query itself matching 0.1%)
>>>> Therefore I was considering building a int[reader.maxDocs()] array
>>>> for each
>>>> field and putting into it the term number for each document. This
>>>> relies on
>>>> the fact that each document contains only a single term for this
>>>> field, but
>>>> with it I should be able to quickly construct a ³multi-filter²
>>>> (that is,
>>>> something that iterates the array and checks that the term is in
>>>> the range
>>>> or set).
>>>> Right now it looks like I can do some very ugly surgery and
>>>> perhaps use the
>>>> offset to the prx file even though it is not contiguous. But I¹m
>>>> hoping
>>>> there is a better technique that I¹m just not seeing right now.
>>>> Thanks,
>>>> Tim
>>> -------------------------------------------------------------------
>>> -- To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]

To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

<Prev in Thread] Current Thread [Next in Thread>