
Hmmm  I hadn't thought about that so I took a quick look at the term
vector support.
What I'm really looking for is a compact but performant representation of
a set of filters on the same (one term field). Using term vectors would
mean an algorithm similar to:
String myfield;
String myterm;
TermVector tv;
for (int i = 0 ; i < maxDoc ; i++) {
tv = reader.getTermFreqVector(i,country)
if (tv.indexOf(myterm) != 1) {
// include this doc...
}
}
The key thing I am looking to achieve here is performance comparable to
filters. I suspect getTermFremVector() is not efficient enough but I'll give
it a try.
Tim
On 11/10/08 10:54 AM, "Paul Elschot" <[email protected]> wrote:
> Tim,
>
> I didn't follow all the details, so this may be somewhat off,
> but did you consider using TermVectors?
>
> Regards,
> Paul Elschot
>
>
> Op Monday 10 November 2008 19:18:38 schreef Tim Sturge:
>> Yes, that is a significant issue. What I'm coming to realize is that
>> either I will end up with something like
>>
>> class MultiFilter {
>> String field;
>> private int[] termInDoc;
>> Map<Term,int> termToInt;
>> ...
>> }
>>
>> which can be entirely built on the current lucene APIs but has
>> significantly more overhead (the termToInt mapping in particular and
>> the need to construct the mapping and array on startup)
>>
>>
>> Or I can go deep into the guts and add a data file persegment with a
>> format something like
>>
>> int version
>> int numFields
>> (int fieldNum, long offset) ^ numFields
>> (int termForDoc) ^ (maxDocs * numFields)
>>
>> and add something to FieldInfo like boolean storeMultiFilter;
>> and FieldInfos something like STORE_MULTIFILTER = 0x40; I'd need to
>> add an int termNum to the .tis file as well.
>>
>> This is clearly a lot more work than the first solution, but it is a
>> lot nicer to deal with as well. Is this interesting to anyone other
>> than me?
>>
>> Tim
>>
>> On 11/9/08 12:23 PM, "Michael McCandless" <[email protected]>
> wrote:
>>> Conceivably, TermInfosReader could track the sequence number of
>>> each term.
>>>
>>> A seek/skipTo would know which sequence number it just jumped too,
>>> because the index is regular (every 128 terms by default), and then
>>> each next() call could increment that. Then retrieving this number
>>> would be as costly as calling eg IndexReader.docFreq(Term) is now.
>>>
>>> But I'm not sure how a multisegment index would work, ie how
>>> would MultiSegmentReader compute this for its terms? Or maybe
>>> you'd just do this persegment?
>>>
>>> Mike
>>>
>>> Tim Sturge wrote:
>>>> Hi,
>>>>
>>>> I¹m wondering if there is any easy technique to number the terms
>>>> in an index
>>>> (By number I mean map a sequence of terms to a contiguous range of
>>>> integers
>>>> and map terms to these numbers efficiently)
>>>>
>>>> Looking at the Term class and the .tis/.tii index format it
>>>> appears that the
>>>> terms are stored in an ordered and prefixcompressed format, but
>>>> while there
>>>> are pointers from a term to the .frq and .prx files, neither is
>>>> really suitable as a sequence number.
>>>>
>>>> The reason I have this question is that I am writing a
>>>> multifilter for
>>>> single term fields. My index contains many fields for which each
>>>> document
>>>> contains a single term (e.g. date, zipcode, country) and I need to
>>>> perform
>>>> range queries or set matches over these fields, many of which are
>>>> very inclusive (they match >10% of the total documents)
>>>>
>>>> A cached RangeFilter works well when there are a small number of
>>>> potential
>>>> options (e.g. for countries) but when there are many options
>>>> (consider a
>>>> date range or a set of zipcodes) there are too many potential
>>>> choices to
>>>> cache each possibility and it is too inefficient to build a filter
>>>> on the
>>>> fly for each query (as you have to visit 10% of documents to build
>>>> the filter despite the query itself matching 0.1%)
>>>>
>>>> Therefore I was considering building a int[reader.maxDocs()] array
>>>> for each
>>>> field and putting into it the term number for each document. This
>>>> relies on
>>>> the fact that each document contains only a single term for this
>>>> field, but
>>>> with it I should be able to quickly construct a ³multifilter²
>>>> (that is,
>>>> something that iterates the array and checks that the term is in
>>>> the range
>>>> or set).
>>>>
>>>> Right now it looks like I can do some very ugly surgery and
>>>> perhaps use the
>>>> offset to the prx file even though it is not contiguous. But I¹m
>>>> hoping
>>>> there is a better technique that I¹m just not seeing right now.
>>>>
>>>> Thanks,
>>>>
>>>> Tim
>>>
>>> 
>>>  To unsubscribe, email: [email protected]
>>> For additional commands, email: [email protected]
>>
>> 
>> To unsubscribe, email: [email protected]
>> For additional commands, email: [email protected]
>
>
>
> 
> To unsubscribe, email: [email protected]
> For additional commands, email: [email protected]
>

To unsubscribe, email: [email protected]
For additional commands, email: [email protected]

