java-user@lucene.apache.org
[Top] [All Lists]

Re: Weird time results doing wildcard queries

Subject: Re: Weird time results doing wildcard queries
From: Richard Krenek
Date: Thu, 8 Sep 2005 17:55:36 -0600
This answers a lot of questions and observations. We looked in the source 
code of the Hits object and found the getMoreDocs(int min) method which does 
what you stated below.

We are assuming you meant for us to use a HitCollector instead. This brings 
up a new question does the Searcher call the collect method in order that 
the docs are in the index or highest score first. We will look in LIA and do 
some testing to find this answer, but if you know it, feel free to answer.

Thanks, this has helped our understanding of Lucene a lot.

On 9/8/05, Yonik Seeley <yseeley@xxxxxxxxx> wrote:
> 
> The Hits class collects the document ids from the query in batches. If you
> iterate beyond what was collected, the query is re-executed to collect 
> more
> ids.
> 
> You can use the expert level search methods on IndexSearcher if this isn't
> what you want.
> 
> -Yonik
> 
> On 9/8/05, Richard Krenek <richard.krenek@xxxxxxxxx> wrote:
> >
> > I understand that for the query, but why does it matter once you have 
> the
> > Hits object? That is the part I'm baffled on. The query with the 
> wildcard
> > in
> > the front takes a lot longer, but we expected that.
> >
> > On 9/8/05, Jeremy Meyer <jakmcbane@xxxxxxxxx> wrote:
> > >
> > > The issue isn't with multiple wildcards exactly. Specifically, the
> > problem
> > > is if the query starts with a wildcard. In the case where it starts 
> with
> > a
> > > wildcard, lucene has no option but to linearly go over every term in 
> the
> > > index to see if it matches your pattern. It must visit every singe 
> term
> > in
> > > the index. If it doesn't start with a wildcard, lucene can skip to the
> > > relevant part of the index and only visit the relevant terms. For this
> > > reason, many people that use Lucene choose to disable having wildcard 
> at
> > the
> > > start of a search term. This is discussed in the "Lucene in Action"
> > book.
> > >
> > > ~Jack~
> > >
> > > >>Hello All,
> > > >>I am getting some weird time results when retrieving documents back
> > from
> > > a hits object. I am just timing this bit of code:
> > > >>Hits hits = searcher.search(query);
> > > >>long startTime = System.currentTimeMillis(); for (int i = 0; i <
> > > hits.length(); i++) { Document doc = hits.doc(i); String field = 
> doc.get
> > (defaultField);
> > > } System.out.println("Cycle Time: "+(System.currentTimeMillis
> > > ()-startTime));
> > > >>
> > > >>It seems when I have a wilcard query like *abcd* vs weqrew*, the
> > *abcd*
> > > query will always take longer to retrieve the documents even if they 
> are
> > of
> > > simular result sizes. We are talking a big difference 1 second vs 16. 
> It
> > is
> > > consistent no matter >>what order I run the queries in, terms with
> > multiple
> > > wildcards always take longer to retrieve the documents. I am not
> > counting
> > > the time of the query.
> > > >>
> > > >>The index is 2.18 GB, 9 fields per document, 10,694,190 documents,
> > > >>25,538,793 terms and has been optimized.
> > > >>
> > > >>I am not sure if this is a real or just a percieved issue. We cannot
> > > figure out why the type of query would affect the speed it takes to
> > retrieve
> > > each document. We have run this on both Windows XP and Linux. With the
> > same
> > > results. Also to >>note we did watch GC and this did not have any
> > > significant impact that we could se.
> > > >>
> > > >>We are trying to figure out what could cause this and how we can 
> work
> > > around it.
> > > >>
> > > >>
> > > >>Thanks,
> > > >>Richard
> > >
> >
> >
> 
> 
> --
> -Yonik
> Now hiring -- http://tinyurl.com/7m67g
> 
>
<Prev in Thread] Current Thread [Next in Thread>