Very quick comments.
----- Original Message ----
> From: Justus Pendleton <[email protected]>
> To: [email protected]
> Sent: Sunday, November 2, 2008 10:42:52 PM
> Subject: Performance of never optimizing
> I have a couple of questions regarding some Lucene benchmarking and what the
> results mean. (Skip to the numbered list at the end if you don't want to
> the lengthy exegesis :)
> I'm a developer for JIRA. We are currently trying to get a better
> understanding of Lucene, and our use of it, to cope with the needs of our
> customers. These "large" indexes are only a couple hundred thousand documents
> but our problem is compounded by the fact that they have a relatively high
> of modification (=delete+insert of new document) and our users expect these
> modification to show up in query results pretty much instantly.
This will be a tough call with large indices - there is no real-time search in
> Our current default behaviour is a merge factor of 4. We perform an
> on the index every 4000 additions. We also perform an optimize at midnight.
I wouldn't optimize every 4000 additions - you are killing IO, rewriting the
whole index, while trying to provide fast searches, plus you are locking the
index for other modifications.
> fundamental problem is that these optimizations are locking the index for
> unacceptably long periods of time, something that we want to resolve for our
> next major release, hopefully without undermining search performance too
Why are you optimizing? Trying to make the search faster? I would try to
avoid optimizing during high usage periods.
> In the Lucene javadoc there is a comment, and a link to a mailing list
> discussion, that suggests applications such as JIRA should never perform
> optimize but should instead set their merge factor very low.
Right, you can let Lucene merge segments.
> In an attempt to understand the impact of a) lowering the merge factor from 4
> 2 and b) never, ever optimizing on an index (over the course of years and
> millions of additions/updates) I wanted to try to benchmark Lucene.
One thing that you might not have tried is the constant re-opening of the
IndexReader, which you'll need to do if you want to see index changes instantly.
> I used the contrib/benchmark framework and wrote a small algorithm that adds
> documents to an index (using the Reuters doc generator), does a search, does
> optimize, then does another search. All the pretty pictures can be seen at:
So you indexed once and then measured search performance? Or did you measure
indexing performance? I can't quite tell from your email.
And in one case you optimized before searching and in the other you did not
> I have several questions, hopefully they aren't overwhelming in their
> 1. Why does the merge factor of 4 appear to be faster than the merge factor
Faster for indexing or searching? If indexing, then it's because 4 means fewer
segment merges than 2. If searching, then I don't know, unless you had
indexing and searching happening in parallel, which then means less IO for 4.
Did you index fit in RAM, by the way?
> 2. Why does non-optimized searching appear to be faster than optimized
> once the index hits ~500,000 documents?
Not sure without seeing the index/machine.
It sounds like you were measuring search performance while at the same time
increasing the index size by incrementally adding more docs?
> 3. There appears to be a fairly sizable performance drop across the board
> 450,000 documents. Why is that?
Something to do with Lucene merging index segments around that point? At this
point I'm assuming you were measuring search speed while indexing.
> 4. Searching performance appears to decrease towards a fairly pessimistic 20
> searches per second (for a relatively simple search). Is this really what we
> should expect long-term from Lucene?
20 reqs/sec sounds very low. How large is your index, how much RAM, and how
about heap size?
What were your queries like? random? from log?
> 5. Does my benchmark even make sense? I am far from an expert on benchmarking
> it is possible I'm not measuring what I think I am measuring.
I'm confused by what exactly you did and measured, but it could just be that
> Thanks in advance for any insight you can provide. This is an area that we
> much want to understand better as Lucene is a key part of JIRA's success,
> : http://www.atlassian.com
> : http://www.gossamer-threads.com/lists/lucene/java-dev/47895
> : http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]