[email protected]
[Top] [All Lists]

Re: Multi -threaded indexing of large number of PDF documents

Subject: Re: Multi -threaded indexing of large number of PDF documents
From: Michael McCandless
Date: Thu, 23 Oct 2008 15:45:09 -0400

Mark Miller wrote:

Glen Newton wrote:
2008/10/23 Mark Miller <[email protected]>:

It sounds like you might have some thread synchronization issues outside of Lucene. To simplify things a bit, you might try just using one IndexWriter. If I remember right, the IndexWriter is now pretty efficient, and there isn't much need to index to smaller indexes and then merge. There is a lot
of juggling to get wrong with that approach.

While I agree it is easier to have a single IndexWriter, if you have
multiple cores you will get significant speed-ups with multiple
IndexWriters, even with the impact of merging at the end.
#IndexWriters = # physical cores is an reasonable rule of thumb.

General speed-up estimate: # cores * 0.6 - 0.8 over single IndexWriter

When I get around to it, I'll re-run my tests varying the # of
IndexWriters & post.


Hey Mr McCandless, whats up with that? Can IndexWriter be made to be as efficient as using Multiple Writers? Where do you suppose the hold up is? Number of threads doing merges? Sync contention? I hate the idea of multiple IndexWriter/Readers being more efficient than a single instance. In an ideal Lucene world, a single instance would hide the complexity and use the number of threads needed to match multiple instance performance.

Honestly this surprises me: I would expect a single IndexWriter with multiple threads to be as fast (or faster, considering the extra merge time at the end) than multiple IndexWriters.

IndexWriter's concurrency has improved alot lately, with ConcurrentMergeScheduler. The only serious operation that is not concurrent is flushing the RAM buffer as a new segment; but in a well tuned indexing process (large RAM buffer) the time spent there should be quite small, especially with a fast IO system.

Actually, addIndexes is also not concurrent in that if multiple threads call it, only one can run at once. But normally you would call it with all the indices you want to add, and then the merging is concurrent.

Glen, in your single IndexWriter test, is it possible there was accidental thread contention during document preparation or analysis?

I do agree that we should strive to have enough concurrency in IndexWriter and IndexReader so that you don't get any real benefit by using separate instances. Eg in 2.4.0 you can now open read-only IndexReaders, and on Unix you can use NIOFSDirectory, both of which should go a long ways towards fixing IndexReader's concurrency issue.


To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

<Prev in Thread] Current Thread [Next in Thread>