Glen Newton wrote:
2008/10/23 Michael McCandless <[email protected]>:
Mark Miller wrote:
Glen Newton wrote:
Hey Mr McCandless, whats up with that? Can IndexWriter be made to
efficient as using Multiple Writers? Where do you suppose the hold
Number of threads doing merges? Sync contention? I hate the idea
IndexWriter/Readers being more efficient than a single instance.
In an ideal
Lucene world, a single instance would hide the complexity and use
2008/10/23 Mark Miller <[email protected]>:
It sounds like you might have some thread synchronization issues
Lucene. To simplify things a bit, you might try just using one
If I remember right, the IndexWriter is now pretty efficient,
isn't much need to index to smaller indexes and then merge.
There is a
of juggling to get wrong with that approach.
While I agree it is easier to have a single IndexWriter, if you
multiple cores you will get significant speed-ups with multiple
IndexWriters, even with the impact of merging at the end.
#IndexWriters = # physical cores is an reasonable rule of thumb.
General speed-up estimate: # cores * 0.6 - 0.8 over single
When I get around to it, I'll re-run my tests varying the # of
IndexWriters & post.
of threads needed to match multiple instance performance.
Honestly this surprises me: I would expect a single IndexWriter with
multiple threads to be as fast (or faster, considering the extra
at the end) than multiple IndexWriters.
IndexWriter's concurrency has improved alot lately, with
ConcurrentMergeScheduler. The only serious operation that is not
is flushing the RAM buffer as a new segment; but in a well tuned
process (large RAM buffer) the time spent there should be quite
especially with a fast IO system.
Actually, addIndexes is also not concurrent in that if multiple
it, only one can run at once. But normally you would call it with
indices you want to add, and then the merging is concurrent.
Glen, in your single IndexWriter test, is it possible there was
thread contention during document preparation or analysis?
I don't think there is. I've been refining this for quite a while, and
have done a lot of analysis and hand-checking of the threading stuff.
For your multiple-index-writer test, how much time is spent building
the N indices vs merging them in the end?
I do use multiple threads for document creation: this is where much of
the speed-up happens (at least in my case where I have a large indexed
field for the full-text of an article: the parsing becomes a
significant part of the process).
So in the single-index-writer vs multiple-index-writer tests, this
part (64 threads that construct document objects) is unchanged, right?
How do you rate limit the 64 threads? (Ie, slow them down when they
get too far ahead of indexing).
If you only process documents with the 64 threads (but not index
them), what percentage of the total time is that? I'd like to tease
out "building documents" vs "indexing" times.
I do agree that we should strive to have enough concurrency in
and IndexReader so that you don't get any real benefit by using
instances. Eg in 2.4.0 you can now open read-only IndexReaders, and
you can use NIOFSDirectory, both of which should go a long ways
fixing IndexReader's concurrency issue.
My original tests were in the Spring with 2.3.1. I am planning on
doing the new tests with 2.4 for indexing, as well as re-doing my
concurrent query tests and concurrent multiple reader tests
using the features you describe. I am sure the results will be quite
Also, for the indexing tests, make sure you run with autoCommit=false.
BTW the files I am indexing were originally PDFs, but were batch
converted to text and stored compressed on the filesystem, so except
for GUnzipping them there is no other overhead.
But I'm confused: why do you need 64 threads to build up the
documents? Gunzipping should be very low CPU cost. Are you pre-
analyzing the fields on your documents?
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]