[Top] [All Lists]

Re: Multi -threaded indexing of large number of PDF documents

Subject: Re: Multi -threaded indexing of large number of PDF documents
From: Mark Miller
Date: Thu, 23 Oct 2008 12:36:51 -0400
It sounds like you might have some thread synchronization issues outside of Lucene. To simplify things a bit, you might try just using one IndexWriter. If I remember right, the IndexWriter is now pretty efficient, and there isn't much need to index to smaller indexes and then merge. There is a lot of juggling to get wrong with that approach.

- Mark

Sudarsan, Sithu D. wrote:

We are trying to index large collection of PDF documents, sizes varying
from few KB to few GB.  Lucene 2.3.2 with jdk 1.6.0_01 (with PDFBox for
text extraction) and on Windows as well as CentOS Linux. Used java -Xms
and -Xmx options, both at 1080m, even though we have 4GB on Windows and
32 GB on Linux with sufficient swap space.

With just one thread, though it takes time, the indexing happens. To
speed up, we tried multi-threaded approach with one Indexwriter for each
thread. After all the threads finish their indexing, they are merged.
With about 100 sample files and 10 threads, the program works pretty
well and it does speed up. But, when we run on document collection of
about 25GB, couple of threads just hang, while the rest have completed
their indexing. The program never gracefully exits, and the threads that
seem to have died ensure that the final index merging does not take
place. The program needs to be manually terminated.
Tried both with simple analyzer as well as standard analyzer, with
similar results.

Any useful tips / solutions welcome.

Thanks in advance,
Sithu Sudarsan
Graduate Research Assistant, UALR
& Visiting Researcher, CDRH/OSEL


To unsubscribe, e-mail: java-user-unsubscribe@xxxxxxxxxxxxxxxxx
For additional commands, e-mail: java-user-help@xxxxxxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>