What is the expected memory usage of Lucene these days? I dug up an old
email  from 2001 which gave the following summary of memory usage:
An IndexReader requires:
one byte per field per document in index (norms)
one open file per file in index
1/128 of the Terms in the index
a Term has two pointers (8 bytes)
and a String (4 pointers = 24 bytes, one to 16-bit chars)
From this, we determined the norms to be by far the biggest problem,
and set about removing them based on a patch submitted on the issue
However, now we've met the next hurdle: the terms use much more memory
than suggested above.
Profiling a text index with roughly 32,000,000 terms, we have about:
* 13MB of char
* 6MB of java.lang.String
* 6MB of org.apache.lucene.index.Term
* 8MB of org.apache.lucene.index.TermInfo
=> Total = 33MB
This actually equates to about:
* 52 bytes (average, depends on the term lengths in the index) per char
* 24 bytes per String
* 24 bytes per Term
* 32 bytes per TermInfo
=> 132 bytes per term, for the 1 in 128 terms which are held.
This isn't a problem in the current state, but when loading 30 of these
text indexes at once, we start running into serious memory usage issues.
My question is: is this 1/128 figure set in stone, or can it be changed
without major consequences?
I would rather have an application which used less memory and took
longer, than one which uses all the available RAM just to milk out a bit
of extra speed.
NUIX Pty Ltd
Level 8, 143 York Street, Sydney 2000
Phone: (02) 9283 9010
Fax: (02) 9283 9020
This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.
To unsubscribe, e-mail: java-user-unsubscribe@xxxxxxxxxxxxxxxxx
For additional commands, e-mail: java-user-help@xxxxxxxxxxxxxxxxx