java-user@lucene.apache.org
[Top] [All Lists]

Re: Big Ducument Indexing Limit?

Subject: Re: Big Ducument Indexing Limit?
From: Catalin Mititelu
Date: Fri, 15 Sep 2006 04:52:17 -0700 PDT
One more hint for 2) and 3): use SimpleAnalyzer on your xml (give up at 
XmlContentExtractor). In this manner you can index all "words" from xml file at 
lower case (tag name, attribute name, attribute value and content). 
Of course, you should use the same analyzer for searching.

Simon Willnauer <simon.willnauer@xxxxxxxxxxxxxx> wrote: On 9/15/06, aslam bari  
wrote:
> Dear Mititelu,
>   Thanks for reply. Can you help me on some samll issue related to it.
>   1) I am new to Lucene. Can you tell me where is this 
> DEFAULT_MAX_FIELD_LENGTH variable available and how to set it and for my 
> purpose like 6-10MB file, how much i should set.
DEFAULT_MAX_FIELD_LENGTH is a constant of IndexWriter.
to set your own value use IndexWriter.html#setMaxFieldLength(int)

JavaDoc:
http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#setMaxFieldLength(int)

>   2) how can i index all the words of XML file as Case Insensitive means 
> either in lowercase or in Uppercase. So that i can search case insensitive.
How your text is processed depends on the analyzer you use for
indexing a certain field. A lot of the available analyzers do
lowercase processing using a lowercasefilter like the
(whitespaceanalyzer)
see (Analysis) at:
http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html


You should have a look at
>   3) I am using XmlContentExtractor. Actually it extract only Content of the 
> xml tag. If i want every thing (including XmlTag or contents or properties - 
> attributes ) of xml file to be indexed, where should i do change the code.

I'm not 100% sure but isn't XmlContentExtractor a part or jakarta
slide? If so please contact the slide mailing list. But there are many
xml apis available to extract every content of your document. If you
really want to index the whole xml file just read the file using java
io anyway I would not suggest doing that at all.

best regards simon
>
>   Thanks...
> Catalin Mititelu  wrote:
>   Yes. The default max limit for indexing tokens is 10,000.
> Look here 
> http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#DEFAULT_MAX_FIELD_LENGTH
>
> aslam bari wrote: Dear all,
> I am trying to index a Xml file which has 6MB size. Does lucene support the 
> big document size. What is the limit of lucene Max file size to index.
> Because when i check and trying to search in the indexed file. I am not able 
> to get all the results. It gives me some results but not others. I think 
> Lucene has done partially indexed and left the remaining part which it could 
> not processed. IS IT RIGHT?. If not , then what will be the problem.
> Thanks...
>
>
> ---------------------------------
> Find out what India is talking about on - Yahoo! Answers India
> Send FREE SMS to your friend's mobile from Yahoo! Messenger Version 8. Get it 
> NOW
>
>
> ---------------------------------
> Stay in the know. Pulse on the new Yahoo.com. Check it out.
>
>
> ---------------------------------
>  Find out what India is talking about on  - Yahoo! Answers India
>  Send FREE SMS to your friend's mobile from Yahoo! Messenger Version 8. Get 
> it NOW
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@xxxxxxxxxxxxxxxxx
For additional commands, e-mail: java-user-help@xxxxxxxxxxxxxxxxx



                
---------------------------------
How low will we go? Check out Yahoo! Messenger?s low  PC-to-Phone call rates.
<Prev in Thread] Current Thread [Next in Thread>