java-user@lucene.apache.org
[Top] [All Lists]

Re: Beginner: Specific indexing

Subject: Re: Beginner: Specific indexing
From: Chris Hostetter
Date: Mon, 8 Sep 2008 15:27:03 -0700 PDT
: I think I'm getting you. But the files I'm  going to parse have many formats
: : PDF, HTML, Word.
: they don't have a particular structure, memos if you will. But the ones I'm
: interested in will have the triplets I described

Ahhhh...  see this is something i completley didn't realize.  "Lucene" as 
a library really doesn't provide any sort of mechanism for doing text 
extraction from unknown file formats ... With some small exceptions (like 
the HTMLStripTokenizer in Solr) the TokenStream concept is much more about 
finding "Tokens" from a stream of plain text -- not about finding "Text" 
in arbitrary (possibly binary) files.

You'll probably wantto check out the Tika subproject...
    http://incubator.apache.org/tika/
...or some of the various "How do i index _____ documents?" FAQs...
    http://wiki.apache.org/lucene-java/LuceneFAQ


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@xxxxxxxxxxxxxxxxx
For additional commands, e-mail: java-user-help@xxxxxxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>