[email protected]
[Top] [All Lists]

Lucene - FileFormat

Subject: Lucene - FileFormat
From: Fisheye
Date: Fri, 21 Apr 2006 04:23:51 -0700 PDT
Im trying to construct a plaintext parser for different file formats like ms
word, excel, powerpoint, rich text format, plain text, html, pdf etc.

I use the known libraries PDFBox, POI and some parts from AtLeap...and now I
should support the OpenOffice formats and the more important msg-fromat (MS
outlook message format).

Does someone know how I can simply (like POI) extract plaint text from msg?
Probably there exists an open source library like for pdf or ms office

I need the plain text because the only way for me seems to extract all the
plain text from every single document, and then add it to my lucene
index...this is necessary to get the best excerpt from highlighter...


Simon Dietschi
View this message in context: 
Sent from the Lucene - Java Users forum at Nabble.com.

To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

<Prev in Thread] Current Thread [Next in Thread>