java-user@lucene.apache.org
[Top] [All Lists]

Re: Indexing MSword Documents

Subject: Re: Indexing MSword Documents
From: "jim shirreffs"
Date: Sat, 09 Jun 2007 09:55:55 -0500
thanks the apprach you and Donna Gresh suggested worked out fine. I now have a much better understanding of the Document class.


here is the create Document code in case another newie is interested. as more mine types are added I will expand the in if

thanks again

jim s


public class KcmiDocument
{
String objectId;
String title;
String mtype;

private boolean debug=true;

public KcmiDocument(String oId, String t, String f)
{
 objectId=oId;
 title = t;
 mtype=f;
if(debug) System.out.println("In KcmiDocument with " + oId + ":" + t + ":" + f);
}

/*
 * Makes a document for a file.
 * The document has four custom fields:
 * path, modified date, title and objectId
 */
public Document document(File f) throws java.io.FileNotFoundException,
 java.io.IOException, java.lang.InterruptedException
{
if(debug) System.out.println("In KcmiDocument.document with " + f.getAbsolutePath());

 Document doc = null;

 /*
  * make a document
  */
 try
 {
  if (mtype.startsWith("text"))
  {
   doc = HTMLDocument.Document(f);
  }
  else if (mtype.equals("application/pdf"))
  {
   doc = LucenePDFDocument.getDocument(f);
  }
  else if (mtype.equals("application/msword"))
  {
   FileInputStream fin = new FileInputStream(f.getAbsolutePath());
   WordExtractor extractor = new WordExtractor(fin);
   String content = extractor.getText();

   if(debug) System.out.println(content);

   doc = new Document();
doc.add(new Field("contents", content, Field.Store.NO, Field.Index.TOKENIZED));
  }
  else if (mtype.equals("???")) //binary
  {
   throw new java.io.IOException(f.getName() +
    " mime type can not be determined unable to extract doument content");
  }
  else
  {
   if(debug) System.out.println("Unknown using trying htmldocument.");
   doc = HTMLDocument.Document(f);
  }
 }
 catch(java.lang.InterruptedException ie)
 {
  throw ie;
 }
 catch(java.io.IOException ioe)
 {
  throw ioe;
 }

 /*
  * Add the path of the file as a field named "path".  Use a field that is
  * indexed (i.e. searchable), but don't tokenize the field into words.
  */
doc.add(new Field("path", f.getPath(), Field.Store.YES, Field.Index.UN_TOKENIZED));

 /*
  * Add ObjectId
  */
 doc.add(new Field("objectId",
  objectId,
  Field.Store.YES,
  Field.Index.UN_TOKENIZED));

 /*
  * Add title
  */
 doc.add(new Field("kcmititle",
  title,
  Field.Store.YES,
  Field.Index.UN_TOKENIZED));

 /*
  * return the document
  */
 return doc;
}
}
----- Original Message ----- From: "Wayne Graham" <wsgrah@xxxxxx>
To: <java-user@xxxxxxxxxxxxxxxxx>
Sent: Friday, June 08, 2007 2:10 PM
Subject: Re: Indexing MSword Documents


Jim,

There are a few things you can do to make extracting text easier on
yourself. There are several libraries that can assist you, POI and
TextMining.org both have excellent text extractors for Word.

As Mathieu suggests, you need to take a look at Document. Essentially,
you do everything you're doing and when it gets time to insert your
content, you'll do something along the lines of (this is using
TextMining.org's extractor):

String content = new WordExtractor().extractText(new FileInputStream(file));

doc.add(new Field("content", content, Field.Store.NO,
Field.Index.TOKENIZED));

You should be able to edit the example code you're working with fairly
easily with the above.

HTH,
Wayne

jim shirreffs wrote:
I looked at nutches code but it is too complicated for me to follow.

I do not understand the guts of Lucene and how analyzers, parsers,
readers, etc all fit together. I suppose I will be forced to learn it
all someday but at the moment I am adhering to KISS, Keep It Simple Stupid.

thanks for taking the time to reply


jim s

----- Original Message ----- From: "Mathieu Lecarme"
<mathieu@xxxxxxxxxxxxxxx>
To: <java-user@xxxxxxxxxxxxxxxxx>
Sent: Friday, June 08, 2007 12:48 PM
Subject: Re: Indexing MSword Documents


Why don't use Document?
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/
org/apache/lucene/document/Document.html

HTMLDocument manage HTML stuff like encoding, header, and other
specificity.

Nutch use specific word tools (http://lucene.apache.org/nutch/apidocs/
org/apache/nutch/parse/msword/package-summary.html), but, IMHO, it's
not the more difficult part.

M.

Le 8 juin 07 à 19:23, jim shirreffs a écrit :

Hi,
I am trying to index msword documents. I've got things working but  I
do not think I am doing things properly.

To index msword docs I use an extractor to extract the text. Then I
write the text to a .txt file and index that using an HTMLDocument
object. Seems to me that since I have the text I should be able to
just do a

       Doc.add("content", the_text_from_the_word_doc, ???, ???);

But looking at Document.java it seems the field "content" requires  a
reader. So I write a temporary file to satified that requirement.

What I would like to have is an MSWORDDocument class that would  take
the extracted text as a argument to the constructor and create  a
Ducument object that I could get.

If any one has any idea, please let me know.

Here is my code segment. Notice the msword hack,


/*
* make a document
*/

try
{
  if (ftype.startsWith("text"))
  {
     doc = HTMLDocument.Document(f);
  }
  else if (ftype.equals("application/pdf"))
  {
     doc = LucenePDFDocument.getDocument(f);
  }
  else if (ftype.equals("application/msword"))
  {
     FileInputStream fin = new FileInputStream(f.getAbsolutePath());
     WordExtractor extractor = new WordExtractor(fin);
     String content = extractor.getText();
     if(debug) System.out.println(content);
     String tempFileName=f.getAbsolutePath() + ".txt";
     BufferedWriter bw = new BufferedWriter(new FileWriter
(tempFileName, false));
     bw.write((String) content.toString());
     bw.close();
     File df = new File(tempFileName);
     doc = HTMLDocument.Document(df);
     df.delete();
  }
  else if (ftype.equals("binary"))
  {
     return null;
  }
  else
  {
     if(debug) System.out.println("Unknown file type not ascii or
pdf.");
     doc = HTMLDocument.Document(f);
  }
}
catch(java.lang.InterruptedException ie)
{
  throw ie;
}
catch(java.io.IOException ioe)
{
  throw ioe;
}





Thanks in advance


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@xxxxxxxxxxxxxxxxx
For additional commands, e-mail: java-user-help@xxxxxxxxxxxxxxxxx




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@xxxxxxxxxxxxxxxxx
For additional commands, e-mail: java-user-help@xxxxxxxxxxxxxxxxx


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@xxxxxxxxxxxxxxxxx
For additional commands, e-mail: java-user-help@xxxxxxxxxxxxxxxxx


--
/**
* Wayne Graham
* Earl Gregg Swem Library
* PO Box 8794
* Williamsburg, VA 23188
* 757.221.3112
* http://swem.wm.edu/blogs/waynegraham/
*/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@xxxxxxxxxxxxxxxxx
For additional commands, e-mail: java-user-help@xxxxxxxxxxxxxxxxx



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@xxxxxxxxxxxxxxxxx
For additional commands, e-mail: java-user-help@xxxxxxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>