java-user@lucene.apache.org
[Top] [All Lists]

Good representation for part-of-speech, chunk, sentence boundary tags?

Subject: Good representation for part-of-speech, chunk, sentence boundary tags?
From: Dave Kor
Date: Wed, 4 Jan 2006 14:34:12 +0800
Hi,

  I would like to associate information (or labels) with each word or a
range of words in a document. Information such as this word is a noun, that
word is a verb, this period marks the end of a sentence, "kick the bucket"
is a contiguous phrase, "white house" is a location and so on. I am seeking
a good representation for such information so that they can be easily stored
as additional fields in a lucene document, and easily recovered after a
search. For the more technically inclined, this would allow me to store
part-of-speech tags, chunk tags, sentence boundary markers and parse trees
for every indexed document.

  These additional information will enable Lucene to perform additional
post-processing on retrieved documents for various purposes such as
information extraction, summarization, question answering, etc... Is there
any available api? If not, I would appreciate any suggestions and tips on
how such information can best be stored in a Lucene document.


Regards,
Dave.
<Prev in Thread] Current Thread [Next in Thread>