I would like to associate information (or labels) with each word or a
range of words in a document. Information such as this word is a noun, that
word is a verb, this period marks the end of a sentence, "kick the bucket"
is a contiguous phrase, "white house" is a location and so on. I am seeking
a good representation for such information so that they can be easily stored
as additional fields in a lucene document, and easily recovered after a
search. For the more technically inclined, this would allow me to store
part-of-speech tags, chunk tags, sentence boundary markers and parse trees
for every indexed document.
These additional information will enable Lucene to perform additional
post-processing on retrieved documents for various purposes such as
information extraction, summarization, question answering, etc... Is there
any available api? If not, I would appreciate any suggestions and tips on
how such information can best be stored in a Lucene document.