java-user@lucene.apache.org
[Top] [All Lists]

Re: Good representation for part-of-speech, chunk, sentence boundary tag

Subject: Re: Good representation for part-of-speech, chunk, sentence boundary tags?
From: Erik Hatcher
Date: Wed, 4 Jan 2006 08:14:34 -0500

On Jan 4, 2006, at 7:53 AM, Paul Elschot wrote:

On Wednesday 04 January 2006 07:34, Dave Kor wrote:
Hi,

I would like to associate information (or labels) with each word or a range of words in a document. Information such as this word is a noun, that word is a verb, this period marks the end of a sentence, "kick the bucket" is a contiguous phrase, "white house" is a location and so on. I am seeking a good representation for such information so that they can be easily stored as additional fields in a lucene document, and easily recovered after a search. For the more technically inclined, this would allow me to store part-of-speech tags, chunk tags, sentence boundary markers and parse trees
for every indexed document.

These additional information will enable Lucene to perform additional
post-processing on retrieved documents for various purposes such as
information extraction, summarization, question answering, etc... Is there any available api? If not, I would appreciate any suggestions and tips on
how such information can best be stored in a Lucene document.

Basically, the index information available in Lucene is the Term, which is a
combination of a field name and a token. For these Lucene indexes
document presence and all positions within a document.  Lucene also
indexes the field length as a norm.
By using one ore more extra fields the tags and sentence boundary markers can be easily indexed at their positions. To search these have a look at the
span package.
In case you want to search for tokens combined with some (part of speech) tag, and the tokens and their tags are in different fields, the span package is not sufficient, because it does not allow position search over different
fields.

Paul - I'm interested in this topic myself. Suppose the "text" field is indexed but also entities are detected like names and places. Suppose I'd like a query that was "all names that have the initials EH in the text field" (where we could identify EH names by doing a SpanRegexQuery for "E.* H.*".

I've been pondering whether it makes sense for Lucene to be enhanced to carry over a Token's type into the index such that it could factor into the query also.

Thoughts?

        Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@xxxxxxxxxxxxxxxxx
For additional commands, e-mail: java-user-help@xxxxxxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>