On Jan 4, 2006, at 7:53 AM, Paul Elschot wrote:
On Wednesday 04 January 2006 07:34, Dave Kor wrote:
I would like to associate information (or labels) with each word
range of words in a document. Information such as this word is a
word is a verb, this period marks the end of a sentence, "kick the
is a contiguous phrase, "white house" is a location and so on. I
a good representation for such information so that they can be
as additional fields in a lucene document, and easily recovered
search. For the more technically inclined, this would allow me to
part-of-speech tags, chunk tags, sentence boundary markers and
for every indexed document.
These additional information will enable Lucene to perform
post-processing on retrieved documents for various purposes such as
information extraction, summarization, question answering, etc...
any available api? If not, I would appreciate any suggestions and
how such information can best be stored in a Lucene document.
Basically, the index information available in Lucene is the Term,
which is a
combination of a field name and a token. For these Lucene indexes
document presence and all positions within a document. Lucene also
indexes the field length as a norm.
By using one ore more extra fields the tags and sentence boundary
can be easily indexed at their positions. To search these have a
look at the
In case you want to search for tokens combined with some (part of
tag, and the tokens and their tags are in different fields, the
is not sufficient, because it does not allow position search over
Paul - I'm interested in this topic myself. Suppose the "text" field
is indexed but also entities are detected like names and places.
Suppose I'd like a query that was "all names that have the initials
EH in the text field" (where we could identify EH names by doing a
SpanRegexQuery for "E.* H.*".
I've been pondering whether it makes sense for Lucene to be enhanced
to carry over a Token's type into the index such that it could factor
into the query also.
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]