java-user@lucene.apache.org
[Top] [All Lists]

How to post date encoded in NCR(decimal) to lucene indexer?

Subject: How to post date encoded in NCR(decimal) to lucene indexer?
From: KK
Date: Mon, 1 Jun 2009 14:18:53 +0530
Hi All,
I'm trying to index data to lucene index in unicode utf-8 format. All my
search queries are of the form \uxxxx and its working fine. But the problem
is in some cases, when the document[actually a webpage content] contains
Numeric Character Reference[decimal], these are getting indexed as such. For
example I've the following data[some telugu language data],

డాక్టర్

When I index this they get indexed as such and querying using \uxxxx
format doesnot give any result. so I want to know is there any way
where we can configure lucene to take
care of such things by itself, or I've to convert the same to \uxxxx
format[this is just replace &# with \u and replace the 4-dig number
with its hex equivalent]. This manual

method doesnot sound good to me. If there is any standard way to doing
the same, please someone let ke know. Thank you.

--KK.

One question?
Is it mandatory that the data to be indexed by lucene has to in \uxxxx
format for unicode utf-8 encoded data?
<Prev in Thread] Current Thread [Next in Thread>
  • How to post date encoded in NCR(decimal) to lucene indexer?, KK <=