java-user@lucene.apache.org
[Top] [All Lists]

Unicode Normalization

Subject: Unicode Normalization
From: "David Woodward"
Date: Wed, 11 Apr 2007 16:00:40 -0400
Hi.

I have encountered a problem searching in my application because of 
inconsistant unicode normalization forms in the corpus (and the queries). I 
would like to normalize to form NFKD in an analyzer (I think). I was thinking 
about creating a filter similar to the lowercasefilter that would do the 
unicode normalization. Then I will add that filter to my existing snowball 
analyzer. I am about to embark on creating said analyzer/filter using the ICU 
(http://icu-project.org/) icu4j jar.

Is this already accounted for in standard lucene somewhere and I'm just missing 
it?

Anything similar out there?

Any other advice?

Thanks,
Dave Wooodward
Library of Congress


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@xxxxxxxxxxxxxxxxx
For additional commands, e-mail: java-user-help@xxxxxxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>