|
|
Hi.
I have encountered a problem searching in my application because of
inconsistant unicode normalization forms in the corpus (and the queries). I
would like to normalize to form NFKD in an analyzer (I think). I was thinking
about creating a filter similar to the lowercasefilter that would do the
unicode normalization. Then I will add that filter to my existing snowball
analyzer. I am about to embark on creating said analyzer/filter using the ICU
(http://icu-project.org/) icu4j jar.
Is this already accounted for in standard lucene somewhere and I'm just missing
it?
Anything similar out there?
Any other advice?
Thanks,
Dave Wooodward
Library of Congress
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@xxxxxxxxxxxxxxxxx
For additional commands, e-mail: java-user-help@xxxxxxxxxxxxxxxxx
|
|