|
|
On 3/9/06, Raul Raja Martinez <dobleerre@xxxxxxxxxxxxxxx> wrote:
> Hi I have a lot of html indexed such as:
>
> Martínez
>
> Of course my users are gonna search for Martínez and they're not gonna
> get a match.
>
> Is there a common approach to solve this kind of problem in lucene,
> Maybe some utility class or something?
If you might have other random HTML markup as well as entities check out,
Solr's HTMLStrip* tokenizers:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
It's good if your input is dirty - if you don't know if it's HTML or
not, or if there are HTML fragments that would cause a normaly HTML
parser to choke.
If you actually have HTML documents, I would go with an HTML parser.
If you have *just* entities, there is probably a simpler approach.
-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@xxxxxxxxxxxxxxxxx
For additional commands, e-mail: java-user-help@xxxxxxxxxxxxxxxxx
|
|