Subject: Re: Best practice for searching html
From: "Yonik Seeley"
Date: Thu, 9 Mar 2006 09:32:26 -0500
On 3/9/06, Raul Raja Martinez <[email protected]> wrote:
> Hi I have a lot of html indexed such as:
> Mart&iacute;nez
> Of course my users are gonna search for Martínez and they're not gonna
> get a match.
> Is there a common approach to solve this kind of problem in lucene,
> Maybe some utility class or something?

If you might have other random HTML markup as well as entities check out,
Solr's HTMLStrip* tokenizers:

It's good if your input is dirty - if you don't know if it's HTML or
not, or if there are HTML fragments that would cause a normaly HTML
parser to choke.

If you actually have HTML documents, I would go with an HTML parser.
If you have *just* entities, there is probably a simpler approach.

