java-user@lucene.apache.org
[Top] [All Lists]

Re: Best practice for searching html

Subject: Re: Best practice for searching html
From: "Yonik Seeley"
Date: Thu, 9 Mar 2006 09:32:26 -0500
On 3/9/06, Raul Raja Martinez <dobleerre@xxxxxxxxxxxxxxx> wrote:
> Hi I have a lot of html indexed such as:
>
> Mart&iacute;nez
>
> Of course my users are gonna search for Martínez and they're not gonna
> get a match.
>
> Is there a common approach to solve this kind of problem in lucene,
> Maybe some utility class or something?

If you might have other random HTML markup as well as entities check out,
Solr's HTMLStrip* tokenizers:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

It's good if your input is dirty - if you don't know if it's HTML or
not, or if there are HTML fragments that would cause a normaly HTML
parser to choke.

If you actually have HTML documents, I would go with an HTML parser.
If you have *just* entities, there is probably a simpler approach.


-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@xxxxxxxxxxxxxxxxx
For additional commands, e-mail: java-user-help@xxxxxxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>