|
|
Hi Sudha,
In the past, I've built regexes to recognize URLs using the information here:
http://www.foad.org/~abigail/Perl/url2.html
The above, however, is currently a dead link.
Here's the Internet Archive's WayBack Machine's cache of this page from August
2007:
<http://web.archive.org/web/20070807114147/http://www.foad.org/~abigail/Perl/url2.html>
Here's the same content, of unknown vintage, as a text file (even though it has
a .html extension):
http://nerxs.com/mirrorpages/urlregex.html
Also, Jeffrey Friedl's book "Mastering Regular Expressions", 2nd edition (but
not the 1st edition), has a section on recognizing URLs in Chapter 5.
Steve
On 11/19/2009 at 12:58 AM, Sudha Verma wrote:
> Hi,
>
> I am using lucene 2-9-1.
>
> I am reading in free text documents which I index using lucene and the
> StandardAnalyzer at the moment.
>
> The StandardAnalyzer keeps email addresses intact and does not tokenize
> them. Is there something similar for
> URLs? This seems like a common need. So, I thought I'd check if there
> is anything out there that does it already.
>
> I'd appreciate any help.
>
> Thanks,
> sudha
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@xxxxxxxxxxxxxxxxx
For additional commands, e-mail: java-user-help@xxxxxxxxxxxxxxxxx
|
|