activeperl@listserv.activestate.com
[Top] [All Lists]

Re: Got this far with regex, now I'm stumped

Subject: Re: Got this far with regex, now I'm stumped
From: Barry Hemphill
Date: Wed, 26 Apr 2006 17:04:58 -0400
Deane.Rothenmaier@xxxxxxxxxxxxx wrote:

Hi, all.

I have a sub that uses a set of URL-parsing regexes that almost works:

Deane,

I've seen a few answers here that address your regex, but there is a broader challenge. Your approach seems targeted to domain names with a two components - a single word, a period, and the name of the TLD (top level domain), e.g. cpan.org. However, there are at least three variations that will trip you up (and maybe others I missed):

- There are domain names with three components - e.g. telstra.com.au or gov.on.ca - in fact many (most?) of the country TLD's assign domains with either regional distinctions (gov.on.ca is Ontario, Canada) or domain type ( e.g. co.uk or com.au). I'm not aware of any that assign domain names with four parts, but I don't think it's prohibited by RFC 1053.

- Companies, schools, etc. often create their own subdomains - I have worked with many customers who do this. For example, company foo.com creates engineering.foo.com, accounting.foo.com etc. This is perfectly valid, although these are not assigned domain names. Of course, for logging purposes, you could just treat "foo.engineering" as a hostname, even though strictly speaking a valid (RFC 952) hostname can not have a period.

- IP addresses - I'm guessing you're getting your input names from a web server log or something similar. In the case where an IP address can not be resolved to an FQDN (fully qualified domain name), your log will contain an IP address. However unlike hostnames you probably don't want to separate this into components by chopping off the first octet or two (which would give you nonsensical results). Conceivably you could take the last octet as the host portion, but without knowing how the network is subnetted that's just a guess.

To really know what part is hostname and what part is domain name if you are dealing with arbitrary domains, you will probably have to check with an assigned name authority to verify each domain, and then take the remainder as the host portion. There is a couple of on CPAN that might be helpful - Net::Domain::TLD (although I haven't tried it myself). Of course, if you have a limited set of data to deal with (e.g. it's an internal web server and you know all the possible domains) the problem is a lot easier to solve.

It might be helpful if you could tell us what you're trying to accomplish - we might be able to point you at a good solution.

Hope this helps,
Barry
_______________________________________________
ActivePerl mailing list
ActivePerl@xxxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

<Prev in Thread] Current Thread [Next in Thread>