|
|
Deane.Rothenmaier@xxxxxxxxxxxxx wrote:
Hi, all.
I have a sub that uses a set of URL-parsing regexes that almost works:
Deane,
I've seen a few answers here that address your regex, but there is a
broader challenge. Your approach seems targeted to domain names with a
two components - a single word, a period, and the name of the TLD (top
level domain), e.g. cpan.org. However, there are at least three
variations that will trip you up (and maybe others I missed):
- There are domain names with three components - e.g. telstra.com.au or
gov.on.ca - in fact many (most?) of the country TLD's assign domains
with either regional distinctions (gov.on.ca is Ontario, Canada) or
domain type ( e.g. co.uk or com.au). I'm not aware of any that assign
domain names with four parts, but I don't think it's prohibited by RFC 1053.
- Companies, schools, etc. often create their own subdomains - I have
worked with many customers who do this. For example, company foo.com
creates engineering.foo.com, accounting.foo.com etc. This is perfectly
valid, although these are not assigned domain names. Of course, for
logging purposes, you could just treat "foo.engineering" as a hostname,
even though strictly speaking a valid (RFC 952) hostname can not have a
period.
- IP addresses - I'm guessing you're getting your input names from a web
server log or something similar. In the case where an IP address can not
be resolved to an FQDN (fully qualified domain name), your log will
contain an IP address. However unlike hostnames you probably don't want
to separate this into components by chopping off the first octet or two
(which would give you nonsensical results). Conceivably you could take
the last octet as the host portion, but without knowing how the network
is subnetted that's just a guess.
To really know what part is hostname and what part is domain name if you
are dealing with arbitrary domains, you will probably have to check with
an assigned name authority to verify each domain, and then take the
remainder as the host portion. There is a couple of on CPAN that might
be helpful - Net::Domain::TLD (although I haven't tried it myself). Of
course, if you have a limited set of data to deal with (e.g. it's an
internal web server and you know all the possible domains) the problem
is a lot easier to solve.
It might be helpful if you could tell us what you're trying to
accomplish - we might be able to point you at a good solution.
Hope this helps,
Barry
_______________________________________________
ActivePerl mailing list
ActivePerl@xxxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
|
|