Perl RegEx (url explode)

by U_nix$_@ (Initiate)
on Nov 01, 2012 at 22:18 UTC ( #1001877=perlquestion: print w/replies, xml ) Need Help??
U_nix$_@ has asked for the wisdom of the Perl Monks concerning the following question:


I think its an banal bug, but would be happy if someone can help me with this.. Here` s the code:

($CON,$WWW,$HOST,$SLD,$TLD,$PORT) = $conf[1] =~ m|(http(?:s?))?(?:(?:: +//)?(w{0,3})\.{0,1})?((.*)(?:\.)(.*))(?::(\d{0,10})?)|;

Following "types" of URLs must come through:


and if they come with a port, it must work too:

If a URl with port is used everything works fine. Without Port nothing works


Prints following:

http(s) www example de 443

if something is missing: :

http "empty" example de 80

Somewhere must be a little bug.

No Variable gets a value if a URL with no Port is given


I guess the reason is "?::". No ":" no match. If I change it both URLs are accepted but it does not split up the Port. The port remains at the TopLevelDomain and is joined to the host variable.

Re: Perl RegEx (url explode)
by choroba (Bishop) on Nov 01, 2012 at 22:53 UTC
    I noticed just one problem: the placement of the final question mark. The whole port part is optional, together with the colon:
    m% (http(?:s?))? # http (?:(?:://)? (w{0,3})\.{0,1})? # www ((.*)(?:\.)([^:]*)) # domains (?::(\d+))? %x; # port
    Edit: . changed to [^:] in # domains.
Re: Perl RegEx (url explode)
by aitap (Curate) on Nov 02, 2012 at 18:54 UTC
    Isn't URI better in this case? Bigger, but simpler code:
    use URI; for (URI::->new($conf[1],"http")) { my @domain = split /\./, $_->host; my $tld = pop @domain; my $sld = join ".",@domain; my $www = @domain > 2 && $domain[0] eq "www" ? shift @domain : ""; my $host = join ".",(@domain,$tld); print ($_->scheme,$www,$host,$sld,$tld,$_->port); }
    (this code will work even in weird cases like perfectly valid
Re: Perl RegEx (url explode)
by U_nix$_@ (Initiate) on Nov 01, 2012 at 23:07 UTC

    thanks. Thats one of the ways I tried before. It allows both types. With and without Port but produces the following output:

    http www //this one should be "" example de:9944 //this one should be only "de" ## PORT is empty ##

      (.*) Seems to ignore whats coming after it if ":" is optional.
      And the port becomes a part of this:


      But how to fix it? A fixed set of commonly used TopLevelDomains is not felxible enough.

        Try to match a character class that does not contain ':' (i.e. [^:]):

        use strict; use warnings; for my $uri( qw( ) ) { print "in ($uri):\n"; my (@spl) = $uri =~ m|(http(?:s?))? (?:(?:://)? (w{0,3})\.{0,1})? ((.*)(?:\.)([^:/]*)) # match if it is not a ":" (?::(\d{0,10}))? |x; print 'out: ', join(', ', map { defined $_ ? $_ : '-' } @spl), "\n\ +n"; } __DATA__ in ( out: https, www,, example, de, - in ( out: http, www,, example, de, - in ( out: https, ,, example, de, - in ( out: http, ,, example, de, - in ( out: -, www,, example, de, - in ( out: -, ,, example, de, 123 in ( out: http, www,, example, de, 445 in ( out: http, www,, example, de, - in ( out: http, www,, example, de, 445
        Update: Added '/' to character class and example '#foo'

        Ok. The spirit reached me. This fixed it:


        Your "match if not" version is the cleaner one. Merci.

and all is quiet...

