Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Perl RegEx (url explode)

by U_nix$_@ (Initiate)
on Nov 01, 2012 at 22:18 UTC ( #1001877=perlquestion: print w/ replies, xml ) Need Help??
U_nix$_@ has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I think its an banal bug, but would be happy if someone can help me with this.. Here` s the code:

($CON,$WWW,$HOST,$SLD,$TLD,$PORT) = $conf[1] =~ m|(http(?:s?))?(?:(?:: +//)?(w{0,3})\.{0,1})?((.*)(?:\.)(.*))(?::(\d{0,10})?)|;

Following "types" of URLs must come through:

http(s)://www.example.de
http(s)://example.de
www.example.de
example.de

and if they come with a port, it must work too:
.de:443

If a URl with port is used everything works fine. Without Port nothing works

print $CON,$WWW,$HOST,$SLD,$TLD,$PORT;

Prints following:

http(s) www example.de example de 443

if something is missing: http://example.de:80 :

http "empty" example.de example de 80

Somewhere must be a little bug.

No Variable gets a value if a URL with no Port is given

(?::(\d{0,10})?)

I guess the reason is "?::". No ":" no match. If I change it both URLs are accepted but it does not split up the Port. The port remains at the TopLevelDomain and is joined to the host variable.

Comment on Perl RegEx (url explode)
Select or Download Code
Re: Perl RegEx (url explode)
by choroba (Abbot) on Nov 01, 2012 at 22:53 UTC
    I noticed just one problem: the placement of the final question mark. The whole port part is optional, together with the colon:
    m% (http(?:s?))? # http (?:(?:://)? (w{0,3})\.{0,1})? # www ((.*)(?:\.)([^:]*)) # domains (?::(\d+))? %x; # port
    Edit: . changed to [^:] in # domains.
    لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: Perl RegEx (url explode)
by U_nix$_@ (Initiate) on Nov 01, 2012 at 23:07 UTC

    Hi,
    thanks. Thats one of the ways I tried before. It allows both types. With and without Port but produces the following output:

    http www example.de:9944 //this one should be "example.de" example de:9944 //this one should be only "de" ## PORT is empty ##

      (.*) Seems to ignore whats coming after it if ":" is optional.
      And the port becomes a part of this:

      ((.*)(?:\.)(.*))

      But how to fix it? A fixed set of commonly used TopLevelDomains is not felxible enough.

        Try to match a character class that does not contain ':' (i.e. [^:]):

        use strict; use warnings; for my $uri( qw(https://www.example.de http://www.example.de https://example.de http://example.de www.example.de example.de:123 http://www.example.de:445/can?this=happen&too=1#lalala http://www.example.de/can?this=happen&too=1#foo http://www.example.de:445 ) ) { print "in ($uri):\n"; my (@spl) = $uri =~ m|(http(?:s?))? (?:(?:://)? (w{0,3})\.{0,1})? ((.*)(?:\.)([^:/]*)) # match if it is not a ":" (?::(\d{0,10}))? |x; print 'out: ', join(', ', map { defined $_ ? $_ : '-' } @spl), "\n\ +n"; } __DATA__ in (https://www.example.de): out: https, www, example.de, example, de, - in (http://www.example.de): out: http, www, example.de, example, de, - in (https://example.de): out: https, , example.de, example, de, - in (http://example.de): out: http, , example.de, example, de, - in (www.example.de): out: -, www, example.de, example, de, - in (example.de:123): out: -, , example.de, example, de, 123 in (http://www.example.de:445/can?this=happen&too=1#lalala): out: http, www, example.de, example, de, 445 in (http://www.example.de/can?this=happen&too=1#foo): out: http, www, example.de, example, de, - in (http://www.example.de:445): out: http, www, example.de, example, de, 445
        Update: Added '/' to character class and example '#foo'

        Ok. The spirit reached me. This fixed it:

        ((.*)(?:\.)([a-zA-Z]*))(?::(\d{0,10}))?

        Edit:
        @Perlbotics,
        Your "match if not" version is the cleaner one. Merci.

Re: Perl RegEx (url explode)
by aitap (Deacon) on Nov 02, 2012 at 18:54 UTC
    Isn't URI better in this case? Bigger, but simpler code:
    use URI; for (URI::->new($conf[1],"http")) { my @domain = split /\./, $_->host; my $tld = pop @domain; my $sld = join ".",@domain; my $www = @domain > 2 && $domain[0] eq "www" ? shift @domain : ""; my $host = join ".",(@domain,$tld); print ($_->scheme,$www,$host,$sld,$tld,$_->port); }
    (this code will work even in weird cases like perfectly valid http://www.ru)
    Sorry if my advice was wrong.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1001877]
Approved by Perlbotics
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (8)
As of 2014-12-28 04:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (178 votes), past polls