Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Perl RegEx (url explode)

by U_nix$_@ (Initiate)
on Nov 01, 2012 at 22:18 UTC ( [id://1001877]=perlquestion: print w/replies, xml ) Need Help??

U_nix$_@ has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I think its an banal bug, but would be happy if someone can help me with this.. Here` s the code:

($CON,$WWW,$HOST,$SLD,$TLD,$PORT) = $conf[1] =~ m|(http(?:s?))?(?:(?:: +//)?(w{0,3})\.{0,1})?((.*)(?:\.)(.*))(?::(\d{0,10})?)|;

Following "types" of URLs must come through:

http(s)://www.example.de
http(s)://example.de
www.example.de
example.de

and if they come with a port, it must work too:
.de:443

If a URl with port is used everything works fine. Without Port nothing works

print $CON,$WWW,$HOST,$SLD,$TLD,$PORT;

Prints following:

http(s) www example.de example de 443

if something is missing: http://example.de:80 :

http "empty" example.de example de 80 ”

Somewhere must be a little bug.

No Variable gets a value if a URL with no Port is given

(?::(\d{0,10})?)

I guess the reason is "?::". No ":" no match. If I change it both URLs are accepted but it does not split up the Port. The port remains at the TopLevelDomain and is joined to the host variable.

Replies are listed 'Best First'.
Re: Perl RegEx (url explode)
by choroba (Cardinal) on Nov 01, 2012 at 22:53 UTC
    I noticed just one problem: the placement of the final question mark. The whole port part is optional, together with the colon:
    m% (http(?:s?))? # http (?:(?:://)? (w{0,3})\.{0,1})? # www ((.*)(?:\.)([^:]*)) # domains (?::(\d+))? %x; # port
    Edit: . changed to [^:] in # domains.
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: Perl RegEx (url explode)
by aitap (Curate) on Nov 02, 2012 at 18:54 UTC
    Isn't URI better in this case? Bigger, but simpler code:
    use URI; for (URI::->new($conf[1],"http")) { my @domain = split /\./, $_->host; my $tld = pop @domain; my $sld = join ".",@domain; my $www = @domain > 2 && $domain[0] eq "www" ? shift @domain : ""; my $host = join ".",(@domain,$tld); print ($_->scheme,$www,$host,$sld,$tld,$_->port); }
    (this code will work even in weird cases like perfectly valid http://www.ru)
    Sorry if my advice was wrong.
Re: Perl RegEx (url explode)
by cnd (Acolyte) on Mar 31, 2018 at 06:12 UTC

    This answer caters for usernames and passwords too:

    /^(\w+):\/\/(?:([^:@\/]*)(?::([^@\/]+)|)\@|)((?:[a-zA-Z0-9]+\.|[a-zA-Z0-9][a-zA-Z0-9\-]*[a-zA-Z0-9]\.)*(?:[a-zA-Z0-9]+|[a-zA-Z0-9][a-zA-Z0-9\-]*[a-zA-Z0-9]))(?::(\d{1,5})|)(.*)/o

    e.g.

    #!perl use strict; use warnings; for my $uri( qw(https://www.example.de http://www.example.de https://example.de http://example.de www.example.de example.de:123 http://www.example.de:445/can?this=happen&too=1#lalala http://www.example.de/can?this=happen&too=1#foo http://www.example.de:445 wss://stream.binance.com:9443/stream?streams=xrpbtc@kl +ine_1m/ethbtc@kline_1m/btcusdt@kline_1m http://a:b@example.com:890/path/wah@t/foo.js?foo=bar&b +ingobang=&king=kong@kong.com#foobar/bing/bo@ng?bang" ftp://username@hostname/ ftp://username:password@hostname/ ) ) { print "in ($uri):\n"; my @parts=($uri=~/^(\w+):\/\/ # scheme (ftp http wss etc) (?:([^:@\/]*) # optional username (?::([^@\/]+)|) # optional password \@|) # username and password are op +tional ( # group all the bits of the UR +L and its dots (?:[a-zA-Z0-9]+\.|[a-zA-Z0-9][a-zA-Z0-9\-]*[a-zA- +Z0-9]\.)*(?:[a-zA-Z0-9]+|[a-zA-Z0-9][a-zA-Z0-9\-]*[a-zA-Z0-9]) ) (?::(\d{1,5})|) # optional port (.*)/xo); # path and query parms come la +st for(my $i=0;$i<=$#parts;$i++) { print " $i: $parts[$i]\n" if($parts +[$i]); } }
Re: Perl RegEx (url explode)
by U_nix$_@ (Initiate) on Nov 01, 2012 at 23:07 UTC

    Hi,
    thanks. Thats one of the ways I tried before. It allows both types. With and without Port but produces the following output:

    http www example.de:9944 //this one should be "example.de" example de:9944 //this one should be only "de" ## PORT is empty ##

      (.*) Seems to ignore whats coming after it if ":" is optional.
      And the port becomes a part of this:

      ((.*)(?:\.)(.*))

      But how to fix it? A fixed set of commonly used TopLevelDomains is not felxible enough.

        Try to match a character class that does not contain ':' (i.e. [^:]):

        use strict; use warnings; for my $uri( qw(https://www.example.de http://www.example.de https://example.de http://example.de www.example.de example.de:123 http://www.example.de:445/can?this=happen&too=1#lalala http://www.example.de/can?this=happen&too=1#foo http://www.example.de:445 ) ) { print "in ($uri):\n"; my (@spl) = $uri =~ m|(http(?:s?))? (?:(?:://)? (w{0,3})\.{0,1})? ((.*)(?:\.)([^:/]*)) # match if it is not a ":" (?::(\d{0,10}))? |x; print 'out: ', join(', ', map { defined $_ ? $_ : '-' } @spl), "\n\ +n"; } __DATA__ in (https://www.example.de): out: https, www, example.de, example, de, - in (http://www.example.de): out: http, www, example.de, example, de, - in (https://example.de): out: https, , example.de, example, de, - in (http://example.de): out: http, , example.de, example, de, - in (www.example.de): out: -, www, example.de, example, de, - in (example.de:123): out: -, , example.de, example, de, 123 in (http://www.example.de:445/can?this=happen&too=1#lalala): out: http, www, example.de, example, de, 445 in (http://www.example.de/can?this=happen&too=1#foo): out: http, www, example.de, example, de, - in (http://www.example.de:445): out: http, www, example.de, example, de, 445
        Update: Added '/' to character class and example '#foo'

        Ok. The spirit reached me. This fixed it:

        ((.*)(?:\.)([a-zA-Z]*))(?::(\d{0,10}))?

        Edit:
        @Perlbotics,
        Your "match if not" version is the cleaner one. Merci.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1001877]
Approved by Perlbotics
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (4)
As of 2024-04-18 22:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found