http://www.perlmonks.org?node_id=1008114

Doozer has asked for the wisdom of the Perl Monks concerning the following question:

I have scoured the Perl user guides and beginner tutorials and couldn't find an answer to my question (This is just a disclaimer in case anyone tells me to 'read the fine manual' ;).

I have a few scripts that interact with each other via system commands. One script passes a website url to another script using the following system command:

perl getrequest.pm http://www.google.com

In the 'getrequest.pm' script I have defined the following variable:

my $url = $ARGV[0];

I have used $ARGV[0] because later on the script gets passed another argument as $ARGV[1]

I am trying to then split the $url variable further so that I can get the bare host name 'google.com' for example (I need both the bare host name and the full url for further processing). I have tried a number of ways using the split command but it doesn't seem to be working. I would really appreciate some advice and whether split is the right function to use or not?

Replies are listed 'Best First'.
Re: Unable to split $ARGV[0] variable. Can it be done?
by space_monk (Chaplain) on Dec 10, 2012 at 15:47 UTC

    You could use split but instead of RTFM I always say if in doubt, see if someone has written a library to do the job you need the code to do.

    breakup URIs? Maybe URI::Split or simply URI itself will do the job?

    A Monk aims to give answers to those who have none, and to learn from those who know more.
Re: Unable to split $ARGV[0] variable. Can it be done?
by McDarren (Abbot) on Dec 10, 2012 at 15:50 UTC
    ..so that I can get the bare host name 'google.com'

    um, google.com is not a hostname, it's a domain name.
    Also, you start your example with 'www.google.com', and then you say you want 'google.com'
    Is that correct, or was it a typo?

    I'll assume you want to extract the Fully Qualified Domain Name

    ..appreciate some advice and whether split is the right function to use or not?

    Although you could get what you want with split, I wouldn't consider it the best thing to use here. Especially if you're dealing with more complex URL's.
    Personally, I'd use URI::Split

    use URI::Split qw/uri_split/; my $url = 'http://www.google.com'; my ($proto, $fqdn) = uri_split($url); print "Protocol:$proto Domain:$fqdn\n";
    Prints:
    Protocol:http Domain:www.google.com

    Cheers,
    Darren

      Sorry, domain name was what I meant yes. No it wasn't a typo. 'http://www.google.com' is passed in to the script and a 'get' request is made against that URL using LWP. If the get request fails, it then tries a different prefix 'https://www.google.com' or 'http://google.com' for example. I want to split the domain name away from the prefix so I can chop and change the combinations as I please. It may be easier to have just the domain name passed in to the script and then the script can handle ALL of the prefixes itself.

      I appreciate all the responses and am currently working through the suggestions to see what I can work with.

        It may be easier to have just the domain name passed in to the script and then the script can handle ALL of the prefixes itself.

        Yeah, that sounds sensible.
        Here is an example of how you might implement that approach:

        #!/usr/bin/perl use strict; use warnings; use LWP::Simple; DOMAIN: while (my $domain = <DATA>) { chomp($domain); for my $protocol (qw/http https/) { next DOMAIN if test_url("$protocol://$domain"); for my $sub (qw/www web/) { next DOMAIN if test_url("$protocol://$sub.$domain"); } } print "Couldn't get anything from $domain\n"; } sub test_url { my $url = shift; print "Trying $url ..."; my $ua = LWP::UserAgent->new( timeout => 5, agent => 'Mozilla/5.0', ssl_opts => { verify_hostname => 0 }, ); my $response = $ua->get($url); if ($response->is_success) { print "OK\n"; return 1; } else { print "FAILED because " . $response->status_line . "\n"; return undef; } } __DATA__ google.com apple.com fred.com dschjksdbckjqh.com
        Output:
        Trying http://google.com ...OK Trying http://apple.com ...OK Trying http://fred.com ...OK Trying http://dschjksdbckjqh.com ...FAILED because 500 Can't connect t +o dschjksdbckjqh.com:80 (Bad hostname 'dschjksdbckjqh.com') Trying http://www.dschjksdbckjqh.com ...FAILED because 500 Can't conne +ct to www.dschjksdbckjqh.com:80 (Bad hostname 'www.dschjksdbckjqh.com +') Trying http://web.dschjksdbckjqh.com ...FAILED because 500 Can't conne +ct to web.dschjksdbckjqh.com:80 (Bad hostname 'web.dschjksdbckjqh.com +') Trying https://dschjksdbckjqh.com ...FAILED because 500 Can't connect +to dschjksdbckjqh.com:443 (getaddrinfo: nodename nor servname provide +d, or not known) Trying https://www.dschjksdbckjqh.com ...FAILED because 500 Can't conn +ect to www.dschjksdbckjqh.com:443 (getaddrinfo: nodename nor servname + provided, or not known) Trying https://web.dschjksdbckjqh.com ...FAILED because 500 Can't conn +ect to web.dschjksdbckjqh.com:443 (getaddrinfo: nodename nor servname + provided, or not known) Couldn't get anything from dschjksdbckjqh.com

        HTH,
        Darren

Re: Unable to split $ARGV[0] variable. Can it be done?
by Your Mother (Archbishop) on Dec 10, 2012 at 19:58 UTC

    .pm is usually for modules which don't usually take arguments in the way you seem to want. So, I'd recommend changing the name to getrequest.pl, or something more descriptive as it sounds like it's doing a lot. I would highly recommend URI as mentioned by space_monk

    use URI; my $url = URI->new( $ARGV[0] || die "Give a URI\n" ); $url->scheme =~ /https?/ or die "This is not a URL we can use...\n"; print $url, $/, $url->host, $/, $url->path, $/; __END__ perl ~/getrequest.pl http://nasa.org/moon http://nasa.org/moon nasa.org /moon

      Thanks to everyone for the help and suggestions on this. I have sorted it now using inspiration from the responses.

      I didn't use URI in the end as it was overkill for what I actually wanted to do. I will however keep it in mind if we decide to evolve our current test

        I'm glad you worked it out but I caution you against thinking that one line of code is overkill, right? URI is an excellent package (set of packages really) that is not going to surprise you or let you down on edge cases and you can see from the example how dead-simple it is to apply. This kind of OOP in straightforward/short code isn't overkill as much as it is applying standards and preparing for future growth; growth which visits short scripts much more often than intended.

Re: Unable to split $ARGV[0] variable. Can it be done?
by Utilitarian (Vicar) on Dec 10, 2012 at 15:16 UTC
    perl getrequest.pm http://www.google.com my $url = $ARGV[0]; my ($host)=$url=~/https?:\/\/([^\/]+)/;
    print "Good ",qw(night morning afternoon evening)[(localtime)[2]/6]," fellow monks."
Re: Unable to split $ARGV[0] variable. Can it be done?
by Anonymous Monk on Dec 10, 2012 at 15:15 UTC

    Split does not care where you got your string data from; that would be silly.

    Print out the contents of your variable just before you do the split, and show the actual code that includes the split. Comparing the two with the documentation on how split works should then make the problem obvious.