Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re: Grabbing numbers from a URL

by kevbot (Vicar)
on Jul 09, 2017 at 05:10 UTC ( [id://1194576]=note: print w/replies, xml ) Need Help??


in reply to Grabbing numbers from a URL

Hello htmanning,

Take a look at the documentation for regex quantifiers, and capture groups.

Your code will match numbers that come in multiples of 4 integers. For example something-1234.html will match as well as something-12341234.html. For matching only 4 digits, your pattern can be simplified to:

$url=~/(\d{4})\.htm/i;
Note, that the + has been removed from your regex. Also, as your code is written $num will not contain the number. It will contain the whole URL. To get just the number, you need to get the value of the first capture group
$num = $1;

To allow for 4 or more digits, use the following

$url=~/(\d{4,})\.htm/i;

To allow for only 4 or 5 digits, use the following

$url=~/(\d{4,5})\.htm/i;

UPDATE:I really like the named capture groups feature that comes with perl versions 5.10 and greater. They can be overkill when you are only dealing with one or two groups, but can make the code much more clear if you are dealing with multiple capture groups.

#!/usr/bin/env perl use strict; use warnings; use v5.10; my $url = 'something-12345.html'; $url =~ /(?<num>\d{4,5})\.htm/i; my $num = $+{num}; print "$num\n"; exit;

Replies are listed 'Best First'.
Re^2: Grabbing numbers from a URL
by htmanning (Friar) on Jul 09, 2017 at 05:52 UTC
    Thanks so much. Now I'm thoroughly confused. Apparently the code I posted grabs the contents of $itemID which includes a number, but is the entire string like something-1234.html. What I really need to do is simply capture the 4 numbers in the string, or 5 numbers if it is a 5 digit numbers. There would be no other numbers in any of the urls. Perhaps there is a better way to achieve this. Can you tell me what each of these lines do? I think the first line takes the query string and sets it to $url. I'm not sure I know what the second line does.
    $url=q|$itemID|; $url=~/(\d{4})+\.htm/i;
    Here's what I'm doing. I'm using this recipe in the .htaccess file to run a perl script to display a page, but it displays a .html file in the browser.
    RewriteEngine on RewriteRule ^(.*)$ $1 [nc] RewriteRule ^(.*)$ /cgi-bin/getpage.pl?itemID=$1
    So in the script getpage.pl, I grab the $itemID with the code above and turn it into a filename. I search the database for a filename field that includes the page. Sometimes it's 1234.html, other times it's something-something-1234.html. It would really be best to simply grab the 1234 but I don't know how to do that.
      According to Quote and Quote Like Operators, the q operator does not interpolate the string. So, this code
      $url = q|$itemID|;
      results in $url contains the literal string $itemID, not the contents of $itemID. To test this out, run the following code
      #!/usr/bin/env perl use strict; use warnings; my $itemID = 'something-12345.html'; my $url = q|$itemID|; $url =~ /(\d{4,5})\.htm/i; print "Item ID: $itemID\n"; print " URL: $url\n"; my $num = $1; print " NUM: $1\n"; exit;
      You should get an error, since there is no number found in the URL. If you change the code to use the qq operator, then the string is interpolated, the match succeeds and you get the number in the $1 capture group variable.
      #!/usr/bin/env perl use strict; use warnings; my $itemID = 'something-12345.html'; my $url = qq|$itemID|; $url =~ /(\d{4,5})\.htm/i; print "Item ID: $itemID\n"; print " URL: $url\n"; my $num = $1; print " NUM: $1\n"; exit;
      If you want a description of regular expression in plain english, then you can use the YAPE::Regex::Explain module. Running this one-liner on your regex
      perl -MYAPE::Regex::Explain -E 'say YAPE::Regex::Explain->new("(\d{4}) ++\.htm")->explain();'
      The result is You can compare this to one of the modified regex patterns that I gave you. For example,
      perl -MYAPE::Regex::Explain -E 'say YAPE::Regex::Explain->new("(\d{4,5 +})\.htm")->explain();'

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1194576]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (4)
As of 2024-04-19 05:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found