Extracting Page Name

JimStone has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Extracting Page Name by ww (Archbishop) on Apr 26, 2012 at 23:16 UTC
The something simple may be that the 'pagename' will follow the last slash. The catch: it may be followed by many options -- a colon, ^Note_1 if it's followed by a port number; a questionmark for several possible uses; and perhaps others that I'm blanking on just now. But regardless, the entity from the last slash, through a period to the next punctuation should be what you're looking for. And to broaden the hint a bit further, the regex documentation and tutorials here will show you precisely the way to obtain what you're looking for. Update: ^Note_1 See correction (+ + by quester immediately below. Aargh.	[reply]
Re^2: Extracting Page Name by quester (Vicar) on Apr 27, 2012 at 06:34 UTC
... a colon, if it's followed by a port number... Minor nit: The colon and port number is just after the hostname in a URL, not the page name. For example, consider the port 8080 in `http://www.example.com:8080/pagename.html` The question mark following the page name in a URL starts a list of parameters being passed from the browser to the script running in the server. The parameter values can be more or less anything; by convention spaces will have been replaced by plus signs, but otherwise almost anything goes, including colons. For example, `http://www.example.com/filename.pl?credentials=myuserid:zomg_dont_send_passwords_in_the_clear`	[reply] [d/l] [select]
Re: Extracting Page Name by choroba (Cardinal) on Apr 26, 2012 at 23:20 UTC
You can use a regular expression. It matches non-slash characters up to the end of the URL. `my ($pagename) = $url =~ m{([^/]+)$};` [download]	[reply] [d/l]
Re^2: Extracting Page Name by ww (Archbishop) on Apr 27, 2012 at 13:54 UTC
... but has a (greedy) failure mode: `C:\>perl -e "my $url = 'http://www.perlmonks.com/index.pl?node_id=9674 +84'; my ($pagename) = $url =~ m{([^/]+)$}; print $pagename;" index.pl?node_id=967484 C:\>` [download]	[reply] [d/l]
Re^2: Extracting Page Name by JimStone (Initiate) on Apr 27, 2012 at 00:07 UTC
Thanks everyone for the quick replies. This regular expression is just what I needed to see what I was doing wrong.	[reply]
Re: Extracting Page Name by Marshall (Canon) on Apr 26, 2012 at 23:31 UTC
There are many modules that can deal with web pages. A very easy one is LWP::Simple. There are many others! Get started, then show us code and where you are having troubles. Of course test the URL that you are trying to get by using your normal Web browser. If it can't "get" it, Perl can't either.	[reply]


Keep It Simple, Stupid
	PerlMonks