regex to match URLs

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: regex to match URLs by ikegami (Patriarch) on Feb 28, 2006 at 20:30 UTC
Regexp::Common::URI should do the trick. Caveats: It only validate absolute URIs. (I think. But that's probably what you want anyway.) If you want to validate URIs of a specific scheme, it has to be one of: fax, file, FTP, gopher, HTTP, news, NTTP, pop, prospero, tel, telnet, tv and WAIS. If you want to validate URIs of any scheme, it will fail if the URI is not of one of the following schemes: fax, file, FTP, gopher, HTTP, news, NTTP, pop, prospero, tel, telnet, tv and WAIS.	[reply]
Re: regex to match URLs by JediWizard (Deacon) on Feb 28, 2006 at 20:29 UTC
Try Regex::Common::URI. They say that time changes things, but you actually have to change them yourself. —Andy Warhol	[reply] [d/l]
Re: regex to match URLs by pileofrogs (Priest) on Feb 28, 2006 at 21:22 UTC
You also might want to consider a different approach. It's really hard to define what a "valid" URL is. Maybe you only need http://, or maybe http:// and ftp:// or etc.. etc.. Then there's the problem of non standard URLS that I'm sure someone is using or will start to use. For instance if Microsoft released a product that used a URL like bill://. You might have to support it, even if it's not in a standard. Rather than trying to validate the entire url as a regex, break it into parts, then test them. For instance, test the bit you think is a host name by running gethostbyname() and test the part that names the protocol by running getservbynam(). This takes some of the strain off your regex. The best part is, you don't have to update your script to keep up with changes in the world. If a new bill:// protocol comes out (and you keep your /etc/services file up to date), your script won't miss a beat. Even more likely is a new top-level domain. Of course, this will impact performance, so you need to ask yourself how fast you need this to be and how well you need it to check the URL. If letting a bad URL through is just a little annoying, it might be easiest to cull out the really egregious offenders and let the slippery ones pass. If on the other hand, you really suffer if a bad URL makes it past this test, it might be worth the clock cycles.	[reply]
Re: regex to match URLs by atcroft (Abbot) on Feb 28, 2006 at 20:33 UTC
You may wish to look at the URI module, if you are looking to get specific components. Also, the documentation for that module includes a regex that can be used to split a URI into its parts (something also handled by URI::Split's uri_split() function)-it is possible you might be able to adapt that regex to ensure what you want is there. Hope that helps.	[reply]
Re^2: regex to match URLs by ikegami (Patriarch) on Feb 28, 2006 at 20:43 UTC
URI will not do the trick, since it accepts both absolute and relative URIs, and it doesn't do validation. For example, "www.example.com" is accepted (even though it's not a valid absolute URI), ":80" and "http://:80" are accepted (even though they are not valid URIs).	[reply]
Re: regex to match URLs by spiritway (Vicar) on Mar 01, 2006 at 06:00 UTC
For a really, REALLY, REALLY complete regex, there is the one Abigail posted on comp.lang.perl.misc (it's quite long): Read more... (11 kB) Untested	[reply] [d/l]
Re: regex to match URLs by moklevat (Priest) on Feb 28, 2006 at 20:34 UTC
I found this (untested) one for North American URLs by Brad Dobyns in the Regular Expression Library Perhaps you can modify it for your specific purpose. `^(((ht\|f)tp(s?))\://)?(www.\|[a-zA-Z].)[a-zA-Z0-9\-\.]+\.(com\|edu\|gov\|m +il\|net\|org\|biz\|info\|name\|museum\|us\|ca\|uk)(\:[0-9]+)(/($\|[a-zA-Z0-9\. +\,\;\?\'\\\+&%\$#\=~_\-]+))$` [download] Updated: Fixed wrapping in the code tags. Thanks for the tip ikegami.	[reply] [d/l]
Re: regex to match URLs by holli (Abbot) on Feb 28, 2006 at 21:11 UTC
I'd simply ping the given url. If it answers fine, if not reject it. holli, /regexed monk/	[reply] [d/l]
Re^2: regex to match URLs by Anonymous Monk on Feb 28, 2006 at 21:15 UTC
You can't ping http://www.perlmonks.org/index.pl?node_id=533 even though its valid.	[reply]
Re: regex to match URLs by Anonymous Monk on Mar 01, 2006 at 00:21 UTC
You can also ignore the verification of URLs entirely, and just carry the assumption that you were given a proper URL as far as you can -- until something fails that actually wanted to do something practical with your URL. If your program is interactive, and not just feeding a database, please consider this option. You'll need to gracefully recover from errors anyway, so why engage in duplicate efforts?	[reply]


Problems? Is your data what you think it is?
	PerlMonks