Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
I have a URL formfield and for my script to work, I need to ensure that the URL IS infact a proper URL.
if ($url =~ m/regex/) { .. }
I'm sure someone has a suitable regex for this already.
Re: regex to match URLs
by ikegami (Patriarch) on Feb 28, 2006 at 20:30 UTC
|
| [reply] |
Re: regex to match URLs
by JediWizard (Deacon) on Feb 28, 2006 at 20:29 UTC
|
Try Regex::Common::URI.
They say that time changes things, but you actually have to change them yourself. Andy Warhol
| [reply] [d/l] |
Re: regex to match URLs
by pileofrogs (Priest) on Feb 28, 2006 at 21:22 UTC
|
You also might want to consider a different approach. It's really hard to define what a "valid" URL is. Maybe you only need http://, or maybe http:// and ftp:// or etc.. etc..
Then there's the problem of non standard URLS that I'm sure someone is using or will start to use. For instance if Microsoft released a product that used a URL like bill://. You might have to support it, even if it's not in a standard.
Rather than trying to validate the entire url as a regex, break it into parts, then test them. For instance, test
the bit you think is a host name by running gethostbyname() and test the part that names the protocol by running getservbynam().
This takes some of the strain off your regex. The best part is, you don't have to update your script to keep up with changes in the world. If a new bill:// protocol comes out (and you keep your /etc/services file up to date), your script won't miss a beat. Even more likely is a new top-level domain.
Of course, this will impact performance, so you need to ask yourself how fast you need this to be and how well you need it to check the URL. If letting a bad URL through is just a little annoying, it might be easiest to cull out the really egregious offenders and let the slippery ones pass. If on the other hand, you really suffer if a bad URL makes it past this test, it might be worth the clock cycles.
| [reply] |
Re: regex to match URLs
by atcroft (Abbot) on Feb 28, 2006 at 20:33 UTC
|
You may wish to look at the URI module, if you are looking to get specific components. Also, the documentation for that module includes a regex that can be used to split a URI into its parts (something also handled by URI::Split's uri_split() function)-it is possible you might be able to adapt that regex to ensure what you want is there.
Hope that helps.
| [reply] |
|
URI will not do the trick, since it accepts both absolute and relative URIs, and it doesn't do validation.
For example, "www.example.com" is accepted (even though it's not a valid absolute URI), ":80" and "http://:80" are accepted (even though they are not valid URIs).
| [reply] |
Re: regex to match URLs
by spiritway (Vicar) on Mar 01, 2006 at 06:00 UTC
|
For a really, REALLY, REALLY complete regex, there is the one Abigail posted on comp.lang.perl.misc (it's quite long):
Untested
| [reply] [d/l] |
Re: regex to match URLs
by moklevat (Priest) on Feb 28, 2006 at 20:34 UTC
|
I found this (untested) one for North American URLs by Brad Dobyns in the Regular Expression Library
Perhaps you can modify it for your specific purpose.
^(((ht|f)tp(s?))\://)?(www.|[a-zA-Z].)[a-zA-Z0-9\-\.]+\.(com|edu|gov|m
+il|net|org|biz|info|name|museum|us|ca|uk)(\:[0-9]+)*(/($|[a-zA-Z0-9\.
+\,\;\?\'\\\+&%\$#\=~_\-]+))*$
Updated: Fixed wrapping in the code tags. Thanks for the tip ikegami.
| [reply] [d/l] |
Re: regex to match URLs
by holli (Abbot) on Feb 28, 2006 at 21:11 UTC
|
I'd simply ping the given url. If it answers fine, if not reject it.
| [reply] [d/l] |
|
| [reply] |
Re: regex to match URLs
by Anonymous Monk on Mar 01, 2006 at 00:21 UTC
|
You can also ignore the verification of URLs entirely, and just carry the assumption that you were given a proper URL as far as you can -- until something fails that actually wanted to do something practical with your URL.
If your program is interactive, and not just feeding a database, please consider this option. You'll need to gracefully recover from errors anyway, so why engage in duplicate efforts?
| [reply] |
|
|