Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

regex to match URLs

by Anonymous Monk
on Feb 28, 2006 at 20:09 UTC ( [id://533485]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a URL formfield and for my script to work, I need to ensure that the URL IS infact a proper URL.

if ($url =~ m/regex/) { .. }

I'm sure someone has a suitable regex for this already.

Replies are listed 'Best First'.
Re: regex to match URLs
by ikegami (Patriarch) on Feb 28, 2006 at 20:30 UTC

    Regexp::Common::URI should do the trick.

    Caveats:

    • It only validate absolute URIs. (I think. But that's probably what you want anyway.)

    • If you want to validate URIs of a specific scheme, it has to be one of: fax, file, FTP, gopher, HTTP, news, NTTP, pop, prospero, tel, telnet, tv and WAIS.

    • If you want to validate URIs of any scheme, it will fail if the URI is not of one of the following schemes: fax, file, FTP, gopher, HTTP, news, NTTP, pop, prospero, tel, telnet, tv and WAIS.

Re: regex to match URLs
by JediWizard (Deacon) on Feb 28, 2006 at 20:29 UTC

    Try Regex::Common::URI.


    They say that time changes things, but you actually have to change them yourself.

    —Andy Warhol

Re: regex to match URLs
by pileofrogs (Priest) on Feb 28, 2006 at 21:22 UTC

    You also might want to consider a different approach. It's really hard to define what a "valid" URL is. Maybe you only need http://, or maybe http:// and ftp:// or etc.. etc.. Then there's the problem of non standard URLS that I'm sure someone is using or will start to use. For instance if Microsoft released a product that used a URL like bill://. You might have to support it, even if it's not in a standard.

    Rather than trying to validate the entire url as a regex, break it into parts, then test them. For instance, test the bit you think is a host name by running gethostbyname() and test the part that names the protocol by running getservbynam().

    This takes some of the strain off your regex. The best part is, you don't have to update your script to keep up with changes in the world. If a new bill:// protocol comes out (and you keep your /etc/services file up to date), your script won't miss a beat. Even more likely is a new top-level domain.

    Of course, this will impact performance, so you need to ask yourself how fast you need this to be and how well you need it to check the URL. If letting a bad URL through is just a little annoying, it might be easiest to cull out the really egregious offenders and let the slippery ones pass. If on the other hand, you really suffer if a bad URL makes it past this test, it might be worth the clock cycles.

Re: regex to match URLs
by atcroft (Abbot) on Feb 28, 2006 at 20:33 UTC

    You may wish to look at the URI module, if you are looking to get specific components. Also, the documentation for that module includes a regex that can be used to split a URI into its parts (something also handled by URI::Split's uri_split() function)-it is possible you might be able to adapt that regex to ensure what you want is there.

    Hope that helps.

      URI will not do the trick, since it accepts both absolute and relative URIs, and it doesn't do validation.

      For example, "www.example.com" is accepted (even though it's not a valid absolute URI), ":80" and "http://:80" are accepted (even though they are not valid URIs).

Re: regex to match URLs
by spiritway (Vicar) on Mar 01, 2006 at 06:00 UTC

    For a really, REALLY, REALLY complete regex, there is the one Abigail posted on comp.lang.perl.misc (it's quite long):

    Untested

Re: regex to match URLs
by moklevat (Priest) on Feb 28, 2006 at 20:34 UTC
    I found this (untested) one for North American URLs by Brad Dobyns in the Regular Expression Library

    Perhaps you can modify it for your specific purpose.
    ^(((ht|f)tp(s?))\://)?(www.|[a-zA-Z].)[a-zA-Z0-9\-\.]+\.(com|edu|gov|m +il|net|org|biz|info|name|museum|us|ca|uk)(\:[0-9]+)*(/($|[a-zA-Z0-9\. +\,\;\?\'\\\+&%\$#\=~_\-]+))*$
    Updated: Fixed wrapping in the code tags. Thanks for the tip ikegami.
Re: regex to match URLs
by holli (Abbot) on Feb 28, 2006 at 21:11 UTC
    I'd simply ping the given url. If it answers fine, if not reject it.


    holli, /regexed monk/
Re: regex to match URLs
by Anonymous Monk on Mar 01, 2006 at 00:21 UTC

    You can also ignore the verification of URLs entirely, and just carry the assumption that you were given a proper URL as far as you can -- until something fails that actually wanted to do something practical with your URL.

    If your program is interactive, and not just feeding a database, please consider this option. You'll need to gracefully recover from errors anyway, so why engage in duplicate efforts?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://533485]
Approved by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (4)
As of 2024-04-23 06:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found