Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

very quick link regex

by Anonymous Monk
on Aug 03, 2007 at 04:27 UTC ( [id://630443]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to parse a link that looks like http://www.page.com/files4/i/90c/1493010807t1.jpg

files# can be any number, /i/ can be any leter, /90c/ can be any number/letter, and the filename can be a mix of numbers and chars but will always be a .jpg. I tried my best to break it myself but I failed.

This is what I came up with.

my $content2 =~ m#(www\.page\.com/files\d+/[a..z]/\d+[a-z]/(.+)+\.jpg) +#i;

Replies are listed 'Best First'.
Re: very quick link regex
by Thelonius (Priest) on Aug 03, 2007 at 09:09 UTC
    You have [a..z] where you should have [a-z]. You also have 'my' in front of $content2, which is wrong if you are trying to match a string which is already in $content2. If you are trying to match against $_ and put the results in $content2, you should write:
    my $content2 = m#(yourregexhere)#i;
    or, maybe clearer,
    my $content2 = ($_ =~ m#(yourregexhere)#i);

    The construct (.+)+ doesn't really make sense. Just .+ would be okay, but you have to worry about matching too much. One possibility is .+?, but that's not the most efficient code and you could still get false matches in some cases, such as if the source text was, e.g.:

    http://www.page.com/files4/i/90c/93898.gif is not a .jpg file
    You might want
    m#(www\.page\.com/files\d+/[a-z]/\d\S+\.jpg)#i
    or, if this is a real HTML link,
    m#(www\.page\.com/files\d+/[a-z]/[^"']+\.jpg)#i
Re: very quick link regex
by Zaxo (Archbishop) on Aug 03, 2007 at 04:35 UTC

    If you're parsing HTML, try HTML::LinkExtor. If text, Regexp::Common::URI.

    The former is a parser, the latter a well-tested regex.

    After Compline,
    Zaxo

Re: very quick link regex
by GrandFather (Saint) on Aug 03, 2007 at 04:55 UTC

    Are you trying to match, validate or extract parts?

    my $url = 'http://www.page.com/files4/i/90c/1493010807t1.jpg'; my ($domain, $path, $file) = $url =~ m'^(\w+://[\w.]+/)(.*?/)([\w.]+)$ +'; print "$domain\n$path\n$file\n";

    Prints:

    http://www.page.com/ files4/i/90c/ 1493010807t1.jpg

    which may or may not be anything like what you need to do.


    DWIM is Perl's answer to Gödel
Re: very quick link regex
by Anno (Deacon) on Aug 03, 2007 at 11:40 UTC
    Other monks have provided possible solutions to your problem. I only want to comment on the subject of your query.

    I may be lacking in monkish indulgence, but I resent it when a question is presented as "very quick" (or "easy", or whatever). Since you don't have a solution, how do you know it's "quick"?

    Anno

Re: very quick link regex
by Anonymous Monk on Aug 03, 2007 at 04:31 UTC
    actually after doing a little more research I found that the link is MORE dynamic than I thought. I need to match anything after page.com/files\d+... until after the .jpg . No other part of the page has /files\d+ so if it finds a match, it'll always be the right one. Can someone help with this?
      You were very close to result already.
      /(.+)+\.jpg
      In this last part of your regexp the dot-plus will match all up to the end of line. The remaining '.jpg' part will not be found and therefore the regexp never matches.

      The solution is to make the dot-plus part of the regexp less greedy, change the part above to:

      /(.+?)\.jpg

      try

      my $content2 =~ m#(www\.page\.com/files\d+/.+\.jpg)#i;
      Note: the code is NOT tested.

      Cheers !

      --VC



      There are three sides to any argument.....
      your side, my side and the right side.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://630443]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (2)
As of 2024-12-14 16:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Which IDE have you been most impressed by?













    Results (70 votes). Check out past polls.