Beefy Boxes and Bandwidth Generously Provided by pair Networks Cowboy Neal with Hat
The stupid question is the question not asked
 
PerlMonks  

WWW::Mechanize find_link question.

by devnul (Monk)
on May 12, 2005 at 23:22 UTC ( #456565=perlquestion: print w/ replies, xml ) Need Help??
devnul has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I have several HTML links like the following:
<a href="url?page=1">10</a> <a href="url?page=2">20</a> <a href="url?page=3">30</a> <a href="url?page=2"><img src="next_image"></a>


I need to extract the last link (and only the last link) which contains the image. Is there a way to do this which I'm not seeing? I have tried using find_link, but not sure what I should be passing into it, if it is even possible.

I should probably mention that the actual URL for "20" and the image are identical to each other.

Thanks!

dEvNuL

Comment on WWW::Mechanize find_link question.
Download Code
Re: WWW::Mechanize find_link question.
by mrborisguy (Hermit) on May 12, 2005 at 23:35 UTC
    i would think something like:
    $mech->find_link( text_regex => qr/<img src="next_image">/i );
    would do the trick, but i've never got in depth with this module before, so i can't be for sure. it just seems that anything between the opening and closing tags would be the "text" of the link.
Re: WWW::Mechanize find_link question.
by thor (Priest) on May 13, 2005 at 00:11 UTC
    I wonder if find_all_links returns all of the links in order of appearance. If so, just use that and grab the last item in the array.

    thor

    Feel the white light, the light within
    Be your own disciple, fan the sparks of will
    For all of us waiting, your kingdom will come

Re: WWW::Mechanize find_link question.
by Adrade (Pilgrim) on May 13, 2005 at 00:59 UTC
    Dear dEvNuL,

    Well, I've never worked with the Mechanize library either, but if I understand your question correctly, I would personally use a pattern match to solve your problem.

    If you want to capture the last link in some HTML:

    # If you have a link that includes a space, # then remove the space from the last set of brackets ($linkloc) = ($html =~ m/.*href=["']([^"'> ]+)/s);
    Now, if you wanted to match the last url in your document that specifically linked an image (if there were more links that you want to ignore that follow the image link):
    ($linkloc) = ($html =~ m/.*href=["']([^"'> ]+)[^>]*>\s*<img/s);

    Using your html, in both cases, $linkloc becomes url?page=2

    I hope this is helpful. Best,
      -Adam
      /me downvoted, because using regex to match HTML is almost always wrong, unless you use a very correct regex, which you didn't.

      Please see the other answers in this thread for much better solutions.

      -- Randal L. Schwartz, Perl hacker
      Be sure to read my standard disclaimer if this is a reply.

        Dear Merlyn,

        It so happens that this particular user is trying to parse specifically formatted HTML. I would normally agree with you, but a regex is especially convenient when one is expecting data of a certain structure - this seems to meet that condition.

        Also, I'm interested in how you would modify the regex to meet your more stringent requirements. Always looking to better my ability here.

          -Adam
Re: WWW::Mechanize find_link question.
by devnul (Monk) on May 17, 2005 at 00:03 UTC
    For what it's worth, regarding the above debate:

    In this case using a regular expression (matching a HTML tag) is exactly what I needed to do. I was not aware that the find_link method would allow this.

    It works flawlessly and IMHO is a perfect solution in this case. Thanks!

    - Greg

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://456565]
Approved by polettix
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (11)
As of 2014-04-16 17:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (433 votes), past polls