Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

2 questions for Corion regarding href with Mechanize::Firefox

by help_3452 (Initiate)
on Jul 15, 2013 at 18:15 UTC ( #1044435=perlquestion: print w/ replies, xml ) Need Help??
help_3452 has asked for the wisdom of the Perl Monks concerning the following question:

Good Afternoon All,
These are 2 questions for Corion. I'm asking them here as I think it might help others !

1. I noticed that Mechanize::Firefox does not consistently return usable links when you use $mech->find_all_links
2. I spent quite some time tracing through perl debug
3. I think I understand what's going on.
4. If you navigate to http://m.rte.ie and examine one of the news links in Firebug you will notice that the href in the HTML is not the same as the href in the DOM.
5. This was not obvious to me at the start!

Questions:
Is this correct ? Mech::firefox picks up href from the DOM ? - If so it might be worth making this a little bit clearer in your cpan entry. Perhaps it is clear. but it wasn't to me. (Excuses I'm not a perl pro)
If this is correct could you suggest how I might access all the links consistently ? My thought at the moment is to reexamine the outerHTML. This seems very clumsy !!

Comment on 2 questions for Corion regarding href with Mechanize::Firefox
Re: 2 questions for Corion regarding href with Mechanize::Firefox
by moritz (Cardinal) on Jul 15, 2013 at 19:38 UTC
Re: 2 questions for Corion regarding href with Mechanize::Firefox
by mtmcc (Hermit) on Jul 15, 2013 at 19:43 UTC
    Have you looked through the documentation?

    I haven't used that module, but it seems to me that there are options for both?

    I hope thats helpful! Apologies if not.

    -Michael
Re: 2 questions for Corion regarding href with Mechanize::Firefox
by Anonymous Monk on Jul 15, 2013 at 19:55 UTC
    There is no such module as Mechanize::Firefox, the site you are scraping forbids that in terms of use, why they make harder to scrape. To start? Ask for API access, or a feed from their DB. Clearer cpan entry? Submit a patch or say something meaningful.
Re: 2 questions for Corion regarding href with Mechanize::Firefox
by LanX (Canon) on Jul 15, 2013 at 20:53 UTC
    You have to decide when to poll the links.

    The module has only limited means to tell when a page is "ready" (it's meant to mechanize dynamic pages! remember?)

    That's why you have to check what "ready" means for you (e.g. testing if a special DOM-element already appeared) and synchronize your actions.

    Cheers Rolf

    ( addicted to the Perl Programming Language)

      Error Code :
      No link found matching '//a(@href = "https://www43.friendsprovident.com/CAA/jsp/register_customer.jsp;jsessionid=0001xxxxxxxxxxxxxxxxxxxxxxx:xxxxxxxxxxx?targeturl=https%3A%2F%2Fwww88.friendsprovident.com%2Fmembersite%2Factivate%2FhaveNoCRN.jhtml" or @src="https://www43.friendsprovident.com/CAA/jsp/register_customer.jsp;jsessionid=0001xxxxxxxxxxxxxxxxxxxxxxx:xxxxxxxxxxx?targeturl=https%3A%2F%2Fwww88.friendsprovident.com%2Fmembersite%2Factivate%2FhaveNoCRN.jhtml")'

      Actual HTML:

      <a id="register" tabindex="106" href="/CAA/jsp/register_customer.jsp;j +sessionid=0001xxxxxxxxxxxxxxxxxxxxxxx:xxxxxxxxxxx?targeturl=https%3A% +2F%2Fwww88.friendsprovident.com%2Fmembersite%2Factivate%2FhaveNoCRN.j +html"> Register</a>


      Found @ : https://www43.friendsprovident.com/CAA/jsp/login.jsp;jsessionid=0001xxxxxxxxxxxxxxxxxxxxxxx:xxxxxxxxxxx?targeturl=https%3A%2F%2Fwww88.friendsprovident.com%2Fmembersite%2Flogin%2FMSLogin.jsp%3Fsite_id%3Dmembersite%26finaltargetURL%3Dhttps%3A%2F%2Fwww88.friendsprovident.com%2Fmembersite%2Findex.jhtml%26realmid%3D3DMS-UID

      (Probably best found by navigating to https://www88.friendsprovident.com/membersite/ and then click login.)
      The register link on the login page is what I'm trying demo $mech->follow_link

      Using Code:
      @arr = $mech->find_all_links; $link_obj = @arr[4];

      Further on I have
      eval {$mech->follow_link( url => $link_obj->url, tag => $link_obj-> +tag) };

      These two methods don't work consistently.

      To respond to all of you in turn.

      moritz: Ok. I do need to have the ability to navigate/edit javascript. That part I have covered.
      micheal: Yes I've looked at the documentation. Yes having now gone through pretty much all the code for www::mechanize::firefox it is clear to me. The code i've quoted is reasonable and logical. The documentation only needs one extra line to point out that follow_link will not necessarily work with the links found by find_all_links. Or if I understand the code follow_link will search the html while find_all_links will refer to the DOM, which is not documented.
      anonymous monk: This is really aggressive! I'm just asking a question. 1. mechanize::firefox doesn't exit. ANS you are pedantic. 2. terms of service. I don't want to scrape either website. I fact i don't want to scrape any website, certainly not any that i don't have permission to. Both websites cited here are only mentioned so i can show the problem. 3. API db access. That is just aggressive and unhelpful. 4. The documentation could be a little clearer on the follow_links section. 5. submit a patch. sure no problem. This is my fifth day learning perl, so once i have the skill I would be very happy to help improve the code. 6. say something meaningful. www::mechanize::firefox is a super piece of work, it is very slick. I'm only quibbling about a small element, because I would like others to use this cool software and would not like them to get confused as i did.

      bottom line is find_all_links returns objects which are then not usable with follow_link. This is not what I would expect.
      Rolf: Cheers but no that is not the problem. It is not a question of the data being unavailable. I can access the data.

        Thanks for posting actual code and data.

        I've never used ->find_all_links in conjunction with ->follow_link. As ->find_all_links needs to return absolute links, and as there is no easy way for ->follow_link to determine URL equivalence for arbitrary href attributes, the problem is basically unsolvable that way.

        As a workaround, I would look at ->find_link_dom , which returns DOM objects instead of converting things to strings.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1044435]
Approved by ww
Front-paged by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (6)
As of 2014-12-26 10:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (171 votes), past polls