http://www.perlmonks.org?node_id=814989


in reply to Re^3: anchor text match
in thread anchor text match

#!/usr/bin/perl -- use strict; use warnings; use HTML::TreeBuilder; my $html = <<'__HTML__'; <a href="http://www.yahoo.com" target=_blank><img src="http://us.i1.yimg.com/nw.gif" alt="Open this result in new window">ANCHOR TEXT</a> <a href="http://www.yahoo.com" target=_blank><img src="http://us.i1.yimg.com/nw.gif" alt="Two clues for the price of one"></a> __HTML__ { my $h = HTML::TreeBuilder->new_from_content($html); for my $link ( $h->look_down( _tag => q{a}, href => 'http://www.yaho +o.com' ) ) { print $link->attr('href'),"\n"; my $text = $link->as_trimmed_text; unless ($text) { $text = join ' ', map { $_->attr('alt') } $link->look_down( alt => qr/^.+$/ ); } print "$text\n\n"; } ## end for my $link ( $h->look_down...) } __END__

The output is :

http://www.yahoo.com; ANCHOR TEXT

http://www.yahoo.com; Two clues for the price of one

But the desired output is :

http://www.yahoo.com; ANCHOR TEXT

http://www.yahoo.com; IMAGE (indicating no anchor text and also presence of img tag within anchor tag )

Any ideas?

Replies are listed 'Best First'.
Re^5: anchor text match
by JadeNB (Chaplain) on Dec 30, 2009 at 21:08 UTC
    You have just copied the code from Re^2: anchor text match literatim (except for some mild massaging of the input). What have you tried?
      I also tried this:
      use WWW::Mechanize(); my $mech = WWW::Mechanize->new(); my $html = $mech->get('http://umallvt.com/directory.php'); my @links= $mech->find_all_links( text_regex => qr/a/i ); foreach(@links){ if($_->url() eq 'http://www.victoriassecret.com/'){ print "\n"; print "url \n"; print $_->url(); print "\n"; print " text\n"; print $_->text(); print "\n"; } } _END_

      The out put is :

      url: http://www.victoriassecret.com/

      text: Victoria's Secret

      In case the page had an anchor tag like below:

      a href="http://www.victoriassecret.com/" target=_blank><img src=http://www.victoriassecret.com/nw.gif height=11 width=11 border=0 alt="Open this result in new window"> </anchor>

      The above perl script would give :

      url: http://www.victoriassecret.com/

      text: Open this result in new window

      But the desired result is:

      url: http://www.victoriassecret.com/

      text: IMAGE

        I think that you may have expected a ready-made solution, which is why Re^2: anchor text match surprised you. The poster there was not (I think) trying to solve your problem, but rather to indicate to you how you could solve it. (That was the meaning of the “Two clues in one” text.)

        It's not surprising that the code you indicate doesn't do what you want—the for loop makes no effort to check whether the link being processed satisfies any special conditions, and so must treat every link equally.

        To fix this, you must have something of the following shape in your code:

        for my $link ( @links ) { if ( is_special $link ) { do_special_thing $link } else { do_ordinary_thing $link } }
        * where it's up to you to determine how to write is_special and do_special_thing (you've already indicated what you want do_ordinary_thing to be). As an aid, you have the $link object to hand, and so can test its properties in as much detail as necessary.

        * I don't mean literally that your code has to contain these words; just that, without some sort of conditional, you'll never get the special treatment you like.