Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re: Using HTTP::LinkExtor to get URL and description info

by crazyinsomniac (Prior)
on Aug 08, 2002 at 04:39 UTC ( #188520=note: print w/ replies, xml ) Need Help??


in reply to Using HTTP::LinkExtor to get URL and description info

You have to know your tools. HTML::LinkExtor was designed to only extract the links, not the text in between (whatever you call it, cdata or whatever).

Demo

use strict; use Data::Dumper; use HTML::LinkExtor; my $base = 'http://perlmonks.org/'; my $stringy = q{ <tr><td><a HREF="/index.pl?node_id=188511">How does this code work (w +arnings.pm)?</a></td> <td>by <a HREF="/index.pl?node_id=80322">John +M. Dlugosz</a></td></tr> <tr><td><a HREF="/index.pl?node_id=188509">Tk and X events</a></td> < +td>by <a HREF="/index.pl?node_id=961">Anonymous Monk</a></td></tr> <tr><td><a HREF="/index.pl?node_id=188507">warnings::warnif etc. wise + usage?</a></td> <td>by <a HREF="/index.pl?node_id=80322">John M. Dl +ugosz</a></td></tr> <tr><td><a HREF="/index.pl?node_id=188505">52-bit numbers as floating + point</a></td> <td>by <a HREF="/index.pl?node_id=80322">John M. Dlu +gosz</a></td></tr> }; my $p = new HTML::LinkExtor(undef, $base); $p->parse($stringy); print Dumper $p->links; $p = new HTML::LinkExtor( sub { print Dumper($_) for @_; } , $base); $p->parse($stringy);
And now for the nudge, HTML::TokeParser tutorial

update: suprise, suprise, I've solved this one before (crazyinsomniac) Re: Getting the Linking Text from a page

 
______crazyinsomniac_____________________________
Of all the things I've lost, I miss my mind the most.
perl -e "$q=$_;map({chr unpack qq;H*;,$_}split(q;;,q*H*));print;$q/$q;"


Comment on Re: Using HTTP::LinkExtor to get URL and description info
Download Code
Replies are listed 'Best First'.
Re: Re: Using HTTP::LinkExtor to get URL and description info
by Popcorn Dave (Abbot) on Aug 08, 2002 at 05:58 UTC
    Thanks for that!

    I'll be looking at that tomorrow for certain, but I do have one question. My program is taking headlines off of newspaper sites, but at the moment I'm using LWP::Simple with get(URL), dumping it in to an array, then reading through to a certain pre-determined point, and then using a regex to get the info I want.

    Is HTML::TokeParser going to allow me to do that type of thing or will I have to write new "rules" to determine what is a headline and what is just a link on the page?

    Thanks again!

    Some people fall from grace. I prefer a running start...

      I would suggest the CPAN module HTML::Parser. It's pretty straightforward:
      use HTML::Parser; $p = new HTML::Parser(start_h => [\&start, "tagname"], end_h => [\&end, "tagname"], default_h => [\&default, "text"]); $p->parse($some_html); $p->parsefile(\*SOME_FH); sub start { my ($tagname) = @_; $in_a = 1 if $tagname eq 'a'; } sub end { my ($tagname) = @_; $in_a = 0 if $tagname eq 'a'; } sub default { my ($text) = @_; # do something with text if $in_a }
      HTH. Off the top of my head. Check the HTML::Parser PoD for absolute correctness.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://188520]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (12)
As of 2015-07-31 12:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (276 votes), past polls