Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options

Use Parsers To Get Chunk of HTML?

by Cody Pendant (Prior)
on Jul 04, 2005 at 03:15 UTC ( #472117=perlquestion: print w/replies, xml ) Need Help??

Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

Slight corollary question to my previous scraper node -- although you don't need to read it to answer this one -- I need to extract links from a page, but not all of them.

Now sometimes I can find the links I want on an HTML page just by matching a URL pattern. This method is amenable to parsing with Toke::Parser or similar.

But say a site uses a completely opaque URL format like "?storyid=123456" for everything?

What I've done in the past is to find the chunk of the page which contains those "good" links as a way to exclude the "bad" ones. And I've done it the "dumb" way, i.e.

$whole_thing =~ m|<some unique html start string>(.*?)<end string>|s; $good_chunk = $1;
and then working on the $good_chunk.

I've spent a bit of time looking at Toke::Parser and HTML::Parser and I can't seem to figure out how to do the equivalent.

Say I've determined that what I need is

<div id="good_chunk">
up to the closing tag of that DIV.

I need something like

while ( my $token = $p->get_tag( "div" ) ) { if ( $token->[1]->{'id'} eq 'good_chunk' ){ # get the entire contents of the div, as HTML, # for further parsing } }
Perhaps I'm missing something obvious?

($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
=~y~b-v~a-z~s; print

Replies are listed 'Best First'.
Re: Use Parsers To Get Chunk of HTML?
by merzy (Scribe) on Jul 04, 2005 at 04:22 UTC
    I've become a big fan of HTML::TreeBuilder for this sort of thing. If I understand your question, you'd do something like:
    use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new_from_content($whole_thing); $tree->elementify(); my $good_chunk = $tree->look_down("_tag","div","id","good_chunk"); my $links_ref = $good_chunk->extract_links; my $good_chunk_html = $good_chunk->as_HTML;
      Wow, that worked straight away! Brilliant stuff. Thank you.

      ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
      =~y~b-v~a-z~s; print
Re: Use Parsers To Get Chunk of HTML?
by GrandFather (Saint) on Jul 04, 2005 at 04:17 UTC

    Take a look at HTML::TreeBuilder. It builds a tree representing the HTML in memorywhich you can then extract information from in various ways.

    Perl is Huffman encoded by design.
Re: Use Parsers To Get Chunk of HTML?
by polettix (Vicar) on Jul 04, 2005 at 10:45 UTC
    $whole_thing =~ m|<some unique html start string>(.*?)<end string>|s; $good_chunk = $1;
    The matching could fail here, so you should check before using $1, otherwise you'll get the value remaining from the previous positive evaluation. You could also evaluate in list context:
    ($good_chunk) = $whole_thing =~ m|<some unique html start string>(.*?)<end string>|s;
    even if readability could suffer a bit here. This will assign $1 to $good_chunk if the regex matches, undef otherwise.

    perl -ple'$_=reverse' <<<ti.xittelop@oivalf

    Don't fool yourself.
      Thanks for that, good point.

      ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
      =~y~b-v~a-z~s; print

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://472117]
Approved by GrandFather
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (5)
As of 2022-05-24 16:11 GMT
Find Nodes?
    Voting Booth?
    Do you prefer to work remotely?

    Results (84 votes). Check out past polls.