Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"

HTML::TokeParser - extract values between tags

by doubledecker (Scribe)
on Feb 17, 2014 at 10:27 UTC ( #1075152=perlquestion: print w/replies, xml ) Need Help??

doubledecker has asked for the wisdom of the Perl Monks concerning the following question:


I am trying to extract the following text from HTML page using the following code, but my code fails..

Budget - $25,000,00
Gross(worldwide) - $58,500,00

#!/usr/bin/perl use HTML::TokeParser; my $content = <<HTML; <h5>Budget</h5> $25,000,000 (estimated)<br/> <br/> <h5>Opening Weekend</h5> $727,327 (USA) (<a href="/date/09-25/">25 September</a> <a href="/year +/1994/">1994</a>) (33 Screens)<br/> <br/> <h5>Gross</h5> $28,341,469 (USA) (<a href="/date/08-05/">5 August</a> <a href="/year/ +2012/">2012</a>)<br/>&#163;2,344,349 (UK) (<a href="/date/05-18/">18 +May</a> <a href="/year/1995/">1995</a>)<br/>&#163;1,732,123 (UK) (<a +href="/date/04-16/">16 April</a> <a href="/year/1995/">1995</a>)<br/> +$58,500,000 (Worldwide)<br/>$555,480 (Belgium)<br/>ESP 637,291,985 (S +pain)<br/> <br/> <h5>Admissions</h5> 82,890 (Belgium)<br/>163,594 (France) (<a href="/date/03-28/">28 March +</a> <a href="/year/1995/">1995</a>)<br/>410,811 (Germany) (<a href=" +/date/12-31/">31 December</a> <a href="/year/1995/">1995</a>)<br/>1,2 +45,604 (Spain)<br/> <br/> <h5>Filming Dates</h5> <a href="/date/06-16/">16 June</a> <a href="/year/1993/">1993</a>&nbsp +;-&nbsp;<a href="/date/09-10/">10 September</a> <a href="/year/1993/" +>1993</a><br/> <br/> HTML my $description = ""; my $parser = HTML::TokeParser->new(\$content) || die "Can't open: $!"; while (my $token = $tp->get_tag("h5")) { my $text = $parser->get_text(); last if $text =~ /budget/i; }

Replies are listed 'Best First'.
Re: HTML::TokeParser - extract values between tags
by Anonymous Monk on Feb 17, 2014 at 12:06 UTC
    you're using interpolating heredocs :) double-quoted here docs :) strict vars or warnings would have warned you
    #!/usr/bin/perl -- use strict; use warnings; use XML::LibXML 1.70; ## for load_html/load_xml/location use Data::Dump qw/ dd /; my %shabs; my $dom = XML::LibXML->new( qw/ recover 2 / )->load_html( string => $c +ontent ); for my $h5 ( $dom->findnodes( q{ //h5 } ) ){ print $h5->nodePath, "\n"; my $key = $h5->textContent; my $next = $h5->nextSibling; while( $next ){ print $next->nodePath, "\n"; $shabs{$key} .= $next->textContent; $next = $next->nextSibling; last if eval { $next->tagName eq 'h5' } ; } print "\n"; } dd( \%shabs );
Re: HTML::TokeParser - extract values between tags
by hdb (Monsignor) on Feb 17, 2014 at 10:56 UTC

    In what way does your code fail? Even if it parses the HTML correctly, it will not produce any output as there is not print or anything similar in the code.

Re: HTML::TokeParser - extract values between tags
by Anonymous Monk on Feb 17, 2014 at 10:57 UTC
    Why choose tokeparser? Where does this $parser variable come from?
      Updated the code to reflect the parser object. Apologies for the same.
        the other question is more important :)

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1075152]
Approved by hdb
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (2)
As of 2021-12-01 23:49 GMT
Find Nodes?
    Voting Booth?
    R or B?

    Results (15 votes). Check out past polls.