Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

HTML::TokeParser - extract values between tags

by doubledecker (Scribe)
on Feb 17, 2014 at 10:27 UTC ( [id://1075152]=perlquestion: print w/replies, xml ) Need Help??

doubledecker has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I am trying to extract the following text from HTML page using the following code, but my code fails..

Budget - $25,000,00
Gross(worldwide) - $58,500,00

#!/usr/bin/perl use HTML::TokeParser; my $content = <<HTML; <h5>Budget</h5> $25,000,000 (estimated)<br/> <br/> <h5>Opening Weekend</h5> $727,327 (USA) (<a href="/date/09-25/">25 September</a> <a href="/year +/1994/">1994</a>) (33 Screens)<br/> <br/> <h5>Gross</h5> $28,341,469 (USA) (<a href="/date/08-05/">5 August</a> <a href="/year/ +2012/">2012</a>)<br/>&#163;2,344,349 (UK) (<a href="/date/05-18/">18 +May</a> <a href="/year/1995/">1995</a>)<br/>&#163;1,732,123 (UK) (<a +href="/date/04-16/">16 April</a> <a href="/year/1995/">1995</a>)<br/> +$58,500,000 (Worldwide)<br/>$555,480 (Belgium)<br/>ESP 637,291,985 (S +pain)<br/> <br/> <h5>Admissions</h5> 82,890 (Belgium)<br/>163,594 (France) (<a href="/date/03-28/">28 March +</a> <a href="/year/1995/">1995</a>)<br/>410,811 (Germany) (<a href=" +/date/12-31/">31 December</a> <a href="/year/1995/">1995</a>)<br/>1,2 +45,604 (Spain)<br/> <br/> <h5>Filming Dates</h5> <a href="/date/06-16/">16 June</a> <a href="/year/1993/">1993</a>&nbsp +;-&nbsp;<a href="/date/09-10/">10 September</a> <a href="/year/1993/" +>1993</a><br/> <br/> HTML my $description = ""; my $parser = HTML::TokeParser->new(\$content) || die "Can't open: $!"; while (my $token = $tp->get_tag("h5")) { my $text = $parser->get_text(); last if $text =~ /budget/i; }

Replies are listed 'Best First'.
Re: HTML::TokeParser - extract values between tags
by Anonymous Monk on Feb 17, 2014 at 12:06 UTC
    you're using interpolating heredocs :) double-quoted here docs :) strict vars or warnings would have warned you
    #!/usr/bin/perl -- use strict; use warnings; use XML::LibXML 1.70; ## for load_html/load_xml/location use Data::Dump qw/ dd /; my %shabs; my $dom = XML::LibXML->new( qw/ recover 2 / )->load_html( string => $c +ontent ); for my $h5 ( $dom->findnodes( q{ //h5 } ) ){ print $h5->nodePath, "\n"; my $key = $h5->textContent; my $next = $h5->nextSibling; while( $next ){ print $next->nodePath, "\n"; $shabs{$key} .= $next->textContent; $next = $next->nextSibling; last if eval { $next->tagName eq 'h5' } ; } print "\n"; } dd( \%shabs );
Re: HTML::TokeParser - extract values between tags
by hdb (Monsignor) on Feb 17, 2014 at 10:56 UTC

    In what way does your code fail? Even if it parses the HTML correctly, it will not produce any output as there is not print or anything similar in the code.

Re: HTML::TokeParser - extract values between tags
by Anonymous Monk on Feb 17, 2014 at 10:57 UTC
    Why choose tokeparser? Where does this $parser variable come from?
      Updated the code to reflect the parser object. Apologies for the same.
        the other question is more important :)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1075152]
Approved by hdb
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (5)
As of 2024-03-19 08:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found