Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Extract HTML rows with headers specified

by kalyanrajsista (Scribe)
on Jan 29, 2010 at 05:10 UTC ( #820306=perlquestion: print w/ replies, xml ) Need Help??
kalyanrajsista has asked for the wisdom of the Perl Monks concerning the following question:

Hello all,

I'm trying to extract HTML table rows with the following code

use strict; use warnings; use HTML::TableExtract; use Data::Dumper; my $html = qq{ <HTML> <BODY> <table border="1"> <tr><td align="center" nowrap><font size="2"><u>Activity #</u><t +d align="center"><font size="2">Some&nbsp;ID<br>/Debit&nbsp;ID</font> +</td></tr> <tr><td align="right"><font size="2">588476377</font></td><td><f +ont size="2"><a href="/cgi-bin/page?id=1275591">1275591</a></font></t +d></tr> <tr><td align="right"><font size="2">588484813</font></td><td><f +ont size="2"><a href="/cgi-bin/page?id=1210540">1210540</a></font></t +d></tr> </table> </BODY> </HTML> }; my $te = HTML::TableExtract->new( headers => ['Some ID'] ); $te->parse($html); eval { $te->rows; }; if ( $@ ) { print "No rows found\n"; } print Dumper($te->rows);

When trying to extract with table headers like 'Invoice ID' which doesn't display '\ ' in the webpage, code is displaying as 'No rows found'. How can I handle to extract the data even when there are any spaces, '/' or any other characters inside the headers.

Comment on Extract HTML rows with headers specified
Download Code
Re: Extract HTML rows with headers specified
by wfsp (Abbot) on Jan 29, 2010 at 08:49 UTC
    Changing
    ['Some ID']
    to
    ['Some&nbsp;ID']
    and it works ok here.

    Update: No it doesn't :-(
    But it's because Some ID isn't the same as Some&nbsp;ID (although it may look the same in the browser).

    Update2:

    my $header = q{Some} . chr(0x0A0) . q{ID}; my $te = HTML::TableExtract->new( headers => [$header] );
Re: Extract HTML rows with headers specified
by Anonymous Monk on Jan 29, 2010 at 08:58 UTC
    Try turn on debugging
Re: Extract HTML rows with headers specified
by steve (Deacon) on Jan 29, 2010 at 19:27 UTC
    HTML::TableExtract indicates that there is a "decode" constructor attribute that is described as follows:
    Automatically decode retrieved text with HTML::Entities::decode_entities(). Enabled by default. Has no effect if keep_html was specified or if extracting into an element tree structure.

    The following works for me:
    my $html = qq{ <HTML> <BODY> <table border="1"> <tr><td align="center" nowrap><font size="2"><u>Activity #</u><t +d align="center"><font size="2">Some&nbsp;ID<br>/Debit&nbsp;ID</font> +</td></tr> <tr><td align="right"><font size="2">588476377</font></td><td><f +ont size="2"><a href="/cgi-bin/page?id=1275591">1275591</a></font></t +d></tr> <tr><td align="right"><font size="2">588484813</font></td><td><f +ont size="2"><a href="/cgi-bin/page?id=1210540">1210540</a></font></t +d></tr> </table> </BODY> </HTML> }; my $te = HTML::TableExtract->new( headers => ['Some&nbsp;ID'] , decode + => 0); $te->parse($html); eval { $te->rows; }; if ( $@ ) { print "No rows found\n"; } print Dumper($te->rows);

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://820306]
Approved by ww
Front-paged by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (9)
As of 2014-08-22 08:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (150 votes), past polls