Beefy Boxes and Bandwidth Generously Provided by pair Networks Bob
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Queries on HTML::TableExtract - How to parse from saved html file

by howdoesitwork (Initiate)
on Aug 08, 2012 at 08:11 UTC ( #986173=perlquestion: print w/ replies, xml ) Need Help??
howdoesitwork has asked for the wisdom of the Perl Monks concerning the following question:

Hello there, am pretty newish to perl,mainly have a java background, so do let me know if I'm missing something completely obvious.

I have been looking at various examples to try to get them to work, but so far haven't been able to get any using parse_file to work, only managed to get one working, but it was using parse() for parsing a html string.

For more background, I'm on windows 64 bit, and using strawberry perl, and I did install all the prerequisites for html::tableextract thru cpan. If possible, what I'd love is for an example of extracting table data from a html file already saved locally, and I should hopefully be able to fumble my way around from there.

Essentially, what I need to do is to extract some Table rows from a html file thats saved on my computer. And my apologies for the pretty horribly formatted post, and thanks for having a look!

edit: can't seem to post in the thread, probably doing something wrong.

aitap: This is part of the file I'll be parsing (it's pretty horribly formatted, and there are empty td tags sometimes.)

<tr> <td>2012/07/30</td> <td><a href="http://www.zone-h.org/archive/special=1/notifier=Dg4nx">Dg4nx</a +></td> <td>H</td> <td></td> <td><a href="http://www.zone-h.org/archive/domain=www.bauan.gov.ph">R</a></td +> <td><img src=" +../../images/cflags/png/us.png" alt="United States" title="United Sta +tes"></td> <td><img src="../../images/star.gif" borde +r="0"></td> <td>www.bauan.gov.ph </td> <td>Linux</td> <td><a href="http://www.zone-h.org/mirror/id/18160940">mirror</a></td> </tr>

As to examples, one I'm trying is http://search.cpan.org/~msisk/HTML-TableExtract-2.10/lib/HTML/TableExtract.pm but I seem to be missing something. I keep seeing a "can't call method "tree" on an undefined value at line 5" error when using this code from the TableExtracts examples(I have tried parsing in a html file $html_file = "page1.html"; , but it doesn't seem to be working)

use HTML::TableExtract qw(tree); $te = HTML::TableExtract->new( headers => [qw(Date Notifier H M R L D +omain OS View)] ); $te->parse_file($html_file); $table = $te->first_table_found; $table_tree = $table->tree; $table_html = $table_tree->as_HTML; $table_text = $table_tree->as_text; $document_tree = $te->tree; $document_html = $document_tree->as_HTML;

(My input likely won't fit this, but I'm just trying to get an example working to start with, I know I'm missing something, but not quite sure what.

influx: I'll give that a shot, thanks. appreciate the responses!

Comment on Queries on HTML::TableExtract - How to parse from saved html file
Select or Download Code
Re: Queries on HTML::TableExtract - How to parse from saved html file
by aitap (Chaplain) on Aug 08, 2012 at 08:31 UTC

    Can you post a small example of your file, so it will be easier to help you parse it?

    Posting examples usually helps others to help you, when it's an example of code, file to be parsed or error message.

    Sorry if my advice was wrong.
Re: Queries on HTML::TableExtract - How to parse from saved html file
by influx (Beadle) on Aug 08, 2012 at 08:37 UTC

    I don't know much about that module, but if parse_file() isn't very persistent, then perhaps you could slurp the file into a string and continue using the parse() method instead

    For larger files you might be better off using File::Slurp or something

    use File::Slurp 'read_file'; my $html = "/path/to/file.html"; my $str = read_file($html);

    Once you've done that, then you can just parse $str as you normally would a string.

Re: Queries on HTML::TableExtract - How to parse from saved html file
by Anonymous Monk on Aug 08, 2012 at 10:10 UTC

    can't call method "tree" on an undefined value at line 5

    That means first_table_found did not find any tables, it can happen

      hmm.. gotcha =/ i'll keep trying, then, thanks.
        there are such things as "css div tables" that use div and css and look like tables in modern browsers but aren't, tableextract won't help you with those

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://986173]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (14)
As of 2013-05-23 20:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best material for plates (tableware) is:









    Results (491 votes), past polls