Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Table.pm: Extract text from html tables

by zzspectrez (Hermit)
on Jan 18, 2001 at 12:00 UTC ( [id://52723]=CUFP: print w/replies, xml ) Need Help??

Simple little module that allows you to access the text inside nested tables using a multidimensional array. The html can either be in a variable or from a file.

usage:
my $table = Table->parse_it(\$content); or
my $table = Table->parse_it($filename);

then:
print $table->[$table][$row][$col];

package Table; use strict; use HTML::Parser; ## PRIVATE my $table = []; my $tb_count; my $tb_idx; my $row; my $column; my $table_status; my @save; sub new { my $type = shift; return bless $table, $type; } sub parse_it { my $self = shift; my $src = shift; my $p = HTML::Parser->new( api_version => 3, handlers => [ start => [ \&_start, "tagname"], end => [ \&_end, "tagname"], text => [ \&_text, "dtext"], ], marked_sections => 1, ); if (ref($src)){ $p->parse($$src) or return; }else{ $p->parse_file($src) or return; } return 1; } sub _start { my $tag = shift; if ($tag eq 'table'){ push @save, [$tb_idx, $row, $column]; $row = $column = 0; ++$tb_count; $tb_idx = $tb_count; ++$table_status; } $row++ if ($tag eq 'tr'); $column++ if ($tag eq 'td'); } sub _end { my $tag = shift; if ($tag eq 'table') { ($tb_idx, $row, $column) = @{ pop @save }; --$table_status; } $column = 0 if ($tag eq 'tr'); } sub _text { my $text = shift; $text =~ s/\xa0//; $table->[$tb_idx][$row][$column] .= $text if ($table_status) && ($text !~ m/^\s+$/) && ($text); } return 1;

Replies are listed 'Best First'.
Re: Table.pm: Extract text from html tables
by merlyn (Sage) on Jan 18, 2001 at 19:58 UTC

      I did do a search on search.cpan.org first. I agree with you that it is better not to reinvent the wheel if you dont have to. Because not only are you wasting time but the established code will probably be more efficient or at the least better debugged.

      However, I dont think HTML::Table applies well in this situation because it is for creating tables. I just want to get the data.

      I did install HTML::TableExtract before attempting it myself. However, it did not seem to work well for my needs. The author states that it was designed in the mind of selecting table data based off table headers. In my case the site I am accessing doesnt utilize text headers in its tables at all. This module also allows selecting data by using Depth and Count.

      From the pod.

      Depth and Count are more specific ways to specify tables in relation to one another. Depth represents how deeply a table resides in other tables. The depth of a top-level table in the document is 0. A table within a top-level table has a depth of 1, and so on. Each depth can be thought of as a layer; tables sharing the same depth are on the same layer. Within each of these layers, Count represents the order in which a table was seen at that depth, starting with 0. Providing both a depth and a count will uniquely specify a table within a document.

      This seems confusing to me when you have a document such as that I am accesing that has multiple top level tables with many sub tables beneath them.

      My solution allows me to access the table data just as by accesing the table data through a multideminsional array. Just count each <table> tag untill you are in the table that contains the data you want then note the row and column from that table and then accessing as $table->[table_number][row][column]. Seem much easier and in my opinion a better tool for my perticular situation. Of course HTML::TableExtract is a much more robust way to handle tables and better for situation where you can select the tables using headers instead of hard coding to the page layout.

      If you disagree with this, I would be interested your reasons why. I respect your opinion, as a known perl wizard!

      Thanks!
      zzSPECTREz

      I dunno, merlyn if you were already committed to using HTML::Parser this might be nice to have about. Once you are carrying the whole toolbox in, it seems a shame to go back for one more tool.

      --
      $you = new YOU;
      honk() if $you->love(perl)

Re: Table.pm: Extract text from html tables
by Anonymous Monk on Feb 10, 2005 at 06:33 UTC
    Thanks for writing this little snippet/module. Has proved useful in my use of HTML::Parser, particularly in being able to index into a complex table structure.
      HTML::TableExtract is not useful at all. I never can get any piece out of the html file.
        HTML::TableExtract is not useful at all. I never can get any piece out of the html file.
        You are doing something wrong then, because it works and I've found it very useful on occasion.
Re: Table.pm: Extract text from html tables
by Anonymous Monk on Apr 30, 2010 at 23:56 UTC
    Is there an error in the Table.pm code?

    It seems that to get a "table" object to succeed in being populated with html table contents required a change to Table.pm.

    It was necessary to replace the line "return 1;" in "parse_it" with "return $table;" in order for the "example.pl" script to function correctly.

    Here's the "example.pl" script;

    #!/bin/perl use table; $filename = Some html file ; my $table = Table->parse_it($filename); for $x (1 .. $#$table) { print "table $x contents:\n"; for $y (1 .. $#{$table->[$x]}) { for $z (1 .. $#{$table->[$x][$y]}) { print "cell($y,$z)-->\"$table->[$x][$y][$z]\"\n"; } } print "\n\n"; }
      Is there an error in the Table.pm code?

      I don't know, but since you have use table; the problem is with your code, since Table ne table

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: CUFP [id://52723]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (4)
As of 2024-03-29 06:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found