Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

embedded table remover

by BigJoe (Curate)
on May 26, 2000 at 11:15 UTC ( #14932=sourcecode: print w/ replies, xml ) Need Help??

Category: HTML Utilities
Author/Contact Info Big Joe Big_Joe1008@linuxstart.com
Description: This script you can run on a html document to remove all embedded tables that are in it. Assuming that the tables were programmed into the document correctly. By default it will remove all embedded and leave the main table but you can also tell how many embedded tables are allowed by changing the numofTables variable.
#!/usr/bin/perl -w
$inputfile="test.htm";
$outputfile=">outfile2.html";
$numofTables=1;


open(INFILE, $inputfile) or die ("no file $inputfile");
$filesize = -s INFILE;
read(INFILE, $thispage, $filesize);
close(INFILE);

#this removes anypage breaks
$thispage=~s/<BR>/ /g;
$thispage=~s/<\/BR>/ /g;


@myarray=split("\s", $thispage);
open(OUTFILE, $outputfile);


$start=0;
foreach(@myarray){
#this is not to clean but the ASP that wrote the HTML 
#put the table tags and script tags on their own line
    if(($_ =~ m/<TABLE/)||($_ =~ m/<SCRIPT/))
    {
        $start++;
    }
    if($start<=$numofTables){
    print OUTFILE "$_\s";
    }
    if($_ =~ m/<\/TABLE>/)
    {
        $start--;
        print OUTFILE "</TR><TR>\n<TD>";
    }elsif($_ =~ m/<\/SCRIPT>/){
        $start--;
    }
} 



close(OUTFILE);

Comment on embedded table remover
Download Code
RE: embedded table remover
by merlyn (Sage) on May 27, 2000 at 00:01 UTC
    Perhaps a more robust (and shorter) solution can be created on top of HTML::Table, part of LWP. Amazing how much reinvention happens (creating more fragile solutions) when you don't check the CPAN first. :)

    -- Randal L. Schwartz, Perl hacker

      I read up on that and really didn't understand it. It showed how to access the data but I wanted to just remove all the embedded tables.
      HTML::Table is used for creating tables, rather than reading them. I suspect you meant HTML::TableExtract?

      Again, however, I suspect that that won't really work either as it discards all information that it doesn't need.

      You probably just want to build a handler onto HTML::Parser:

      #!/usr/bin/perl -w use strict; use HTML::Parser; my $in_table = 0; my $p = HTML::Parser->new( default_h => [ sub { print shift unless $in_table }, 'text'], start_h => [ sub { shift eq 'table' ? $in_table++ : $in_table || print shift }, 'tagname, text'], end_h => [ sub { shift eq 'table' ? $in_table-- : $in_table || print shift }, 'tagname, text'], ); $p->parse_file(shift || die "Need a file") || die $!;

      Tony

Back to Code Catacombs

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: sourcecode [id://14932]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (9)
As of 2014-07-31 15:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (249 votes), past polls