Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Remove section from a HTML file

by Xevven (Initiate)
on Oct 24, 2013 at 14:34 UTC ( #1059482=perlquestion: print w/replies, xml ) Need Help??
Xevven has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I'm really desperate with the task to remove a section from a given bunch of HTML-files. I never worked with additional modules in Perl, but I guess, this time I can't avoid it ;-)

What I want to achieve:
* Remove a section from a html-file, but only from the one that has only one dot in its filename (match on abc.html but not on abc.html_aaa.html, abc.html_bbb.html in the same folder)

* The section looks like this:

<div class="sectionHeading">REMOVE_THIS</div> <div class="sectionContent"> <table class="sectionTable" border="0" cellspacing="0" cellpadding="0" + title="Properties" summary="Properties"> <tr valign="top"> <th class="sectionTableHeading" scope="row" id="property_aaa bbb" abbr +="aaa bbb">aaa bbb</th><td class="sectionTableCell" align="left" head +ers="property_aaa bbb"><img width="20" height="15" alt="" title="" sr +c="./../../images/indent.gif"></td> </tr> <tr valign="top"> <th class="sectionTableHeading" scope="row" id="property_ccc ddd" abbr +="ccc ddd">ccc ddd</th><td class="sectionTableCell" align="left" head +ers="property_ccc ddd"><img width="20" height="15" alt="" title="" sr +c="./../../images/indent.gif"></td> </tr> <tr valign="top"> <th class="sectionTableHeading" scope="row" id="property_eee" abbr="ee +e">eee</th><td class="sectionTableCell" align="left" headers="propert +y_eee"><img width="20" height="15" alt="" title="" src="./../../image +s/indent.gif"></td> </tr> <tr valign="top"> <th class="sectionTableHeading" scope="row" id="property_fff" abbr="ff +f">fff</th><td class="sectionTableCell" align="left" headers="propert +y_fff"><img width="20" height="15" alt="" title="" src="./../../image +s/indent.gif"></td> </tr> <tr valign="top"> <th class="sectionTableHeading" scope="row" id="property_ggg" abbr="gg +g">ggg</th><td class="sectionTableCell" align="left" headers="propert +y_ggg"><img width="20" height="15" alt="Yes" title="Yes" src="./../.. +/images/true.gif"></td> </tr> <tr valign="top"> <th class="sectionTableHeading" scope="row" id="property_hhh" abbr="hh +h">hhh</th><td class="sectionTableCell" align="left" headers="propert +y_hhh"><img width="20" height="15" alt="" title="" src="./../../image +s/indent.gif"></td> </tr> </table> </div>


* The section always starts with the two div-Containers, but the content in its inner table may vary (to be precise: only the referenced filenames './../../images/indent.gif' and './../../images/true.gif' vary).

I think, this section is too complicated to match with RegExp, do you agree?
Can I expect help from HTML::TokeParser or something similar?

Thanks for any helping hand :]
Cheers,
Xevven

Replies are listed 'Best First'.
Re: Remove section from a HTML file
by kcott (Chancellor) on Oct 24, 2013 at 15:42 UTC

    G'day Xevven,

    Welcome to the monastery.

    "I think, this section is too complicated to match with RegExp, do you agree?"

    No, I don't agree. On the basis of the data you've shown, this regex works just fine:

    my $re = qr{ <div \s+ class="sectionHeading">.*?</div>\s+ <div \s+ class="sectionContent">.*?</div>\s+ }msx;

    Here's my test:

    #!/usr/bin/env perl use strict; use warnings; my $re = qr{ <div \s+ class="sectionHeading">.*?</div>\s+ <div \s+ class="sectionContent">.*?</div>\s+ }msx; my $html = do { local $/; <DATA> }; $html =~ s/$re//; print $html; __DATA__ <!-- KEEP --> <div class="sectionHeading">REMOVE_THIS</div> <div class="sectionContent"> <table class="sectionTable" ... ... </table> </div> <!-- KEEP -->

    I added the <!-- KEEP --> comments as markers. I used all the <table>...</table> data exactly as you posted: I saw no reason to repeat it all again here.

    Here's the output:

    <!-- KEEP --> <!-- KEEP -->

    -- Ken

      Thank you very much, this is indeed working as expected, even if I put in a complete real-world file in the __DATA__ section ;-) I tried to alter the script, so that i modifies all of the apropriate files. For testing purposes, I tried to match the files and output there modified content. It seems, that this approach eliminates all line-breaks. Output is all in a single line. Can some one help me out, where my error is ? ;-) Cheers, Xevven
      #!/usr/bin/env perl use strict; use warnings; my $re = qr{ <div \s+ class="sectionHeading">REMOVE_THIS.*?</div>\s+ <div \s+ class="sectionContent">.*?</div>\s+ }msx; #my $html = do { local $/; <DATA> }; #$html =~ s/$re//; opendir(my $dh, ".") or die "$!"; my @files = grep { s/\././g < 2 } <*.html>; closedir $dh; for my $file (@files) { local $/ = undef; open my $fh, "<", $file or die "$!"; my $content = <$fh>; $content =~ s/$re//; print $content; close $fh; }
Re: Remove section from a HTML file
by rhumbliner (Sexton) on Oct 24, 2013 at 15:30 UTC

    you probably need to explain this case in a little more detail, but i see no reason why this example is too complicated to solve with a simple regex.

    i would break this problem up into two parts where the first portion consists of working with only the files with one dot:

    my @files = grep { s/\././g < 2 } <*.html>;

    then you can focus on modifying the files that qualify. if i understand your example correctly (and i probably don't) here's how i would remove the first div only if the second div contains an img tag with its src set to indent.gif

    $html =~ m|<div class="sectionHeading">.+?</div>\s+<div class="section +Content">.+?<img .+? src="./../../images/indent.gif">.+?</div>|s and +do { $html =~ s|<div class="sectionHeading">.+?</div>\s+||; };

    actually, this can probably be done using a look ahead but then the example gets a little more complicated.

Re: Remove section from a HTML file
by aaron_baugher (Curate) on Oct 24, 2013 at 16:12 UTC

    There's no such thing as "too complicated to match with RegExp." Often it does make sense to use a module that understands the format, but in a case like this where you're matching one exact chunk of text, a regex is pretty straightforward:

    perl -0777 -p -i -e 's|<div class="sectionHeading.+?</table>\s+</div>| +|s' test.html

    To apply that only to certain files, you can wrap that regex in perl code that filters multiple files through it, or use the shell to tell that command what files to work on.

    Aaron B.
    Available for small or large Perl jobs and *nix system administration; see my home node.

Re: Remove section from a HTML file
by stonecolddevin (Vicar) on Oct 25, 2013 at 21:45 UTC

    First off, don't parse HTML with regex.

    Second off, Web::Scraper is absolutely awesome for dealing with HTML using a nice, easy and coherent DSL.

    Three thousand years of beautiful tradition, from Moses to Sandy Koufax, you're god damn right I'm living in the fucking past

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1059482]
Approved by keszler
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (6)
As of 2018-06-19 14:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?



    Results (114 votes). Check out past polls.

    Notices?