Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

remove html comments using HTML::HTML5::ToText

by mendeepak (Scribe)
on Jun 14, 2013 at 12:36 UTC ( #1038958=perlquestion: print w/ replies, xml ) Need Help??
mendeepak has asked for the wisdom of the Perl Monks concerning the following question:

When i tried to convert a html content to plain text using the HTML::HTML5::ToText module ( am using the newer version 0.003 which had fixed the comment bug). The converted text contains html comment text which should have been removed.
But When i removed the "RenderTables" trait. The module works fine. It does not show the html comment text
Any idea guys ??
Here is my example code Html Part

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http:/ +/www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>my html</title> </head> <body> <table width="100%" border="0" bgcolor="a5a3a6"> <tr> <td> <p>&nbsp;</p> <table width="600" cellpadding="0" cellspacing="0" border="0" align="c +enter" style="margin-top:10px;font-family:Verdana, Geneva, sans-serif +"> <tr style="background-color:#5d5d65;background: -webkit-gradie +nt(linear, 0% 0%, 0% 100%, from(#5d5d65), to(#45454d));background: -w +ebkit-linear-gradient(top, #5d5d65, #45454d bottom);background: -moz- +linear-gradient(top, #5d5d65 0%, #45454d 100%);background: -ms-linear +-gradient(top, #5d5d65, #45454d bottom);background: -o-linear-gradien +t(top, #5d5d65, #45454d bottom);filter: progid:DXImageTransform.Micro +soft.gradient(startColorstr='#5d5d65', endColorstr='#45454d');"> <td style="border-top-left-radius: 5px;-webkit-border-top-left +-radius:5px;-moz-border-top-left-radius:5px;padding:15px 0 15px 25px" +><img src="http://www.midphase.com/emails/midphase-email/images/logo. +png" width="191" height="52" style="display:block" /></td> <td align="right" style="border-top-right-radius: 5px;-webkit-bord +er-top-right-radius:5px;-moz-border-top-right-radius:5px;padding:15px + 25px 15px 0;color:#cccccc;font-size:21px;"><!-- SUBJECT HERE -->Tha +nk You <!-- // SUBJECT HERE --></td> </tr> </table> </body> </html>
Perl Part
print HTML::HTML5::ToText->with_traits(qw/ShowLinks ShowImages TextFor +matting RenderTables/)->new()->process_string($html_string);

*=*=*dEEPAk*=*=*

Comment on remove html comments using HTML::HTML5::ToText
Select or Download Code
Replies are listed 'Best First'.
Re: remove html comments using HTML::HTML5::ToText
by tobyink (Abbot) on Jun 14, 2013 at 13:10 UTC

    It looks like there's still a $node->isa("XML::LibXML::Text") check that should be replaced with $node->nodeName eq "#text".

    HTML::HTML5::ToText 0.004 should be on a CPAN mirror near you some time in the next few hours with a fix.

    package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name

      Thanks Works Fine now ...

      *=*=*dEEPAk*=*=*

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1038958]
Approved by ww
Front-paged by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (8)
As of 2015-07-08 00:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (93 votes), past polls