Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine

Converting HTML to txt with HTML::Strip

by elef (Friar)
on Oct 03, 2010 at 10:28 UTC ( #863161=perlquestion: print w/replies, xml ) Need Help??
elef has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
I am asking for your help in converting HTML files to UTF-8 txt.

The tasks seems simple, but it's a lot more tricky to do than I expected.

I have a working solution, but it involves decoding character entities with HTML::Entities before running HTML::Strip. As a result of that, if the text in the HTML file contains something like <this is a tag quoted inside html>, it gets stripped along with the real HTML tags.
I tried decoding the character entities later, when the stripper is run (see lines I commented out). In that case, I get incorrect character conversions (eacute and uuml) and "wide character in print" error messages. I could fix the problem by introducing some sort of a workaround into my original solution (say, tell HTML::Entities to ignore < and >, although I can't find an easy way to do it), but I'm more interested in what the "proper" solution is.

Update: I've found a good workaround: just insert

s/\&gt\;/\&amp\;gt\;/g; s/\&lt\;/\&amp\;lt\;/g;
before print OUT decode_entities($_); to make lt and gt stay character references. Still, I'm interested in your comments/improvements.

Here's my code, it's in a sub as it's part of a larger project (obviously, fill in path/to/test.html if you want to run the script):

#!/usr/bin/perl use strict; use warnings; use File::Copy; use HTML::Strip; use HTML::Entities; sub convert_html; convert_html("path/to/test.html"); sub convert_html($){ # NOTE: $pf contains the path as well as the filename excluding th +e extension. # parse filename $_[0] =~ /(.*)\.(.*)/; my $pf = $1; my $ext = $2; # PREPARE FILES BEFORE RUNNING THE TAG STRIPPER open (IN, "<:encoding(UTF-8)", "${pf}.${ext}"); open (OUT, ">:encoding(UTF-8)", "${pf}_htmlmod.${ext}"); while (<IN>) { s/\x{A0}/ /g; # remove non-breaking spaces s/\n//g; # remove literal line breaks s/<\/?p>/\n/ig; # conserve line breaks ("\/?" b +ecause "<p style =...> blabla</p>" is not caught by the normal regex s/<br( \/)?>/\n/ig; # yet more line breaks s/\&\#8209;/-/g; print OUT decode_entities($_); # print OUT $_; # alternative attempt } close IN; close OUT; print "\nline break and nbsp preparation done\n"; <STDIN>; # STRIP TAGS # using :encoding(UTF-8) breaks this open (IN, "<", "${pf}_htmlmod.${ext}"); open (OUT, ">", "${pf}.txt"); { my $hs = HTML::Strip->new(); # my $hs = HTML::Strip->new( decode_entities => 1 ); # alte +rnative attempt while (<IN>) { my $clean_text = $hs->parse($_); print OUT $clean_text; } close IN; close OUT; unlink "${pf}_htmlmod.${ext}"; } print "\nhtml conversion done\n"; <STDIN>; }

The test file with a couple of BRK tags in the text:

<HTML> <HEAD> <meta http-equiv="Content-Type" content="text/html; charset=UTF- +8"> <!--Filename : PISZ@TRA-DOC-HU-CONCL-C-0371-2003-200506500-06_00 +--> <!-- Feuille de style --> <LINK HREF="lex/css/Style_CNC_C_FR.css" REL="stylesheet" TYPE="t +ext/css"> <LINK HREF="lex/css/Style_CNC_C_HU.css" REL="stylesheet" TYPE="t +ext/css"> <!-- Titre du document --> <TITLE></TITLE> </HEAD> <BODY> <P class="C36Centre">JACOBS</P> <P class="C36Centre">F&#336;TAN&Aacute;CSNOK IND&Iacute;TV&Aacut +e;NYA&lt;BRK&gt;</P> <P class="C36Centre">Az ismertet&eacute;s napja: 2005.&nbsp;nove +mber&nbsp;17.<SUP>1</SUP>(<A HREF="#Footnote1" NAME="Footref1">1</A>) </P> <P class="C38Centregrasgrandespacement"><B>C&#8209;371/03.&nbsp; +sz.&nbsp;&uuml;gy</B></P> <P class="C37Centregras"><B>Siegfried Aulinger&lt;BRK&gt;</B></P +> <P class="C37Centregras"><B>kontra&lt;this should be left in&gt; +</B></P> <P class="C37Centregras"><B>Bundesrepublik Deutschland</B></P> <P class="C71Indicateur"><br></P><BR><BR><BR><BR><P class="C01Po +intAltN">1.&lt;BRK&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp +;Ebben az &uuml;gyben az &#8216;Oberlandesgericht K&ouml;ln&#8217; (k +&ouml;lni fellebbviteli b&iacute;r&oacute;s&aacute;g) a Szerb &eacute +;s a Montenegr&oacute;i K&ouml;zt&aacute;rsas&aacute;g, valamint az Eur&oacute;pai Gazdas&aacute;gi K&ouml;z&ouml;ss&eacute;g k&o +uml;z&ouml;tti kereskedelem megtilt&aacute;s&aacute;r&oacute;l sz&oac +ute;l&oacute;, 1992. j&uacute;nius 1&#8209;jei 1432/92/EGK tan&aacute +;csi rendelet (a tov&aacute;bbiakban: az embarg&oacute;r&oacute;l sz&oacute;l&oacute; rendelet)(<A +HREF="#Footnote2" NAME="Footref2">2</A>) &eacute;rtelmez&eacute;s&eac +ute;re vonatkoz&oacute;an k&eacute;t k&eacute;rd&eacute;st terjesztet +t a B&iacute;r&oacute;s&aacute;g el&eacute; el&#337;zetes d&ouml;nt&e +acute;shozatalra. </BODY> </HTML>

Replies are listed 'Best First'.
Re: Converting HTML to txt with HTML::Strip
by wfsp (Abbot) on Oct 03, 2010 at 13:43 UTC
    This uses HTML::TokeParser::Simple (there are many other parsers) and may help get you started. It preserves your <BRK> 'tags', is that what you were after?
    #! /usr/bin/perl use warnings; use strict; use HTML::Entities; use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new( q{monk.html}, ) or die qq{cant parse HTML}; open my $fh_out, q{>:utf8}, q{out.txt} or die qq{cant open file to write}; while (my $t = $p->get_token){ if ($t->is_end_tag(q{p}) or $t->is_tag(q{br})){ print $fh_out qq{\n}; } elsif ($t->is_text){ my $out = $t->as_is; for ($out){ s/^\s+//; s/\s+$//; } next unless $out; print $fh_out decode_entities($out); } }
    output (long lines snipped)
    JACOBS F&#336;TANÁCSNOK INDÍTVÁNYA<BRK> Az ismertetés napja: 2005. november 17.1(1) C&#8209;371/03. sz. ügy Siegfried Aulinger<BRK> kontra<this should be left in> Bundesrepublik Deutschland 1.<BRK>        Ebben az ügyben az... Európai Gazdasági Közösség közötti... az embargóról szóló rendelet)(2)...
    Some numeric entities appear here (in the browser), e.g. &#336;, these aren't in the file.
      Well, yes, the BRK tags should be conserved with the lt and gt character references converted to < and > (everything that's "in the text", i.e. everything that isn't part of the HTML markup should stay in).
      Frankly, most of your actual code went right over my head. I'm pretty new to perl and programming in general.
      I'm not sure what you mean about the the numerical entities not being in the file. They are in the original HTML file and should be converted to the appropriate characters, e.g. 336 is the accented letter Ő.
      Either way, now I have a solution I'm happy with (the workaround I posted). It's not elegant, but it does everything I want it to so I think I'll stick with it.
      By the way, it's pretty surprising that there seems to be no foolproof HTML->txt converter module that would just let you just provide a path to an HTML file and spit out a UTF-8 txt with the right line breaks, all the character entities decoded etc.
      I.e. instead of the 20 or so lines you and I posted, it should be
      #! /usr/bin/perl use warnings; use strict; use HTML::Convert; HTML::Convert(file.html);
      ... and you'd get file.txt created in the same folder.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://863161]
Approved by marto
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (2)
As of 2018-02-24 08:35 GMT
Find Nodes?
    Voting Booth?
    When it is dark outside I am happiest to see ...

    Results (310 votes). Check out past polls.