comment on

Dear Monks,
I am asking for your help in converting HTML files to UTF-8 txt.

The tasks seems simple, but it's a lot more tricky to do than I expected.

I have a working solution, but it involves decoding character entities with HTML::Entities before running HTML::Strip. As a result of that, if the text in the HTML file contains something like <this is a tag quoted inside html>, it gets stripped along with the real HTML tags.
I tried decoding the character entities later, when the stripper is run (see lines I commented out). In that case, I get incorrect character conversions (eacute and uuml) and "wide character in print" error messages. I could fix the problem by introducing some sort of a workaround into my original solution (say, tell HTML::Entities to ignore < and >, although I can't find an easy way to do it), but I'm more interested in what the "proper" solution is.

Update: I've found a good workaround: just insert

        s/\&gt\;/\&amp\;gt\;/g;
        s/\&lt\;/\&amp\;lt\;/g;
[download]

before print OUT decode_entities($_); to make lt and gt stay character references. Still, I'm interested in your comments/improvements.

Here's my code, it's in a sub as it's part of a larger project (obviously, fill in path/to/test.html if you want to run the script):

#!/usr/bin/perl
use strict;
use warnings;
use File::Copy;

use HTML::Strip;
use HTML::Entities;
sub convert_html;

convert_html("path/to/test.html");


sub convert_html($){
    # NOTE: $pf contains the path as well as the filename excluding th
+e extension.

# parse filename
    $_[0] =~ /(.*)\.(.*)/;
    my $pf = $1;
    my $ext = $2;

# PREPARE FILES BEFORE RUNNING THE TAG STRIPPER
    open (IN, "<:encoding(UTF-8)", "${pf}.${ext}");
    open (OUT, ">:encoding(UTF-8)", "${pf}_htmlmod.${ext}");

    while (<IN>) {
        s/\x{A0}/ /g;                # remove non-breaking spaces
        s/\n//g;                    # remove literal line breaks
        s/<\/?p>/\n/ig;                # conserve line breaks ("\/?" b
+ecause "<p style =...> blabla</p>" is not caught by the normal regex
        s/<br( \/)?>/\n/ig;            # yet more line breaks
        s/\&\#8209;/-/g;
        print OUT decode_entities($_);
        # print OUT $_;                # alternative attempt
    }
    close IN;
    close OUT;
print "\nline break and nbsp preparation done\n";
<STDIN>;


# STRIP TAGS

    # using :encoding(UTF-8) breaks this
    open (IN, "<", "${pf}_htmlmod.${ext}");
    open (OUT, ">", "${pf}.txt");
    {
        my $hs = HTML::Strip->new();
        # my $hs = HTML::Strip->new( decode_entities => 1 );    # alte
+rnative attempt

        while (<IN>) {
        my $clean_text = $hs->parse($_);
        print OUT $clean_text;
    }

    close IN;
    close OUT;
    unlink "${pf}_htmlmod.${ext}";
    }
print "\nhtml conversion done\n";
<STDIN>;
}
[download]

The test file with a couple of BRK tags in the text:

<HTML>
   <HEAD>
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-
+8">
   
      <!--Filename : PISZ@TRA-DOC-HU-CONCL-C-0371-2003-200506500-06_00
+-->
      <!-- Feuille de style -->
      <LINK HREF="lex/css/Style_CNC_C_FR.css" REL="stylesheet" TYPE="t
+ext/css">
      <LINK HREF="lex/css/Style_CNC_C_HU.css" REL="stylesheet" TYPE="t
+ext/css">
      <!-- Titre du document -->
      <TITLE></TITLE>
   </HEAD>
   <BODY>
      <P class="C36Centre">JACOBS</P>
      <P class="C36Centre">F&#336;TAN&Aacute;CSNOK IND&Iacute;TV&Aacut
+e;NYA&lt;BRK&gt;</P>
      <P class="C36Centre">Az ismertet&eacute;s napja: 2005.&nbsp;nove
+mber&nbsp;17.<SUP>1</SUP>(<A HREF="#Footnote1" NAME="Footref1">1</A>)
      </P>
      <P class="C38Centregrasgrandespacement"><B>C&#8209;371/03.&nbsp;
+sz.&nbsp;&uuml;gy</B></P>
      <P class="C37Centregras"><B>Siegfried Aulinger&lt;BRK&gt;</B></P
+>
      <P class="C37Centregras"><B>kontra&lt;this should be left in&gt;
+</B></P>
      <P class="C37Centregras"><B>Bundesrepublik Deutschland</B></P>
      <P class="C71Indicateur"><br></P><BR><BR><BR><BR><P class="C01Po
+intAltN">1.&lt;BRK&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp
+;Ebben az &uuml;gyben az &#8216;Oberlandesgericht K&ouml;ln&#8217; (k
+&ouml;lni fellebbviteli b&iacute;r&oacute;s&aacute;g) a Szerb &eacute
+;s a Montenegr&oacute;i K&ouml;zt&aacute;rsas&aacute;g, valamint az
         Eur&oacute;pai Gazdas&aacute;gi K&ouml;z&ouml;ss&eacute;g k&o
+uml;z&ouml;tti kereskedelem megtilt&aacute;s&aacute;r&oacute;l sz&oac
+ute;l&oacute;, 1992. j&uacute;nius 1&#8209;jei 1432/92/EGK tan&aacute
+;csi rendelet (a tov&aacute;bbiakban:
         az embarg&oacute;r&oacute;l sz&oacute;l&oacute; rendelet)(<A 
+HREF="#Footnote2" NAME="Footref2">2</A>) &eacute;rtelmez&eacute;s&eac
+ute;re vonatkoz&oacute;an k&eacute;t k&eacute;rd&eacute;st terjesztet
+t a B&iacute;r&oacute;s&aacute;g el&eacute; el&#337;zetes d&ouml;nt&e
+acute;shozatalra.

   </BODY>
</HTML>
[download]

In reply to Converting HTML to txt with HTML::Strip by elef

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Problems? Is your data what you think it is?
	PerlMonks