Dear Monks,
I am asking for your help in converting HTML files to UTF-8 txt.
The tasks seems simple, but it's a lot more tricky to do than I expected.
I have a working solution, but it involves decoding character entities with HTML::Entities before running HTML::Strip. As a result of that, if the text in the HTML file contains something like
<this is a tag quoted inside html>, it gets stripped along with the real HTML tags.
I tried decoding the character entities later, when the stripper is run (see lines I commented out). In that case, I get incorrect character conversions (eacute and uuml) and "wide character in print" error messages.
I could fix the problem by introducing some sort of a workaround into my original solution (say, tell HTML::Entities to ignore < and >, although I can't find an easy way to do it), but I'm more interested in what the "proper" solution is.
Update: I've found a good workaround: just insert
s/\>\;/\&\;gt\;/g;
s/\<\;/\&\;lt\;/g;
before
print OUT decode_entities($_); to make lt and gt stay character references. Still, I'm interested in your comments/improvements.
Here's my code, it's in a sub as it's part of a larger project (obviously, fill in path/to/test.html if you want to run the script):
#!/usr/bin/perl
use strict;
use warnings;
use File::Copy;
use HTML::Strip;
use HTML::Entities;
sub convert_html;
convert_html("path/to/test.html");
sub convert_html($){
# NOTE: $pf contains the path as well as the filename excluding th
+e extension.
# parse filename
$_[0] =~ /(.*)\.(.*)/;
my $pf = $1;
my $ext = $2;
# PREPARE FILES BEFORE RUNNING THE TAG STRIPPER
open (IN, "<:encoding(UTF-8)", "${pf}.${ext}");
open (OUT, ">:encoding(UTF-8)", "${pf}_htmlmod.${ext}");
while (<IN>) {
s/\x{A0}/ /g; # remove non-breaking spaces
s/\n//g; # remove literal line breaks
s/<\/?p>/\n/ig; # conserve line breaks ("\/?" b
+ecause "<p style =...> blabla</p>" is not caught by the normal regex
s/<br( \/)?>/\n/ig; # yet more line breaks
s/\&\#8209;/-/g;
print OUT decode_entities($_);
# print OUT $_; # alternative attempt
}
close IN;
close OUT;
print "\nline break and nbsp preparation done\n";
<STDIN>;
# STRIP TAGS
# using :encoding(UTF-8) breaks this
open (IN, "<", "${pf}_htmlmod.${ext}");
open (OUT, ">", "${pf}.txt");
{
my $hs = HTML::Strip->new();
# my $hs = HTML::Strip->new( decode_entities => 1 ); # alte
+rnative attempt
while (<IN>) {
my $clean_text = $hs->parse($_);
print OUT $clean_text;
}
close IN;
close OUT;
unlink "${pf}_htmlmod.${ext}";
}
print "\nhtml conversion done\n";
<STDIN>;
}
The test file with a couple of BRK tags in the text:
<HTML>
<HEAD>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-
+8">
<!--Filename : PISZ@TRA-DOC-HU-CONCL-C-0371-2003-200506500-06_00
+-->
<!-- Feuille de style -->
<LINK HREF="lex/css/Style_CNC_C_FR.css" REL="stylesheet" TYPE="t
+ext/css">
<LINK HREF="lex/css/Style_CNC_C_HU.css" REL="stylesheet" TYPE="t
+ext/css">
<!-- Titre du document -->
<TITLE></TITLE>
</HEAD>
<BODY>
<P class="C36Centre">JACOBS</P>
<P class="C36Centre">FŐTANÁCSNOK INDÍTV&Aacut
+e;NYA<BRK></P>
<P class="C36Centre">Az ismertetés napja: 2005. nove
+mber 17.<SUP>1</SUP>(<A HREF="#Footnote1" NAME="Footref1">1</A>)
</P>
<P class="C38Centregrasgrandespacement"><B>C‑371/03.
+sz. ügy</B></P>
<P class="C37Centregras"><B>Siegfried Aulinger<BRK></B></P
+>
<P class="C37Centregras"><B>kontra<this should be left in>
+</B></P>
<P class="C37Centregras"><B>Bundesrepublik Deutschland</B></P>
<P class="C71Indicateur"><br></P><BR><BR><BR><BR><P class="C01Po
+intAltN">1.<BRK>  
+;Ebben az ügyben az ‘Oberlandesgericht Köln’ (k
+ölni fellebbviteli bíróság) a Szerb é
+;s a Montenegrói Köztársaság, valamint az
Európai Gazdasági Közösség k&o
+uml;zötti kereskedelem megtiltásáról sz&oac
+ute;ló, 1992. június 1‑jei 1432/92/EGK taná
+;csi rendelet (a továbbiakban:
az embargóról szóló rendelet)(<A
+HREF="#Footnote2" NAME="Footref2">2</A>) értelmezés&eac
+ute;re vonatkozóan két kérdést terjesztet
+t a Bíróság elé előzetes dönt&e
+acute;shozatalra.
</BODY>
</HTML>
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.