HTML::Strip and UTF8 -- is there some way I can just skip all the "UTF8 only" entities?

tphyahoo has asked for the wisdom of the Perl Monks concerning the following question:

I've got zillions of lines of stuff that should be html, but, you know, not very clean.

Every line needs to be cleaned up. Problem I'm having is html that has exotic characters like

’

What the hell is that anyway? I don't know, I don't care. It seems to only have meaning under utf-8, and the team I am delivering the data to hasn't switched to utf-8 yet. So the agreed work around is we skip formatting that is "utf-8 only". However, we'd like to quick-convert html to text using HTML::Strip for everything else. Is there a way to do this? Or is there a better way to quick-convert html to text than HTML::Strip?

Below is tests and code that demonstrate the problem.

The meat is in two functions: stripUtf8Entities and stripUtf8EntitiesBetter -- which I call before converting my "html" to text. stripUtf8Entities lets me pass my tests, but only for that one "ugly" special character, I guess it won't work in general. stripUTF8EntitiesBetter doesn't pass tests, because it's just a stub. But this would be the code to change if you have a better idea on how to do this. Test output:

ok 1 - stripUtf8Entities
# before:blah
# after: blah
ok 2 - stripUtf8Entities
# before:&Uuml --
# after: Ü --
ok 3 - stripUtf8Entities
# before:blah -- &rsquo; -- blah
# after: blah --  -- blah
ok 4 - stripUtf8Entities
# before:&Uuml; -- &rsquo; -- blah
# after: Ü --  -- blah
ok 5 - stripUtf8EntitiesBetter
# before:blah
# after: blah
ok 6 - stripUtf8EntitiesBetter
# before:&Uuml --
# after: Ü --
not ok 7 - stripUtf8EntitiesBetter
# before:blah -- &rsquo; -- blah
# after: blah --  -- blah
#   Failed test 'stripUtf8EntitiesBetter
# before:blah -- &rsquo; -- blah
# after: blah --  -- blah'
#   at shopImporter-test.pl line 49.
Wide character in print at /home/hartman/idealo_external_dependencies/
+current/localperl/lib/5.8.8/Test/Builder.pm line 1192.
#          got: 'blah -- â -- blah'
#     expected: 'blah --  -- blah'
not ok 8 - stripUtf8EntitiesBetter
# before:&Uuml; -- &rsquo; -- blah
# after: Ü --  -- blah
#   Failed test 'stripUtf8EntitiesBetter
# before:&Uuml; -- &rsquo; -- blah
# after: Ü --  -- blah'
#   at shopImporter-test.pl line 49.
Wide character in print at /home/hartman/idealo_external_dependencies/
+current/localperl/lib/5.8.8/Test/Builder.pm line 1192.
#          got: 'Ã -- â -- blah'
#     expected: 'Ã --  -- blah'
1..8
# Looks like you failed 2 tests of 8.
[download]

Code:

$ cat utf8-and-html-entities.pl
#!/usr/angebote/perlroot/bin/perl
use strict;
use warnings;

# use strict;
# use IO::File;
# use Text::CSV_XS;
# use DBI;
# use Time::Local;
# use Time::HiRes;
# use Compress::Zlib;
# use LWP::UserAgent;
#use POSIX qw(locale_h);
use HTML::Strip;
use Test::More qw(no_plan);
use Data::Dumper;

#setlocale(LC_CTYPE, "de_DE.ISO8859-1");

require "../../perl/agentFunc.pl";

my $stringsBeforeAfter = [
               [ 'blah', 'blah' ],
               [ '&Uuml --', 'Ü --'],
               ["blah -- &rsquo; -- blah", "blah --  -- blah"],
               ["&Uuml; -- &rsquo; -- blah", "Ü --  -- blah"],
              ];


foreach my $beforeAfter ( @$stringsBeforeAfter ) {
  my ( $before, $after )  = @$beforeAfter;
  my $transformed =HTML2Text(  stripUtf8Entities( $before ) );
  my $strings = [ [ "before", $before ],
                  [ "after", $after ],
                  [ "transformed", $transformed ]
                ];
  #print "strings: " . Dumper($strings);
  is($transformed, $after, "stripUtf8Entities");
}

foreach my $beforeAfter ( @$stringsBeforeAfter ) {
  my ( $before, $after )  = @$beforeAfter;
  my $transformed =HTML2Text(  stripUtf8EntitiesBetter( $before ) );
  my $strings = [ [ "before", $before ],
                  [ "after", $after ],
                  [ "transformed", $transformed ]
                ];
  #print "strings: " . Dumper($strings);
  is($transformed, $after, "stripUtf8EntitiesBetter");
}

sub HTML2Text {
    my ($changeText) = @_;

    my $htmlStripObject = HTML::Strip->new();

    $changeText = $htmlStripObject->parse($changeText);

    return $changeText;
}

# works, but only for one special character: &rsquo
# what happens when I hit another char that doesn't translate well out
+ of utf8?
sub stripUtf8Entities {
   my $string = shift || "";

   my $utf8Entities = ["&rsquo;"];

   foreach my $utf8Entity ( @$utf8Entities ) {
     $string =~ s/$utf8Entity//g;
   }

   return $string;
}

#just a stub -- is there a better, more general way to do this?
sub stripUtf8EntitiesBetter {
   my $string = shift || "";
   return $string;

}
[download]

Comment on HTML::Strip and UTF8 -- is there some way I can just skip all the "UTF8 only" entities? Select or Download Code

Replies are listed 'Best First'.
Re: HTML::Strip and UTF8 -- is there some way I can just skip all the "UTF8 only" entities? by ikegami (Patriarch) on Jan 03, 2007 at 19:04 UTC
HTML::Strip's `parse` returns a string of (unicode) characters. You can't write a string of characters to a stream of bytes (such as STDOUT). The characters must be encoded first. That's what the warning you're getting means. `encode` in Encode can be used to perform that task. So what happens when you try to encode a character that doesn't exist in the target character set? It gets replaced, often with a question mark ("`?`"). If that ok, then check out the following snippet. use 5.008000; use strict; use warnings; use Test::More 'no_plan'; use Encode qw( encode ); use HTML::Strip qw( ); use constant ENCODING => 'iso-latin-1'; sub html_to_text { my ($html) = @_; my $stripper = HTML::Strip->new(); return $stripper->parse($html); } sub chars_to_bytes { my ($encoding, $text) = @_; return encode($encoding, $text); } { foreach ( [ 'blah', 'blah' ], [ '&Uuml --', 'Ü --' ], [ 'blah -- ’ -- blah', 'blah -- ? -- blah' ], [ 'Ü -- ’ -- blah', 'Ü -- ? -- blah' ], ) { my $html = $_->[0]; my $expect = $_->[1]; my $text = chars_to_bytes(ENCODING, html_to_text($html)); is($text, $expect, $html); } } [download] If question marks are not ok, you can specify your own replacement character, including nothing. use 5.008000; use strict; use warnings; use Test::More qw( no_plan ); use Encode qw( encode ); use HTML::Strip qw( ); use constant ENCODING => 'iso-latin-1'; sub html_to_text { my ($html) = @_; my $stripper = HTML::Strip->new(); return $stripper->parse($html); } sub chars_to_bytes { my ($encoding, $text) = @_; return encode($encoding, $text, sub { '' }); } { for ( [ 'blah', 'blah' ], [ '&Uuml --', 'Ü --' ], [ 'blah -- ’ -- blah', 'blah -- -- blah' ], [ 'Ü -- ’ -- blah', 'Ü -- -- blah' ], ) { my $html = $_->[0]; my $expect = $_->[1]; my $text = chars_to_bytes(ENCODING, html_to_text($html)); is($text, $expect, $html); } } [download]	[reply] [d/l] [select]
Re: HTML::Strip and UTF8 -- is there some way I can just skip all the "UTF8 only" entities? by wfsp (Abbot) on Jan 03, 2007 at 20:22 UTC
Problem I'm having is html that has exotic characters like: `’` A possible source of such exoticism is MS and their use of x80-x9F for such characters (a range not used by either Latin1 or utf8). After a round trip through something like HTML::Entities they come back as a utf8 equivalent (e.g. 0x201C). I've had several runs round the block with this myself. See Fixing suspect characters in HTML for a possible approach. Hope that helps.	[reply] [d/l]
Re: HTML::Strip and UTF8 -- is there some way I can just skip all the "UTF8 only" entities? by tphyahoo (Vicar) on Jan 03, 2007 at 18:45 UTC
Answering my own question (partially), I think I have to do something along the lines of `use strict; use warnings; use Encode::Encoder; my $utf8String="\x{2019}"; my $latin1String = latin1ify($utf8String); print "$latin1String\n"; sub latin1ify { my $string = shift \|\| ""; Encode::encode( "iso-8859-1" , Encode::decode_utf8($string) ); }` [download] which gives "?" and then strip the question marks. But I have to go now, so I'll finish this another time.	[reply] [d/l]


Keep It Simple, Stupid
	PerlMonks