Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

HTML::Strip and UTF8 -- is there some way I can just skip all the "UTF8 only" entities?

by tphyahoo (Vicar)
on Jan 03, 2007 at 18:21 UTC ( #592806=perlquestion: print w/ replies, xml ) Need Help??
tphyahoo has asked for the wisdom of the Perl Monks concerning the following question:

I've got zillions of lines of stuff that should be html, but, you know, not very clean.

Every line needs to be cleaned up. Problem I'm having is html that has exotic characters like

’

What the hell is that anyway? I don't know, I don't care. It seems to only have meaning under utf-8, and the team I am delivering the data to hasn't switched to utf-8 yet. So the agreed work around is we skip formatting that is "utf-8 only". However, we'd like to quick-convert html to text using HTML::Strip for everything else. Is there a way to do this? Or is there a better way to quick-convert html to text than HTML::Strip?

Below is tests and code that demonstrate the problem.

The meat is in two functions: stripUtf8Entities and stripUtf8EntitiesBetter -- which I call before converting my "html" to text. stripUtf8Entities lets me pass my tests, but only for that one "ugly" special character, I guess it won't work in general. stripUTF8EntitiesBetter doesn't pass tests, because it's just a stub. But this would be the code to change if you have a better idea on how to do this. Test output:

ok 1 - stripUtf8Entities # before:blah # after: blah ok 2 - stripUtf8Entities # before:&Uuml -- # after: -- ok 3 - stripUtf8Entities # before:blah -- ’ -- blah # after: blah -- -- blah ok 4 - stripUtf8Entities # before:Ü -- ’ -- blah # after: -- -- blah ok 5 - stripUtf8EntitiesBetter # before:blah # after: blah ok 6 - stripUtf8EntitiesBetter # before:&Uuml -- # after: -- not ok 7 - stripUtf8EntitiesBetter # before:blah -- ’ -- blah # after: blah -- -- blah # Failed test 'stripUtf8EntitiesBetter # before:blah -- ’ -- blah # after: blah -- -- blah' # at shopImporter-test.pl line 49. Wide character in print at /home/hartman/idealo_external_dependencies/ +current/localperl/lib/5.8.8/Test/Builder.pm line 1192. # got: 'blah -- -- blah' # expected: 'blah -- -- blah' not ok 8 - stripUtf8EntitiesBetter # before:Ü -- ’ -- blah # after: -- -- blah # Failed test 'stripUtf8EntitiesBetter # before:Ü -- ’ -- blah # after: -- -- blah' # at shopImporter-test.pl line 49. Wide character in print at /home/hartman/idealo_external_dependencies/ +current/localperl/lib/5.8.8/Test/Builder.pm line 1192. # got: ' -- -- blah' # expected: ' -- -- blah' 1..8 # Looks like you failed 2 tests of 8.
Code:
$ cat utf8-and-html-entities.pl #!/usr/angebote/perlroot/bin/perl use strict; use warnings; # use strict; # use IO::File; # use Text::CSV_XS; # use DBI; # use Time::Local; # use Time::HiRes; # use Compress::Zlib; # use LWP::UserAgent; #use POSIX qw(locale_h); use HTML::Strip; use Test::More qw(no_plan); use Data::Dumper; #setlocale(LC_CTYPE, "de_DE.ISO8859-1"); require "../../perl/agentFunc.pl"; my $stringsBeforeAfter = [ [ 'blah', 'blah' ], [ '&Uuml --', ' --'], ["blah -- ’ -- blah", "blah -- -- blah"], ["Ü -- ’ -- blah", " -- -- blah"], ]; foreach my $beforeAfter ( @$stringsBeforeAfter ) { my ( $before, $after ) = @$beforeAfter; my $transformed =HTML2Text( stripUtf8Entities( $before ) ); my $strings = [ [ "before", $before ], [ "after", $after ], [ "transformed", $transformed ] ]; #print "strings: " . Dumper($strings); is($transformed, $after, "stripUtf8Entities"); } foreach my $beforeAfter ( @$stringsBeforeAfter ) { my ( $before, $after ) = @$beforeAfter; my $transformed =HTML2Text( stripUtf8EntitiesBetter( $before ) ); my $strings = [ [ "before", $before ], [ "after", $after ], [ "transformed", $transformed ] ]; #print "strings: " . Dumper($strings); is($transformed, $after, "stripUtf8EntitiesBetter"); } sub HTML2Text { my ($changeText) = @_; my $htmlStripObject = HTML::Strip->new(); $changeText = $htmlStripObject->parse($changeText); return $changeText; } # works, but only for one special character: &rsquo # what happens when I hit another char that doesn't translate well out + of utf8? sub stripUtf8Entities { my $string = shift || ""; my $utf8Entities = ["’"]; foreach my $utf8Entity ( @$utf8Entities ) { $string =~ s/$utf8Entity//g; } return $string; } #just a stub -- is there a better, more general way to do this? sub stripUtf8EntitiesBetter { my $string = shift || ""; return $string; }

Comment on HTML::Strip and UTF8 -- is there some way I can just skip all the "UTF8 only" entities?
Select or Download Code
Re: HTML::Strip and UTF8 -- is there some way I can just skip all the "UTF8 only" entities?
by tphyahoo (Vicar) on Jan 03, 2007 at 18:45 UTC
    Answering my own question (partially), I think I have to do something along the lines of

    use strict; use warnings; use Encode::Encoder; my $utf8String="\x{2019}"; my $latin1String = latin1ify($utf8String); print "$latin1String\n"; sub latin1ify { my $string = shift || ""; Encode::encode( "iso-8859-1" , Encode::decode_utf8($string) ); }

    which gives "?" and then strip the question marks.

    But I have to go now, so I'll finish this another time.

Re: HTML::Strip and UTF8 -- is there some way I can just skip all the "UTF8 only" entities?
by ikegami (Pope) on Jan 03, 2007 at 19:04 UTC

    HTML::Strip's parse returns a string of (unicode) characters. You can't write a string of characters to a stream of bytes (such as STDOUT). The characters must be encoded first. That's what the warning you're getting means. encode in Encode can be used to perform that task.

    So what happens when you try to encode a character that doesn't exist in the target character set? It gets replaced, often with a question mark ("?"). If that ok, then check out the following snippet.

    use 5.008000; use strict; use warnings; use Test::More 'no_plan'; use Encode qw( encode ); use HTML::Strip qw( ); use constant ENCODING => 'iso-latin-1'; sub html_to_text { my ($html) = @_; my $stripper = HTML::Strip->new(); return $stripper->parse($html); } sub chars_to_bytes { my ($encoding, $text) = @_; return encode($encoding, $text); } { foreach ( [ 'blah', 'blah' ], [ '&Uuml --', ' --' ], [ 'blah -- ’ -- blah', 'blah -- ? -- blah' ], [ 'Ü -- ’ -- blah', ' -- ? -- blah' ], ) { my $html = $_->[0]; my $expect = $_->[1]; my $text = chars_to_bytes(ENCODING, html_to_text($html)); is($text, $expect, $html); } }

    If question marks are not ok, you can specify your own replacement character, including nothing.

    use 5.008000; use strict; use warnings; use Test::More qw( no_plan ); use Encode qw( encode ); use HTML::Strip qw( ); use constant ENCODING => 'iso-latin-1'; sub html_to_text { my ($html) = @_; my $stripper = HTML::Strip->new(); return $stripper->parse($html); } sub chars_to_bytes { my ($encoding, $text) = @_; return encode($encoding, $text, sub { '' }); } { for ( [ 'blah', 'blah' ], [ '&Uuml --', ' --' ], [ 'blah -- ’ -- blah', 'blah -- -- blah' ], [ 'Ü -- ’ -- blah', ' -- -- blah' ], ) { my $html = $_->[0]; my $expect = $_->[1]; my $text = chars_to_bytes(ENCODING, html_to_text($html)); is($text, $expect, $html); } }
Re: HTML::Strip and UTF8 -- is there some way I can just skip all the "UTF8 only" entities?
by wfsp (Abbot) on Jan 03, 2007 at 20:22 UTC
    Problem I'm having is html that has exotic characters like:
    ’
    A possible source of such exoticism is MS and their use of x80-x9F for such characters (a range not used by either Latin1 or utf8). After a round trip through something like HTML::Entities they come back as a utf8 equivalent (e.g. 0x201C).

    I've had several runs round the block with this myself.
    See Fixing suspect characters in HTML for a possible approach.

    Hope that helps.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://592806]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (13)
As of 2014-07-14 12:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (259 votes), past polls