transforming html

morgon has asked for the wisdom of the Perl Monks concerning the following question:

Venerable monks!

I have collection of spanish-language html-files that I want to convert into a plucker document.

But before I can do that I need to get rid of some crap that the files contain, so I parse them, extract the things I want and create a new html-file containing just the extracted bits.

The files I start with are utf8-encoded and claim (in the DOCTYPE) to be xhtml, but they don't validate (missing closing tags - oh well) so I use HMTL::TreeBuilder::XPath for parsing.

Now the thing is that the source-documents do not use any html-entities but contain the spanish special characters as (2 bytes) utf-characters. And this is where I have a problem.

Here my code (which works apart from the problem below):


use strict;

use HTML::TreeBuilder::XPath;

my $tree = HTML::TreeBuilder::XPath->new;
$tree->parse_file("in.html");

my ($tit) = $tree->findnodes(q{/html//h1[@class='title']});
my ($sub) = $tree->findnodes(q{/html//div[@class='submitted']});
my ($aut) = $tree->findnodes(q{/html//div[@class='autor']});
my ($art) = $tree->findnodes(q{/html//div[@class='content clear-block'
+]});
my ($nav) = $tree->findnodes(q{/html//div[@class='book-navigation']});

$nav->detach;

my (@childs, undef, undef) = $art->content_list;

open my $fh, ">", "out.html" or die $!;

print $fh "<html><body>"
          .   join("\n", map { $_->as_HTML } ($tit, $sub, $aut, $art) 
+)
          .   "</body></html>";
[download]

What happens now is that e.g. the spanish character ú that is encoded as hex c3ba in the source document gets transformed into Ãº (i.e. Ãº) in the output - and that is wrong...

Does someone have an idea on how to fix this?

Many thanks!

Comment on transforming html Select or Download Code

Replies are listed 'Best First'.
Re: transforming html by muba (Priest) on Sep 29, 2010 at 01:44 UTC
It sounds like HTML::TreeBuilder isn't dealing with the utf-8 encoding right. However, I think there's an easy solution, hinted at by HTML::Treebuilder and open. HTML::Treebuilder: `$root = HTML::TreeBuilder->new() $root->parse_file(...)` [download] An important method inherited from HTML::Parser, which see. Current versions of HTML::Parser can take a filespec, or a filehandle object, like FOO, or some object from class IO::Handle, IO::File, IO::Socket) or the like. I think you should check that a given file exists before calling $root->parse_file($filespec). Ok, so it accepts file handles? Good... open: `open(my $fh, "<:encoding(UTF-8)", "filename") \|\| die "can't open UTF-8 encoded filename: $!";` [download] Ok, so we can specify which encoding to use when we open a file? Hmm! So here's what I'd try. open my $fh, "<:encoding(UTF-8)", $yourOriginalFileName; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse_file($fh); </c> Untested though, but hopefully it helps. Edit:*: made a weird mistake in my code (as well as in the links). Fixed. I hope.	[reply] [d/l] [select]
Re^2: transforming html by morgon (Priest) on Sep 29, 2010 at 02:01 UTC
Great - many thanks! Opening the file with the proper encoding and passing the filehandle to parse_file indeed results in the proper entities being used - which is good enough for me. For extra points: Can you think of a way to make to_HTML emit the utf8-characters as in the source-document without replacing them with html-entities?	[reply]
Re^3: transforming html by muba (Priest) on Sep 29, 2010 at 02:35 UTC
I'm not sure why you'd want this, as in the end it renders pretty much the same, but I suppose you have your reasons. After digging through the documentation a bit, I finally found that as_HTML() is defined on HTML::Element and those docs don't really hint at a way to prevent that encoding from happening. But we don't let them discourage us that easily, do we? So after diving into the source of HTML::Element and having a look at the code of the as_HTML subroutine, I learned that the entities encoding is handled by HTML::Entities. sub as_HTML { # Bla bla bla # Your typical subroutine initial stuff we don't care much about.. +. if ( ... ) { # Some condition I don't really understand since I didn't bother t +o # understand the initial stuff above. But it didn't seem to releva +nt. # A whole lot of stuff happens here, seemingly all dealing with ta +gs, # not with text. else { # it's a text segment # Hey! Cool. # One more line of bla bla bla, before...: HTML::Entities::encode_entities( $node, $entities +) # Yeah, this sounds about right. Let's look at that. # More stuff I didn't bother to look at... } [download] Ok, so HTML::Entities is our target now. There's no apparent way to disable entity encoding so we'll have to use the source as our documentation again. Shrug, whatever, it's way past bedtime anyway now so I might as well see what I can do. # HTML::Entities # First there's a whole lot of POD here, but since I already saw the H +TML # version of that (which wasn't very helpful) I don't really care. # Hey, cool. The actual module begins here. use strict; use vars qw(@ISA @EXPORT @EXPORT_OK $VERSION); use vars qw(%entity2char %char2entity); # Bla bla bla. Oh, wait, that last line looks promising. # Some more stuff for Exporter happens next. I don't care. %entity2char = ( # What follows is a long, long, long mapping of character names # to actual characters. # This list goes on and on and on... Never knew there were so many! ); # Then, suddenly: # Make the opposite mapping while (my($entity, $char) = each(%entity2char)) { $entity =~ s/;\z//; $char2entity{$char} = "&$entity;"; } delete $char2entity{"'"}; # only one-way decoding [download] He, he, he. I think we win. Just one line should, theoretically, keep this whole mean machine from replacing your characters with the html entities. It's a bit of a shame, since the original authors of this module went through such a pain to first set up one mapping (which is really a handful of pages long) and then to revert that mapping, but well, they should've made entity-encoding optional in the first place. Just one line, I think (although again it's untested). `%HTML::Entity::char2entity = (); # Bye bye.` Addendum: for completeness' sake, you'd put this line somewhere before you begin printing. Something like this should do the trick. `%HTML::Entity::char2entity = (); # Bye bye. open my $fh, ">", "out.html" or die $!; print $fh "<html><body>" . join("\n", map { $_->as_HTML } ($tit, $sub, $aut, $art) +) . "</body></html>";` [download]	[reply] [d/l] [select]
Re^4: transforming html by morgon (Priest) on Sep 29, 2010 at 21:07 UTC
Re^5: transforming html by muba (Priest) on Sep 29, 2010 at 23:21 UTC
Re: transforming html by salmonix (Initiate) on Sep 29, 2010 at 10:10 UTC
What you get are the html utf-8 escapes. See this table Perhaps you should make a proper heather for the page and indicate the encoding using CGI(::something), like `use CGI; my $cgi=CGI->new(); print CGI->header( -charset=>'utf-8');` [download] or something similar. You may also try pass your result text through Encode::Detect::Detector and according to the result convert it to be utf-8 or the encoding you need, and substitute the strings in your html code and binmode-ing out as utf-8. This encoding story can be very twisted so try to make a strategy for that.	[reply] [d/l]


XP is just a number
	PerlMonks