Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

transforming html

by morgon (Priest)
on Sep 28, 2010 at 23:52 UTC ( [id://862527]=perlquestion: print w/replies, xml ) Need Help??

morgon has asked for the wisdom of the Perl Monks concerning the following question:

Venerable monks!

I have collection of spanish-language html-files that I want to convert into a plucker document.

But before I can do that I need to get rid of some crap that the files contain, so I parse them, extract the things I want and create a new html-file containing just the extracted bits.

The files I start with are utf8-encoded and claim (in the DOCTYPE) to be xhtml, but they don't validate (missing closing tags - oh well) so I use HMTL::TreeBuilder::XPath for parsing.

Now the thing is that the source-documents do not use any html-entities but contain the spanish special characters as (2 bytes) utf-characters. And this is where I have a problem.

Here my code (which works apart from the problem below):

use strict; use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse_file("in.html"); my ($tit) = $tree->findnodes(q{/html//h1[@class='title']}); my ($sub) = $tree->findnodes(q{/html//div[@class='submitted']}); my ($aut) = $tree->findnodes(q{/html//div[@class='autor']}); my ($art) = $tree->findnodes(q{/html//div[@class='content clear-block' +]}); my ($nav) = $tree->findnodes(q{/html//div[@class='book-navigation']}); $nav->detach; my (@childs, undef, undef) = $art->content_list; open my $fh, ">", "out.html" or die $!; print $fh "<html><body>" . join("\n", map { $_->as_HTML } ($tit, $sub, $aut, $art) +) . "</body></html>";
What happens now is that e.g. the spanish character ú that is encoded as hex c3ba in the source document gets transformed into &Atilde;&ordm; (i.e. ú) in the output - and that is wrong...

Does someone have an idea on how to fix this?

Many thanks!

Replies are listed 'Best First'.
Re: transforming html
by muba (Priest) on Sep 29, 2010 at 01:44 UTC

    It sounds like HTML::TreeBuilder isn't dealing with the utf-8 encoding right. However, I think there's an easy solution, hinted at by HTML::Treebuilder and open.

    HTML::Treebuilder:

    $root = HTML::TreeBuilder->new() $root->parse_file(...)
    An important method inherited from HTML::Parser, which see. Current versions of HTML::Parser can take a filespec, or a filehandle object, like *FOO, or some object from class IO::Handle, IO::File, IO::Socket) or the like. I think you should check that a given file exists before calling $root->parse_file($filespec).
    Ok, so it accepts file handles? Good...

    open:

    open(my $fh, "<:encoding(UTF-8)", "filename") || die "can't open UTF-8 encoded filename: $!";
    Ok, so we can specify which encoding to use when we open a file? Hmm!

    So here's what I'd try. open my $fh, "<:encoding(UTF-8)", $yourOriginalFileName; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse_file($fh); </c> Untested though, but hopefully it helps.

    Edit:: made a weird mistake in my code (as well as in the links). Fixed. I hope.

      Great - many thanks!

      Opening the file with the proper encoding and passing the filehandle to parse_file indeed results in the proper entities being used - which is good enough for me.

      For extra points:

      Can you think of a way to make to_HTML emit the utf8-characters as in the source-document without replacing them with html-entities?

        I'm not sure why you'd want this, as in the end it renders pretty much the same, but I suppose you have your reasons.

        After digging through the documentation a bit, I finally found that as_HTML() is defined on HTML::Element and those docs don't really hint at a way to prevent that encoding from happening.

        But we don't let them discourage us that easily, do we? So after diving into the source of HTML::Element and having a look at the code of the as_HTML subroutine, I learned that the entities encoding is handled by HTML::Entities.

        sub as_HTML { # Bla bla bla # Your typical subroutine initial stuff we don't care much about.. +. if ( ... ) { # Some condition I don't really understand since I didn't bother t +o # understand the initial stuff above. But it didn't seem to releva +nt. # A whole lot of stuff happens here, seemingly all dealing with ta +gs, # not with text. else { # it's a text segment # Hey! Cool. # One more line of bla bla bla, before...: HTML::Entities::encode_entities( $node, $entities +) # Yeah, this sounds about right. Let's look at that. # More stuff I didn't bother to look at... }

        Ok, so HTML::Entities is our target now. There's no apparent way to disable entity encoding so we'll have to use the source as our documentation again. *Shrug*, whatever, it's way past bedtime anyway now so I might as well see what I can do.

        # HTML::Entities # First there's a whole lot of POD here, but since I already saw the H +TML # version of that (which wasn't very helpful) I don't really care. # Hey, cool. The actual module begins here. use strict; use vars qw(@ISA @EXPORT @EXPORT_OK $VERSION); use vars qw(%entity2char %char2entity); # Bla bla bla. Oh, wait, that last line looks promising. # Some more stuff for Exporter happens next. I don't care. %entity2char = ( # What follows is a long, long, long mapping of character names # to actual characters. # This list goes on and on and on... Never knew there were so many! ); # Then, suddenly: # Make the opposite mapping while (my($entity, $char) = each(%entity2char)) { $entity =~ s/;\z//; $char2entity{$char} = "&$entity;"; } delete $char2entity{"'"}; # only one-way decoding

        He, he, he. I think we win. Just one line should, theoretically, keep this whole mean machine from replacing your characters with the html entities. It's a bit of a shame, since the original authors of this module went through such a pain to first set up one mapping (which is really a handful of pages long) and then to revert that mapping, but well, they should've made entity-encoding optional in the first place. Just one line, I think (although again it's untested).

        %HTML::Entity::char2entity = (); # Bye bye.

        Addendum: for completeness' sake, you'd put this line somewhere before you begin printing. Something like this should do the trick.

        %HTML::Entity::char2entity = (); # Bye bye. open my $fh, ">", "out.html" or die $!; print $fh "<html><body>" . join("\n", map { $_->as_HTML } ($tit, $sub, $aut, $art) +) . "</body></html>";
Re: transforming html
by salmonix (Initiate) on Sep 29, 2010 at 10:10 UTC

    What you get are the html utf-8 escapes. See this table

    Perhaps you should make a proper heather for the page and indicate the encoding using CGI(::something), like

    use CGI; my $cgi=CGI->new(); print CGI->header( -charset=>'utf-8');

    or something similar. You may also try pass your result text through Encode::Detect::Detector and according to the result convert it to be utf-8 or the encoding you need, and substitute the strings in your html code and binmode-ing out as utf-8.

    This encoding story can be very twisted so try to make a strategy for that.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://862527]
Approved by zwon
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (9)
As of 2024-04-26 08:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found