Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Wierd behaviour with HTML::Entities::decode_entities()

by Anonymous Monk
on Dec 13, 2009 at 21:16 UTC ( [id://812617]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Here's my code -
BEGIN { $ENV{LC_ALL} = $ENV{LANG} = 'sv_SE.UTF-8' } use strict; use warnings; use utf8; use open ':std', ':locale'; use LWP::UserAgent qw( get ); use HTML::Strip qw( ); use HTML::Entities qw( decode_entities ); use Text::Sentence qw( split_sentences ); my $userAgent = LWP::UserAgent->new(); #$userAgent->agent('Mozilla/5.0'); my $url = "http://www.expressen.se"; my $response = $userAgent->get($url); die "Can't get $url: ", $response->status_line unless $response->is_success; my $hs = HTML::Strip->new( decode_entities => 0 ); my $parsedContent = $hs->parse( $response->content ); utf8::decode( my $decodedParsedContent = $parsedContent ); $decodedParsedContent =~ s/(\s)+/ /g; # remove double whitespace decode_entities(my $decodedParsedContentWithDecodeEntities = $decodedP +arsedContent); my @sentences = split_sentences( $decodedParsedContentWithDecodeEntiti +es ); foreach my $sentence (@sentences) { #$sentence =~ s/^\s+//; # remove leading whitespace #$sentence =~ s/\s+$//; # remove trailing decode_entities(my $sentenceDecodeEntities = $sentence); while ($sentenceDecodeEntities =~ /(\w+)/g) { print "$1 : ".$sentenceDecodeEntities."\n"; } }
One of my output lines is -
gor : BLOGG Europe Turnéblogg Mic i vĺr replokal "The Dungeon" BLOGG L +otta Gröning Krönikör Demokratifiasko...
Which looks good, however if I comment out either of the two calls to decode_entities(), I end up getting -
gor : BLOGG Europe Turnéblogg Mic i vĺr replokal "The Dungeon&quo +t; BLOGG Lotta Gröning Krönikör Demokratifiasko...
Why do I need the two calls to decode_entities()???
Thanks very much for your help!

Replies are listed 'Best First'.
Re: Wierd behaviour with HTML::Entities::decode_entities()
by ikegami (Patriarch) on Dec 13, 2009 at 22:50 UTC
    • Wow, why the switch to insane variable names? You're not even separating the words! Your code is unreadable. (Wall of text.)

    • Anyway, you always discard the result of decode_entities, so I don't see how you could claim it ever works.

    • And the order of the following statements is backwards:

      $decodedParsedContent =~ s/(\s)+/ /g; # remove double whitespace decode_entities(my $decodedParsedContentWithDecodeEntities = $decodedP +arsedContent);

      You're removing the spaces before you decode the entities that might result in spaces.

    • Finally, s/(\s)+/ /g is wrong. Except ? in some very special circumstances, applying a modifier to a capture makes no sense. Which brings to the point that you don't need to be capturing at all. If you need parens, they're (?...): s/(?:\s)+/ /g. Of course, there's one one atom in the parens, so all you need is s/\s+/ /g.

    Update: "If called in void context the arguments are decoded in-place.", oops.

    Your test produces waaaaay too much output, so I can't even see the problem, much less diagnose it.

Re: Wierd behaviour with HTML::Entities::decode_entities()
by JadeNB (Chaplain) on Dec 14, 2009 at 04:16 UTC

    I'm not very experienced with either of these modules, but, as ikegami points out, some of your code seems strange—for example, if you're going to put the result of decode_entries in another scalar anyway, why use the hard-to-read decode_entities(my $new = $old) rather than the more natural my $new = decode_entities $old? Have you looked at $decodedParsedContentWithDecodeEntities? I'd take a look at that, because, well, you have unexpected behaviour, and it's good to know what's happening every step of the way.

    Also note that the Text::Sentence documentation says:

    The split sentences function takes a scalar containing ascii text as an argument and returns an array of sentences that the text has been split into.
    —that is, it mentions that it's expecting ASCII text, which you're explicitly not giving it.

    I'm also puzzled how you can get the gor : BLAH line at all. It seems that you're printing lines of the form word : words (why?), with the left-hand side a word in the right-hand side, but gor doesn't appear in the right-hand side of the output that you displayed.

    UPDATE: For that matter, have you looked at $decodedParsedContent itself? A quick look at the non-XS part of the source for HTML::Entities reveals that it's just substituting decimal, then hexadecimal, then named entities. One can imagine a strange scenario where, say, the expansion of a hexadecimal entity creates a decimal entity; it's possible, though (I imagine) unlikely, that you're seeing that here.

      One can imagine a strange scenario where,

      If the scenario you describe is possible, it's a bug in HTML::Entities. And it's not what the OP is seeing. The OP claims he need to do too many decodings, not too few.

      I think you're thinking of double-encoding, where "foo" was accidentally encoded as

      "foo"
      when it should have been encoded as
      "foo"
        If the scenario you describe is possible, it's a bug in HTML::Entities. And it's not what the OP is seeing. The OP claims he need to do too many decodings, not too few.

        I don't think that I'm claiming what you think I'm claiming. :-) What I meant was that, say, 
 (or even just 
) would be interpreted (incorrectly) as 
 by two passes of decode_entities, but not by one. This gives “unexpected decoding” after the second pass, but it's not a bug in HTML::Entities.

        (UPDATE: I meant what I meant, but it wasn't quite what I said. A better example is a, which becomes a after one pass of the decoder and then (incorrectly) a after another. This is very particular to the ordering I mentioned earlier (first decimal, then hexadecimal, then named entities are expanded). This is the ordering in the pure-Perl decode_entities_old in HTML::Entities; I have no idea if the XS version also behaves this way. Perhaps you thought that I was mentioning that, say, &#amp;quot; " would be incorrectly converted to "? You're right, it seems to me that that is what will happen, and that it is a bug.)

        On the other hand, I couldn't, and can't, think of a way that this would give the behaviour that the OP is seeing. The kind of double-encoding you mentioned sounds far more likely—and the remedy, I think, is the same, to look at the intermediate steps along the way to see where something's going wrong. (Actually, I guess that's so generic that it's true for just about any problem.)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://812617]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (4)
As of 2024-04-19 23:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found