Wierd behaviour with HTML::Entities::decode

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Here's my code -

BEGIN { $ENV{LC_ALL} = $ENV{LANG} = 'sv_SE.UTF-8' }

use strict;
use warnings;
use utf8;

use open ':std', ':locale';

use LWP::UserAgent qw( get );
use HTML::Strip    qw( );
use HTML::Entities qw( decode_entities );
use Text::Sentence qw( split_sentences );

my $userAgent = LWP::UserAgent->new();
#$userAgent->agent('Mozilla/5.0');

my $url = "http://www.expressen.se";

my $response = $userAgent->get($url);
  die "Can't get $url: ", $response->status_line
   unless $response->is_success;

my $hs = HTML::Strip->new( decode_entities => 0 );
my $parsedContent = $hs->parse( $response->content );
utf8::decode( my $decodedParsedContent = $parsedContent );
$decodedParsedContent =~ s/(\s)+/ /g; # remove double whitespace
decode_entities(my $decodedParsedContentWithDecodeEntities = $decodedP
+arsedContent);

my @sentences = split_sentences( $decodedParsedContentWithDecodeEntiti
+es );
foreach my $sentence (@sentences) 
{
      #$sentence =~ s/^\s+//; # remove leading whitespace 
    #$sentence =~ s/\s+$//; # remove trailing
    decode_entities(my $sentenceDecodeEntities = $sentence);
    
    while ($sentenceDecodeEntities =~ /(\w+)/g) 
    {
        print "$1 : ".$sentenceDecodeEntities."\n";
    }
}
[download]

One of my output lines is -

gor : BLOGG Europe Turnéblogg Mic i vår replokal "The Dungeon" BLOGG L
+otta Gröning Krönikör Demokratifiasko...
[download]

Which looks good, however if I comment out either of the two calls to decode_entities(), I end up getting -

gor : BLOGG Europe Turnéblogg Mic i vår replokal &quot;The Dungeon&quo
+t; BLOGG Lotta Gröning  Krönikör Demokratifiasko...
[download]

Why do I need the two calls to decode_entities()???
Thanks very much for your help!

Comment on Wierd behaviour with HTML::Entities::decode_entities() Select or Download Code

Replies are listed 'Best First'.
Re: Wierd behaviour with HTML::Entities::decode_entities() by ikegami (Patriarch) on Dec 13, 2009 at 22:50 UTC
Wow, why the switch to insane variable names? You're not even separating the words! Your code is unreadable. (Wall of text.) ~~Anyway, you always discard the result of `decode_entities`, so I don't see how you could claim it ever works.~~ And the order of the following statements is backwards: `$decodedParsedContent =~ s/(\s)+/ /g; # remove double whitespace decode_entities(my $decodedParsedContentWithDecodeEntities = $decodedP +arsedContent);` [download] You're removing the spaces before you decode the entities that might result in spaces. Finally, `s/(\s)+/ /g` is wrong. Except `?` in some very special circumstances, applying a modifier to a capture makes no sense. Which brings to the point that you don't need to be capturing at all. If you need parens, they're `(?...)`: `s/(?:\s)+/ /g`. Of course, there's one one atom in the parens, so all you need is `s/\s+/ /g`. Update: "If called in void context the arguments are decoded in-place.", oops. Your test produces waaaaay too much output, so I can't even see the problem, much less diagnose it.	[reply] [d/l] [select]
Re: Wierd behaviour with HTML::Entities::decode_entities() by JadeNB (Chaplain) on Dec 14, 2009 at 04:16 UTC
I'm not very experienced with either of these modules, but, as ikegami points out, some of your code seems strange—for example, if you're going to put the result of `decode_entries` in another scalar anyway, why use the hard-to-read `decode_entities(my $new = $old)` rather than the more natural `my $new = decode_entities $old`? Have you looked at `$decodedParsedContentWithDecodeEntities`? I'd take a look at that, because, well, you have unexpected behaviour, and it's good to know what's happening every step of the way. Also note that the Text::Sentence documentation says: The split sentences function takes a scalar containing ascii text as an argument and returns an array of sentences that the text has been split into. —that is, it mentions that it's expecting ASCII text, which you're explicitly not giving it. I'm also puzzled how you can get the `gor : BLAH` line at all. It seems that you're printing lines of the form `word : words` (why?), with the left-hand side a word in the right-hand side, but `gor` doesn't appear in the right-hand side of the output that you displayed. UPDATE: For that matter, have you looked at `$decodedParsedContent` itself? A quick look at the non-XS part of the source for HTML::Entities reveals that it's just substituting decimal, then hexadecimal, then named entities. One can imagine a strange scenario where, say, the expansion of a hexadecimal entity creates a decimal entity; it's possible, though (I imagine) unlikely, that you're seeing that here.	[reply] [d/l] [select]
Re^2: Wierd behaviour with HTML::Entities::decode_entities() by ikegami (Patriarch) on Dec 14, 2009 at 06:22 UTC
One can imagine a strange scenario where, If the scenario you describe is possible, it's a bug in HTML::Entities. And it's not what the OP is seeing. The OP claims he need to do too many decodings, not too few. I think you're thinking of double-encoding, where `"foo"` was accidentally encoded as `&quot;foo&quot;` [download] when it should have been encoded as `"foo"` [download]	[reply] [d/l] [select]
Re^3: Wierd behaviour with HTML::Entities::decode_entities() by JadeNB (Chaplain) on Dec 14, 2009 at 11:29 UTC
If the scenario you describe is possible, it's a bug in HTML::Entities. And it's not what the OP is seeing. The OP claims he need to do too many decodings, not too few. I don't think that I'm claiming what you think I'm claiming. :-) What I meant was that, say, `&#10;` (or even just `&#10;`) would be interpreted (incorrectly) as ` ` by two passes of `decode_entities`, but not by one. This gives “unexpected decoding” after the second pass, but it's not a bug in HTML::Entities. (UPDATE: I meant what I meant, but it wasn't quite what I said. A better example is `&#97;`, which becomes `a` after one pass of the decoder and then (incorrectly) `a` after another. This is very particular to the ordering I mentioned earlier (first decimal, then hexadecimal, then named entities are expanded). This is the ordering in the pure-Perl `decode_entities_old` in HTML::Entities; I have no idea if the XS version also behaves this way. Perhaps you thought that I was mentioning that, say, ~~`&#amp;quot;`~~ `&quot;` would be incorrectly converted to `"`? You're right, it seems to me that that is what will happen, and that it is a bug.) On the other hand, I couldn't, and can't, think of a way that this would give the behaviour that the OP is seeing. The kind of double-encoding you mentioned sounds far more likely—and the remedy, I think, is the same, to look at the intermediate steps along the way to see where something's going wrong. (Actually, I guess that's so generic that it's true for just about any problem.)	[reply] [d/l] [select]
Re^4: Wierd behaviour with HTML::Entities::decode_entities() by ikegami (Patriarch) on Dec 14, 2009 at 16:27 UTC
Re^5: Wierd behaviour with HTML::Entities::decode_entities() by JadeNB (Chaplain) on Dec 14, 2009 at 17:02 UTC
Some notes below your chosen depth have not been shown here


Welcome to the Monastery
	PerlMonks

Wierd behaviour with HTML::Entities::decode_entities()