Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

By the shine on my bald pate, I dislike this encoding stuff

by jfrm (Monk)
on Mar 04, 2018 at 07:40 UTC ( #1210301=perlquestion: print w/replies, xml ) Need Help??
jfrm has asked for the wisdom of the Perl Monks concerning the following question:

I spent most of last week wading/stumbling through a conversion of a large part of my system to UTF-8. After much reading and inexplicableness, I'm virtually there. Still have one problem though; one routine expects UTF-8 data and can crash if it gets some from a non-UTF-8 file. So I need to test a file for UTF-8-edness. I have read much documentation/Perlmonks/Stackoverflow and apparently the following should work:

open (ORDERFILE, '<:encoding(UTF-8)', $emailfile) or return (@err, "Co +uld not open order email file: $emailfile"); my(@LINES) = <ORDERFILE>; my $filedata = <ORDERFILE>; close(ORDERFILE); use Encode; eval { my $utf8 = decode("utf8", $filedata, Encode::FB_CROAK ) }; return(@err, "File was not encoded in UTF-8") if ($@);

But I have ANSII files for which this doesn't return but just outputs lots of warnings such as: utf8 "\xA3" does not map to Unicode. If I remove the '<:encoding(UTF-8)', argument from 'open', it still works but there are no warnings. A salient insight would be a welcome relief if there are any ideas?

Replies are listed 'Best First'.
Re: By the shine on my bald pate, I dislike this encoding stuff
by haukex (Canon) on Mar 04, 2018 at 11:52 UTC

    In addition to the issue poj pointed out with reading from <ORDERFILE> twice (my(@LINES) = <ORDERFILE> reads all lines from the file, so $filedata would normally be empty), I just wanted to point out that the pattern eval {...}; if ($@) {...} has issues and that the pattern eval {...; 1} or do {...} or a module like Try::Tiny is better. Also, nowadays lexical filehandles (open my $fh, ...) are generally preferred over bareword filehandles (open ORDERFILE, ...). (Update: The AM also made a good point that you appear to be decoding the data twice.)

    Really, the best way to go is to know in advance what encoding your files are in, and then opening them with the appropriate encoding in open my $fh, '<:encoding(...)', $filename or die $!;

    You may want to have a look at The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Re: By the shine on my bald pate, I dislike this encoding stuff
by Anonymous Monk on Mar 04, 2018 at 14:24 UTC

    Strictly speaking, a file containing "\xA3" is not ASCII, since ASCII only consists of the characters from "\x00" to "\x7F". Maybe it's ISO Latin-1?

    Also, your logic double-decodes the file. Assuming it is UTF-8, opening it '<:encoding(UTF-8)' decodes it, and then your decode() decodes it again.

    My knee-jerk would be to apply Encode::Guess to the problem, since that way somebody else has worked out this mess for you, and since if you are going to convert the file to UTF-8 you need to know what its encoding currently is. If I just wanted to know if the file decoded as UTF-8 I might be lazy and do something like

    open my $orderfile, '<:raw', $emailfile
        or return( @err, "Could not open $emailfile: $!" );
    local $/ = undef;
    my $filedata = <$orderfile>;
    close $orderfile;
    use Encode;
    eval {
        decode( "utf-8", $filedata, Encode::FB_CROAK );
        1;
    } or return( @err, "File was not encoded in UTF-8" );
    

    One possible source of confusion in this horrible mess is that the ASCII encoding is a subset of the UTF-8 encoding, so technically there is no way to distinguish between a file encoded in ASCII and a file encoded in UTF-8

      Yep. Betcha the real problem is that the files which contain "non-ASCII characters" didn't use Unicode (UTF-8, UTF-16) to encode those characters, but instead used old-style code pages. But the program's logic assumes that it's Unicode without checking the entire file. I didn't see the OP ever describing what the nature of the "crash" actually is.
Re: By the shine on my bald pate, I dislike this encoding stuff
by Anonymous Monk on Mar 04, 2018 at 08:00 UTC

      This doesn't explain why it doesn't work but OK, thanks, I have added:

      use Encode::Guess; my $decoder = Encode::Guess->guess($filedata); # default detectable e +ncodings include utf8. return(@err, "Can't guess encoding: $decoder") unless ref($decoder);

      This is still no good as it fails for all files, ANSII and UTF-8 with error

      Can't guess encoding: Empty string, empty guess.
        Empty string

        All the content is read into @LINES so $filedata is empty

        my(@LINES) = <ORDERFILE>;
        my $filedata = <ORDERFILE>;
        

        Maybe try

        my @LINES = <ORDERFILE>;
        my $filedata = join '',@LINES;
        

        or just

        open ORDERFILE, '<', $emailfile or die "$emailfile: $!";
        my $filedata = do { local $/; <ORDERFILE> };
        close ORDERFILE;
        
        poj
Re: By the shine on my bald pate, I dislike this encoding stuff
by Anonymous Monk on Mar 06, 2018 at 11:09 UTC
    open (ORDERFILE, '<:encoding(UTF-8)', $emailfile) or return (@err, "Co +uld not open order email file: $emailfile"); #... my $filedata = <ORDERFILE>; #... eval { my $utf8 = decode("utf8", $filedata, Encode::FB_CROAK ) };

    First you're applying an IOLayer to a filehandle to obtain characters decoded from UTF-8, then you additionally decode unicode characters as if they were UTF-8 bytes. If this is working, it's by chance (i.e. when reading ASCII-only files).

    You should either open with :encoding(UTF-8) (but then you'll get warnings on non-UTF-8 text) or open without the IOLayer and do the decoding manually with FB_CROAK option.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1210301]
Approved by Athanasius
Front-paged by davies
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (4)
As of 2018-12-19 10:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    How many stories does it take before you've heard them all?







    Results (85 votes). Check out past polls.

    Notices?