Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re: Fatal code point 0xFFFFFFFFFFFFFFFF

by Anonymous Monk
on Sep 09, 2018 at 06:20 UTC ( #1221964=note: print w/replies, xml ) Need Help??


in reply to Fatal code point 0xFFFFFFFFFFFFFFFF

Pseudo-SSCCE:
my $regex = qr/some: (\S+) pattern/si; my $data = Encode::decode(utf8 => $file); do_something($1) if $data =~ $regex; # <- ERROR HAPPENS HERE
This works ok on many files. I can't really control the input. Is there a way to deal with this issue besides disabling all warnings?

Use of code point 0xFFFFFFFFFFFFFFFF is deprecated; the permissible max is 0x7FFFFFFFFFFFFFFF. This will be fatal in Perl 5.28 at X.pm line 1685 (#2)

(D deprecated) You used a code point that will not be allowed in a
future perl version, because it is too large.  Unicode only allows code
points up to 0x10FFFF, but Perl allows much larger ones.  However, the
largest possible ones break the perl interpreter in some constructs,
including causing it to hang in a few cases.  The known problem areas
are in tr///, regular expression pattern matching using quantifiers,
as quote delimiters in qX...I<X> (where X is the chr() of a large
code point), and as the upper limits in loops.

There may be other breakages as well.  If you get this warning, and
things aren't working correctly, you probably have found one of these.
    
If your code is to run on various platforms, keep in mind that the upper
limit depends on the platform.  It is much larger on 64-bit word sizes
than 32-bit ones.
    
The use of out of range code points was deprecated in Perl 5.24, and
it will be a fatal error in Perl 5.28.

Replies are listed 'Best First'.
Re^2: Fatal code point 0xFFFFFFFFFFFFFFFF
by dave_the_m (Monsignor) on Sep 09, 2018 at 10:47 UTC
    This works ok on many files. I can't really control the input. Is there a way to deal with this issue besides disabling all warnings?
    It appears that you are reading a corrupt/illegal sequence of octets from a file which, when fed through Encode::decode(), gets interpreted as a code point greater than the maximum allowed. See "Handling Malformed Data" in the docs for Encode.

    Dave.

      Thanks man but I hate the Encode docs. They can't give one lousy example on the exact syntax of CHECK to suppress unicode's stupid errors. Anyway I tried shoving ,Encode::FB_QUIET in there and it seems to work at turning off the noise. But as usual the docs are not so clear and seem to say that FB_QUIET stops decoding at the error and returns undecoded data? Really not useful. Is just sending Unicode's useless fatals to devnull an effective way to deal with data I just don't care too much about being exactly and perfectly correct according to the commandments of the freakin unicode consortium? Practical extraction of text used to be so easy, now it's calculus :-/

      Thank you for your help

        to suppress unicode's stupid errors.
        They are not stupid errors. The errors (in fact warnings - they only become errors from 5.28 onwards) are telling you that you are trying to feed illegal data into the regex engine. It's illegal because the octets you told decode() to interpret as utf8 aren't in fact valid utf8.

        How you want to handle this corrupt data is of course your choice depending on what is best for your circumstances. You might want to make decode croak if fed a bad file with FB_CROAK, then go back and delete / fixup any bad files. Or depending on the nature of the corruption of the files, you might like to use FB_DEFAULT to just replace the corrupt bits with REPLACEMENT CHARACTER. You would only want to use FB_QUIET if you don't mind decode() stopping at the first bad part of the file and ignoring the rest of its contents.

        Dave.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1221964]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (8)
As of 2019-06-26 15:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Is there a future for codeless software?



    Results (110 votes). Check out past polls.

    Notices?