Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Fatal code point 0xFFFFFFFFFFFFFFFF

by Anonymous Monk
on Sep 05, 2018 at 12:22 UTC ( #1221752=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I get the following errors matching text with a simple precompiled regex (5.26.2). I think my code handles unicode properly because it seems to work, but those fatals have to go. Do you see a pattern in these errors that indicate the problem? I suspect some files are corrupt but also wonder if this is a programmer error. Thank you for running my problem on your brain :-)

Operation "pattern match (m//)" returns its argument 
for UTF-16 surrogate U+DFA8

Operation "pattern match (m//)" returns its argument 
for non-Unicode code point 0x1C9140

Operation "pattern match (m//)" returns its argument 
for non-Unicode code point 0xE6BAAA

Use of code point 0xFFFFFFFFFFFFFFFF is deprecated; 
the permissible max is 0x7FFFFFFFFFFFFFFF. 
This will be fatal in Perl 5.28 in pattern match (m//)

Use of code point 0xFFFFFFFFFFFFFFFF is deprecated; 
the permissible max is 0x7FFFFFFFFFFFFFFF. 
This will be fatal in Perl 5.28

Operation "pattern match (m//)" returns its argument 
for non-Unicode code point 0xFFFFFFFFFFFFFFFF

Replies are listed 'Best First'.
Re: Fatal code point 0xFFFFFFFFFFFFFFFF
by Corion (Pope) on Sep 05, 2018 at 12:29 UTC

    If in doubt, use diagnostics and/or see perldiag for your error message(s):

    Operation "%s" returns its argument for non-Unicode code point 0x%X

    (S non_unicode) You performed an operation requiring Unicode rules on a code point that is not in Unicode, so what it should do is not defined. Perl has chosen to have it do nothing, and warn you.

    To me, this means that the data you are reading is not really valid UTF-16 or valid Unicode. Please show us the relevant code that reads and decodes the data, and the relevant snippet of the data. That way, maybe we can see better where the problem originates from and make better suggestions as how to address this problem.

      I posted the wrong pattern. I'm matching about 1000 files and get this sequence of errors 8 times. It looks like something, but what? I can't tell if it means 8 files have 1 error or 1 file has 8 errors. I suspect it may be 8 files, and might involve the Japanese language:
      
      UTF-16 surrogate U+DFA8
      non-Unicode code point 0x1C9140
      non-Unicode code point 0xE6BAAA
      code point 0xFFFFFFFFFFFFFFFF
      code point 0xFFFFFFFFFFFFFFFF
      non-Unicode code point 0xFFFFFFFFFFFFFFFF
      non-Unicode code point 0x18B0E4
      non-Unicode code point 0x18B4DC
      non-Unicode code point 0x18B0E4
      
      
      Thank you for suggesting diagnostics. It shows something slightly different: a \t in front of every code point, like \tU+DFA8. I tried s/\t/ /gs before the regex but it has no effect.

      Sorry I can't really post the code or data because it's too complicated :-/

        Without either code or data, it's really hard for us to reproduce your problem or to suggest what might be the (root) cause, other than data that decodes to invalid Unicode sequences. My random guess is that you are either fiddling with the UTF-8 flag on strings or are creating Unicode strings in another invalid way, but that's hard to tell without code or data.

        My suggestion to you is to reduce your input data to find the line(s) which are causing the warnings to be thrown. In a second step, reduce the code of your program until nothing else remains except a short sequence of statements that are causing the warnings to be thrown.

        If by then, the solution is not obvious to you, show us both, the data and the short program. Maybe then we can help you better.

        Sorry I can't really post the code or data because it's too complicated

        Please see Short, Self-Contained, Correct Example. It should be possible for you to compose an SSCCE because you must have some idea of the parts of the text that are causing problems. Just the exercise of composing an SSCCE may give you insight into the root cause of the problem.


        Give a man a fish:  <%-{-{-{-<

Re: Fatal code point 0xFFFFFFFFFFFFFFFF
by kcott (Bishop) on Sep 07, 2018 at 09:03 UTC

    You've already received feedback regarding our inability to provide any direct help when you don't provide any code or data. Here's some indirect help.

    The core module Unicode::UCD can provide a lot of information about Unicode characters. Here's a brief example:

    $ perl -E 'use Unicode::UCD "charinfo"; my @cps = qw{U+DFA8 0xDFA8 0x1 +C9140 0xE6BAAA 0xFFFFFFFFFFFFFFFF}; for (@cps) { say "$_: ", defined +charinfo($_) ? "Assigned" : "Unassigned" }' U+DFA8: Assigned 0xDFA8: Assigned 0x1C9140: Unassigned 0xE6BAAA: Unassigned Hexadecimal number > 0xffffffff non-portable at /Users/ken/perl5/perlb +rew/perls/perl-5.28.0t/lib/5.28.0/Unicode/UCD.pm line 355. 0xFFFFFFFFFFFFFFFF: Unassigned

    The $codepoint argument to charinfo($codepoint) can be in many formats. I added 0xDFA8 to your posted U+DFA8 as a minimal example. This $codepoint argument is used in a similar fashion by many of the other functions provided by Unicode::UCD.

    Also be aware that different versions of Perl support different versions of Unicode.

    Also note that Unicode is currently at version 11.0 (see "Announcing The Unicode® Standard, Version 11.0") which isn't supported by any version of Perl as yet. Unicode characters that you're investigating could be one of the 684 new characters in this version.

    Just out of interest, I ran the above one-liner using 5.26 - the results were the same (except, of course, for 5.26 appearing in the message instead of 5.28).

    $ perl -E 'use Unicode::UCD "charinfo"; my @cps = qw{U+DFA8 0xDFA8 0x1 +C9140 0xE6BAAA 0xFFFFFFFFFFFFFFFF}; for (@cps) { say "$_: ", defined +charinfo($_) ? "Assigned" : "Unassigned" }' U+DFA8: Assigned 0xDFA8: Assigned 0x1C9140: Unassigned 0xE6BAAA: Unassigned Hexadecimal number > 0xffffffff non-portable at /Users/ken/perl5/perlb +rew/perls/perl-5.26.0t/lib/5.26.0/Unicode/UCD.pm line 365. 0xFFFFFFFFFFFFFFFF: Unassigned

    — Ken

Re: Fatal code point 0xFFFFFFFFFFFFFFFF
by ikegami (Pope) on Sep 10, 2018 at 22:41 UTC

    You should be decoding using UTF-8 instead of utf8 to nip the problem in the bud. Difference

Re: Fatal code point 0xFFFFFFFFFFFFFFFF
by Anonymous Monk on Sep 09, 2018 at 06:20 UTC
    Pseudo-SSCCE:
    my $regex = qr/some: (\S+) pattern/si; my $data = Encode::decode(utf8 => $file); do_something($1) if $data =~ $regex; # <- ERROR HAPPENS HERE
    This works ok on many files. I can't really control the input. Is there a way to deal with this issue besides disabling all warnings?
    
    Use of code point 0xFFFFFFFFFFFFFFFF is deprecated; the permissible max is 0x7FFFFFFFFFFFFFFF. This will be fatal in Perl 5.28 at X.pm line 1685 (#2)
    
    (D deprecated) You used a code point that will not be allowed in a
    future perl version, because it is too large.  Unicode only allows code
    points up to 0x10FFFF, but Perl allows much larger ones.  However, the
    largest possible ones break the perl interpreter in some constructs,
    including causing it to hang in a few cases.  The known problem areas
    are in tr///, regular expression pattern matching using quantifiers,
    as quote delimiters in qX...I<X> (where X is the chr() of a large
    code point), and as the upper limits in loops.
    
    There may be other breakages as well.  If you get this warning, and
    things aren't working correctly, you probably have found one of these.
        
    If your code is to run on various platforms, keep in mind that the upper
    limit depends on the platform.  It is much larger on 64-bit word sizes
    than 32-bit ones.
        
    The use of out of range code points was deprecated in Perl 5.24, and
    it will be a fatal error in Perl 5.28.
    
      This works ok on many files. I can't really control the input. Is there a way to deal with this issue besides disabling all warnings?
      It appears that you are reading a corrupt/illegal sequence of octets from a file which, when fed through Encode::decode(), gets interpreted as a code point greater than the maximum allowed. See "Handling Malformed Data" in the docs for Encode.

      Dave.

        Thanks man but I hate the Encode docs. They can't give one lousy example on the exact syntax of CHECK to suppress unicode's stupid errors. Anyway I tried shoving ,Encode::FB_QUIET in there and it seems to work at turning off the noise. But as usual the docs are not so clear and seem to say that FB_QUIET stops decoding at the error and returns undecoded data? Really not useful. Is just sending Unicode's useless fatals to devnull an effective way to deal with data I just don't care too much about being exactly and perfectly correct according to the commandments of the freakin unicode consortium? Practical extraction of text used to be so easy, now it's calculus :-/

        Thank you for your help

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1221752]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (5)
As of 2019-06-25 00:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Is there a future for codeless software?



    Results (100 votes). Check out past polls.

    Notices?