Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

contest problem about UTF-8

by rsFalse (Pilgrim)
on Oct 06, 2017 at 22:08 UTC ( #1200866=perlquestion: print w/replies, xml ) Need Help??
rsFalse has asked for the wisdom of the Perl Monks concerning the following question:

Hello.

I was trying to solve a problem about UTF-8, and went with solution which is too slow. I want to ask suggestions how to optimize my code, or suggestions about better approaches. Maybe with using pack/unpack, but I've found perlpacktut not very clear for me.

I have no idea how to write a problem statement from .pdf paper, so I made some screenshots.
How would you solve this problem?

Test case #1:
C1 B3 E0 81 B3 F0 80 81 B3 F8 80 80 81 B3 FC 80 80 80 81 B3 80 C1 B3 E0 81 B3 F0 80 81 B3 F8 80 80 81 B3 FC 80 80 80 81 B3 E0 AF B5
Test case #2:
FF 00 FF D1 81 F0 80 91 81 00 D1 81 FF F0 80 91 81 FF FF 00 D1 81 F0 80 91 81
Output for TC #1:
73 73 73 73 73 73 73 73 73 73 BF5
Output for TC #2:
441 441 0 441 0 441 441
My code which solves two tests correctly, but it is slow for bigger tests (can't see them), and maybe it is incorrect(?):
#!/usr/bin/perl use warnings; use strict; $\ = $/; my $debug = 0; $_ = do { local $/; <> }; @_ = split; $_ = join '', map { sprintf "%08b", eval "0x$_" } @_, 'FF'; my @data = reverse <DATA>; chomp @data; y/x/./ for @data; my @rxs = join '|', @data; my @r; my @R; while( /\G(?:@rxs|.{8})/g ){ my $c = $&; $debug and print "c:$c"; $c =~ /@rxs/ or do { push @R, [ @r ] if @r > 2; @r = (); $debug and print " -F"; next; }; my $x = $c =~ s/.{8}/ ($&) =~ s!^1+0!!r /ger; $debug and print "x:",$x; push @r, ~~ reverse $x; } my @acc; my @ACC; for my $R (@R){ @acc = (); for my $r ( @{ $R } ){ my $acc = ''; while( $r =~ /.{1,4}/g ){ $acc .= ( 0 .. 9, 'A' .. 'F' )[ eval "0b" . reverse $& ]; } $acc = reverse $acc; $acc =~ s/^0+\B//; push @acc, $acc; } push @ACC, "@acc"; } print for @ACC; __DATA__ 0xxxxxxx 110xxxxx10xxxxxx 1110xxxx10xxxxxx10xxxxxx 11110xxx10xxxxxx10xxxxxx10xxxxxx 111110xx10xxxxxx10xxxxxx10xxxxxx10xxxxxx 1111110x10xxxxxx10xxxxxx10xxxxxx10xxxxxx10xxxxxx

Replies are listed 'Best First'.
Re: contest problem about UTF-8
by ikegami (Pope) on Oct 07, 2017 at 01:15 UTC

    They basically want you to write your own UTF-8 decoder[1] that ignores errors.

    Don't work with the hex or binary representations of the values. Work with the numbers directly.

    You could store those numbers in an array.

    # my @input = ( 0xC1, 0xB3, 0xE0, 0x81, ... ); my @input = do { local $/; unpack "C*", pack "H*", <> =~ s/\s//gr };

    Or you could store those numbers in a string.

    # my $input = "\xC1\xB3\xE0\x81..."; my $input = do { local $/; pack "H*", <> =~ s/\s//gr };

    The latter allows us to search the numbers using the regex engine. The following matches an encoded sequence, and will allow one to find sequences of three or more encoded values in a row very easily:

    (?: [\x00-\x7F] | [\xC0-\xDF][\x80-\xBF] | [\xE0-\xEF][\x80-\xBF]{2} | ... )

    Once you've extracted the encoded values, it's just a question of decoding them.

    my @bytes = unpack 'C*', $encoded_value; if (@bytes == 1) { push @output, $bytes[0]; } elsif (@bytes == 2) { push @output, (( $bytes[0] & 0x1F ) << 6 ) | ( $ +bytes[0] & 0x3F ); } elsif (@bytes == 3) { push @output, (( $bytes[0] & 0x0F ) << 12 ) | (( + $bytes[0] & 0x3F ) << 6 ) | ( $bytes[0] & 0x3F ); } ...

    Back to hex:

    say join " ", map sprintf("%X", $_), @output;
    or
    # Separated by "." instead of spaces. say sprintf "%vX", pack "W*", @output;

    1. Note that the encoding deviates from UTF-8 in a few respects:

      • UTF-8 is limited to encoding values up to 0x10FFFF.
      • UTF-8 is limited to four-bytes sequences.
      • UTF-8 forbids using more bytes than necessary to encode a number.
      UTF-8 is limited to encoding values up to 0x10FFF.
      *currently

        They can always extend the spec, of course, but they've reserved of lot of space so they wouldn't need to do so. Before now, they've *reduced* the max to 0x10FFFF.

Re: contest problem about UTF-8
by LanX (Bishop) on Oct 07, 2017 at 00:55 UTC
    Does the  "Open Cup named after E.V. Pankratiev - Grand Prix Eurasia" allow external help? :)

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

      Contest is finished now. It was held on October 1st, 5 hours long.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1200866]
Approved by stevieb
Front-paged by Corion
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (8)
As of 2017-12-14 15:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What programming language do you hate the most?




















    Results (396 votes). Check out past polls.

    Notices?