Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Can't decode ill-formed UTF-8 octet sequence

by Veraellyunjie (Initiate)
on Jul 17, 2024 at 02:47 UTC ( [id://11160652]=perlquestion: print w/replies, xml ) Need Help??

Veraellyunjie has asked for the wisdom of the Perl Monks concerning the following question:

In all snippets below,
  1. the 1st half of the pipe is slurping the file, decoding of BitTorrent bencode serialisation format and printing the torrent name
  2. the 2nd half of the pipe is echoing STDIN to STDOUT
perl -MBencode=bdecode -gnE \ 'say $_->{info}{name} for bdecode $_' \ ~/.local/share/qBittorrent/BT_backup/1f1c7bab36973365cbe6572b09e0926 +abaad31a6.torrent \ | perl -pe ''

the output is:

Crysis�

(I've put the output as a paragraph, not as code, coz it doesn't display properly as code)

I need unicode support, so add -Mutf8::all, and it fails:

perl -MBencode=bdecode -gnE \ 'say $_->{info}{name} for bdecode $_' \ ~/.local/share/qBittorrent/BT_backup/1f1c7bab36973365cbe6572b09e0926 +abaad31a6.torrent \ | perl -Mutf8::all -pe '' Can't decode ill-formed UTF-8 octet sequence <AE>.
Even Raku/Perl6 fails:
perl -MBencode=bdecode -gnE \ 'say $_->{info}{name} for bdecode $_' \ ~/.local/share/qBittorrent/BT_backup/1f1c7bab36973365cbe6572b09e0926 +abaad31a6.torrent \ | raku -e '$*IN.slurp.put' Malformed UTF-8 near bytes 69 73 ae in block <unit> at -e line 1
Raku needs adjustments:
perl -MBencode=bdecode -gnE \ 'say $_->{info}{name} for bdecode $_' \ ~/.local/share/qBittorrent/BT_backup/1f1c7bab36973365cbe6572b09e0926 +abaad31a6.torrent \ | raku -e '$*IN.slurp(:bin).decode("utf8-c8").put'

the output is:

Crysis􏿽xAE

The torrent name in the qBittorent is shown as Crysis®

The torrent file (inside a .tgz archive):

https://github.com/user-attachments/files/16258067/1f1c7bab36973365cbe6572b09e0926abaad31a6.torrent.tgz

(part of Raku issue over there https://github.com/rakudo/rakudo/issues/5606. perlmonks doesn't support uploading files?)

The main issue is the whole thing dies instead of carrying on. I would expect a warning about a malformed string, not a fatal error.

  • How do I sanitize such malformed strings before feeding them into perl/raku/etc.?
  • How do I read such malformed strings in perl with unicode support enabled without having it dying?

Replies are listed 'Best First'.
Re: Can't decode ill-formed UTF-8 octet sequence
by ikegami (Patriarch) on Jul 17, 2024 at 05:27 UTC

    How do I read such malformed strings in perl with unicode support enabled without having it dying?

    If you don't like how the decoding layer handles error, you are free to perform the decoding yourself. Encode's decode's third argument controls how it behaves on error.

    But the solution to the problem is to avoid generating the garbage in the first place.

    $_->{info}{name} is a string that consists of the characters 43.72.79.73.69.73.AE. It's apparently a string of decoded text (a string of Unicode Code Points).

    But file handles can only transmit bytes. You need to encode the Unicode Code Points into bytes to write them to a file handle.

    One way of doing this is to add an encoding layer to the file handle.

    perl -gne' use v5.36; use utf8::all; use Bencode qw( bdecode ); say bdecode( $_ )->{info}{name}; '

      The above is incorrect, as it "decodes" the torrent file too. You want binary mode (:raw) for that, and an encoding layer (:encoding(UTF-8)) on the output.

      Unfortunately, while we can set a default encoding layer for the files read via ARGV, there's no way to make it use binary mode. It gives us a mess.

      perl -e' use v5.36; use utf8::all; use Bencode qw( bdecode ); use File::Slurper qw( read_binary ); binmode STDIN; sub process_torrent { say bdecode( $_[0] )->{info}{name}; } if ( @ARGV ) { process_torrent( read_binary( $_ ) ) for @ARGV; } else { process_torrent( do { local $/; <STDIN> } ); } '

      Outside of Windows, you can probably get away with not using binary mode.

      perl -gne' use v5.36; use Bencode qw( bdecode ); binmode STDOUT, ":encoding( UTF-8 )"; binmode STDERR, ":encoding( UTF-8 )"; say bdecode( $_ )->{info}{name}; '
Re: Can't decode ill-formed UTF-8 octet sequence
by NERDVANA (Curate) on Jul 17, 2024 at 15:51 UTC
    When you use utf8::all, among other things it will:
    Filehandles are opened with UTF-8 encoding turned on by default (including STDIN, STDOUT, and STDERR when utf8::all is used from the main package). Meaning that they automatically convert UTF-8 octets to characters and vice versa. If you don't want UTF-8 for a particular filehandle, you'll have to set binmode $filehandle.

    This means it adds the utf8 layer on your STDIN. Your STDIN is reading binary data, and will die because that binary data is not utf-8. Your options are to call 'binmode STDIN' to un-apply the utf-8 layer from STDIN, or to not use utf8::all and choose between calling utf8::encode on your string before printing it, or manually adding the utf8 layer to STDOUT: binmode STDOUT, ":utf8".

    FWIW, on Windows you would need to call binmode STDIN; regardless of whether you use utf8:all, because windows default is to convert \r\n line endings on STDIN. Your example is exactly why perl5 can't just change the defaults to make utf8::all automatic. Sometimes people do want to read binary data from stdin or write it to stdout, and on Unix they've never needed to call binmode before, because it's the default.

Re: Can't decode ill-formed UTF-8 octet sequence
by cavac (Parson) on Jul 25, 2024 at 08:09 UTC

    perlmonks doesn't support uploading files?

    No, it doesn't. It's not a file hosting site, but a place to discuss perl code. You are generally expected to provide a Short, Self-Contained, Correct Example, that means both the code and (if applicable) any example data should be boiled down to the minimum, with the whole node (text, markup, code, example data) fitting into 64k.

    That being said, if you need a (short'ish) binary file, just include it in the code as Base64. Before Base64 encoding, if might be good to compress it as much as possible. Let's take your torrent file and prepare it for a post to a code discussion forum (note: a shorter torrent file that shows the problem would have been better, but let's work with what we got):

    wget https://github.com/user-attachments/files/16258067/1f1c7bab369733 +65cbe6572b09e0926abaad31a6.torrent.tgz tar xvzf 1f1c7bab36973365cbe6572b09e0926abaad31a6.torrent.tgz brotli -k -9 1f1c7bab36973365cbe6572b09e0926abaad31a6.torrent base64 1f1c7bab36973365cbe6572b09e0926abaad31a6.torrent.br > 1f1c7bab3 +6973365cbe6572b09e0926abaad31a6.torrent.br.b64

    This can now be either posted as an additional code snippet ("save this in bla.torrent.b64") or directly included in your sample code like this:

Re: Can't decode ill-formed UTF-8 octet sequence
by sectokia (Pilgrim) on Jul 18, 2024 at 03:36 UTC

    I think you are over thinking things. Firstly... there is no UTF8 anywhere here:

    use Bencode qw( bdecode ); use File::Slurp qw(read_file); my $b = read_file('test.torrent', { binmode => ':raw' }); my $o = bdecode($b); print unpack('H*',$o->{info}{name});

    output:

    437279736973ae

    To me it looks like an extended ASCII code set. You still need to know the page encoding to convert the 0xAE to a uni-code code point if that is what you are trying to do.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11160652]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (3)
As of 2024-09-08 22:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.