Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re^6: Unicode strings internals

by vsespb (Chaplain)
on May 10, 2013 at 21:30 UTC ( [id://1033039]=note: print w/replies, xml ) Need Help??


in reply to Re^5: Unicode strings internals
in thread [SOLVED] Unicode strings internals

The use of a raw filehandle for output of non-binary data

That was the problem, that I used raw filehandle for binary-only data.

something like this:

# $binarydata is binary data, $id is a ASCII-only number, $command is +ASCII-only string. # so result of concatenation should be binary data my $line = "$id\t$command\t$datalength\t$binarydata"; syswrite $file, $line ...

However i've received $id in another part of program, like this:

my ($id, $filename) = split (/\t/, $record);

Problem that $record was UTF-8 character string by intention and contained non-ASCII filename. Thus ASCII-only $id had UTF-8 bit set.

And thus $line was UTF-8 non-ASCII character string with $binarydata screwed (i.e. bytes converted from Latin-1 to UTF-8).

Suprisely everything worked fine, as screwed $binarydata was converted back (bytes from UTF-8 to Latin-1) when I wrote it using syswrite().

So I notices that strange implementation only when added some additional stuff to that code (like I used bytes::length somewhere).

So I am thinking now, either I am responsible to make sure that $id never will have UTF-8 bit set. Either I should, in additional, test it with "confess if is_utf8($id)". Or maybe I should never concatenate binary data with known ASCII-only-data.Or maybe even never concatenate with known binary data...

Replies are listed 'Best First'.
Re^7: Unicode strings internals
by kennethk (Abbot) on May 10, 2013 at 22:13 UTC
    It sounds like your bug would only rear its head when $id actually contains non-ASCII characters. The canonical method for handling this, as I understand it, is to explicitly encode incoming text streams that are potentially problematic; i.e.
    my ($id, $filename) = split (/\t/, $record); $id = encode ("UTF-8", $id);
    I'd watch out for the 'filtering programmer input' trap in all this; the Perl philosophy of giving people as much rope as they like means that a properly-motivated foolish programmer can always outwit your filtering. Since you expect that $id is printable ASCII, I'd more inclined to filter using my regex above, and re-examine the logic the introduced UTF encoding sensitivity into the code in the first place. YMMV, of course.

    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

      It sounds like your bug would only rear its head when $id actually contains non-ASCII characters.
      No! ASCII only - letters and digits. Just like in example of my original posting:
      my $utfstring = "123 \x{439}\x{439}\x{439}\x{439}"; my ($ascii_but_utf, undef) = split ' ', $utfstring;

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1033039]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (5)
As of 2024-03-28 10:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found