http://www.perlmonks.org?node_id=670579


in reply to Re^5: Getting mad with CGI::Application and utf8
in thread Getting mad with CGI::Application and utf8

So as the average John Doe Perl hacker, what should I use to find out if a certain module or sub returns text strings or binary strings?

Warning: culture shock ahead.

From perlunifaq:

How can I determine if a string is a text string or a binary string?

You can't. Some use the UTF8 flag for this, but that's misuse, and makes well behaved modules like Data::Dumper look bad. The flag is useless for this purpose, because it's off when an 8 bit encoding (by default ISO-8859-1) is used to store the string.

This is something you, the programmer, has to keep track of; sorry. You could consider adopting a kind of "Hungarian notation" to help with this.

There is no way to determine whether a string is binary or text. Every operation (including your own subroutines) should handle a single mode: either text or binary. If you want to handle both kinds of string, and for any reason need to know the difference between bytes and characters with the same ordinal values, you will have to specify multiple routines, or a way to indicate that a certain string is binary rather than text.

Just an advance warning: you may want to argue that this is as stupid concept, but eventually you'll have to accept that Perl just works like this. I personally think the model is well thought through.

See also this journal post and the discussion tree that follows it. I plan to release a module called BLOB that lets you (and everyone else) flag a string as "this is binary, not text".

Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

  • Comment on Re^6: Getting mad with CGI::Application and utf8

Replies are listed 'Best First'.
Re^7: Getting mad with CGI::Application and utf8
by moritz (Cardinal) on Feb 27, 2008 at 10:55 UTC
    Let me repeat the question: I get some data from a foreign Perl module (let's say a file parser), and that module doesn't document what it returns.

    But other parts of the code have to deal with text strings (for example because they query unicode properties).

    What should I do? I only need to know that once, at write/debug time.

    My current approach is to try to get some data with high codepoints (outside latin-1 range) out of the foreign module, and check with Devel::Peek or utf8::is_utf8 if that stupid flag is that.

    Is there a better, more reliable approach? And is that really an abuse?

      Let me repeat the question

      Do you expect me to repeat the answer too? :)

      If your subroutine or module specifically only handles binary strings, I'd recommend documenting it as such, and downgrading the string that you receive:

      my $copy = $foo; utf8::downgrade($copy) or utf8::encode($copy) && carp "Wide character +in operation";
      That's more or less what Perl does in its binary operators, like print.

      Whatever you do, though, never assume that the absence of the flag means it's not a text string!

      Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

        Do you expect me to repeat the answer too?

        No. Your anser ("you can't") is probably right, but not very helpful for the situation I described. When I need to solve a problem, and I can't get the 100% complete solution, I try to approximate.

        So I suggested an approximation, and asked if there's a better way. So, is there one? Or is it already as close as I can get, without having to read a foreign module, possibly tracing strings manually through thousands of lines of code?