Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??
In your first listing, the regex engine will match on every byte, so if you feed it a UTF-8-encoded file it will report multiple-byte sequences as its component values. Meanwhile, you are unpacking 'C', which also "only does bytes". So the program is consistant, but the output labeling is wrong: it's not "Unicode Value:", it's "byte value:".

On your second listing, the regex is in the scope of use utf8, so the dot will match a multi-byte character as one character. But then you use unpack 'C' again which ignores the fact that $1 might have a multi-byte character in it, and just returns the value of the first byte.

Now UTF8-encoding is designed to overcome the headaches of past variable-length encoding systems. It's very easy because a character that's represented by a single byte always has the high bit cleared, and all bytes that are part of a multi-byte sequence all have their high bits set.

So, when you test for your range of 32..126 inclusive, you are indeed going to test for ASCII graphics characters because (by design) UTF8 is a proper superset of ASCII. You are picking out those bytes that are single-byte characters that also are not control codes (<32) or the DEL character (127).

The unpack is a roundabout way of doing that. If you just used ord(), you would respect the multibyte nature of what's in $1, and get numbers >256 as applicable. This would work about the same but would not rely on this artifact of UTF8 encoding.

But you can skip the while loop completely!

use utf8; return ! ($column =~ /[^\x21-\x7f]/);
will also return false if the $column contains any character outside of that range, true if it contains only characters in that range.

Meanwhile, ASCII doesn't have a character 231, since it only goes up to 127. Windows displays an ANSI character in your current code page, which varies based on what country you are in. The command window is using the OEM character set to be compatible with old DOS text programs, which is why it interprets 231 as a different character!

—John


In reply to Re: ASCII, Unicode, use utf8: My Story of Discovery by John M. Dlugosz
in thread ASCII, Unicode, use utf8: My Story of Discovery by princepawn

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others musing on the Monastery: (4)
    As of 2014-12-28 12:19 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      Is guessing a good strategy for surviving in the IT business?





      Results (181 votes), past polls