Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

unicode normalization layer

by DrWhy (Chaplain)
on Sep 15, 2009 at 20:49 UTC ( [id://795472]=perlquestion: print w/replies, xml ) Need Help??

DrWhy has asked for the wisdom of the Perl Monks concerning the following question:

Gretings brothers (of all genders),

I wonder if any of you know where I could get my hands on a perlio layer that does Unicode normalization. I have data that is (supposed to be) in UTF-8. I am writing code that uses the :encoding(utf8) layer to validate that it is in fact good UTF-8/Unicode data, but to work with the data I'd like to have it in normalized form (NFKC, to be specific). I'd really like to have that done in a layer on top of :encoding(utf8) so that I can read the data in blocks and not have to worry about the block boundaries falling between a base character and following combining characters.

Thanks,

--DrWhy

"If God had meant for us to think for ourselves he would have given us brains. Oh, wait..."

Replies are listed 'Best First'.
Re: unicode normalization layer
by graff (Chancellor) on Sep 16, 2009 at 03:44 UTC
    If you don't have time to dive into PerlIO::via, would the following suffice (and if not, why not)?
    #!/usr/bin/perl use strict; use Unicode::Normalize; use open IN => ':encoding(utf8)'; binmode STDIN, ':encoding(utf8)'; while (<>) { $_ = NFKC( $_ ); # now do whatever you want... }
    That applies the 'encoding:utf8' layer on all input files (including STDIN), so any method of reading input from any file handle will complain if the data can't be interpreted as utf8. Once it's read in, you just apply normalization, and do whatever else you need to do.

    You weren't very specific on what you mean by "validate" (or what you want to do with invalid data). Note that the above doesn't actually die on invalid input; it just prints warning and tries to do the best it can with what it gets.

    Personally, when I really want to know whether a file is valid utf8 (and I want to provide useful diagnostics when it isn't), I tend to read it as raw data and then do

    eval "Encode::decode('utf8', \$_, FB_CROAK)";
    so I can trap non-utf8 input and give a proper error report.

    Did I misunderstand the question? I don't think there'll be any significant speed-up by trying to do block-oriented input; input gets buffered anyway. (But the Encode man page does explain how to handle input that isn't "character oriented", if you really want to do that.)

    UPDATE: ikegami's reply to my post is very helpful (++!), so definitely follow his advice over mine.

      I believe

      my $is_valid = utf8::decode($_);
      is cheaper than
      use Encode qw( decode ); my $is_valid = eval '$_ = decode("utf-8", $_, FB_CROAK); 1';

      It's definitely simpler (and you don't even need to load any modules!)

      Note that "utf8" is not the same thing "utf-8". "utf8" is the name of Perl's internal encoding. It differs from "utf-8". You definitely want to use "utf-8" when validating (if not always).

      I also fixed the bug where decoding the string "0" would be considered a validation error.

        This is certainly the simplest approach I've seen so far, and I'll definitely keep it in mind for future use. However, I'm currently using something closer to graff's approach. I need to have a count of the invalid items encountered in the input stream, so I've defined a CHECK function to be used by :encoding(utf8) that ticks up a counter of the number of bad things found and then returns the unicode WTF?! character to replace it in the input stream.

        As for the relative speed of getline (<>) and read block, I was recently working with a system where benchmarking showed the speed difference between the two approaches was quite substantial -- 7-8 times difference -- which is why I wanted to avoid getline in this case, especially since my processing needs are not specifically line-oriented.

        --DrWhy

        "If God had meant for us to think for ourselves he would have given us brains. Oh, wait..."

Re: unicode normalization layer
by ikegami (Patriarch) on Sep 15, 2009 at 21:39 UTC
    If you already have a module that does normalisation — a glance reveals two promising candidates — then it's easy to make a layer out of it using PerlIO::via
      If I had the time, I would write my own PerlIO::via::* module using Unicdoe::Normalize and test that it's working right, but unfortunately, I don't. My fallback for now is to slurp the whole file in and pass it through Unicode::Normalize and just hope I don't run into too huge a file in production.

      --DrWhy

      "If God had meant for us to think for ourselves he would have given us brains. Oh, wait..."

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://795472]
Approved by almut
Front-paged by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (4)
As of 2025-03-27 17:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    When you first encountered Perl, which feature amazed you the most?










    Results (70 votes). Check out past polls.

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.