http://www.perlmonks.org?node_id=314423

After having spent the better part of a day figuring out how perfectly good UTF-8 data out of a MySQL database was being mangled by Perl, I found that this was caused by Perl upgrading the string which was already UTF-8 (but didn't have the internal UTF-8 flag set). So I needed a way to set that flag, I found the "_utf8_on" of Encode through graff's answer Re: restore unicode data from database?.

However, doing this "manually" on everything you fetch out of a database, becomes tedious quickly. Since I frequently use the selectall_arrayref fetching methods, I created this little sub that could can "wrap" around such a call.

It expects the arrayref of arrayrefs as input, and also returns it. So you can "inline" the call to "_utf8_on_all_arrayref".

require Encode; # needs to be done only once in the beginning sub _utf8_on_all_arrayref { # For all the records specified # Switch on the UTF-8 flag for all values # Return the original reference foreach (@{$_[0]}) { Encode::_utf8_on( $_ ) foreach @{$_}; } $_[0]; } #_utf8_on_all_arrayref

Replies are listed 'Best First'.
Re: Switching on internal UTF-8 flaq on DBI result from database
by Aristotle (Chancellor) on Dec 14, 2003 at 17:19 UTC
    require is a no-op if called multiple times on the same module, so you can just move it into the function. You can also shorten the code by collapsing the outer for into a map:
    sub _utf8_on_all_arrayref { require Encode; Encode::_utf8_on($_) for map @$_, @{$_[0]}; $_[0]; }
    sub _utf8_on_all_arrayref { require Encode; Encode::_utf8_on($$_) for map \(@$_), @{$_[0]}; $_[0]; }

    Makeshifts last the longest.

      Well, I think if you can avoid calling require over and over again, you should. Any require is at least a lookup in %INC each time you execute it.

      use Benchmark qw(:hireswallclock timethese); timethese( 1000000,{ one => sub { require Benchmark }, two => sub { require Benchmark; require Benchmark }, }); __END__ $ perl 1 Benchmark: timing 1000000 iterations of one, two... one: 2.43888 wallclock secs ( 1.70 usr + 0.00 sys = 1.70 CPU) @ 588 +235.29/s (n=1000000) two: 5.11982 wallclock secs ( 2.77 usr + 0.00 sys = 2.77 CPU) @ 361 +010.83/s (n=1000000)

      Also, for some reason your list flattening with map() does not work. Not sure whether this is a bug in Perl, or a conceptual problem with using map() and $_. Observe:

      use Encode qw(_utf8_on); $a = [['']]; foreach (@$a) { _utf8_on($_) foreach @$_ } # my way print utf8::is_utf8( $a->[0][0] ),$/; $a = [['']]; _utf8_on( $_ ) for map @$_, @{$a}; # Aristotle's way print utf8::is_utf8( $a->[0][0] ),$/; __END__ 1
      which should show two 1's instead of 1.

      But also from a performance point of view, the extra list flattening with map() is not very efficient:

      use Encode qw(_utf8_on); use Benchmark qw(:hireswallclock timethese); push @$a,[(0) x 10] foreach 1..10; timethese( 10000,{ liz => sub { foreach (@{$a}) { _utf8_on( $_ ) foreach @{$_}; } }, Aristotle => sub { _utf8_on( $_ ) for map @$_, @{$a}; }, }); __END__ Benchmark: timing 10000 iterations of Aristotle, liz... Aristotle: 4.37344 wallclock secs ( 3.73 usr + 0.00 sys = 3.73 CPU) @ + 2680.97/s (n=10000) liz: 4.06957 wallclock secs ( 2.73 usr + 0.00 sys = 2.73 CPU) @ 3663. +00/s (n=10000)

      Liz

      Update:
      The way Aristotle proposed doesn't work because map() creates a copy of the elements, on which the UTF-8 flag is set and then discarded. See $_ and list flattening with map() for more info.

        Doh. Of course, map aliases $_ during the expression, but the expression's result is completely separate. I updated my previous reply with a version that works. Returning a new value rather than just operating on an alias is the reason for the performance hit of course; unfortunately, I can only think of only two other remotely relevant aliasing constructs in Perl, neither of which are any help here: grep returns aliases, and the @_ in a function call contains aliases rather than copies. I can't see how to use either to achieve something less awkward than a nested for loop though.

        Makeshifts last the longest.