Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

Switching on internal UTF-8 flaq on DBI result from database

by liz (Monsignor)
on Dec 12, 2003 at 22:53 UTC ( #314423=snippet: print w/replies, xml ) Need Help??
Description: After having spent the better part of a day figuring out how perfectly good UTF-8 data out of a MySQL database was being mangled by Perl, I found that this was caused by Perl upgrading the string which was already UTF-8 (but didn't have the internal UTF-8 flag set). So I needed a way to set that flag, I found the "_utf8_on" of Encode through graff's answer Re: restore unicode data from database?.

However, doing this "manually" on everything you fetch out of a database, becomes tedious quickly. Since I frequently use the selectall_arrayref fetching methods, I created this little sub that could can "wrap" around such a call.

It expects the arrayref of arrayrefs as input, and also returns it. So you can "inline" the call to "_utf8_on_all_arrayref".

require Encode; # needs to be done only once in the beginning
sub _utf8_on_all_arrayref {
# For all the records specified
#  Switch on the UTF-8 flag for all values
# Return the original reference

    foreach (@{$_[0]}) {
        Encode::_utf8_on( $_ ) foreach @{$_};
} #_utf8_on_all_arrayref
Replies are listed 'Best First'.
Re: Switching on internal UTF-8 flaq on DBI result from database
by Aristotle (Chancellor) on Dec 14, 2003 at 17:19 UTC
    require is a no-op if called multiple times on the same module, so you can just move it into the function. You can also shorten the code by collapsing the outer for into a map:
    sub _utf8_on_all_arrayref { require Encode; Encode::_utf8_on($_) for map @$_, @{$_[0]}; $_[0]; }
    sub _utf8_on_all_arrayref { require Encode; Encode::_utf8_on($$_) for map \(@$_), @{$_[0]}; $_[0]; }

    Makeshifts last the longest.

      Well, I think if you can avoid calling require over and over again, you should. Any require is at least a lookup in %INC each time you execute it.

      use Benchmark qw(:hireswallclock timethese); timethese( 1000000,{ one => sub { require Benchmark }, two => sub { require Benchmark; require Benchmark }, }); __END__ $ perl 1 Benchmark: timing 1000000 iterations of one, two... one: 2.43888 wallclock secs ( 1.70 usr + 0.00 sys = 1.70 CPU) @ 588 +235.29/s (n=1000000) two: 5.11982 wallclock secs ( 2.77 usr + 0.00 sys = 2.77 CPU) @ 361 +010.83/s (n=1000000)

      Also, for some reason your list flattening with map() does not work. Not sure whether this is a bug in Perl, or a conceptual problem with using map() and $_. Observe:

      use Encode qw(_utf8_on); $a = [['']]; foreach (@$a) { _utf8_on($_) foreach @$_ } # my way print utf8::is_utf8( $a->[0][0] ),$/; $a = [['']]; _utf8_on( $_ ) for map @$_, @{$a}; # Aristotle's way print utf8::is_utf8( $a->[0][0] ),$/; __END__ 1
      which should show two 1's instead of 1.

      But also from a performance point of view, the extra list flattening with map() is not very efficient:

      use Encode qw(_utf8_on); use Benchmark qw(:hireswallclock timethese); push @$a,[(0) x 10] foreach 1..10; timethese( 10000,{ liz => sub { foreach (@{$a}) { _utf8_on( $_ ) foreach @{$_}; } }, Aristotle => sub { _utf8_on( $_ ) for map @$_, @{$a}; }, }); __END__ Benchmark: timing 10000 iterations of Aristotle, liz... Aristotle: 4.37344 wallclock secs ( 3.73 usr + 0.00 sys = 3.73 CPU) @ + 2680.97/s (n=10000) liz: 4.06957 wallclock secs ( 2.73 usr + 0.00 sys = 2.73 CPU) @ 3663. +00/s (n=10000)


      The way Aristotle proposed doesn't work because map() creates a copy of the elements, on which the UTF-8 flag is set and then discarded. See $_ and list flattening with map() for more info.

        Doh. Of course, map aliases $_ during the expression, but the expression's result is completely separate. I updated my previous reply with a version that works. Returning a new value rather than just operating on an alias is the reason for the performance hit of course; unfortunately, I can only think of only two other remotely relevant aliasing constructs in Perl, neither of which are any help here: grep returns aliases, and the @_ in a function call contains aliases rather than copies. I can't see how to use either to achieve something less awkward than a nested for loop though.

        Makeshifts last the longest.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: snippet [id://314423]
[ambrus]: Corion: that's not true. Actually for Christmas and Thanksgiving, a lot of people buy electronics such as cameras as present, then many of them figure out they don't need it,
[ambrus]: and the electronics gets reselled almost new, but it has to be sold at half price because otherwise everyone chooses to buy the new product which has fewer risk of selling damaged products labelled as almost new.
[ambrus]: You can actually get a lot of useful cheap really almost new products that way, with only a little risk of scams.
[ambrus]: That's what some of the "Black Friday" sales are about.
[Corion]: ambrus: Well, usually, these people don't have in their description "mail me at dodgy_reseller # g m a i l | co m" , replace the "#" by "@" :)
[Corion]: Oh, and the "o" in "com" is a zero
choroba orders a camera from Ole Scæmmer
[ambrus]: Corion: ah. that's different. the ones I mean are selling at reputable sites like ebay that usually filters scammers out pretty quickly (as well as filters a lot of legitimate users who then get annoyed that the biggest providers exclude them)

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (13)
As of 2017-11-21 15:04 GMT
Find Nodes?
    Voting Booth?
    In order to be able to say "I know Perl", you must have:

    Results (304 votes). Check out past polls.