Beefy Boxes and Bandwidth Generously Provided by pair Networks Bob
Pathologically Eclectic Rubbish Lister
 
PerlMonks

Unexpected utf8 in hash keys

by kappa (Chaplain)
 | Log in | Create a new user | The Monastery Gates | Super Search | 
 | Seekers of Perl Wisdom | Meditations | PerlMonks Discussion | 
 | Obfuscation | Reviews | Cool Uses For Perl | Perl News | Q&A | Tutorials | 
 | Poetry | Recent Threads | Newest Nodes | Donate | What's New | 

on Feb 20, 2008 at 11:22 UTC ( #668987=perlquestion: print w/ replies, xml ) Need Help??
kappa has asked for the wisdom of the Perl Monks concerning the following question:

Hello, fellow monks! I'm seeing strange things going on with my hash keys.
#! /usr/bin/perl use strict; use warnings; use utf8; sub U { return utf8::is_utf8($_[0])?'is_utf8':'not_utf8'; } my %s = ( MaxAccountSize1 => 1, 'MaxAccountSize2' => 1, 2 => 1, ); foreach (sort keys %s) { print "'$_' ".U($_)." => '".$s{$_}."' ".U($s{$_})."\n"; }
It looks like in presence of use utf8 hash keys upgraded from barewords via virtues of => operator get the utf8 flag. Quoted string literals on the contrary get this flag on if they contain characters with high codes -- in full accordance with the docs. It cost us a lot of blood and sweat to debug why some perfectly ASCII strings would suddenly get the flag.

Is there any rationale in such decision? Is it a bug? Does anyone know what the performance penalty of utf8 hash keys -- even if they contain only ASCII chars -- is?

--kap

Comment on Unexpected utf8 in hash keys
Select or Download Code
Re: Unexpected utf8 in hash keys
by kappa (Chaplain) on Feb 20, 2008 at 11:27 UTC
    Fat comma is really at fault:
    use utf8; sub ff { die if utf8::is_utf8($_[0]); } ff(asd => 1);
    --kap

      I'd expect both of the following calls to output the same thing, so I think it's a bug.

      use utf8; sub ff { print utf8::is_utf8($_[0]) ? 1 : 0, "\n"; } ff(asd => 1); # 1 ff("asd"); # 0

      Tested in 5.8.8 and 5.10.0.

      After some investigation, barewords in general (anything parsed by force_word) are affected.

      use utf8; sub ff { print utf8::is_utf8($_[0]) ? 1 : 0, "\n"; } ff("asd"); # 0 ff('asd'); # 0 ff(qw(asd)); # 0 ff(asd => 1); # 1 { no strict; ff(asd); } # 1 ff(-asd); # 1 ff(asd::); # 1

      I think function names (including built-ins) and keywords are also affected.

      Gotta go. No time to devise a fix atm.

      encoding produces the opposite result.

      #use encoding 'UTF-8'; # or 'utf8' #use utf8; use Encode qw( is_utf8 ); sub ff { print is_utf8($_[0]) ? 1 : 0, "\n"; } # none enco utf8 both # ---- ---- ---- ---- ff("asd"); # 0 1 0 1 ff('asd'); # 0 1 0 1 ff(qw(asd)); # 0 1 0 1 ff(asd => 1); # 0 0 1 1 { no strict; ff(asd); } # 0 0 1 1 ff(-asd); # 0 0 1 1 ff(asd::); # 0 0 1 1
Re: Unexpected utf8 in hash keys
by graff (Canon) on Feb 20, 2008 at 13:57 UTC
    It's definitely worthwhile to know and understand the trickiness demonstrated so clearly by ikegami's various tests, and I would agree that some of his results point to "actionable" inconsistencies that should probably be treated as bugs. BUT... when you say:

    It cost us a lot of blood and sweat to debug why some perfectly ASCII strings would suddenly get the flag.

    Does this mean you were using the utf8 flag to determine whether or not a string contains wide characters? That is not what the flag is for, and you shouldn't be using it that way. To test for wide characters in a string, use a regex:

    if ( /[^[:ascii:]]/ ) { ... } # which is equivalent to if ( /[^\x00-\x7f]/ ) { ... }
    The purpose of the utf8 flag, as I understand it, is to answer the question: if there happen to be non-ASCII bytes in this string, are they to be interpreted as utf8 characters, or not? The treatment of an all-ASCII string should be the same regardless of whether the utf8 flag is set.
      Yes, it was a wrong way to do that -- an attempt based on wrong guess that perl would not set utf8 flag on ASCII strings.
      --kap
Re: Unexpected utf8 in hash keys
by pc88mxer (Vicar) on Feb 20, 2008 at 14:13 UTC
    Actually, the issue really is the use utf8 which allows you to use utf8 in your program identifiers. For instance:

    use strict; use warnings; my %hash = ( asd => 1 ); sub ff { print utf8::is_utf8($_[0]) ? 1 : 0, "\n"; } eval { use utf8; ff(%hash); # now prints 0 };

    I suppose it's reasonable that perl encodes barewords as utf8 if you use utf8 even if only ascii characters are involved.

    Update: from the utf8 documentation:

    The "use utf8" pragma tells the Perl parser to allow UTF-8 in the pro- gram text in the current lexical scope (allow UTF-EBCDIC on EBCDIC based platforms). ... Do not use this pragma for anything else than telling Perl that your script is written in UTF-8. The utility functions described below are useful for their own purposes, but they are not really part of the "pragmatic" effect.
      We actually use lots of non-ASCII strings so we need use utf8.
      --kap
Re: Unexpected utf8 in hash keys
by doc_faustroll (Beadle) on Feb 20, 2008 at 18:19 UTC
    I may be reading this too quickly and not appreciating or understanding your intent, or what exactly you are using the use utf8 pragma for or why. My hunch is that you might benefit from not enquiring as to when the utf8 flag is set or not. Can you tell me your purposes? We can play around with the internals of Perl and how it represents strings all day long, but I gather you have more than academic interest here? This may be a place where solid pragmatism trumps. Have you read perlunitut?
Re: Unexpected utf8 in hash keys
by Juerd (Abbot) on Feb 21, 2008 at 01:42 UTC

    Out of curiosity, why does the flag bother you?

      We are in the process of converting a huge production codebase from 5.6 + Unicode::String to 5.8. This is as painful as it can get and we log this flag in a lot of places just to find codepaths that need attention.
      --kap
Re: Unexpected utf8 in hash keys
by creamygoodness (Deacon) on Aug 27, 2009 at 11:12 UTC
    Does anyone know what the performance penalty of utf8 hash keys -- even if they contain only ASCII chars -- is?
    Benchmarking script: Results for vanilla custom-compiled Perl 5.10.0 on Mac OS X:

Login:
Password
remember me
What's my password?
Create A New User

Node Status?
node history
Node Type: perlquestion [id://668987]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (11)
ikegami
BrowserUk
GrandFather
jdporter
atcroft
herveus
MidLifeXis
ssandv
AndyZaft
Neighbour
im2
As of 2010-09-06 02:57 GMT
Sections?
Seekers of Perl Wisdom
Cool Uses for Perl
Meditations
PerlMonks Discussion
Categorized Q&A
Tutorials
Obfuscated Code
Perl Poetry
Perl News
See About the sections of PerlMonks
Information?
PerlMonks FAQ
Guide to the Monastery
What's New at PerlMonks
Voting/Experience System
Tutorials
Reviews
Library
Perl FAQs
Other Info Sources
Find Nodes?
Nodes You Wrote
Super Search
List Nodes By Users
Newest Nodes
Recently Active Threads
Selected Best Nodes
Best Nodes
Worst Nodes
Saints in our Book
Leftovers?
The St. Larry Wall Shrine
Offering Plate
Awards
Craft
Snippets Section
Code Catacombs
Quests
Editor Requests
Buy PerlMonks Gear
PerlMonks Merchandise
Planet Perl
Perlsphere
Use Perl
Perl.com
Perl 5 Wiki
Perl Jobs
Perl Mongers
Perl Directory
Perl documentation
CPAN
Random Node
Voting Booth?

My favourite poll on PerlMonks is ...

Your first Perl Book - the first one ever
Average number of caffeinated beverages per work day - the poll with the highest participation
My Thoughts on the New Voting/Experience System - the poll with the fewest votes cast
When I grow up, I want to be: - one of the polls with the fewest options
Perl 6 will primarily be: - the first one on Perl6
When I see a poll - one of the many polls about polls
this poll ;-)
yet to come
none - I hate polls. Bah.
some other

Results (95 votes), past polls