Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Other effects of 'use utf8' on script

by YenForYang (Beadle)
on Apr 28, 2018 at 19:22 UTC ( [id://1213737]=perlquestion: print w/replies, xml ) Need Help??

YenForYang has asked for the wisdom of the Perl Monks concerning the following question:

I may get a bit of flak for this as I know micro-optimization is frowned upon basically everywhere...but technically this isn't entirely about optimization.

You see, I have two options here. I could type Unicode characters using a hex sequence \x{...} (tedious to lookup) XOR I could enable the utf8 pragma and type Unicode characters via mostly copy & paste or some other method into the script (also tedious).

I know the general rule is to avoid micro-optimizing and go with readability/maintainability--but here, I have two tedious methods--one easier to type, harder to read, and one harder to type, easier to read. Having 'wide' characters is UNavoidable in my case. So I've decided on the 'which generally performs better' route, even though I realize that the impact is going to be miniscule. Nonetheless--my question is: does use utf8 affect script performance more than hex escape sequences (at the lowest possible level)?

Sidenote: I have only a total of about 20 multibyte characters in the script currently in total (slighter greater than 50 bytes) at the moment. I would say that the script is very unlikely to have more than double this number

EDIT: Accidentally said 'avoidable', when I mean UNavoidable. My apologies!

Replies are listed 'Best First'.
Re: Other effects of 'use utf8' on script
by LanX (Saint) on Apr 28, 2018 at 20:09 UTC
    > and one harder to type, easier to read.

    Depends on your editor, I'm pretty sure there are options to get a code translated. °

    I also saw Perl allowing named unicodes to be expanded. *

    > a total of about 20 multibyte characters

    And the third option are defining your own 20 Readonly constants and using variable interpolation of $widechar variables.

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Wikisyntax for the Monastery

    *) See perlunicode#Unicode-Encodings -> \N{notation}

    °) Like here

Re: Other effects of 'use utf8' on script
by kcott (Archbishop) on Apr 29, 2018 at 11:35 UTC

    G'day YenForYang,

    Given you wrote "... I realize that the impact is going to be miniscule.", you should not allow it to affect your choices. Bear in mind that the utf8 pragma is lexically scoped: you could write your code such that it only affects a small portion of it; that portion could be just compile time code (see further down for examples).

    I would be more concerned with readability and maintainability. You haven't provided any information on what characters you'll be using nor who'll be reading the code; as a result, what follows are just general thoughts and ideas.

    Putting the actual character, its hexadecimal value, or its official Unicode name in code is often not helpful; for instance, seeing any of these in source code would not be immediately meaningful to me:

    • \x{5c0d}
    • \N{CJK UNIFIED IDEOGRAPH-5C0D}

    By the way, they're all the same character:

    $ perl -C -E 'use utf8; my @x = ("對", "\x{5c0d}", "\N{CJK UNIFIED IDEOGRAPH-5C0D}"); say for @x'
    對
    對
    對
    

    As you're only dealing with a small number, perhaps use constants with meaningful names:

    { use utf8; use constant { SOME_MEANINGFUL_NAME => "\N{...}", OTHER_MEANINGFUL_NAME => "\N{...}", ... }; }

    Some quick tests to demonstrate:

    $ perl -C -E '{use utf8; use constant { X => "λ", Y => "Д" }; } say X'
    λ
    $ perl -C -E '{use utf8; use constant { X => "λ", Y => "Д" }; } say X; say "對"'
    λ
    å°
    $ perl -C -E '{use utf8; use constant { X => "λ", Y => "Д" }; } say X; use utf8; say "對"'
    λ
    對
    

    You mentioned that you thought the hex sequence was "tedious to lookup". Let Perl do it for you. Write a small script for yourself, the guts of which might look something like this:

    $ perl -E 'use utf8; printf "%x\n", ord "對"'
    5c0d
    

    And, if you want the official Unicode name, Perl can do that for you too. Here's an example:

    perl -E 'use utf8; use Unicode::UCD "charinfo"; say charinfo(ord "對")->{name}'
    CJK UNIFIED IDEOGRAPH-5C0D
    

    Unicode::UCD is a core module. It can provide a lot more information about characters than just the name. UCD stands for "Unicode Character Database".

    — Ken

Re: Other effects of 'use utf8' on script
by dsheroh (Monsignor) on Apr 29, 2018 at 08:56 UTC
    The general rule of thumb, when all else is equal (or even vaguely close to equal), is to optimize for readability.

    If you really must decide on the basis of performance, then the readable option wins in this case anyhow, because it gives the compiler the actual bytes it needs instead of an escape sequence which then needs to be converted into the actual bytes. Doing the conversion takes non-zero time, therefore not needing to convert will take less time. Post-compilation, the executed code will contain the actual bytes either way, so there will be no difference when the code is executed (unless you're inside a string eval or something like that which forces the escape sequence to be processed repeatedly).

    But, although the conversion takes non-zero time, the time required is very near-zero. Even if your script is run millions of times, the aggregate difference in run time across all those runs will be less than the time it takes me to type this word. There's a good reason why people are so against micro-optimization - in the vast majority of cases, the time you spend to design and implement the optimization is orders of magnitude larger than the time saved when executing it.

Re: Other effects of 'use utf8' on script
by Laurent_R (Canon) on Apr 29, 2018 at 08:15 UTC
    I would think that using hex sequences or the use utf8; might very slightly modify the compile time but there will probably be no difference in the run time once the program is compiled. In other words, the difference, if any, is most probably negligible.
Re: Other effects of 'use utf8' on script
by ikegami (Patriarch) on May 01, 2018 at 14:50 UTC

    does use utf8 affect script performance more than hex escape sequences (at the lowest possible level)?

    Not at all unless you have string literals with characters in 80..FF. These will get stored in the upgraded (UTF8=1) format with use utf8;.

A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1213737]
Approved by Corion
Front-paged by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (2)
As of 2024-04-26 00:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found