Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re^2: ASCII encoded unicode strings on web, such as \u00F3

by afoken (Chancellor)
on Jul 12, 2015 at 15:22 UTC ( [id://1134390]=note: print w/replies, xml ) Need Help??


in reply to Re: ASCII encoded unicode strings on web, such as \u00F3
in thread ASCII encoded unicode strings on web, such as \u00F3

s/\\u(\w{4})/eval "\"\\x{$1}\""/ge;

Really? String eval and \w?

  • You only want hex digits, not arbitary characters after \u.
  • To make a character from a number written in hexadecimal, convert the number to decimal using hex, then convert that number to a character using chr. No need to torture perl with a string eval.
$_= 'Compruebe si las direcciones URL que encontr\u00e9 en el archivo +de configuraci\u00f3n son v\u00e1lidos'; s/\\u([0-9a-fA-F]{4})/chr hex $1/ge; print;

Alexander

--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

Replies are listed 'Best First'.
Re^3: ASCII encoded unicode strings on web, such as \u00F3
by shmem (Chancellor) on Jul 12, 2015 at 15:48 UTC
    Really? String eval and \w?

    Yes. As stated, just one way to do it: \u0f00 => "\x{0f00}" => ༀ

    No need to torture perl with a string eval.

    Torture? String eval happens every time you use a module.

    Sometimes I just post TIMTOWTDI, since surely someone else will come up with a ( less odd | cleaner | more succinct | better | less costly ) way to do it. This time it has been you; Kudos ;-)

    perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
      Torture? String eval happens every time you use a module.

      Only if the module was not loaded before. Modules that have already been loaded are not evaluated again, see require.

      s/\\u(\w{4})/eval "\"\\x{$1}\""/ge has one string eval per match. In a non-english text, that may by one eval for every few words. The german language is quite harmless, umlauts are quite rare, and the sharp s (ß) suffers from the new spelling rules that prefer ss. But other languages tend to decorate latin letters (the ASCII stuff) with all kinds of hooks, dots, slashes. And with messages written in non-latin alphabets (cyrillic, greek), words are composed entirely of \uXXXX, so you end with one string eval for every single letter of the message.

      s/\\u([0-9a-fA-F]{4})/chr hex $1/ge also treats the replacement part as expression, but that's prepared at compile time, once.


      There still is a trap: The \uXXXX notation is limited to 16 bits = 65536 characters, but Unicode is larger. It depends on the encoder how characters needing 17 or more bits are represented.

      It would be wise to use the UTF-16 schema, i.e. surrogates, i.e. two \uXXXX sequences to encode one of those characters. If the encoder uses surrogates, the Perl code has to handle them accordingly. Encode::Unicode looks promising, but s///g could be sufficient (find surrogate pairs, calculate replacement character from surrogate character codes according to surrogate rules).

      Another way could be to simply use more hex digits, perhaps by accident, so \u would be followed by five or six hex digits. If those are mixed with the four digit variant, it is impossible to decode the text without heuristics: What does "\u101112" represent? chr(0x1012).'12', chr(0x10111).'2' or chr(0x101112)?


      By the way: s/\\u(\w{4})/\"\\x{$1}\"/gee should be equal to s/\\u(\w{4})/eval "\"\\x{$1}\""/ge, according to perlop. Still, I would prefer the explicit eval over /ee, because /ee looks too much like a typo.

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
Re^3: ASCII encoded unicode strings on web, such as \u00F3
by igoryonya (Pilgrim) on Jul 13, 2015 at 03:57 UTC
    So, I am curious, since chr hex is more efficient, then eval, is pack('U', hex) more efficient then chr hex or vice versa?

      Well, eval is clearly the looser, chr wins:

      use Benchmark qw(cmpthese); my $S = 'Compruebe si las direcciones URL que encontr\u00e9 en el arch +ivo de configuraci\u00f3n son v\u00e1lidos'; cmpthese(1e6, { eval => sub { $_ = $S; s/\\u([0-9a-fA-F]{4})/eval "\"\\x{$1}\" +"/ge; }, chr => sub { $_ = $S; s/\\u([0-9a-fA-F]{4})/chr hex $1/ge; }, pack => sub { $_ = $S; s/\\u([0-9a-fA-F]{4})/pack 'U', hex $1/ +ge; } } ); __END__ Rate eval pack chr eval 38865/s -- -84% -88% pack 242131/s 523% -- -28% chr 336700/s 766% 39% -- # without eval Rate pack chr pack 242131/s -- -27% chr 330033/s 36% --

      which is reasonable, since chr is more specialized than pack and should have less overhead (otherwise chr would have been implemented in terms of pack).

      perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1134390]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (2)
As of 2024-04-19 19:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found