Re: performance of length() in utf-8

An easier solution is to dodge the UTF-8 problem. The slowness of length is because (from length)

length() normally deals in logical characters, not physical bytes

Essentially, in order to know how many characters are in the string, length has to interrogate every byte to see if it is part of a longer character. (Incidentally, your timings look linear, not exponential to me). You can avoid this challenge if instead of storing the string as you encounter it, store it encoded:

{
    package LenTestC;
    use Encode;
    sub new {
        my $class = shift;
        my $self = '';
        return bless \$self, $class;
    }
    sub add {
        my ($self, $data) = @_;
        $$self .=  Encode::encode_utf8($data);
    }
    sub len {
        my $self = shift;
        return length $$self;
    }
}
[download]

The target string never gets upgraded to UTF-8, and thus the fast length algorithm can be used.

Note that your print/tell solution did the same kind of accounting, reporting bytes instead of characters.

use strict;
use warnings;
use feature 'say';
use feature 'state';
use utf8;
use Time::HiRes;

$|++;
my $chunk = '€' x 256;
my $td = Time::HiRes::time;
my $tf;
my $l;

say "with length()";
my $str = new LenTestA;
for my $n (1..15_000){
    state $count = 0;
    $str->add($chunk);
    $l = $str->len;

    $count++;
    if ($count % 1000 == 0){
        $tf = Time::HiRes::time;
        say sprintf "%12d L=%10d t=%f", $n, $l, $tf-$td;
        $td = $tf;
    }
}

$td = Time::HiRes::time;
say "\nwith a scalar";
$str = new LenTestB;
for my $n (1..15_000){
    state $count = 0;
    $str->add($chunk);
    $l = $str->len;

    $count++;
    if ($count % 1000 == 0){
        $tf = Time::HiRes::time;
        say sprintf "%12d L=%10d t=%f", $n, $l, $tf-$td;
        $td = $tf;
    }
}

say "\nwith encode/length()";
$str = new LenTestC;
for my $n (1..15_000){
    state $count = 0;
    $str->add($chunk);
    $l = $str->len;

    $count++;
    if ($count % 1000 == 0){
        $tf = Time::HiRes::time;
        say sprintf "%12d L=%10d t=%f", $n, $l, $tf-$td;
        $td = $tf;
    }
}

{
    package LenTestA;
    sub new {
        my $class = shift;
        my $self = '';
        return bless \$self, $class;
    }
    sub add {
        my ($self, $data) = @_;
        $$self .= $data;
    }
    sub len {
        my $self = shift;
        return length $$self;
    }
}
{
    package LenTestB;
    my $len;
    sub new {
        my $class = shift;
        my $self = '';
        return bless \$self, $class;
    }
    sub add {
        my ($self, $data) = @_;
        $$self .= $data;
        $len += length($data);
    }
    sub len {
        my $self = shift;
        return $len;
    }
}
{
    package LenTestC;
    use Encode;
    sub new {
        my $class = shift;
        my $self = '';
        return bless \$self, $class;
    }
    sub add {
        my ($self, $data) = @_;
        $$self .=  Encode::encode_utf8($data);
    }
    sub len {
        my $self = shift;
        return length $$self;
    }
}
[download]

outputs

with length()
        1000 L=    256000 t=0.510051
        2000 L=    512000 t=1.387138
        3000 L=    768000 t=2.304231
        4000 L=   1024000 t=3.246324
        5000 L=   1280000 t=4.112412
        6000 L=   1536000 t=5.093509
        7000 L=   1792000 t=5.957596
        8000 L=   2048000 t=6.853685
        9000 L=   2304000 t=9.705970
       10000 L=   2560000 t=9.114912
       11000 L=   2816000 t=9.906990
       12000 L=   3072000 t=11.083109
       13000 L=   3328000 t=12.515251
       14000 L=   3584000 t=12.456246
       15000 L=   3840000 t=13.957395

with a scalar
        1000 L=    256000 t=0.021152
        2000 L=    512000 t=0.021664
        3000 L=    768000 t=0.026949
        4000 L=   1024000 t=0.025393
        5000 L=   1280000 t=0.021830
        6000 L=   1536000 t=0.022298
        7000 L=   1792000 t=0.022668
        8000 L=   2048000 t=0.021850
        9000 L=   2304000 t=0.026711
       10000 L=   2560000 t=0.019835
       11000 L=   2816000 t=0.023417
       12000 L=   3072000 t=0.020025
       13000 L=   3328000 t=0.021878
       14000 L=   3584000 t=0.020085
       15000 L=   3840000 t=0.019838

with encode/length()
        1000 L=    256000 t=0.044469
        2000 L=    512000 t=0.037547
        3000 L=    768000 t=0.038610
        4000 L=   1024000 t=0.040161
        5000 L=   1280000 t=0.039640
        6000 L=   1536000 t=0.041329
        7000 L=   1792000 t=0.038967
        8000 L=   2048000 t=0.037193
        9000 L=   2304000 t=0.040582
       10000 L=   2560000 t=0.042830
       11000 L=   2816000 t=0.039120
       12000 L=   3072000 t=0.038353
       13000 L=   3328000 t=0.047136
       14000 L=   3584000 t=0.037603
       15000 L=   3840000 t=0.036865
[download]

#11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Comment on Re: performance of length() in utf-8 Select or Download Code

Replies are listed 'Best First'.
Re^2: performance of length() in utf-8 by seki (Monk) on Mar 04, 2016 at 10:59 UTC
Many thanks for your valuable answer, I reproduced the same performance gains on my system, while not grasping the why. I was told that since utf-8 string management was natively integrated into Perl core a string has an internal flag to tell if it is utf-8 or not. When parsing an xml file declared as `Encoding="utf-8"`, the strings parsed by the XML SAX Parser are not given in utf-8? (I did not noticed that because I do not display processed data, so if the string is given undecoded i guess it is written as-is, but I should double-check that) I seem to understand that the SAX writer might query many times the data size, so the overweight of encoding the data is compensated by the many calls to a `length()` that has better performance. But I do not see how length() is different depending on the string encoding: if the string is not in utf-8 length should return a byte size (1 byte per character) while on utf-8 we must process each byte to know if it is a simple char, a starting byte of a multi-byte char, or a continuation byte of a multi-byte char. I would have think that processing an utf-8 string has worse performance than a plain string... Note that your print/tell solution did the same kind of accounting, reporting bytes instead of characters. Yes, but that is not a problem as I am asked to split the xml on a file size basis (per 30, 100 or 200 MB chunks) so counting the bytes is ok.	[reply] [d/l] [select]
Re^3: performance of length() in utf-8 by kennethk (Abbot) on Mar 04, 2016 at 17:03 UTC
Encoding can be a challenge to get one's head around. When you read the strings in from your XML parsing, Perl pulls them in as a series of UTF-8 characters, and the string that contains them has the UTF-8 flag set to true. In order to determine the length of the string, each byte must be queried to determine to figure out how many characters are represented, thus the slow `length`. Invoking `Encode::encode_utf8($data)` returns the UTF-8 string transformed into the equivalent byte stream. Essentially, from Perl's perspective, it breaks the logical connection between the bytes, and leaves it as some combination of high bit and low bit characters. Now, since every record in the string is exactly 1 byte wide, the byte count requires no introspection. So: `print length chr 199;` [download] outputs `1` while `use Encode; print length Encode::encode_utf8(chr 199);` [download] outputs `2`. Similarly, if you run `say join ",", map ord, split //, chr 199;` [download] you output `199`, while `use Encode; say join ",", map ord, split //, Encode::encode_utf8(chr 199);` [download] outputs `195, 135`. However, if your terminal is set to display UTF-8, printing both of those strings will output the same because the series of bits is unaffected. Does that help? #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.	[reply] [d/l] [select]
Re^4: performance of length() in utf-8 by seki (Monk) on Mar 11, 2016 at 13:23 UTC
Here is a quite long answer to try to be specific on my understanding of that case... Does that help? In some way, but not completely. :op I am quite familiar with encodings (at least iso-8859-1 & 15, Win1252, "DOS" 437 & 850 utf-8 and utf-16) but I did not figured the data flow in Perl, yet. I think I did not get what part of the "magic" is done at the (windows CMD) terminal level by the xml parsing / decoding (if any?) at the Perl internal level `chcp Active code page: 1252 perl -e "print chr 199" Ç perl -e "print join ' ', map {sprintf '%02x', $_} unpack 'C', chr 199 +" c7` [download] I am in Win1252 and the code 199 (= 0xc7) corresponds to the upper-case c-cedilla character. Okay. `perl -MEncode -e "print Encode::encode_utf8 chr 199" Ã‡ perl -MEncode -e "print join ' ', map {sprintf '%02x', $_} unpack 'C' +, Encode::encode_utf8 chr 199" c3 87` [download] So if encode the byte 199 to utf-8 (I seem to understand "from the current console codepage"), I get the values c3 87 that correspond to the U+00c7 unicode "LATIN CAPITAL LETTER C WITH CEDILLA". I still follow. `perl -MEncode -e "print Encode::decode_utf8 \"\xc3\x87\"" Ç` [download] If I decode a raw "c3 87" I get back my "Ç", so everything is how I suppose it to be. Now, your part: Encoding can be a challenge to get one's head around. When you read the strings in from your XML parsing, Perl pulls them in as a series of UTF-8 characters, and the string that contains them has the UTF-8 flag set to true. In order to determine the length of the string, each byte must be queried to determine to figure out how many characters are represented, thus the slow length. Well... Not sure: Here is a simple `utf8-1.xml` file: `<?xml version="1.0" encoding="utf-8"?> <root>Ç foo</root>` [download] (to be sure, if hex-editing the file, we see actually C3 87 in the place of the char 199) With a little sax parser: use strict; use warnings; use feature 'say'; #~ use utf8; use XML::SAX::ParserFactory; $\|++; #to force one kind of parser for ParserFactory->parser() #~ $XML::SAX::ParserPackage = "XML::SAX::PurePerl"; #~ $XML::SAX::ParserPackage = "XML::SAX::Expat"; #no xml_decl #~ $XML::SAX::ParserPackage = "XML::SAX::ExpatXS"; #~ $XML::SAX::ParserPackage = "XML::LibXML::SAX"; $XML::SAX::ParserPackage = "XML::LibXML::SAX::Parser"; { package MySax; use feature 'say'; use Devel::Peek; sub new { my $class = shift; return bless {}, $class; } sub hexprint { my ($self, $data) = @_; join ' ', map { sprintf '%02X', $_ } unpack 'C', $data; } sub characters { my ($self, $data) = @_; my $content = $data->{Data}; say "characters for elt: ". $content; say "bytes for elt: ". $self->hexprint($content); Dump($content); } } my $handler = new MySax; my $parser = XML::SAX::ParserFactory->parser(Handler => $handler); say "parser is " . ref $parser; say "file: " . $ARGV[0] if $ARGV[0]; $parser->parse_file($ARGV[0] // DATA); __DATA__ <empty/> [download] I can see: `perl sax_utf.pl utf8-1.xml parser is XML::LibXML::SAX::Parser file: utf8-1.xml characters for elt: Ç foo bytes for elt: C7 20 66 6F 6F SV = PV(0x288c658) at 0x233d2e8 REFCNT = 1 FLAGS = (PADMY,POK,IsCOW,pPOK,UTF8) PV = 0x2b28228 "\303\207 foo"\0 [UTF8 "\x{c7} foo"] CUR = 6 LEN = 10 COW_REFCNT = 1` [download] Can I assume the following: the 199 / 0xC7 character was decoded by libxml, as I see that its byte is "C7" but the string is flagged as utf-8? and internaly, the byte flow is actually some utf-8, as shown by the (unusual but in my Emacs editor) \303\207 octal values = C3 87 So 1 )in can't understand the difference between the `unpack` and `Devel::Peek` dumps. and 2) I cannot see why would do the following Invoking Encode::encode_utf8($data) returns the UTF-8 string transformed into the equivalent byte stream. Essentially, from Perl's perspective, it breaks the logical connection between the bytes, and leaves it as some combination of high bit and low bit characters. Now, since every record in the string is exactly 1 byte wide, the byte count requires no introspection. If the string is already in utf-8, why processing it with `encode_utf8` ? If I patch the `sub characters` like this: `sub characters { use Encode; my ($self, $data) = @_; my $content = Encode::encode_utf8 $data->{Data}; say "characters for elt: ". $content; say "bytes for elt: ". $self->hexprint($content); Dump($content); }` [download] Now I see (still in a Windows console in cp1252) : `characters for elt: Ã‡ foo bytes for elt: C3 87 20 66 6F 6F SV = PV(0x28ba328) at 0x236d2b8 REFCNT = 1 FLAGS = (PADMY,POK,IsCOW,pPOK) PV = 0x2b548b8 "\303\207 foo"\0 CUR = 6 LEN = 10 COW_REFCNT = 1` [download] So unpacking the string shows the expected `C3 87` bytes for the char 199, confirmed by the octal dum, but the UTF8 flag has vanished? I'm puzzled! Now an additional challenge: I make a copy of the first xml, to add the euro sign into the data ("Ç foo €") so the hex-editing of the file shows `C3 87 20 66 6F 6F 20 E2 82 AC`. With the non utf-8 forcing of the string, it shows this in the console: `parser is XML::LibXML::SAX::Parser file: utf8-2.xml Wide character in say at sax_utf.pl line 36. characters for elt: Ã‡ foo â‚¬ bytes for elt: C7 20 66 6F 6F 20 20AC SV = PV(0x2a61748) at 0x250ade8 REFCNT = 1 FLAGS = (PADMY,POK,IsCOW,pPOK,UTF8) PV = 0x2cefc98 "\303\207 foo \342\202\254"\0 [UTF8 "\x{c7} foo \x{20 +ac}"] CUR = 10 LEN = 12 COW_REFCNT = 1` [download] Now I am not sure of the byte representation: it could be some Win1252, for the C7, but the euro char is 80 in 1252, while the 20AC seems to the U+20AC unicode char and not the `E2 82 AC` utf-8, and why `20AC` while unpack should show bytes? the "Ç foo" part is not displayed identically with that additional character Forcing the data with `encode_utf8` seems less surprising `parser is XML::LibXML::SAX::Parser file: utf8-2.xml characters for elt: Ã‡ foo â‚¬ bytes for elt: C3 87 20 66 6F 6F 20 E2 82 AC SV = PV(0x2991768) at 0x243ade8 REFCNT = 1 FLAGS = (PADMY,POK,IsCOW,pPOK) PV = 0x2c1fc98 "\303\207 foo \342\202\254"\0 CUR = 10 LEN = 12 COW_REFCNT = 1` [download] While I still do not understand the missing UTF8 flag...	[reply] [d/l] [select]
Re^5: performance of length() in utf-8 by kennethk (Abbot) on Mar 11, 2016 at 21:40 UTC
Re^6: performance of length() in utf-8 by hippo (Bishop) on Mar 11, 2016 at 23:21 UTC


Welcome to the Monastery
	PerlMonks