in reply to Re^2: Inspect the members of a Posix Character Class
in thread Inspect the members of a Posix Character Class

Generating them isn't a huge cost, and it's just a one-time cost. And if you use a string rather than an array to store it, it's much more efficient, both in time and size.

benchmark for [[:alpha:]] on Windows:

__END__ C:\Users\peter.jones\Downloads\TempData\perl>posix_bench.pl mem before generate arrays: 11,584 K mem after generate arrays: 25,688 K mem after delete arrays: 17,600 K mem before generate strings: 17,600 K mem after generate strings: 18,264 K mem after delete strings: 18,264 K COMPARE GENERATING: Rate genArr genStr genArr 1.27/s -- -92% genStr 16.5/s 1198% -- COMPARE ACCESSING: Rate getStr10k getArr10k getStr10k 7.40/s -- -98% getArr10k 417/s 5539% --

And practical code for using multiple posix character sets, and generating random strings from those sets:

#!perl use 5.012; # strict, // use warnings; use open ':std', ':encoding(UTF-8)'; my (%posix, %lengths); for my $class (qw/alpha digit punct/) { $posix{$class} .= $_ for grep { /[[:${class}:]]/ } map {chr} 0 .. +0xEFFFF; $lengths{$class} = length($posix{$class}); } sub get_random_char_from_posix { my $class = shift; die "no such class" unless exists $posix{$class} and exists $lengt +hs{$class}; substr $posix{$class}, rand($lengths{$class}), 1; } use Data::Dump; my $alpha_str; $alpha_str .= get_random_char_from_posix('alpha') for 1 + .. 10; my $digit_str; $digit_str .= get_random_char_from_posix('digit') for 1 + .. 10; my $punct_str; $punct_str .= get_random_char_from_posix('punct') for 1 + .. 10; dd $_ for $alpha_str, $digit_str, $punct_str; __END__ C:\Users\peter.jones\Downloads\TempData\perl>posix_use.pl "\x{288EC}\x{1309F}\x{88F8}\x{29B5D}\x{85A3}\x{209E2}\x{9EE1}\x{23015} +\x{168AB}\x{2A691}" "\x{B6D}\x{17E5}\x{1D7F6}\x{A627}\x{F21}\x{6F1}\x{1043}\x{A8D7}\x{118E +4}\x{116C2}" "\x{A673}\x{FE16}\x{110BC}\x{2CFF}\x{1BFE}\x{1804}\x{11642}\x{1AA0}\x{ +2051}\x{1183B}"

Replies are listed 'Best First'.
Re^4: Inspect the members of a Posix Character Class
by NERDVANA (Monk) on Jul 29, 2021 at 17:40 UTC

    Your benchmark seems to show that accessing an array if 55x faster than accessing a position in the string. This is what I would expect because unicode strings (utf8) don't have fixed-width characters and perl has to go searching for character boundaries.

    For comparison purposes, could you add construction of an inversion list to your benchmark?

    sub build_inversion_list_and_index { my @invlist; my $match; for (0..$max_codepoint) { next unless $match xor (chr($_) =~ /[[:$class:]]/); push @invlist, $_; $match= !$match; } my @index= ( 0 ); for (my $i= 0; $i < @invlist; $i+= 2) { push @index, $index[-1] + $invlist[$i+1] - $invlist[$i]; } shift @index; return \@invlist, \@index; }
    and random selection from the inversion list:
    sub get_nth_char($i, $invlist, $index) { return undef if $i >= $index[-1]; my ($min, $max, $mid)= (0, $#$index); while (1) { $mid= ($min+$max) >> 1; if ($i > $index[$mid]) { $min= $mid+1 } elsif ($mid > 0 && $i < $index[$mid-1]) { $max= $mid-1 } else { return $invlist[$mid*2] + ($i - ($mid > 0? $index[$mid-1] : + 0)) } } }
      Your benchmark seems to show that accessing an array if 55x faster than accessing a position in the string.

      I think you need to read it again. It says that string is 55x faster than array, not the other way around. The string version got 417 reads of 10k values per second, whereas the array version only got 7.4 reads of 10k values per second.

      And you can run your own benchmark using the benchmark code in the spoiler, adding whatever other code you wish to benchmark, because really it matters more how it performs for you, not how it performs for me.


      Update: sorry, apparently the code didn't get pasted into the spoiler section (or more likely, when I added spoiler tags, I accidentally deleted the code). Since I am searching for my tempfile for the code, I will also look at adding your functions in another update to this post in a little while.


      Update 2: Okay, better data now. Sorry for not including the code the first time.

      #!perl -l use 5.012; # strict, // use warnings; use open ':std', ':encoding(UTF-8)'; use Data::Dump; sub mem { `tasklist /nh /fi "PID eq $$"`=~ m[(\S+ K)$] } # [id://11752 +44] # grabs memory usage on Windows systems; if you have Memory::Usag +e, some of it's functions will work as well # store the characters for certain posix character classes for U+0000 +thru my $class = "alpha"; my $max_codepoint = 0xEFFFF; sub store_using_arrays { my @store; for my $c ( 0 .. $max_codepoint ) { my $character = chr($c); push @store, $character if $character =~ /[[:${class}:]]/; } return \@store; } sub store_using_strings { my $store; for my $c ( 0 .. $max_codepoint ) { my $character = chr($c); $store .= $character if $character =~ /[[:${class}:]]/; } return $store; } sub build_inversion_list_and_index { my @invlist; my $match; for (0..$max_codepoint) { next unless $match xor (chr($_) =~ /[[:$class:]]/); push @invlist, $_; $match= !$match; } my @index= ( 0 ); for (my $i= 0; $i < @invlist; $i+= 2) { push @index, $index[-1] + $invlist[$i+1] - $invlist[$i]; } shift @index; return \@invlist, \@index; } sub get_nth_char { my ($i, $invlist, $index) = @_; return undef if $i >= $index->[-1]; my ($min, $max, $mid)= (0, $#$index); while (1) { $mid= ($min+$max) >> 1; if ($i > $index->[$mid]) { $min= $mid+1 } elsif ($mid > 0 && $i < $index->[$mid-1]) { $max= $mid-1 } else { return $invlist->[$mid*2] + ($i - ($mid > 0? $index->[$ +mid-1] : 0)) } } } # memory comparisons: { print "mem before generate arrays: ", mem(); my $ref = store_using_arrays(); print "mem after generate arrays: ", mem(); undef $ref; print "mem after delete arrays: ", mem(); } { print "mem before generate strings: ", mem(); my $str = store_using_strings(); print "mem after generate strings: ", mem(); undef $str; print "mem after delete strings: ", mem(); } { print "mem before generate inversions: ", mem(); my @refs = build_inversion_list_and_index(); print "mem after generate inversions: ", mem(); undef @refs; print "mem after delete inversions: ", mem(); } require Benchmark; my ($aref, $strn, $ilist, $iidx); print "COMPARE GENERATING:"; Benchmark::cmpthese(-10, { 'genArr' => sub { $aref = store_using_arrays }, 'genStr' => sub { $strn = store_using_strings }, 'genInv' => sub { ($ilist, $iidx) = build_inversion_list_and_index + }, }); # dd $aref; dd $strn; #($ilist, $iidx) = build_inversion_list_and_index(); #dd $ilist; dd $iidx; #print $_, " => ", get_nth_char($_, $ilist, $iidx) for 0 .. 32; my $aasize = scalar @$aref; my $sasize = length $strn; my $iasize = scalar @$ilist; my (@arec, @srec, @irec); use constant MAX_REC => (10_000 - 1); $#arec = $#srec = MAX_REC; print "COMPARE ACCESSING:"; Benchmark::cmpthese(-10, { 'getArr10k' => sub { $arec[$_] = $aref->[rand $aasize] for 0 .. MA +X_REC }, 'getStr10k' => sub { $srec[$_] = substr($strn, rand($sasize), 1) f +or 0 .. MAX_REC }, 'getInv10k' => sub { $irec[$_] = get_nth_char(int rand($iasize), $ +ilist, $iidx) for 0 .. MAX_REC }, }); __END__ C:\Users\peter.jones\Downloads\TempData\perl>posix_bench.pl mem before generate arrays: 12,244 K mem after generate arrays: 26,460 K mem after delete arrays: 18,132 K mem before generate strings: 18,136 K mem after generate strings: 19,952 K mem after delete strings: 19,952 K mem before generate inversions: 19,952 K mem after generate inversions: 19,952 K mem after delete inversions: 19,956 K COMPARE GENERATING: Rate genArr genStr genInv genArr 1.32/s -- -7% -19% genStr 1.42/s 7% -- -13% genInv 1.62/s 23% 14% -- COMPARE ACCESSING: Rate getStr10k getInv10k getArr10k getStr10k 2.82/s -- -96% -99% getInv10k 69.7/s 2374% -- -83% getArr10k 422/s 14868% 505% --

      I had a bug in the first version of the benchmark that wasn't generating the same number of codepoints (off by a factor of 16) -- so that 55x wasn't real. Once that was fixed, getInv and getArr are both faster than getStr for the [[:alpha:]] class. (Not shown, but I debugged with the [[:punct:]] class, and with that, getStr was faster than getInv but still slower than getArr). So it looks like the array is the fastest to access, but string and inversion use significantly less memory. But really, even the memory usage of 8-16MB isn't horrible, at least on my system.