Perl Monk, Perl Meditation PerlMonks

### Re^2: Fast - Compact That String

by BrowserUk (Pope)
 on Feb 09, 2012 at 23:30 UTC ( #952886=note: print w/replies, xml ) Need Help??

in reply to Re: Fast - Compact That String
in thread Fast - Compact That String

Doubled the speed of the implementation:

```#! perl -slw
use strict;
use Time::HiRes qw[ time ];

my @c1 = (' ', '0'..'9', 'A'..'Z' );
sub fromB37 {
my \$n = shift;
my \$s = '      ';
substr( \$s, \$_, 1, \$c1[ \$n%37 ] ), \$n /= 37 for 0 .. 5;
\$s;
}

my @c2;
\$c2[  ord( \$c1[ \$_ ] ) ] = \$_ for 0 .. 36;
sub toB37 {
my \$n = 0;
\$n = \$n * 37 + \$c2[\$_] for reverse unpack 'C*', \$_[0];
\$n;
}

my \$start = time;
for ( 1 .. 1e6 ) {
my \$s = fromB37( rand 37**6 );
}
printf "fromB37 took %.3f seconds/million\n", time() - \$start;

\$start = time;
for ( 'AAAAA' .. 'CEXHN' ) {
my \$n = toB37( \$_ );
}
printf "toB37 took %.3f second/million\n", time() - \$start;

my @data = map int( rand 37**6 ), 1 .. 1e6;
\$start = time;
\$_ == toB37( fromB37( \$_ ) ) or die \$_ for @data;
printf "Both ways took %.3f second/million\n", time() - \$start;

__END__
C:\test>junk45
fromB37 took 5.309 seconds/million
toB37 took 3.234 second/million
Both ways took 8.953 second/million

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Replies are listed 'Best First'.
Re^3: Fast - Compact That String
by mbethke (Hermit) on Feb 10, 2012 at 05:12 UTC

The method is pretty much optimal but a little loop unrolling and speed/memory tradeoff can again speed it up a bit (a bit over twice as fast for fromB37() and a good 50% faster for toB37()). Using a couple 100K extra, it probably depends on the CPU's cache size though. Only changed/extra parts:

```my @c3 = map { my \$c = \$_; (map { "\$_\$c" } @c1) } @c1;
sub myFromB37 {
my \$n = shift;
my \$s = '      ';
substr( \$s, 0, 2, \$c3[\$n % 1369] );
\$n /= 1369;
substr( \$s, 2, 2, \$c3[\$n % 1369] );
\$n /= 1369;
substr( \$s, 4, 2, \$c3[\$n] );
\$s;
}

my @c4;
\$c4[unpack 'S', \$c3[\$_]] = \$_ foreach 0 .. 1368;
sub myToB37 {
1874161 * \$c4[unpack 'S', substr(\$_[0], 4, 2)] +
1369 * \$c4[unpack 'S', substr(\$_[0], 2, 2)] +
\$c4[unpack 'S', substr(\$_[0], 0, 2)];
}

my @num_data = map int( rand 37**6 ), 1 .. 1e6;
my @asc_data = map " \$_", 'AAAAA' .. 'CEXHN';

The last lines pre-generate test data because my version depends on strings being no less then 6 characters; the rest of the benchmarking is analogous to yours. You could save one "reverse" in your version though by putting the characters in reverse order, which they should be anyway to represent a normal left-to-right base37 number.

Nice!++ On my machine, both your routines are almost exactly twice as fast as mine.

I get the table sizes as totaling around 400k: @c3:87792 @c4:295160, but that is a perfect trade for speed.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

mbethke,
Very nice. I had to slightly modify it in order to get the sort order I wanted:
```my @c3 = map { my \$c = \$_; (map { "\$c\$_" } @c1) } @c1;
sub myFromB37 {
my \$n = shift;
my \$s = '      ';
substr( \$s, 4, 2, \$c3[\$n % 1369] );
\$n /= 1369;
substr( \$s, 2, 2, \$c3[\$n % 1369] );
\$n /= 1369;
substr( \$s, 0, 2, \$c3[\$n] );
\$s;
}

my @c4;
\$c4[unpack 'S', \$c3[\$_]] = \$_ foreach 0 .. 1368;
sub myToB37 {
1874161 * \$c4[unpack 'S', substr(\$_[0], 0, 2)] +
1369 * \$c4[unpack 'S', substr(\$_[0], 2, 2)] +
\$c4[unpack 'S', substr(\$_[0], 4, 2)];
}
If you have time, I would really appreciate an in-depth explanation.

Cheers - L~R

Yeah, that's what I meant with reversing the bytes: BrowserUK's version generates "little-endian" strings, I just made mine produce the same to more easily check that I'm getting the correct results. To be a proper base37 number they should really have their most significant digit left.

If you understand BrowserUK's version, mine is a relatively simple extension. fromB37 is probably easiest: first preallocate a string of six characters (faster than to keep appending characters if you know the final length anyway), then take the remainder of the current number divided by 37 and look up the digit to represent this with in @c1. Do this six times and you've converted the number.

For the reverse conversion, you need a lookup table that has, under the index of a character code, the associated decimal value (ord(' ') => 0, ord('A') => 1, etc.). Then you have to split up the string into characters (the unpack is just a fast alternative to "split //"), and for each characters multiply the old value of \$n by 37 and add the current character's "worth".

What my version does different is just that it processes two characters at a time. @c3 is the cross product @c1x@c1, i.e. every possible 2-character string from ' ' to 'ZZ'. That way I don't have to modulo and divide by 37 six times but only thrice by 37^2. Surprisingly, not using a loop but three separate substr()s was quite a bit faster. Apparently Perl does some clever optimizations there if the string indexes are known at compile time.

My unpack 'S' ... is just a "wider" version of ord() that works for two characters at a time by converting them to an unsigned short. So I use unpack to convert the 2-char string to an array index, and just like in BrowserUK's version, @c4 gives me the "worth" of that snippet. According to its position in the string I still have to multiply it by 37**4 and 37**2 respectively—for clarity I should probably have used these expressions instead of the "magic" numbers.

According to my benchmarking, this is still about 10% slower than my "bitshift" version. It's faster than my "multiplication" version though.

tobyink,
I am confused by that statement. After eliminating the subroutine overhead (inlining everything), I can encode and decode 1 million strings in under 3 seconds. When I read that yours was taking 28 seconds I didn't even bother to check it. Can you please provide your benchmark code?

Cheers - L~R

Re^3: Fast - Compact That String
by BrowserUk (Pope) on Feb 10, 2012 at 20:06 UTC

An explanation of the code as requested:

```## Map the numbers, 0 .. 36 to the symbols we use
## to represent the number in base37
my @c1 = (' ', '0'..'9', 'A'..'Z' );

sub fromB37 {
my \$n = shift;      ## Get the number to convert
## Allocate space for the Base37 representation
## Initialise it to the representation of 0 (six spaces)
my \$s = '      ';

## For each position in the string
for( 0 .. 5 ) {
## extract the next base37 digit value
## look up its representaion character
## and assign it to the 'right place' i the string.
substr( \$s, \$_, 1 ) = \$c1[ \$n%37 ] );

## dividing by 37 effectively right-shifts
## the last digit's value out of the number
\$n /= 37;
}
\$s;
}

my @c2;
## Map the ordinal values of the symbols
## to their numeric values (0 .. 37)
## The sparse array is faster than a hash
\$c2[  ord( \$c1[ \$_ ] ) ] = \$_ for 0 .. 36;

sub toB37 {
my \$n = 0; ## initialise our return value to 0

## split the base37 representation
## into a list of the ordinal values of the symbols
## and reverse their order to match that produced by fromB37()
for( reverse unpack 'C*', \$_[0] ) {
## multiple the running total by 37
## (effectively left-shifting the accumulator
## to accommodate the next digit.)
## and add value of the next base37 digit
## by looking it up in the mapping array
\$n = \$n * 37 + \$c2[ \$_ ];
}
## return the accumulated value.
\$n;
}

As mbethke points out, these treat the base37 number in 'little-endian' fashion. This because you emphasised compression and speed, with no mention of needing to manipulate the compressed values numerically (sorting).

To get the sortable, big-endian representation, you could use:

```my @c1 = (' ', '0'..'9', 'A'..'Z' );
sub fromB37 {
my \$n = shift;
my \$s = '      ';
substr( \$s, \$_, 1, \$c1[ \$n%37 ] ), \$n /= 37 for 5, 4, 3, 2, 1, 0;
\$s;
}

my @c2;
\$c2[  ord( \$c1[ \$_ ] ) ] = \$_ for 0 .. 36;
sub toB37 {
my \$n = 0;
\$n = \$n * 37 + \$c2[\$_] for unpack 'C*', \$_[0];
\$n;
}

Which actually works out a bit quicker still, but not as fast as mbethke's unrolled version.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Create A New User
Node Status?
node history
Node Type: note [id://952886]
help
Chatterbox?
 [Corion]: Meh. Once again I find that SQLite doesn't support window functions and I want to use those nowadays :-) [erix]: [Corion]: Hmm - actually, I don't need them, even though they'd be nice. I just want the (say) 10 latest images, and that's easily done with a limit 10 offset 0 clause, as I don't need all top 10 images for all users. [Corion]: erix: Sure, but this is for a really-lightweight application and I'm replacing a CSV file / JSON file for user configuration with SQLite (and optionally, Pg) :) [erix]: isn't a texty format handier for configs? [Corion]: So far, I've avoided having even a user database by storing the user information in a (signed) cookie that the browser keeps for me, but as I want to be able to lock users, I need a second storage option :) [Corion]: erix: It's needed for keeping the list of users and the list of tags associated with an image, and for keeping the images with users. I want an easy way to know if an image can be deleted, which means that it can't be referenced by any tag anymore. ... [Corion]: SQL feels like a natural choice here :)

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (12)
As of 2018-03-20 14:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
When I think of a mole I think of:

Results (253 votes). Check out past polls.

Notices?