Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Bidirectional lookup algorithm? (Updated: further info.)

by BrowserUk (Pope)
on Jan 04, 2015 at 21:14 UTC ( #1112131=perlquestion: print w/replies, xml ) Need Help??
BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

I have an application where I need to be able to map some strings to some integers and then be able to lookup the integers from their strings or vice versa. (NOTE: both strings and integers are unique!)

In Perl this can be (and currently is) achieved by using two hashes:

my %intByStr = populate(); my %strByInts; $strByInt{ intByStr{ $_ } } = $_ for keys %intByStr; ...

But the penalty for this is that every key and value is stored twice. Which means for a fairly modest dataset of just 450,000 key/value pairs, the total storage requirements are 95MB:

$i=1; $intByStr{ $_ } = $i++ for 'aaaa'..'zzzz';; $strByInt{ intByStr{ $_ } } = $_ for keys %intByStr;; print total_size( $_ ) for \(%intByStr, %strByInt);; 50805928 44297159

As my dataset can be somewhat larger, I'd like to minimise the space requirement. A modest improvement can be achieved by storing both mappings in a single hash:

undef %AbyB; $i=1; $AbyB{ $_ } = $i, $AbyB{ $i++ } = $_ for 'aaaa'..' +zzzz';; print total_size( \%AbyB );; 80479783

The potential space requirement for my dataset, which might grow to 10x (or 20x or more) the size of the examples above, is such that I'm looking for an alternative, even if it means coding the solution in (Inline::)C, but simply using Perl's hashes from C won't gain me anything.

I could use a more space efficient hash implementation, which I found a few of on the net; that usually come with a penalty of less flexibility which I could live with for this purpose.

Somewhere in the back of my mind I seem to recall a bidirectional lookup mechanism/algorithm that didn't require the duplicate storage, but for the life of me I cannot remember anything about how it worked; and my attempts to search for "bidirectional search algorithm" all turn up variations on this which is a shortest path between two node of a graph algorithm and thus not useful for me.

To the question:

Does anyone out there know/remember/have any pointers to "bidirection lookup algorithm" that only stores the values once?

Update per /msg request for further info:

The strings are usually fairly short, 1 to 8 characters, but sometimes can extend further upto 32 or occasionally more. (Think: typical variable names!)

The integers are (unique) 64-bit values. Think memory addresses for 64-bit processors.

The data is loaded once. The two lookup mechanisms (hashes/arrays/indexes/whatever) would be built once as, or after, the data is read; so build/insert speed is not a great concern.

The vast majority of the uses will be lookups, many millions, so the priority is the lookup speed -- preferably O(1) though O(logN) might be acceptable if the memory reduction was sufficient.

The data is discarded at the end of each run. There are no deletes.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re: Bidirectional lookup algorithm? (judy trie?)
by Anonymous Monk on Jan 04, 2015 at 23:47 UTC
      Have you considered a Trie like a Judy array? specifically Judy::L (int to pointer) and Judy::HS (string to int/pointer)

      I hadn't. In the past; I've avoided Judy arrays for the simple reason that I hate to use stuff I don't understand. And whilst I 'understand' the principles behind Judy arrays, the code is so complex that I doubt my ability to maintain it; and hate to get stuck needing other people to do things for me.

      And I just rediscovered another reason I've not used them:

      C:\perl64\packages\Judy-0.41>build compilet-1064304640.c Creating library C:\Windows\TEMP\compilet.lib and object C:\Windows +\TEMP\compilet.exp Generating code Finished generating code Checking prerequisites... requires: ! Alien::Judy is not installed build_requires: ! Test::Deep is not installed ERRORS/WARNINGS FOUND IN PREREQUISITES. You may wish to install the v +ersions of the modules indicated above before proceeding with this installatio +n Run 'Build installdeps' to install missing prerequisites. Creating new 'MYMETA.yml' with configuration results Creating new 'Build' script for 'Judy' version '0.41' C:\perl64\packages\Judy-0.41>Build installdeps Too early to specify a build action 'installdeps'. Do 'Build installd +eps' instead.

      Recurse: see recurse!


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Bidirectional lookup algorithm? (Updated: further info.)
by ikegami (Pope) on Jan 05, 2015 at 16:04 UTC

    So you have something like the following:

    my $addr = 0x1234; my $ident = "abc"; my %by_addr = ( $addr => $ident ); my %by_ident = ( $ident => $addr );

    A copy of the address and of the identifier are stored in the elements of each hash. (The key of an element is stored in the element to distinguish elements in collisions.) You can cut the number of scalars (or is it string buffers?) in half as follows:

    use Data::Alias qw( alias ); sub find_undef_element { my ($hash) = @_; while (my ($key, $val) = each(%$hash)) { if (!defined($val)) { keys(%$hash); # Reset iterator. return $key; } } return undef; } my $addr = 0x1234; my $ident = "abc"; my %by_addr; my %by_ident; $by_addr{$addr} = undef; $by_ident{$ident} = undef; alias my $shared_addr = find_undef_element(\%by_addr); alias my $shared_ident = find_undef_element(\%by_ident); alias $by_addr{$shared_addr} = $shared_ident; alias $by_ident{$shared_ident} = $shared_addr;

    Of course, constructing is going to be much slower. I think you can avoid iterating through the hash (and avoid using Data::Alias) by dropping to C, but if you can do that, you could use a C-based data structure and use strings instead of scalars.

    Update: It seems that only the string buffers are getting shared, not the entire scalar.

Re: Bidirectional lookup algorithm? (Updated: further info.)
by RichardK (Parson) on Jan 05, 2015 at 13:25 UTC

    Sorry, I don't have any pointers on bidirectional lookups, but have you considered using DB_File to tie your hash to a database file?.

    It looks like your data will map nicely to a BTREE and lookups should be fast. You could run the database completely in memory or tweak the cache size to suit your requirements. It will be easy to try with your existing code if you want to give it a whirl.

    I tried a single hash btree from your example above ( 'aaaa' - 'zzzz') and it created a ~11Mb file, so it seems to pack pretty well.

Re: Bidirectional lookup algorithm? (Updated: further info.)
by soonix (Monsignor) on Jan 05, 2015 at 10:22 UTC
    would using a database be overkill? I saw that DBD::SQLite can work in-memory:
    my $dbh = DBI->connect("dbi:SQLite::memory:"});

      That certainly is worth considering as it reduces the memory requirement to ~1/6th compared to the single hash method.

      However, the penalty is that the lookups run ~30x more slowly, so a job that currently takes ~8hrs would require 10 days! Hence, it will be a last resort for when I cannot find anything quicker.

      Single hash:

      #! perl -slw use strict; use Time::HiRes qw[ time ]; sub mem{ `tasklist /nh /fi "PID eq $$"` =~ m[(\S+ K)$] } our $S //= 1; my $scale = chr( ord('a') + $S - 1 ); print "Start: mem: ", mem; my( $i, %lookup ) = 1; $lookup{ $_ } = $i, $lookup{ $i++ } = $_ for 'aaaaa' .. $scale . 'zzzz +'; print scalar keys %lookup; print "Hash built: mem: ", mem; my $start = time; for( 'aaaaa' .. $scale . 'zzzz' ) { my $addr = $lookup{ $_ }; # print $addr; } printf "Fetch by Name took %f seconds \n", time() - $start; $start = time; for( 1 .. 26**4 * $S ) { my $name = $lookup{ $_ }; # print $name; } printf "Fetch by Addr took %f seconds \n", time() - $start; print "Hash queried: check mem: ", mem; __END__ C:\test>1112165-hash -S=1 Start: mem: 8,032 K 913952 Hash built: mem: 169,116 K Fetch by Name took 0.217840 seconds Fetch by Addr took 0.234923 seconds Hash queried: check mem: 169,140 K C:\test>1112165-hash -S=2 Start: mem: 8,008 K 1827904 Hash built: mem: 329,660 K Fetch by Name took 0.445586 seconds Fetch by Addr took 0.512336 seconds Hash queried: check mem: 329,684 K C:\test>1112165-hash -S=3 Start: mem: 7,984 K 2741856 Hash built: mem: 524,328 K Fetch by Name took 0.699219 seconds Fetch by Addr took 0.717473 seconds Hash queried: check mem: 524,352 K C:\test>1112165-hash -S=4 Start: mem: 7,972 K 3655808 Hash built: mem: 651,160 K Fetch by Name took 0.976562 seconds Fetch by Addr took 1.065514 seconds Hash queried: check mem: 651,184 K

      Sqlite & :memory:

      #! perl -slw use strict; use Time::HiRes qw[ time ]; use DBI; use constant { CREATE => 'create table DB ( NAME test primary key not null, ADDR +int not null )', INSERT => 'insert into DB ( NAME, ADDR ) values ( ?, ? )', INDEX => 'create index ADDR_IDX on DB ( ADDR );', QUERYBYNAME => 'select ADDR from DB where NAME = ?', QUERYBYADDR => 'select NAME from DB where ADDR = ?', }; sub mem{ `tasklist /nh /fi "PID eq $$"` =~ m[(\S+ K)$] } our $S //= 1; my $scale = chr( ord('a') + $S - 1 ); print "Start: mem: ", mem; my $dbh = DBI->connect( 'dbi:SQLite:dbname=:memory:','','' ) or die DB +I::errstr; $dbh->do( CREATE ) or die DBI::errstr; print "DB created: mem: ", mem; my $ins = $dbh->prepare( INSERT ) or die DBI->errstr; my $i = 1; for( 'aaaaa' .. $scale . 'zzzz' ) { $ins->execute( $_, $i++ ); } print "DB populated: mem: ", mem; $dbh->do( INDEX ); print "DB indexed: mem: ", mem;; my $start = time; my $sth = $dbh->prepare( QUERYBYNAME ) or die DBI->errstr; for( 'aaaaa' .. $scale . 'zzzz' ) { $sth->execute( $_ ) or die DBI::errstr; my $ref = $sth->fetch; my $addr = $ref->[0]; # print $addr; } printf "Fetch by Name took %f seconds \n", time() - $start; $start = time; $sth = $dbh->prepare( QUERYBYADDR ) or die DBI->errstr; for( 1 .. 26**4 * $S ) { $sth->execute( $_ ) or die DBI::errstr; my $ref = $sth->fetch; my $name = $ref->[0]; # print $name; } printf "Fetch by Addr took %f seconds \n", time() - $start; print "DB queried: mem: ", mem; __END__ C:\test>for /l %s in (1,1,4) do 1112165-sql -S=%s C:\test>1112165-sql -S=1 Start: mem: 9,656 K DB created: mem: 11,072 K DB populated: mem: 30,796 K DB indexed: mem: 39,384 K Fetch by Name took 6.933570 seconds Fetch by Addr took 7.299877 seconds DB queried: mem: 39,412 K C:\test>1112165-sql -S=2 Start: mem: 9,700 K DB created: mem: 11,116 K DB populated: mem: 50,088 K DB indexed: mem: 65,400 K Fetch by Name took 14.425735 seconds Fetch by Addr took 14.653500 seconds DB queried: mem: 65,428 K C:\test>1112165-sql -S=3 Start: mem: 9,656 K DB created: mem: 11,068 K DB populated: mem: 69,544 K DB indexed: mem: 92,320 K Fetch by Name took 21.599556 seconds Fetch by Addr took 21.844892 seconds DB queried: mem: 92,348 K C:\test>1112165-sql -S=4 Start: mem: 9,668 K DB created: mem: 11,084 K DB populated: mem: 88,800 K DB indexed: mem: 119,372 K Fetch by Name took 30.339751 seconds Fetch by Addr took 30.587817 seconds DB queried: mem: 119,400 K

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
      that probably isn't faster than using a single %hash
Re: Bidirectional lookup algorithm? (Solution.)
by BrowserUk (Pope) on Jan 11, 2015 at 18:15 UTC

    Okay. I have my first pass solution completed, and here (as promised) is the code (in the form of an Inline::C script for RAD and testing), and a brief description.

    Description:

    The code presents a class: BiMap; which supports the following methods:

    • my $bm = BiMap::new( (U32)$initSize, (double)$factor );

      initSize is the initial size the tables are allocated to.

      (Fill) factor is a double representing the proportion of the table that will fill before it is resized to double its current size.

    • $bm->add( (u64)addr, (char*)$name );

      Add a paired integer and string to the BiMap.

      (Currently) returns the number of (additional) probes required to add the pair to both tables. (For debugging and tuning purposes.)

    • my $addr = $bm->findByStr( (char*)$name );

      Lookup an integer from its paired string.

    • my $name = $bm->findByInt( U64 i );

      Lookup a string from its paired integer.

    • Some utility/debug methods that may or may not persist into the final version:
      • my $size = $bm->size;

        (Getter only!)

      • my $used = $bm->used;

        (Getter only!)

      • my $factor = $bm->factor;

        (Getter only!)

      • $bm->dump( (BOOL)$flag );

        Dumps the BiMap to stdout. (Should probably be to stderr, but attempting to use stderr from I::C causes segfaults.)

        If flag is false only dumps the header rather than the full table.

    The basic structure of the BiMap contains pointers to two parallel arrays: byInt[] of PAIR structures; byStr[] of U32 integers;

    typedef struct { U64 addr; char *name; } PAIR; typedef struct { PAIR **byInt; U32 *byStr; U32 size; U32 used; double factor; } BIMAP;

    The PAIRs are inserted into byInt[] hashed by the (U64)addr in the PAIR.

    The byStr[] holds indexes (+1 to allow 0 to represent empty), of the PAIRs, hashed by the (char*)name in the PAIR. Using indexes into byInt[] rather than address of the pairs themselves saves half the space at the cost of an extra indirection.

    As a large majority of the names will be less than 8 characters; when that is the case the char* name component of the PAIR is used to hold the string directly, rather than a pointer to it, thus saving another 8-bytes per compliant PAIR. The high bit of the 8th byte is set to distinguish between directly stored strings and addresses of strings.

    I tested several combinations of: a) various hash functions; and b) various probing algorithms; and in my (simplistic) testing; no other combination beat the performance of the (Jenkins) one-at-a-time hash function erstwhile used by Perl; and the very simple, +1 probing mechanism.

    Results:

    Using a single, combined Perl hash to look up 11 million strings from their associated 64-bit integer address, and vice versa, takes a total of 15.9 seconds and requires 4.17 GB:

    C:\test>1112165-hash Start: mem: 7,920 K 23762752 Hash built: mem: 4,382,828 K Fetch by Name took 7.539046 seconds Fetch by Addr took 8.349479 seconds Hash queried: check mem: 4,382,852 K

    Using BiMap on the same dataset using a table presized to accomodate 16 million pairs, and a 0.71 fill factor. takes 22.3 seconds and requires just over 1/2GB:

    C:\test>BiMap -S=5 -F=0.71 -I=12000000 -D=0 bless(do{\(my $o = 52848976)}, "BiMap") Fetch by Addr & Fetch by Name took 22.257803 seconds for 11881376 item +s in a 16777216 sized BiMap Mem: 580,552 K

    By moving to a fill factor of 0.70 and pre-sizing the arrays to accommodate 33 million pairs, takes 19.8 seconds and requires just over 3/4GB:

    C:\test>BiMap -S=5 -F=0.70 -I=20000000 -D=0 bless(do{\(my $o = 52848976)}, "BiMap") Fetch by Addr & Fetch by Name took 19.896479 seconds for 11881376 item +s in a 33554432 sized BiMap Mem: 777,532 K

    Conclusion:

    Using a BiMap in preference to a Perl hash will allow 8 times as much data to be processed per run; thus both reducing the number of runs required whilst increasing the statistical validity of the results by well over 8 times.

    Size for size, the total runtime of the tests will increase by approximately 40%. A fair penalty for the substantial increase in statistical validity.

    Other options:

    • It seems possible that Judy arrays would reduce the space requirement further and (possibly) increase the performance. Just a shame that the Build mechanism is broken.

      I'm also considering trying a minimal radix tree/Trie implementation to see how it compares. It won't achieve the space reduction of a Judy Array, but as it automatically 'compresses' any prefix overlap in the key sets, it might be competitive and should be fast as it avoid both hashing and some indirection.

    • Perfect hashing/Minimal perfect hashing seems like it might both reduce the space requirement and speed up the lookups.

      However, generating such functions is non-trivial if you start from scratch, but life's too short to try extract useful information from the SourceForge website for the CMPH project.

      The examples suggest that it will only take the inputs from disk; which is a pain as for this case, the PAIRs are generated entirely in memory and writing them out to disk, only to have them read back in, would negate any gains.

      Of course, it could be that there is an api for supplying the keys directly from memory, but as there doesn't appear to be any API documentation; nor a way to view the individual files via SF, who knows.

      In addition, I work in the Windows world and experience tells me that it would probably take longer to work out how to build the library on Windows than to re-write it from scratch. All of these OSS projects seem to have interminable dependencies upon autoconf or cmake or any of a dozen other variations of build tools that either don't work on windows, or if they do, themselves have half a dozen more dependencies some or all of which that don't.

    • Other/"better" hashing functions and/or probing algorithms.

      Having tried several of each, there seems to be no particular benefit for my particular application data, without moving to separate chaining, which incurs the dual penalties of extra complexity and extra memory; taking things back in the direction of the well-tuned generality of the Perl implementation.

    The code:


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
      It seems possible that Judy arrays would reduce the space requirement further and (possibly) increase the performance. Just a shame that the Build mechanism is broken.

      Since Judy seemed the most promising approach in my personal opinion, I did take a quick look at the complexities of getting the Perl module working under Windows. There seems to be at least 3 ways to build the Judy library itself under Windows:

      1. Cross-compile it on Unix.
      2. Configure it by hand and install GNU Make (and maybe gcc) to build it.
      3. Use the included *.bat file and build it using Visual Studio.

      The last choice seems, by far, the easiest. The code included in Alien::Judy even includes the *.bat file. But since I didn't already have Visual Studio, I stopped looking at that point.

      I didn't reply as what I discovered was found easily so I guessed you'd either find the same thing soon enough or wouldn't be that interested in using Judy.

      Your node above (thanks for sharing your implementation) makes me think you may still be interested in Judy and also didn't yet find the *.bat file. So that little bit of information may encourage you and/or just save you some small bit of research.

      The next steps (getting Judy to not fail to build just because Alien::Judy isn't installed and getting the build process for Judy to find the needed include and library files) seem to me to likely be fairly simple.

      I may still eventually get around to jumping through those hoops (though, to be honest, there are a lot of higher-priority items on my to-do list right now). If I do, I'll post more details. If you do, please also post what you find.

      Thanks.

      - tye        

        I finally managed to build a 64-bit version of Judy.

        I started with this one-file/1250 line version and hacked out all the -DSTANDALONE and -DASKITIS stuff along with all the BIG_ENDIAN stuff; extracted a Judy.h; and got the filesize down to 965 lines and built it into a dll:

        C:\test\Judy>cl /W3 /Ot /favor:INTEL64 /MT /LD Judy.c Microsoft (R) C/C++ Optimizing Compiler Version 15.00.21022.08 for x64 Copyright (C) Microsoft Corporation. All rights reserved. Judy.c Microsoft (R) Incremental Linker Version 9.00.21022.08 Copyright (C) Microsoft Corporation. All rights reserved. /out:Judy.dll /dll /implib:Judy.lib Judy.obj Creating library Judy.lib and object Judy.exp

        I then wrote a C program to us it to create two Judy arrays and stored my test data 'aaaaa'..'zzzzz' paired with a 64-bit integer:

        built it against the dll:

        C:\test\Judy>cl /W3 /Ot JudyTest.c Judy.lib Microsoft (R) C/C++ Optimizing Compiler Version 15.00.21022.08 for x64 Copyright (C) Microsoft Corporation. All rights reserved. JudyTest.c Microsoft (R) Incremental Linker Version 9.00.21022.08 Copyright (C) Microsoft Corporation. All rights reserved. /out:JudyTest.exe JudyTest.obj Judy.lib

        A run:

        C:\test\Judy>JudyTest.exe aaaaa Check memory: Bidi lookup of 11881376 pairs took: 6.325 seconds Check memory: 524,332k

        Then I built it as an Inline::C module, adding method wrappers for the important functions:

        Unfortunately, in this form, the runtime increase -- mostly I think due to the perl->C->perl transitions -- from 6.5 seconds to over 25s:

        C:\test\Judy>perl Judy.pm -ODO=aaaaa Memory before building Judy: 10,760 K Memory after building Judy: 347,660 K Bidi lookups of 11881376 pairs took:25.197204113 seconds Memory after lookups: 347,680 K

        So, whilst it does use somewhat less memory than my BiMap version; is also somewhat slower.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

      That line below your signature is somewhat enigmatic:

      The code present a class:BiMap, which supports the following methods: 7 ) { strncpy( (char*)( iHash script for RAD and testing), and a brief description. , CLEAN_AFTER_BUILD =used, newSize ); for( i = 0; i = pair; bm--1

      I've tried setting up a bidirectional hash using XS code. It makes both key and values into keys and misuses the IV slot of SV of a hash entry to store the pointer to the crossover hash key.

      However, the benchmarks are disappointing:

      use blib; use Hash::Bidirectional; use Benchmark qw(cmpthese); use Devel::Size qw(total_size); use strict; use warnings; my $hn = {}; my $hb = bless {}, 'Hash::Bidirectional'; my $i = 0; for ( 'aaaaz' .. 'zzzzz') { $i++; $hn->{$_} = $i; # - normal hash $hn->{$i} = $_; # / $hb->pair($_,$i); # - bidirectional hash } print "size normal hash: ", total_size($hn),"\n"; print "size bi_dir hash: ", total_size($hb),"\n"; cmpthese( 10000000, { normal => sub { $hn->{$hn->{shmem}} }, bi_dir => sub { $hb->get($hb->get('shmem')) }, }); __END__ size normal hash: 3025679168 size bi_dir hash: 2360323536 Rate bi_dir normal bi_dir 2801120/s -- -62% normal 7407407/s 164% --

      Improvement of 22% in memory and more than double time for lookup? Why? Is it due to method call/XS overhead? XS code below.

      perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
        That line below your signature is somewhat enigmatic:

        It's an automatic appending of a collection of random snippets from the body of the post. It happens quite often when I post here. I try to remember to check for them and remove them; but sometimes I forget.

        I'm told (by the PMDevs) that it is due to the browser I use -- Opera -- but the fact that it never happens on any other site I use makes me think it is more to do with PM than Opera; but its always easier to blame someone else's tools than fix your own code.

        more than double time for lookup? Why? Is it due to method call/XS overhead?

        I would assume so. Any XS code is always going to be at a disadvantage compared to built-ins; that's inevitable. Which means you really have to parr everything down to even begin to compete.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

      Perfect hashing/Minimal perfect hashing seems like it might both reduce the space requirement and speed up the lookups.

      However, generating such functions is non-trivial if you start from scratch...

      This may not be as horrible as it sounds if your code is going to get a lot of use. Minimal Perfect Hashing looks like the only way you're going to get O(1) on lookups, and if you're going to be getting bigger and bigger data sets, the time up front starts to look more cost effective. A bit of digging around found a real algorithm that someone actually implemented in C and tested, designed for use with both integer keys and character keys. Unfortunately I've only found the original paper and a later one with pseudocode, nothing quite canned.

      The paper with pseudocode is here: Czech, Havas, Majewski, and an earlier paper that's a little denser to interpret is here Havas, Majewski. This page: Schnell gives a pretty readable interpretation of it as well. It doesn't look too painful to implement.

      I also found what looks like a similar algorithm that appears to include bi-directionality implemented in a python package call pyDAWG that you could either call or port.

      A little more digging around on the relevant names might find you some c code that does about the same

      EDIT: allegedly there's a Java package floating around called "ggperf" that's an implementation of the CHM algorithm (and is supposed to be faster than a "gperf" minimal perfect hash generator that's part of libg++) but I couldn't find source, just a paper that amounts to documentation for ggperf

        bitingduck thankyou for your research. The links provided for fascinating, (if at times somewhat bewildering :) reading.

        In particular the Schnell article made the CHM algorithm understandable; and the pyDAWG code is relatively concise and understandable.

        I think I understand the process (if not the implementation) enough to see a problem (for my application).

        Given that my datasets are generated at runtime, and the goal is to be able to make them as large as available memory will allow (but no larger), it would be necessary to run the CHM algorithm within the same process, and after the dataset has been generated.

        The problem is, that even using the most parsimonious implementation of a DAG, the combined size of the node and edge structures (with their required pointers) require substantially more memory per string-number pair than the data itself. Using the pyDAWG implementation as an example; each pairing requires two DAWGNodes at 18bytes each; and each word required one DAWGEdge struct per letter (say ave. 5*9 + 8*9 = 117bytes ).

        Thus a quick estimate is that for each 13 bytes of data (5-byte string + 8-byte integer); the algorithm would require a minimum of 117+38 = 155 bytes of structs * no_of_pairs, in order to build the DAG.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re: Bidirectional lookup algorithm? (try perfect hashing)
by oiskuu (Hermit) on Jan 07, 2015 at 17:09 UTC

    You could lump longer symbols together a la join qq(\0), grep length() > 8, @syms; access via ptr or offset. Keep short ones directly in the 8-byte slots of the symbol table. (union). This should get you down to ~10 bytes per key.

    To index the tables, use minimal perfect hashing. There's the CMPH library, which appears quite simple to use. Though you'll need to work on the glue — cmph bindings via Perfect::Hash are incomplete, and it's not on CPAN.

    Anyway, with perfect hashing you need but a few extra bits to do the indexing. Various methods are available. CHM needs ~8 bytes per key, but preserves order. BRZ is about ~8 bits per key, uses external memory while building the tables so you're not mem-constrained. BDZ is just under 3 bits per key. Storing the 8-byte keys as you are already, these few extra bits aren't much.

    Performance estimates: about 3 sec per million keys when constructing the mph structures; about .3 sec per million lookups. Compared to .15 usec lookups with direct hashes.

    So expect roughly 3 times slower performance with about 10 times the memory density. If the tradeoff is worth the effort, I'll leave for you to decide.

      Well, I experimented a little further with the CMPH. It might be rough around the edges (e.g. temp. files aren't cleaned up or created safely), but the good thing is it basically works. First, I generated some data as follows.... that took awhile.

      #! /bin/sh perl -E '$/ = \36; for (1 .. 200e6) { ($v,$s) = unpack q<QA*>,<>; say +qq($_ @{[$s =~ y/a-zA-Z//cdr || redo]} $v) }' /dev/urandom | sort -k2 | perl -ape '$_ x= $F ne $F[1], $F = $F[1]' | sort -k3 | perl -ape '$_ x= $F ne $F[2], $F = $F[2]' | sort -n | perl -pe 's/\S*\s*//'
      In case many longer keys are present, it might be better to go with (4-byte) offset records. Simple and compact, but there's one more indirection, hence slower access.
      $ perl -anE '$h[length$F[0]]++ }{ say "@h";' data 52 2704 140608 7190883 35010259 35855576 28751420 19240344 10899542 5 +278104 2202814 795438 249757 68269 16155 3388 640 89 12 1

      Then a small test. You can see I didn't bother mapping value indexes but used order-preserving hash instead. Memory usage comes to about 26 bytes/pair. Double that while building the hashes.

      Edit. Same test with updated code; higher -O used.

      [ 1.303844] data ALLOCATED; tab = 160002048, ss = 14680064 (10 +000000 pairs) [ 5.478031] built BDZ for syms [ 2.873171] inplace REORDER [ 20.015568] built CHM for vals [ 0.000028] mph size when packed: syms = 3459398, vals = 83600 +028 [ 0.522367] fgets loop; lines=10000000 [ 1.195339] fgets+strtoul; lines=10000000 [ 2.235220] SYMS fetch; found=10000000 [ 2.940386] VALS fetch; found=10000000 [ 2.709484] VRFY sym to val; matched=10000000 [ 4.258673] VRFY two-way; matched=10000000
      Old output.

        oiskuu. Thanks for pursuing this and posting your findings. Unfortunately, I do not understand what I am reading?

        1. You read 200 million 36 byte records from /dev/urandom;

          Then you unpack them as: a)a 64-bit uint; and b) the rest as a string, from which you remove any non alpha characters (with redo if nothing is left);

          And print the string and int to stdout;

          Which you pipe through: sort & perl & sort & perl & sort & perl to ???

        2. You then count the lengths of the first fields from a file called data?
        3. You then run an executable call a.out supplying a number and the data file?

          Which produces some output?


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

      Ok. Here's a crude demo in case someone is interested in doing benchmarking comparisons or whatever. The program is pure C, tested on linux 64-bit. Give two arguments: Nkeys and Filename. File should contain sym-value pairs, both unique in their domains. (Space-separated, one pair per line).

      Update. Some additional notes regarding the program.

Re: Bidirectional lookup algorithm? (Updated: further info.)
by roboticus (Chancellor) on Jan 05, 2015 at 19:11 UTC

    BrowserUk:

    I have some couple questions:

    1. Is there an even mix of lookup by number and by name?
    2. Is there any bias to the names/values looked up? In other words, are certain values more likely to be looked up than others? If so, is there a way of knowing which ones are more likely before your lookup pass?
    3. Similarly: are recently-looked-up values more likely to be looked up again soon, or is it a totally random distribution?
    4. As I understand your question, the data is loaded first, then used frequently afterwards. Is my understanding correct?

    There are likely other questions I could/should ask, if only I could think of them...

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      1. Is there an even mix of lookup by number and by name?

        The algorithm is a GA exploration, thus effectively random and approximately evenly distributed across the full ranges of both domains. At least in the early rounds.

        As the selection progresses towards a minima, the visited domains shrink, but the lookups in both domains remain roughly equal.

        In addition, after each iteration, many new pairs are selected (from the lookup tables), with an equivalent number old one discarded (from the actively considered subset, not the lookup tables), thus potentially opening up the full ranges again.

      2. Is there any bias to the names/values looked up? In other words, are certain values more likely to be looked up than others? If so, is there a way of knowing which ones are more likely before your lookup pass?

        As you'll gather from the above, no on all counts.

      3. Similarly: are recently-looked-up values more likely to be looked up again soon, or is it a totally random distribution?

        Yes, for brief periods of the total run, but then a different set will be active, then another. And there is no way to predict which subsets will be required at any given time.

        There is no mileage in only having a subset available at any given time.

      4. As I understand your question, the data is loaded first, then used frequently afterwards. Is my understanding correct?

        Actually generated rather than "loaded"; but yes.

        The dataset is generated as a random subset of a truly huge domain of possibilities.

        The larger the subset that can be generated -- limited by the size of the lookup table(s) and physical memory -- the more statistically valid the results.

        I'm currently limited to runs of 35e6 -- ~6GB of ram -- which represents ~0.000000003% of the total possibilities.

      I'm seeking a way to reduce the memory footprint without too much loss of performance, in order that I might increase the statistical validity of the simulation.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        BrowserUk:

        Hmmm ... I've been thinking about it off and on and haven't come up with anything interesting.

        ...roboticus

        When your only tool is a hammer, all problems look like your thumb.

Re: Bidirectional lookup algorithm? (Updated: further info.)
by Eily (Parson) on Jan 07, 2015 at 19:28 UTC

    I don't know if you have found what you were looking for or not, but here is another way to win a little space: hashes use the stringified value of keys, so most 64 bits ints will be written with 18 bytes or more*. But if you pack them before, it will only use 8 bytes.

    $i=1; $H{ 2**52 + $i++ } = $_, $G{ pack "Q", 2**52 + $i++ } = $_ for " +aaaa".."zzzz"; say total_size(\%H); say total_size(\%G) __DATA__ 63601240 59945432
    This shoudln't change much for values, though it would avoid the scalars from having both a string and int value.

    On my computer:

    perl -E '$a = 2**53+1; $b = 2**53+2; say "Hello" if $a != $b and $a eq + $b; say "End"' Hello End
    perl -Mbignum -E '$a = 2**53+1; $b = 2**53+2; say "Hello" if $a != $b +and $a eq $b; say "End"' End
    (numbers past a certain point are stringified to the same thing)

    Think: typical variable names!
    if that's \w+, that's 63 possible values per byte. So there probably is a possibility of compression here as well, even maybe huffman coding. But if most of those strings are only up to 8 bytes long, I'm not sure there's going to be an actual improvement (though in this case, this will save space both in the keys and the values).

    * Numbers below 2^27 may have less than 4 digits in base 10, but that's one number every 2^37. And numbers above 2^30 are all shorters once packed than stringified, and there's 2^34 more numbers above that point than below.

      Given that the numbers can range from 1 to 20 (decimal) digits, a simple pack will always make an 8-byte string, but if values less than 10_000_000 happen to predominate (the OP suggests there's an upper bound of 10M elements to be tracked in the application), the result will be a net increase of memory usage.

      Apart from that, there's a potential risk that some 64-bit int values, when packed into an 8-byte string, might collide with actual strings encountered in the data. There's a better way to pack integers into smaller strings.

      The following "encode/decode" functions will always return a string of characters in the range "\x80" - "\xff", and the number of characters will be roughly half the number of decimal digits. (I'm assuming that BrowserUK's strings will always be in the ASCII range, so I'm using non-ASCII characters that happen to be stored internally as single bytes in perl):

      sub int2str { my $int = shift; my $str = ''; while ( $int ) { $str .= chr(( $int % 128 ) + 0x80 ); $int = int( $int / 128 ); } $str; } sub str2int { my $mult = 1; my $int = 0; for ( split( //, shift )) { $int += ( ord() - 0x80 ) * $mult; $mult *= 128; } $int; }
      (There might be more efficient ways to implement that.)

      Even so, the overall space saved seems a bit disappointing - on my (macports, 5.12) version of perl, encoding the numbers like that over the ~914K entries in BrowserUK's single-hash example (strings 'aaaa' .. 'zzzz', numbers incrementing from 1) yielded just 4.1% reduction in the total_size of the hash. (If I use random integers ranging up to 15 digits, it edges closer to 6% reduction.)

      Exponentiation is implemented via pow internally, so the result is forced to floating point. Better use the shift operator: $a = (1<<53)+1; ...

      Set size of 35e6 has been mentioned, and the desire to push it further. Elements are unique. In decimal notation, certainly more than half of the values are going to be at least eight digits.

      Similar concerns hold for the symbols. 35e6 is just about 25 bits, but there would be no problem if the coding weren't sparse to begin with, so let's add a few bits for good measure. Base64 is 6 bits per char, [a-z] is 4.7 bits. In those cases, most of the symbols are going to be no less than five or six characters, respectively.

      (numbers past a certain point are stringified to the same thing)

      I have 64-bit ints here, so that isn't a problem for me.

      I did consider packing the keys, but the space saved is only around 15% to 20%; but the performance penalty is considerable. Thanks.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Bidirectional lookup algorithm? (Updated: further info.)
by shmem (Chancellor) on Jan 06, 2015 at 16:28 UTC

    It seems to me that you need a simplified hash datastructure, where adding a tuple (hk,he) to the hash stores the he as a new hk in the hash table and makes the entries of both keys into pointers to each other. That structure would in fact be a hash consisting only of keys and pointers to keys.

    This hash structure probably would consume considerably less memory space than a full featured perl hash, whilst being maybe even faster at lookup. Maybe it is feasible to set up such a thing via XS and (possibly misusing) perl macros. Looks interesting enough to try it...

    update: I've made some meek attempts to cobble together such a thing and wormed myself through macros... but I didn't even scratch the surface, I'm afraid. For such a thing to be possible, a HE should be able to hold a single PV -and nothing more. But AFICS a hash value must be a SV, which means that even for undef hash values, the whole struct sv is allocated. Populating a hash with undef values makes no difference in size to populating it with integers.

    Which leads me to think about a huge optimization possibility regarding memory consumption. Why do we need a whole struct SV for empty value fields? Shouldn't they be subject to autovivification?

    perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
Re: Bidirectional lookup algorithm? (Updated: further info.)
by shmem (Chancellor) on Jan 05, 2015 at 18:02 UTC

    Since your strings and integers are unique - why don't you use a hash for strings and an array for integers

    #!/usr/bin/perl -l my %str2int; my @int2str; $i=0; for ('aaaa'..'zzzz') { $str2int{ $_ } = ++$i; $int2str[$i] = $_; }

    and look up the strings by index into the array, and the integers by key into the hash?

    perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
      The integers are 64-bits. I do not know what they really look like, but there is clear danger of ending up with a sparse array growing to be much much larger than the hash.

        Right you are.

        qwurx [shmem] ~/comp/perl/tmp > perl -le '@a = 2**64; for(1..29){$a[2* +*$_]=$_}print `ps -o vsz= -p $$`' Out of memory!

        Having 8 GB RAM. 2**28 doesn't die.

        perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
Re: Bidirectional lookup algorithm? (Updated: further info.)
by andal (Hermit) on Jan 05, 2015 at 15:57 UTC

    Well, in C I would keep the array with structures (string + integer) and then add 2 more arrays with integer offsets into array with structures. The latter 2 arrays are sorted appropriately. Then lookup speed is O(logN).

    The above approach is simple, saves space and still provides descent speed for lookups in both directions. One can also use random records instead of array with all records, but then "indexes" maybe be larger on 64-bit systems since addresses are 8 bytes and offsets can be kept 4 bytes.

Re: Bidirectional lookup algorithm? (Updated: further info.)
by jmacloue (Beadle) on Jan 05, 2015 at 16:26 UTC

    It seems you are asking a wrong question - if you want to do fast lookups you need to have a pair of sorted lists (e.g., two hashes) which need lots of RAM, if you want to use as little RAM as possible - you will need to search over an unsorted list during at least one lookup which of course is very slow.

    I'm all with the guys suggesting using a ready-made solution like SQLite (and, yes, it is slow but not as slow as walking through an unsorted array in Perl). Unless, of course, your data allows some kind of optimization - e.g., if your keys and values are sorted so that the order is the same (like pairs (1,3), (2,4), (3,5) ...). In this case you can implement, for example, a binary search algorithm over a part of array and beat SQLite performance.

Re: Bidirectional lookup algorithm? (Updated: further info.)
by flexvault (Monsignor) on Jan 06, 2015 at 01:53 UTC

    BrowserUk,

    Just a thought about my attempt about 'Bidirectional Lookup'.

    A technique used with 'Big Data' databases is 'database sharding'. It is usually used to split a very large databases between multiple computers. But for your requirements, it could be fulfilled using an array of multiple scalars. Depending on the 'key/address' you could select where the start and end would be for each element of the array, so you could use 'index' on a sub-set of the data.

    Obviously, the larger the scalar, the longer the 'index' function will take. To use 'sharding' you have to know a lot about your data, but it may improve the lookup results dramatically, which is what you seem to need to satisfy your requirements.

    Just a thought!

    Regards...Ed

    "Well done is better than well said." - Benjamin Franklin

Re: Bidirectional lookup algorithm? (Updated: further info.)
by herveus (Parson) on Jan 05, 2015 at 19:20 UTC
    Howdy!

    Can you afford the memory footprint? If you have the RAM to spare, you should be getting O(1) lookups. On a Solaris box, I took your sample code and added another letter to get to most of 12 million entries for a memory footprint that was up in the 4 gigabyte class (3738M).

    I'd expect that the hash implementation will be pretty time efficient compared to the alternatives, and it sounds like that's the main driver (assuming the RAM needs are not limiting).

    yours,
    Michael

      The current hash lookups are very fast -- possibly as fast as I could hope to get -- the problem I'm trying to address is the memory consumed by the duplication of keys & values.

      As I said in the OP: "so the priority is the lookup speed -- preferably O(1) though O(logN) might be acceptable if the memory reduction was sufficient."


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Bidirectional lookup algorithm? (Updated: further info.)
by salva (Abbot) on Jan 06, 2015 at 01:10 UTC
    You may like to read about the HEf_SVKEY flag in perlapi and in the perl source (hv.h and hv.c). It marks keys that are stored as SVs instead of raw strings (char *) inside the HV.

    I have not investigated it deeply enough to known if it is usable for your purposes.

Re: Bidirectional lookup algorithm? (Updated: further info.)
by polettix (Vicar) on Jan 05, 2015 at 22:37 UTC
    While I was reading through the question and some of the answers, I couldn't stop thinking that a redis data structure server might be something to investigate. brian_d_foy wrote something about it lately and might be a handy starter.

    And yes, I might be totally off track!

    perl -ple'$_=reverse' <<<ti.xittelop@oivalf

    Io ho capito... ma tu che hai detto?
      See http://redis.io/topics/benchmarks (esp Pitfalls and misconceptions ), its too slow for this problem ( its ipc )
      $ redis-benchmark -t set -r 100000 -n 1000000 ====== SET ====== 1000000 requests completed in 13.86 seconds 50 parallel clients 3 bytes payload keep alive: 1

      A simple perl program adds 1mil key/value pairs in a hash in in 2.6875 seconds on a laptop from 2006

      OP doesn't need concurrency , so redis not needed

        Although in the OP there were no specific quality parameters about space/time constraints, my understanding is that space efficiency improvements could come at the expense of (some) speed, hence the suggestion.

        Any benchmark should probably consider both aspects.

        perl -ple'$_=reverse' <<<ti.xittelop@oivalf

        Io ho capito... ma tu che hai detto?

      With due deference to Brian, I must be candid in expressing my extreme doubts as to whether “an entirely separate server” could possibly be apropos in the present case.   (I understand the situation that he is speaking to, and I don’t think that the present situation qualifies.)   To my admittedly simple-minded way of thinking, the present problem is a problem that quite-obviously could quite-readily be solved by the use of two in-memory indexes to an in-memory data structure ... a-n-d / o-r by two indexes on an SQL table.

      To arrive at an appropriate solution to this problem, we do not need to climb the mountain of exotica.   We can satisfy ourselves just as well with the perfectly mundane.   The only decision that we must now arrive-at is this:   “exactly what are the ‘ruling constraints’ to this problem?”   Then, “what is the simplest solution that will satisfy it?”

      The only constraint that has so-far presented itself is ... the availability of RAM.   However, the validity of this constraint as-presented rests entirely upon the supposition that “the entire data-structure must reside completely in (physical(!) ...) RAM.”   If this were the case, then the answer would consist of simple arithmetic.   However, I perceive that this is not the only practicable solution.   This problem can be solved in a satisfactory manner regardless of memory-size.   (We’ve been managing such things for fifty-plus years.)

      “In-memory” approaches are unquestionably fast ... i-f “as much memory as one might desire” is, in fact, available ... on production machines as well as developer boxes ... “without delay.”   In the real world, alas, this is not the case, and therefore we are obliged to resort more-or-less to files.   (And, once we do that, our physical-RAM working-set requirements are dramatically reduced ... thereby re-defining the problem entirely.)

      The [only (!) ... ] “in-memory” solution to this problem is actually trivial:   “you must define two indexes to the data-structure, in addition to the data-structure itself, and therefore you must possess enough RAM to simultaneously hold all three and to access all three of them without any fear of virtual-memory paging delays.   Either you do have such luxury ... o-r ... you don’t.   Nothing more need be debated nor discussed.

      If “an embarrassment of RAM-riches” cannot be 100% counted-upon, then one is pushed rather-decisively toward the use of a database ... and thus upon an approach that must regard every query to that structure as being “more or less costly.”   The “cost-equals-zero proposition” of a pure-RAM solution is eliminated ... as are all of ts theoretical advantages.   “Welcome to the Real World ...”

        “exactly what are the ‘ruling constraints’ to this problem?”

        The ruling constraints are clearly and concisely laid out in the OP; you're either too lazy to read them properly; or too dumb to understand them.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Bidirectional lookup algorithm? (Updated: further info.)
by flexvault (Monsignor) on Jan 06, 2015 at 00:27 UTC

    BrowserUk,

    I came into this late, but I think using Perl's great 'index' function may give you the desired results. I copied your scripts to get a start, and then used a simple scalar to build the 'Bidirectional lookup' functions. I'm sure they can be improved!

    Here goes:

    #!/usr/local/bin/myperl use strict; use warnings; use Time::HiRes qw( gettimeofday ); use Devel::Size qw(total_size); my ( $i, %intByStr, %strByInt, $Both ); $i = 1; $intByStr{ $_ } = $i++ for 'aaaa' .. 'zzzz';; $strByInt { $intByStr{ $_ } } = $_ for keys %intByStr;; print "Hash1: ",total_size( \%intByStr ), "\nHash2: ", total_size( + \%strByInt ), "\n";; our $ksep = chr(253); our $isep = chr(254); foreach my $key ( keys %intByStr ) { $Both .= $ksep . $key . $isep . $intByStr{ $key } . "\n"; } print "\$Both: ",length( $Both ), "\n\n";; my $testkey = "zabc"; my $testaddr = $intByStr{ $testkey }; my $stime = gettimeofday; my $tkey = get_key( $testaddr ); print "\tget_key: ", gettimeofday - $stime, " seconds\n"; $stime = gettimeofday; my $taddr = get_addr( $tkey ); print "\tget_addr: ", gettimeofday - $stime, " seconds\n"; print "\t|$tkey|$taddr|\n"; exit; sub get_key { our( $ksep, $isep ); my $key = shift; my $result; my $si = index( $Both, "$isep$key\n" ); if ( $si >= 0 ) { my $sj = rindex ( $Both, $ksep, $si ); if ( $sj >= 0 ) { $result = substr( $Both, $sj+1, $si - ($sj+1 +) ); } } return ( $result ); } sub get_addr { our( $ksep, $isep ); my $key = shift; my $result; my $srch = "$ksep$key$isep"; my $len = length( $srch ); my $si = index( $Both, "$ksep$key$isep" ); if ( $si >= 0 ) { my $sj = index ( $Both, "\n", $si + $len ); if ( $sj >= 0 ) { $result = substr( $Both, $si+$len, $sj - ($s +i+$len) ); } } return ( $result ); } 1; __END__ ~/tmp:> perl BUKtest.plx ========================================================= Hash1: 37741336 Hash2: 43113919 $Both: 5829583 get_key: 0.00113105773925781 seconds get_addr: 0.000988960266113281 seconds |zabc|439429| ========================================================= Actual Time using Linux Debian 'time' command w/perl.5.14.2: real 0m1.591s user 0m1.560s sys 0m0.032s
    I didn't build my $Both scalar from the raw input, so that I could verify I did it correctly, but $Both doesn't need to be in any order for this to work.

    Please note the subs return 'undef' if the key or address is not in $Both. Also I would use 'pack' on the address, but that may or may not save space depending on the actual data supplied.

    Regards...Ed

    "Well done is better than well said." - Benjamin Franklin

      If you look at my code in Re^2: Bidirectional lookup algorithm? (Updated: further info.), using a hash does lookups by name in: 0.976562/3655808 = 0.000000267 secs; and by addr in 1.065514/3655808 = 0.000000291.

      C:\test>1112165-hash -S=4 Start: mem: 7,972 K 3655808 Hash built: mem: 651,160 K Fetch by Name took 0.976562 seconds Fetch by Addr took 1.065514 seconds Hash queried: check mem: 651,184 K

      And using an in-memory sqlite DB does them ~30 times slower at: by name in: 30.339751/3655808 = 0.0000083; and by addr in: 30.587817/3655808 = 0.0000084 for a memory reduction of 5/6ths.

      C:\test>1112165-sql -S=4 Start: mem: 9,668 K DB created: mem: 11,084 K DB populated: mem: 88,800 K DB indexed: mem: 119,372 K Fetch by Name took 30.339751 seconds Fetch by Addr took 30.587817 seconds DB queried: mem: 119,400 K

      And using a disk-based sqllite db does them ~57 times slower at: by name in: 6.867170/456976 = 0.0000150; and by addr in: 6.949115/456976 = 0.0000152 seconds for a memory reduction of ~3/5ths:

      C:\test>1112165-sql Start: mem: 9,620 K DB created: mem: 11,464 K DB populated: mem: 29,788 K DB indexed: mem: 45,540 K Fetch by Name took 6.867170 seconds Fetch by Addr took 6.949115 seconds DB queried: mem: 45,564 K

      The (single, and possible non representative depending upon where in the strings the chosen keys are (beginning middle or end)) timings you've posted are 800x slower again than the memoryDB, or nearly 5000 times slower than the hashes.

      The net result would be that my current 8 hour runs would require 4.5 years. And that's before I added the extra data that the space reduction would permit.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        BrowserUk,

        When programming my database engine, I found that 'index' did best with smaller scalars ( reason for mentioning 'database sharding' ). My times were close to the worst case ( worst case is 'not found' ).

        I found that with a 8-byte key, the 'index' technique I used was 4.7 times faster searching 512 byte records than searching 1,024 byte records. Naturally, 'index' searching a 5MM+ byte record is going to be real slow!

        My attempt was to give you another approach, not a final solution.

        Good Luck...Ed

        "Well done is better than well said." - Benjamin Franklin

Re: Bidirectional lookup algorithm?
by QM (Parson) on Jan 08, 2015 at 13:31 UTC
    It's too bad there's no easy mechanism for creating a hash value pointing to a hash key, instead of another hash value.

    Does it buy anything to create a hash with keys of some known hash function, and values as indexes pointing into an array? There's one more lookup, but into an array, which should be much cheaper than another hash lookup.

    #!/usr/bin/env perl use strict; use warnings; use Devel::Size qw(size total_size); my $num = 0; my %hashed_keys; my @values = (); for my $word ('aaaa'..'zzzz') { # Store $word -> $num my $hash_word = some_hash($word); my $hash_word_index = scalar @values; $hashed_keys{$hash_word} = $hash_word_index; push @values, $num; # Store $num -> $word my $hash_num = some_hash($num); my $hash_num_index = scalar @values; $hashed_keys{$hash_num} = $hash_num_index; push @values, $word; ++$num; } sub some_hash { return shift; } my $total_size; for my $var (\%hashed_keys, \@values) { my $tsize = total_size($var); printf "%s\n", total_size($var); $total_size += $tsize; } print "total: $total_size\n"; exit;

    Without a real hash function, this gives:

    43383174 29785044 total: 73168218

    ...which is only slightly smaller than your original 2-hash result. Mind you, I'm on a 32-bit perl, so that could be the difference.

    (Caveats: I've not used a real hash function. I've not accounted for hash collisions.)

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

      Essentially, that is what I am doing.

      One array of structs contains the paired data. This is ordered by hashing one of the values.

      A parallel array of ints contains indexes into the first array and are ordered by hashing the second value,

      Once I've fixed a couple of edge cases I know about, tested it some more, and probably moved it from linear to hopscotch probing, I'll post the code in this thread.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Bidirectional lookup algorithm? (Updated: further info.)
by sundialsvc4 (Abbot) on Jan 05, 2015 at 13:34 UTC

    I’m not sure that I am hearing all of the potential decision factors here.   How much memory does this machine have; how much memory can your application be assured of having access to without paging; how many keys will there be?   I would, as soon as possible, construct a worst-case test case and run it on a production machine.   Random data, realistic volume, and most importantly, realistic request-distributions.

    “In-memory” solutions look very attractive on developer machines and at less-than-production loads.   But virtual memory really should be looked-upon as a disk resource, as well as a preemptive-priority demand for another limited resource (RAM).   Solutions that were “very efficient” on developer machines can become 250,000-kilo elephants with enormous working-set sizes.   They apply excessive pressure on other processes, and suffer the most from the pressure that they, in effect, exert most upon themselves.   When they fall, they fall hard.

    On the other hand, a “file-based” solutions are steady.   All disk-I/O is routinely buffered, and operating systems will stuff otherwise-unused memory with caches.   When memory pressure appears, they are the first to go, but unless it does appear, they’re nearly as efficient as memory, especially when a file is memory-mapped.   (Then, the data is mapped using page/segment tables but it is not treated as high-priority backing store.)   When you treat the resource as you would treat a file, e.g. requesting data in chunks, they hold-up even better under stress.

    My gut tells me that a traditional database, e.g. SQLite, without specifying a memory-mapped file, will turn out to be the most efficient solution for you ... especially if you can request the data that you need in groups of a few hundred keys at a time.   While it might not be “fastest” in the short run on your dev box, “steady, steady wins the race.”

Re: Bidirectional lookup algorithm? (Updated: further info.)
by sundialsvc4 (Abbot) on Jan 06, 2015 at 18:51 UTC

    I-F the entire data structure can fit into memory, the solution to this task is obvious.   (Although, if there is known to be “no paging,” even a simple search of an array just might be good enough.)   You can “debate differences of a few microseconds” any way you wish.   We already know that Perl’s data structure implementations are fast.

    Really, it’s all going to come down to ... the production environment.   Will this program, on that machine, and surrounded by all of its other active processes, be able to operate continuously with all of its virtual-memory pages in RAM?   Or, not?

    And you’d better sincerely hope that the answer is always “yes,” because if not, a program designed in this way could stagger like a drunken elephant.   First of all, it’s going to have a large working-set size.   Then, when it does start having pages stolen from it, it’s going to continue making r-a-n-d-o-m memory accesses throughout its l-a-r-g-e virtual memory space.   The probability of such a process experiencing a page fault, or many page faults, is quite high ... and every page-fault that causes disk access will incur a wait of several milliseconds.   Milliseconds add up very fast, to minutes and hours.   “Big working-set” and “pure random” do not play well together in production, and tests run on unconstrained, memory-plentious developer machines will only mislead you.   (You find out when you deploy.)

    These are all external influences upon the program’s behavior which have nothing to do with Perl, nor with any programming language that the algorithm might be written in.   The vulnerability lies in the algorithm, itself.   You need to reduce the exposure to page faults, or make each page fault count for more.   Even if a database is used, you want to find ways to make a single search count for more than just the satisfaction of a single request.   You might for instance find it advantageous to load a bunch of search-keys into a temporary table then do a single SQL query with a JOIN.   Iterate over the resulting cursor.   Old-hat stuff.   Slower.   Yet ... graceful, predictable degradation.

      mike robertson it isn't reasonable or polite to keep spamming BrowserUk like this, why is sundialsvc4 such a constant spammer? Its unprofessional to be a spammer mike robertson.

        Yes, that is my name.   But, pardon me, you don’t seem to be using yours.   Wonder why.   Nevertheless, all of this is entirely off-topic.   I believe that someone asked a question about bidirectional lookup algorithms, yes?

        Returning to the topic, then, if you really want to attempt to use a single indexing scheme to locate two keys, the only way that I know of to do this is to generate a synthetic key, within the same domain of values, for each key that you wish to store.   You then employ a hashref keyed by this synthetic-key.   Each bucket in this hashref points to an arrayref of references.   You insert a reference to the (single ...) record into this hashref/arrayref structure multiple times, one for each synthetic key.   Searching now involves retrieving the arrayref for either key, and iterating through it to find the actual record(s) that you want.   The synthetic-key and constrained domain for that key is merely intended to reduce the size of the hashref:   you will still have a lot of accompanying data-structures ... probably more-or-less equivalent to having two separate hashrefs.

        = = = = =

        But the over-arching problem remains the same:   memory must be presumed to be unlimited and resident.   Either you have that or you don’t.   This strategy continues to be “random.”   It does not exhibit “locality of reference.”   It will produce a very high working-set size that, under production load conditions (really, any sort of virtual-memory stress at all ...) will perform prohibitively poorly, even though it may have seemed incredibly-clever on the developer’s machine, which was never actually “paging.”

        (I like to give developers machines that are memory-tight.   It forces them to write better code, because now they feel it, too.)

        I confess to a little bit of déjà vu here, because I once was called-upon to undo a similar program.   Oh, the thing was a gloriously clever mess.   There was “cleverness” running all through it, and memory-wise it was a big, fat pig.   Pretty much had to re-write the whole thing.   It just couldn’t run.   It acted, as I’ve said, “punch-drunk” when page-faults started hitting it, and of course, page-fault behavior affects everything on the machine.   (Other things, too ... “oh no, don’t call a subroutine!   Subroutines are ‘inefficient!’   It’s ‘faster’ to repeat yourself in-line!”)   A big fat pig mess.   But it sure did look clever to the developer, who by then had conveniently slipped out the door.   Code needs to be fast enough, under production conditions, and it must be maintainable.   That’s all.   Don’t be “too clever by half.”   Please.   Pretty please.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1112131]
Approved by marto
Front-paged by davido
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (9)
As of 2017-10-21 09:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My fridge is mostly full of:

















    Results (269 votes). Check out past polls.

    Notices?