Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Re: Bidirectional lookup algorithm? (try perfect hashing)

by oiskuu (Hermit)
on Jan 07, 2015 at 17:09 UTC ( [id://1112524]=note: print w/replies, xml ) Need Help??


in reply to Bidirectional lookup algorithm? (Updated: further info.)

You could lump longer symbols together a la join qq(\0), grep length() > 8, @syms; access via ptr or offset. Keep short ones directly in the 8-byte slots of the symbol table. (union). This should get you down to ~10 bytes per key.

To index the tables, use minimal perfect hashing. There's the CMPH library, which appears quite simple to use. Though you'll need to work on the glue — cmph bindings via Perfect::Hash are incomplete, and it's not on CPAN.

Anyway, with perfect hashing you need but a few extra bits to do the indexing. Various methods are available. CHM needs ~8 bytes per key, but preserves order. BRZ is about ~8 bits per key, uses external memory while building the tables so you're not mem-constrained. BDZ is just under 3 bits per key. Storing the 8-byte keys as you are already, these few extra bits aren't much.

Performance estimates: about 3 sec per million keys when constructing the mph structures; about .3 sec per million lookups. Compared to .15 usec lookups with direct hashes.

So expect roughly 3 times slower performance with about 10 times the memory density. If the tradeoff is worth the effort, I'll leave for you to decide.

Replies are listed 'Best First'.
Re^2: Bidirectional lookup algorithm? (try perfect hashing)
by oiskuu (Hermit) on Jan 11, 2015 at 02:52 UTC

    Well, I experimented a little further with the CMPH. It might be rough around the edges (e.g. temp. files aren't cleaned up or created safely), but the good thing is it basically works. First, I generated some data as follows.... that took awhile.

    #! /bin/sh perl -E '$/ = \36; for (1 .. 200e6) { ($v,$s) = unpack q<QA*>,<>; say +qq($_ @{[$s =~ y/a-zA-Z//cdr || redo]} $v) }' /dev/urandom | sort -k2 | perl -ape '$_ x= $F ne $F[1], $F = $F[1]' | sort -k3 | perl -ape '$_ x= $F ne $F[2], $F = $F[2]' | sort -n | perl -pe 's/\S*\s*//'
    In case many longer keys are present, it might be better to go with (4-byte) offset records. Simple and compact, but there's one more indirection, hence slower access.
    $ perl -anE '$h[length$F[0]]++ }{ say "@h";' data 52 2704 140608 7190883 35010259 35855576 28751420 19240344 10899542 5 +278104 2202814 795438 249757 68269 16155 3388 640 89 12 1

    Then a small test. You can see I didn't bother mapping value indexes but used order-preserving hash instead. Memory usage comes to about 26 bytes/pair. Double that while building the hashes.

    Edit. Same test with updated code; higher -O used.

    [ 1.303844] data ALLOCATED; tab = 160002048, ss = 14680064 (10 +000000 pairs) [ 5.478031] built BDZ for syms [ 2.873171] inplace REORDER [ 20.015568] built CHM for vals [ 0.000028] mph size when packed: syms = 3459398, vals = 83600 +028 [ 0.522367] fgets loop; lines=10000000 [ 1.195339] fgets+strtoul; lines=10000000 [ 2.235220] SYMS fetch; found=10000000 [ 2.940386] VALS fetch; found=10000000 [ 2.709484] VRFY sym to val; matched=10000000 [ 4.258673] VRFY two-way; matched=10000000
    Old output.

      oiskuu. Thanks for pursuing this and posting your findings. Unfortunately, I do not understand what I am reading?

      1. You read 200 million 36 byte records from /dev/urandom;

        Then you unpack them as: a)a 64-bit uint; and b) the rest as a string, from which you remove any non alpha characters (with redo if nothing is left);

        And print the string and int to stdout;

        Which you pipe through: sort & perl & sort & perl & sort & perl to ???

      2. You then count the lengths of the first fields from a file called data?
      3. You then run an executable call a.out supplying a number and the data file?

        Which produces some output?


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re^2: Bidirectional lookup algorithm? (try perfect hashing)
by oiskuu (Hermit) on Jan 13, 2015 at 02:21 UTC

    Ok. Here's a crude demo in case someone is interested in doing benchmarking comparisons or whatever. The program is pure C, tested on linux 64-bit. Give two arguments: Nkeys and Filename. File should contain sym-value pairs, both unique in their domains. (Space-separated, one pair per line).

    Update. Some additional notes regarding the program.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1112524]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (2)
As of 2025-07-11 05:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.