Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re^2: Bidirectional lookup algorithm? (try perfect hashing)

by oiskuu (Hermit)
on Jan 11, 2015 at 02:52 UTC ( [id://1112876]=note: print w/replies, xml ) Need Help??


in reply to Re: Bidirectional lookup algorithm? (try perfect hashing)
in thread Bidirectional lookup algorithm? (Updated: further info.)

Well, I experimented a little further with the CMPH. It might be rough around the edges (e.g. temp. files aren't cleaned up or created safely), but the good thing is it basically works. First, I generated some data as follows.... that took awhile.

#! /bin/sh perl -E '$/ = \36; for (1 .. 200e6) { ($v,$s) = unpack q<QA*>,<>; say +qq($_ @{[$s =~ y/a-zA-Z//cdr || redo]} $v) }' /dev/urandom | sort -k2 | perl -ape '$_ x= $F ne $F[1], $F = $F[1]' | sort -k3 | perl -ape '$_ x= $F ne $F[2], $F = $F[2]' | sort -n | perl -pe 's/\S*\s*//'
In case many longer keys are present, it might be better to go with (4-byte) offset records. Simple and compact, but there's one more indirection, hence slower access.
$ perl -anE '$h[length$F[0]]++ }{ say "@h";' data 52 2704 140608 7190883 35010259 35855576 28751420 19240344 10899542 5 +278104 2202814 795438 249757 68269 16155 3388 640 89 12 1

Then a small test. You can see I didn't bother mapping value indexes but used order-preserving hash instead. Memory usage comes to about 26 bytes/pair. Double that while building the hashes.

Edit. Same test with updated code; higher -O used.

[ 1.303844] data ALLOCATED; tab = 160002048, ss = 14680064 (10 +000000 pairs) [ 5.478031] built BDZ for syms [ 2.873171] inplace REORDER [ 20.015568] built CHM for vals [ 0.000028] mph size when packed: syms = 3459398, vals = 83600 +028 [ 0.522367] fgets loop; lines=10000000 [ 1.195339] fgets+strtoul; lines=10000000 [ 2.235220] SYMS fetch; found=10000000 [ 2.940386] VALS fetch; found=10000000 [ 2.709484] VRFY sym to val; matched=10000000 [ 4.258673] VRFY two-way; matched=10000000
Old output.

Replies are listed 'Best First'.
Re^3: Bidirectional lookup algorithm? (try perfect hashing)
by BrowserUk (Patriarch) on Jan 11, 2015 at 09:46 UTC

    oiskuu. Thanks for pursuing this and posting your findings. Unfortunately, I do not understand what I am reading?

    1. You read 200 million 36 byte records from /dev/urandom;

      Then you unpack them as: a)a 64-bit uint; and b) the rest as a string, from which you remove any non alpha characters (with redo if nothing is left);

      And print the string and int to stdout;

      Which you pipe through: sort & perl & sort & perl & sort & perl to ???

    2. You then count the lengths of the first fields from a file called data?
    3. You then run an executable call a.out supplying a number and the data file?

      Which produces some output?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1112876]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (4)
As of 2024-04-18 01:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found