Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re: Bidirectional lookup algorithm? (Updated: further info.)

by flexvault (Monsignor)
on Jan 06, 2015 at 00:27 UTC ( [id://1112269]=note: print w/replies, xml ) Need Help??


in reply to Bidirectional lookup algorithm? (Updated: further info.)

BrowserUk,

I came into this late, but I think using Perl's great 'index' function may give you the desired results. I copied your scripts to get a start, and then used a simple scalar to build the 'Bidirectional lookup' functions. I'm sure they can be improved!

Here goes:

#!/usr/local/bin/myperl use strict; use warnings; use Time::HiRes qw( gettimeofday ); use Devel::Size qw(total_size); my ( $i, %intByStr, %strByInt, $Both ); $i = 1; $intByStr{ $_ } = $i++ for 'aaaa' .. 'zzzz';; $strByInt { $intByStr{ $_ } } = $_ for keys %intByStr;; print "Hash1: ",total_size( \%intByStr ), "\nHash2: ", total_size( + \%strByInt ), "\n";; our $ksep = chr(253); our $isep = chr(254); foreach my $key ( keys %intByStr ) { $Both .= $ksep . $key . $isep . $intByStr{ $key } . "\n"; } print "\$Both: ",length( $Both ), "\n\n";; my $testkey = "zabc"; my $testaddr = $intByStr{ $testkey }; my $stime = gettimeofday; my $tkey = get_key( $testaddr ); print "\tget_key: ", gettimeofday - $stime, " seconds\n"; $stime = gettimeofday; my $taddr = get_addr( $tkey ); print "\tget_addr: ", gettimeofday - $stime, " seconds\n"; print "\t|$tkey|$taddr|\n"; exit; sub get_key { our( $ksep, $isep ); my $key = shift; my $result; my $si = index( $Both, "$isep$key\n" ); if ( $si >= 0 ) { my $sj = rindex ( $Both, $ksep, $si ); if ( $sj >= 0 ) { $result = substr( $Both, $sj+1, $si - ($sj+1 +) ); } } return ( $result ); } sub get_addr { our( $ksep, $isep ); my $key = shift; my $result; my $srch = "$ksep$key$isep"; my $len = length( $srch ); my $si = index( $Both, "$ksep$key$isep" ); if ( $si >= 0 ) { my $sj = index ( $Both, "\n", $si + $len ); if ( $sj >= 0 ) { $result = substr( $Both, $si+$len, $sj - ($s +i+$len) ); } } return ( $result ); } 1; __END__ ~/tmp:> perl BUKtest.plx ========================================================= Hash1: 37741336 Hash2: 43113919 $Both: 5829583 get_key: 0.00113105773925781 seconds get_addr: 0.000988960266113281 seconds |zabc|439429| ========================================================= Actual Time using Linux Debian 'time' command w/perl.5.14.2: real 0m1.591s user 0m1.560s sys 0m0.032s
I didn't build my $Both scalar from the raw input, so that I could verify I did it correctly, but $Both doesn't need to be in any order for this to work.

Please note the subs return 'undef' if the key or address is not in $Both. Also I would use 'pack' on the address, but that may or may not save space depending on the actual data supplied.

Regards...Ed

"Well done is better than well said." - Benjamin Franklin

Replies are listed 'Best First'.
Re^2: Bidirectional lookup algorithm? (Updated: further info.)
by BrowserUk (Patriarch) on Jan 06, 2015 at 02:17 UTC

    If you look at my code in Re^2: Bidirectional lookup algorithm? (Updated: further info.), using a hash does lookups by name in: 0.976562/3655808 = 0.000000267 secs; and by addr in 1.065514/3655808 = 0.000000291.

    C:\test>1112165-hash -S=4 Start: mem: 7,972 K 3655808 Hash built: mem: 651,160 K Fetch by Name took 0.976562 seconds Fetch by Addr took 1.065514 seconds Hash queried: check mem: 651,184 K

    And using an in-memory sqlite DB does them ~30 times slower at: by name in: 30.339751/3655808 = 0.0000083; and by addr in: 30.587817/3655808 = 0.0000084 for a memory reduction of 5/6ths.

    C:\test>1112165-sql -S=4 Start: mem: 9,668 K DB created: mem: 11,084 K DB populated: mem: 88,800 K DB indexed: mem: 119,372 K Fetch by Name took 30.339751 seconds Fetch by Addr took 30.587817 seconds DB queried: mem: 119,400 K

    And using a disk-based sqllite db does them ~57 times slower at: by name in: 6.867170/456976 = 0.0000150; and by addr in: 6.949115/456976 = 0.0000152 seconds for a memory reduction of ~3/5ths:

    C:\test>1112165-sql Start: mem: 9,620 K DB created: mem: 11,464 K DB populated: mem: 29,788 K DB indexed: mem: 45,540 K Fetch by Name took 6.867170 seconds Fetch by Addr took 6.949115 seconds DB queried: mem: 45,564 K

    The (single, and possible non representative depending upon where in the strings the chosen keys are (beginning middle or end)) timings you've posted are 800x slower again than the memoryDB, or nearly 5000 times slower than the hashes.

    The net result would be that my current 8 hour runs would require 4.5 years. And that's before I added the extra data that the space reduction would permit.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      BrowserUk,

      When programming my database engine, I found that 'index' did best with smaller scalars ( reason for mentioning 'database sharding' ). My times were close to the worst case ( worst case is 'not found' ).

      I found that with a 8-byte key, the 'index' technique I used was 4.7 times faster searching 512 byte records than searching 1,024 byte records. Naturally, 'index' searching a 5MM+ byte record is going to be real slow!

      My attempt was to give you another approach, not a final solution.

      Good Luck...Ed

      "Well done is better than well said." - Benjamin Franklin

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1112269]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (3)
As of 2024-04-19 22:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found