Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Loading 283600 records (WordNet)

by remiah (Hermit)
on Sep 22, 2012 at 11:37 UTC ( #995089=perlquestion: print w/ replies, xml ) Need Help??
remiah has asked for the wisdom of the Perl Monks concerning the following question:

Hello perl monk.

This is "synlink" table of WordNet. Like this.

synset1    synset2    link
07125096-n 07128527-n hype
07126228-n 07109847-n hype
...
I thought I would like to overview how synsets link each other. For that, I have to do a recursive call and it took too much cost for querying SQLite each time, so I load it to a hash like this.
'07125096-n' => [ ['07128527-n', 'hype'], ..... ]
When I load it from database, it took about 3.6 secs. And I found loading from text file is far faster, and it took about 1.2 secs. Here I almost satisfied, but I would like to ask for monk's wisdom.

My Question is: 1) Is there a faster way ? 2) Please give me advice when you have experience for wordnet.

My fastest script is simple like below.(commented out for HiRes wrapping my time module)

use strict; use warnings; #use Data::Dumper; #use MyTime; my $href={}; #my $timeinf=MyTime->new(); #$timeinf->push('before open'); open(my $fh, "<", "04.txt") or die $!; while(<$fh>){ chomp; push @{ $href->{ substr($_,0,10)} }, [ substr($_,10,10), subst +r($_,20)]; } close $fh; #$timeinf->push('after load'); #print $timeinf->as_string; print "count=", scalar keys %{$href} ,"\n"; #print "test item:" , Dumper $href->{'01785341-a'} , "\n\n";
I put sample text file at here.

regards.

Comment on Loading 283600 records (WordNet)
Select or Download Code
Re: Loading 283600 records (Updated)
by BrowserUk (Pope) on Sep 22, 2012 at 13:30 UTC

    Try:

    my %hash; my @rec; push @{ $hash{ $rec[0] } }, [ $rec[ 1 ], $rec[ 2 ] ] while @rec = split '(?<=-[a-z])', <>;

    Or 25% better still:

    my %hash; my @rec; @rec = unpack( 'a10a10a4', $_ ), push @{ $hash{ $rec[0] } }, [ @rec[ 1, 2 ] ] while <>;

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    RIP Neil Armstrong

    .

      Thanks for reply, BrowserUK.

      I tried and below is the result.

                s/iter 02_split1 04_unpack 03_split2 01_substr
      02_split1   6.34        --      -34%      -41%      -57%
      04_unpack   4.17       52%        --      -11%      -35%
      03_split2   3.71       71%       12%        --      -27%
      01_substr   2.70      134%       54%       37%        --
      
      And test code. I hope there is no silly mistakes. I thought, seeing your unpack example, if there is a way like this ? This is impossible because unpack returns flat list, though...
      open(my $fh, "<", "24length_packed.data" ) or die $!; local $/ = undef; map { push @{ $hash{ $_->[0] } }, [ $_->[1], $_->[2] ] } unpack( '(a10a10a4)*', <$fh>), close $fh;
      With large loop, setting value to variable becomes some cost( this is BrowserUK taught me in this thread). So I think if I can avoid to use @rec, unpack and split becomes faster. Is there a good way?

        There are no rules -- beyond minimising the number of opcodes called -- that apply in all situations. Try plugging this into your benchmark:

        my %hash; while( <> ) { my( $k, @v ) = unpack( 'a10a10a4', $_ ); push @{ $hash{ $k } }, \@v }

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        RIP Neil Armstrong

Re: Loading 283600 records (WordNet)
by dsheroh (Parson) on Sep 23, 2012 at 09:48 UTC
    I'm not entirely clear on what you actually will be doing with the data after getting it loaded into memory, but there is another option: Copy the dataset from the on-disk SQLite database to an in-memory SQLite database using DBD::SQLite's sqlite_backup_from_file method.
    my $mem_dbh = DBI->connect('dbi:SQLite:dbname=:memory:'); $mem_dbh->sqlite_backup_from_file($db_file_name); # process, process, process... # Don't forget to copy the in-memory data back to disk if # you've made any changes that should be persistent; if not, # skip this step. $mem_dbh->sqlite_backup_to_file($db_file_name);
    It won't be as fast as using a hash for your in-memory manipulation, of course, but it will give you access to the full capabilities of SQLite if you find yourself needing to, e.g., query the data in ways that aren't easy/straightforward on a hash.

    For comparison, I recently tried this technique out on a program that inserted a bunch of data into an SQLite database file, but was running inconveniently slowly. By copying the database into memory, inserting the data, and copying the database back to disk, the run time dropped from about seven and a half minutes to around one second. Quite a nice speedup, especially for such a trivial change to the source.

      Thanks for reply, dsheroh.

      I have never dreamed of such solution. I am going to try this.

      I sometimes forget to describe what I am currently doing...
      I am drawing wordnet graph on HTML. Currently, it is tree of 8 depth, around 2-3000 to 15000 synsets. I am so nervous for loading time because it is CGI.

      Before I introduce SVG or GraphViz, I wanted to figure out 8 depth, 15000 synsets could be fetched, drawn smooth or not. I see my results, sadly, it is somewhat slow.

      Now I was thinking of PathENum table of Joe Celko.

      Thanks for your kind reply.
      regards

      I was surprised.

      I have nothing to say for loading time. I never imagined execute and fetch become so fast. It is slower than hash lookup, but it becomes really fast.

      And smaller memory usage. These are output of "ps -axorss,vsz -p $$" before loading, after loaded. Size is in KB.

      >perl 07-1.pl  #load to hash
      before:
        RSS   VSZ
       2736  5036
      after:
        RSS   VSZ
      61332 63404
      
      >perl 07-2.pl  #sqlite in-memory
      before:
        RSS   VSZ
       4612  7308
      after:
        RSS   VSZ
       7796  9992
      
      File size of database is 37MB, if I dump this table to text, it becomes 8.9MB. I wonder how they load it in-memory?

      Below is some test results of lookup. It loads data and lookup ARG times from 283600 records/hash.
      >perl 07.pl 100
                  (warning: too few iterations for a reliable count)
                     s/iter 03_sqlite_disk      01_substr  02_sqlite_mem
      03_sqlite_disk   24.0             --           -89%           -97%
      01_substr        2.72           780%             --           -77%
      02_sqlite_mem   0.627          3720%           334%             --
      

      I cut SQLite on disk here after.
      >perl 07.pl 1000
                    s/iter     01_substr 02_sqlite_mem
      01_substr       2.71            --          -75%
      02_sqlite_mem  0.687          295%            --
      
      >perl 07.pl 10000
                    s/iter     01_substr 02_sqlite_mem
      01_substr       2.74            --          -61%
      02_sqlite_mem   1.07          157%            --
      
      >perl 07.pl 50000
                    s/iter     01_substr 02_sqlite_mem
      01_substr       2.80            --           -3%
      02_sqlite_mem   2.72            3%            --
      
      >perl 07.pl 100000
                    s/iter 02_sqlite_mem     01_substr
      02_sqlite_mem   4.59            --          -36%
      01_substr       2.92           57%            --
      
      
      And test code. I would like to lookup around 2000 to 15000 records, so SQLite in-memory suit me fine. Thanks for information.

        Looks like you have your solution. ;)


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        RIP Neil Armstrong

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://995089]
Approved by philipbailey
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (5)
As of 2014-12-27 05:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (176 votes), past polls