Loading 283600 records (WordNet)

remiah has asked for the wisdom of the Perl Monks concerning the following question:

Hello perl monk.

This is "synlink" table of WordNet. Like this.

synset1    synset2    link
07125096-n 07128527-n hype
07126228-n 07109847-n hype
...

I thought I would like to overview how synsets link each other. For that, I have to do a recursive call and it took too much cost for querying SQLite each time, so I load it to a hash like this.

'07125096-n' => [ ['07128527-n', 'hype'], ..... ]
[download]

When I load it from database, it took about 3.6 secs. And I found loading from text file is far faster, and it took about 1.2 secs. Here I almost satisfied, but I would like to ask for monk's wisdom.

My Question is: 1) Is there a faster way ? 2) Please give me advice when you have experience for wordnet.

My fastest script is simple like below.(commented out for HiRes wrapping my time module)

use strict;
use warnings;
#use Data::Dumper;
#use MyTime;

    my $href={};
    #my $timeinf=MyTime->new();
    #$timeinf->push('before open');
    open(my $fh, "<", "04.txt") or die $!; 
    while(<$fh>){
        chomp;
        push @{ $href->{ substr($_,0,10)} }, [ substr($_,10,10), subst
+r($_,20)];
    }
    close $fh;
    #$timeinf->push('after load');
    #print $timeinf->as_string;

    print "count=", scalar keys %{$href} ,"\n";
    #print "test item:" , Dumper $href->{'01785341-a'} , "\n\n";
[download]

I put sample text file at here.

regards.

Comment on Loading 283600 records (WordNet) Select or Download Code

Replies are listed 'Best First'.
Re: Loading 283600 records (Updated) by BrowserUk (Patriarch) on Sep 22, 2012 at 13:30 UTC
Try: `my %hash; my @rec; push @{ $hash{ $rec[0] } }, [ $rec[ 1 ], $rec[ 2 ] ] while @rec = split '(?<=-[a-z])', <>;` [download] Or 25% better still: `my %hash; my @rec; @rec = unpack( 'a10a10a4', $_ ), push @{ $hash{ $rec[0] } }, [ @rec[ 1, 2 ] ] while <>;` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP Neil Armstrong .	[reply] [d/l] [select]
Re^2: Loading 283600 records (Updated) by remiah (Hermit) on Sep 23, 2012 at 02:28 UTC
Thanks for reply, BrowserUK. I tried and below is the result. s/iter 02_split1 04_unpack 03_split2 01_substr 02_split1 6.34 -- -34% -41% -57% 04_unpack 4.17 52% -- -11% -35% 03_split2 3.71 71% 12% -- -27% 01_substr 2.70 134% 54% 37% -- And test code. I hope there is no silly mistakes. Read more... (2 kB) I thought, seeing your unpack example, if there is a way like this ? This is impossible because unpack returns flat list, though... `open(my $fh, "<", "24length_packed.data" ) or die $!; local $/ = undef; map { push @{ $hash{ $_->[0] } }, [ $_->[1], $_->[2] ] } unpack( '(a10a10a4)*', <$fh>), close $fh;` [download] With large loop, setting value to variable becomes some cost( this is BrowserUK taught me in this thread). So I think if I can avoid to use @rec, unpack and split becomes faster. Is there a good way?	[reply] [d/l] [select]
Re^3: Loading 283600 records (Updated) by BrowserUk (Patriarch) on Sep 23, 2012 at 16:05 UTC
There are no rules -- beyond minimising the number of opcodes called -- that apply in all situations. Try plugging this into your benchmark: `my %hash; while( <> ) { my( $k, @v ) = unpack( 'a10a10a4', $_ ); push @{ $hash{ $k } }, \@v }` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP Neil Armstrong	[reply] [d/l]
Re^4: Loading 283600 records (Updated) by remiah (Hermit) on Sep 23, 2012 at 23:10 UTC
Re^5: Loading 283600 records (substr alias) by Anonymous Monk on Sep 24, 2012 at 03:57 UTC
Re^4: Loading 283600 records (Updated) by remiah (Hermit) on Sep 24, 2012 at 10:31 UTC
Re: Loading 283600 records (WordNet) by dsheroh (Monsignor) on Sep 23, 2012 at 09:48 UTC
I'm not entirely clear on what you actually will be doing with the data after getting it loaded into memory, but there is another option: Copy the dataset from the on-disk SQLite database to an in-memory SQLite database using DBD::SQLite's `sqlite_backup_from_file` method. `my $mem_dbh = DBI->connect('dbi:SQLite:dbname=:memory:'); $mem_dbh->sqlite_backup_from_file($db_file_name); # process, process, process... # Don't forget to copy the in-memory data back to disk if # you've made any changes that should be persistent; if not, # skip this step. $mem_dbh->sqlite_backup_to_file($db_file_name);` [download] It won't be as fast as using a hash for your in-memory manipulation, of course, but it will give you access to the full capabilities of SQLite if you find yourself needing to, e.g., query the data in ways that aren't easy/straightforward on a hash. For comparison, I recently tried this technique out on a program that inserted a bunch of data into an SQLite database file, but was running inconveniently slowly. By copying the database into memory, inserting the data, and copying the database back to disk, the run time dropped from about seven and a half minutes to around one second. Quite a nice speedup, especially for such a trivial change to the source.	[reply] [d/l] [select]
Re^2: Loading 283600 records (WordNet) by remiah (Hermit) on Sep 23, 2012 at 11:26 UTC
Thanks for reply, dsheroh. I have never dreamed of such solution. I am going to try this. I sometimes forget to describe what I am currently doing... I am drawing wordnet graph on HTML. Currently, it is tree of 8 depth, around 2-3000 to 15000 synsets. I am so nervous for loading time because it is CGI. Before I introduce SVG or GraphViz, I wanted to figure out 8 depth, 15000 synsets could be fetched, drawn smooth or not. I see my results, sadly, it is somewhat slow. Now I was thinking of PathENum table of Joe Celko. Thanks for your kind reply. regards	[reply]
Re^2: Loading 283600 records (WordNet) by remiah (Hermit) on Sep 24, 2012 at 10:25 UTC
I was surprised. I have nothing to say for loading time. I never imagined execute and fetch become so fast. It is slower than hash lookup, but it becomes really fast. And smaller memory usage. These are output of "ps -axorss,vsz -p $$" before loading, after loaded. Size is in KB. >perl 07-1.pl #load to hash before: RSS VSZ 2736 5036 after: RSS VSZ 61332 63404 >perl 07-2.pl #sqlite in-memory before: RSS VSZ 4612 7308 after: RSS VSZ 7796 9992 File size of database is 37MB, if I dump this table to text, it becomes 8.9MB. I wonder how they load it in-memory? Below is some test results of lookup. It loads data and lookup ARG times from 283600 records/hash. >perl 07.pl 100 (warning: too few iterations for a reliable count) s/iter 03_sqlite_disk 01_substr 02_sqlite_mem 03_sqlite_disk 24.0 -- -89% -97% 01_substr 2.72 780% -- -77% 02_sqlite_mem 0.627 3720% 334% -- I cut SQLite on disk here after. >perl 07.pl 1000 s/iter 01_substr 02_sqlite_mem 01_substr 2.71 -- -75% 02_sqlite_mem 0.687 295% -- >perl 07.pl 10000 s/iter 01_substr 02_sqlite_mem 01_substr 2.74 -- -61% 02_sqlite_mem 1.07 157% -- >perl 07.pl 50000 s/iter 01_substr 02_sqlite_mem 01_substr 2.80 -- -3% 02_sqlite_mem 2.72 3% -- >perl 07.pl 100000 s/iter 02_sqlite_mem 01_substr 02_sqlite_mem 4.59 -- -36% 01_substr 2.92 57% -- And test code. Read more... (3 kB) I would like to lookup around 2000 to 15000 records, so SQLite in-memory suit me fine. Thanks for information.	[reply] [d/l]
Re^3: Loading 283600 records (WordNet) by BrowserUk (Patriarch) on Sep 24, 2012 at 10:45 UTC
Looks like you have your solution. ;) With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP Neil Armstrong	[reply]

Back to Seekers of Perl Wisdom