Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Faster Hash Slices

by jht (Novice)
on Nov 28, 2013 at 10:53 UTC ( #1064798=perlquestion: print w/ replies, xml ) Need Help??
jht has asked for the wisdom of the Perl Monks concerning the following question:

I have a program that reads and processes approximately 8 million lines of data. For each line read it does:
my @aa = split(/\t/); @{$ref}{@head} = @aa;
When I measure the time spent, the:
@{$ref}{@head} = @aa;
Seems to be fairly expensive time wise. Is there a better/faster way to accomplish this? Should I dump the hash and try something like this?
my ($field1,$field2,$field3, ... ,$fieldn) = @aa

Comment on Faster Hash Slices
Select or Download Code
Re: Faster Hash Slices
by daxim (Chaplain) on Nov 28, 2013 at 10:58 UTC
    Please show the whole code and your NYTProf results.
Re: Faster Hash Slices
by tobyink (Abbot) on Nov 28, 2013 at 11:00 UTC

    Switching from a hashref to a hash may slightly improve matters. Accesses to hash elements are slightly faster going direct. Better still would be to abandon the hash altogether and access @aa directly, because array accesses are faster than hash ones, but this change might not be as simple.

    use Moops; class Cow :rw { has name => (default => 'Ermintrude') }; say Cow->new->name

      Sure? AFAIK lookups against a named hash are slower*, since they need a extra lookup against the symbol table or pad to locate its reference.

      * the speed penalty is close to zero

      perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'

        Seems to be the case that named hash lookups are (marginally, but consistently) quicker:

        { my $ref = { 1 .. 1e6 }; my %h = 1 .. 1e6; cmpthese -5,{ a => sub{ exists $ref->{$_} and 1 for 1 .. 1e6; }, b => sub{ exists $h{$_} and 1 for 1 .. 1e6; } } };; Rate a b a 1.24/s -- -3% b 1.28/s 3% -- Rate a b a 1.36/s -- -5% b 1.43/s 5% -- Rate a b a 1.24/s -- -3% b 1.28/s 3% -- Rate a b a 1.23/s -- -5% b 1.29/s 5% -- Rate a b a 1.27/s -- -3% b 1.30/s 3% --

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The reference is a named variable though, so that also needs a lookup in the symbol table or pad, plus needs to then be dereferenced.

        use Moops; class Cow :rw { has name => (default => 'Ermintrude') }; say Cow->new->name
Re: Faster Hash Slices
by Eily (Hermit) on Nov 28, 2013 at 11:12 UTC

    You could win a little time by not copying your data twice : @{$ref}{@head} = split(/\t/); (which will even make split a little faster if you put the result in less elements than there are tab separated values in your string).

    Or you could keep the array, and access its elements with constants, like was done here for example (it's done on a blessed arrayref in this case, but it works all the same).

Re: Faster Hash Slices
by shmem (Canon) on Nov 28, 2013 at 13:50 UTC

    What Eily says. Then, according to perldata:

    You can preallocate space for a hash by assigning to the keys() function. This rounds up the allocated buckets to the next power of two:
    keys(%users) = 1000; # allocate 1024 buckets

    I'm not sure about the impact on performance, though.

    perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
Re: Faster Hash Slices
by Tux (Monsignor) on Nov 28, 2013 at 14:28 UTC

    Do you need to retain the values outside of the loop? When processing 8 million lines, I wonder if you put all of that data into a hash, as that is likely to run out of memory.

    If the parsed data is "healthy" - as in all lines have the same number of fields, neatly separated by single TAB characters - and you only need the fields *during the iteration*, you could consider using Text::CSV_XS in combination with bind_columns. That will re-use the assigned variables for each line and /might/ save a lot of time.

    --8<--- use strict; use warnings; use Benchmark qw(cmpthese); my @data = ("aaaaaaaaaa") x 100; my $line = join "\t" => @data; my $data = join "\n" => map { $line } 0 .. 50000; use Text::CSV_XS; my $csv = Text::CSV_XS->new ({ sep_char => "\t", binary => 1, auto_dia +g => 1 }); sub bsplit { open my $fh, "<", \$data; while (<$fh>) { chomp; my @aa = split /\t/ => $_; } } # bsplit sub csvxs { open my $fh, "<", \$data; my @aa = ("") x scalar @data; $csv->bind_columns (\(@aa)); while ($csv->getline ($fh)) { } } # csvxs cmpthese (10, { "split" => \&bsplit, "csv_xs" => \&csvxs, }); -->8--- => Rate split csv_xs split 1.60/s -- -27% csv_xs 2.21/s 38% --

    Note that the difference is *completely* depending on what your data looks like!

    my @data = ("aaaaa") x 10; my $line = join "\t" => @data; my $data = join "\n" => map { $line } 0 .. 500000; => Rate csv_xs split csv_xs 1.17/s -- -13% split 1.34/s 14% -- ------------------------------------------------- my @data = ("aaaaa") x 500; my $line = join "\t" => @data; my $data = join "\n" => map { $line } 0 .. 50000; => s/iter split csv_xs split 2.93 -- -57% csv_xs 1.25 134% -- ------------------------------------------------- my @data = ("aaaaaaaaaaaaaaaaaaaaaaaaaaa") x 500; my $line = join "\t" => @data; my $data = join "\n" => map { $line } 0 .. 50000; => s/iter csv_xs split csv_xs 3.74 -- -5% split 3.55 5% --

    Even running the same on a different machine might change the results. Benchmark if things get important, but bench where it matters: do not draw conclusions on a benchmark run on a 64bit non-threaded non-debugging perl on a 64core machine with 128Gb of RAM to have the same result on an AIX 5.2 with 32bit threaded debugging perl with only 1 CPU and 2 Gb of RAM available to the process.


    Enjoy, Have FUN! H.Merijn
Re: Faster Hash Slices
by hdb (Parson) on Nov 28, 2013 at 14:36 UTC

    Under the assumption that @head stays constant during the loop, do you declare the hash outside of the loop? As the keys of the hash are same every iteration, it would be wasteful, to create and delete the hash structure, allocate memory for the keys, etc every time. So it should be similar to this (leaving chomping aside for the moment):

    my @head = split /\t/, <$fh>; my $ref = {}; undef @{$ref}{@head}; # pre-allocate the hash and its keys for( <$fh> ) { @{$ref}{@head} = split /\t/; # do something with the data here }
      This helped. Thanks!

        If you'd show a little more of what you do with the hash within the loop; I'm betting there is more to be gained.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Faster Hash Slices
by Anonymous Monk on Nov 28, 2013 at 14:54 UTC
    Answer the questions the others asked. The best solution depends on the use case. I think what you want is an AoA with numeric constants instead of hash keys.
      I agree that an array with constants should be faster, however, it would require more changes to the existing code than I'm up for at the moment. Moving the hash creation and key allocation out of the loop as suggested by hdb has helped. I will try the array method when I get a chance. Thanks!
Re: Faster Hash Slices
by Laurent_R (Vicar) on Nov 28, 2013 at 23:22 UTC

    I have a program that reads and processes approximately 8 million lines of data. For each line read it does:

    my @aa = split(/\t/); @{$ref}{@head} = @aa;

    Well, sorry, but that program does seem to make any sense. As far as we can say from the code you presented, this program is just assigning 8 million times the split line to the same HoH element (nothing is ever changing $ref or @head). If this is what you want to do, then simply go to the last line of the file and assign its split value to your variable. But, of course, I suspect you want to to something else, but are not really saying what.

    Please explain what you really want. Or, stated differently, please provide a more comprehensive excerpt of your code.

      @{$ref}{@head} = @aa;

      ...is a hash slice, similar to @hash{@keys} = @values, except that in the former case, $ref holds a hash-ref, whereas in the latter case, %hash holds a plain hash.


      Dave

        I might not have expressed myself correctly, but I was only trying to say that, since $ref and @head are never changing in the presented code, each time through the loop, the values of @aa will clobber the previously recorded ones, so that the loop is in effect useless, it would be faster to go straight to the last line of the file, split it and run the discussed command only once. Example under the Perl debugger:

        DB<1> @head = qw 'foo bar baz' DB<2> @{$ref}{@head} = qw /one two three/; DB<3> x $ref 0 HASH(0x8032f9b8) 'bar' => 'two' 'baz' => 'three' 'foo' => 'one' DB<4> @{$ref}{@head} = qw /four five six/; DB<5> x $ref 0 HASH(0x8032f9b8) 'bar' => 'five' 'baz' => 'six' 'foo' => 'four' DB<6>
        The assignment of line 4 is clobbering what was previously in the hash referenced by $ref. It does not make sense to do that in a loop, since only the last assignment of the loop is useful.

        Or the real code is different, in which case it might be useful to see the real code.

Re: Faster Hash Slices
by oiskuu (Pilgrim) on Nov 29, 2013 at 00:31 UTC
    What Tux said above.

    The split + slice assign itself (times 8e6) might take 10 seconds. If you run out of memory, it can take "forever". Your hashes may easily consume > 4GB, not counting the data.

    Improve your data layout. Only use a hash for indexing. Keep data packed where possible. Consider using a database...

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1064798]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (10)
As of 2014-07-30 09:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (230 votes), past polls