Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

how to speed up dupe checking of arrays

by ultibuzz (Monk)
on Jul 31, 2007 at 09:44 UTC ( #629773=perlquestion: print w/ replies, xml ) Need Help??
ultibuzz has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

i need to duplicate check arrays , array size is from 5 million till 1 billion elements.
i use the following code

if ($file =~ $spec_text){ my $file_date = (split(/\./,$file))[3]; open(IN, '<', $file) or die("open failed: $!"); my @rows; while (<IN>) { chomp; my @eles = split(";",$_); push @rows,$eles[0].";".$eles[1].";".$file_date; } print scalar(@rows),"\n"; my @non_dupe_rows = do { my %seen;grep !$seen{$_}++, @rows }; print scalar(@non_dupe_rows),"\n"; }

this code need for 9 million elements 203 seconds on our server, but i need it faster, i need to parse in total approx 15 billion elements ( from all files).
idears how do speed it up are greatly welcome

kd ultibuzz

Comment on how to speed up dupe checking of arrays
Download Code
Re: how to speed up dupe checking of arrays
by wind (Priest) on Jul 31, 2007 at 09:57 UTC
    Most likely you're going to simply be throttled by IO speed. However, your above code could in theory be simplified by limiting the split to only 3 parts, and by adding the dup check to the while loop. Assuming that dup count is all you really care about.
    if ($file =~ $spec_text){ my $file_date = (split(/\./,$file))[3]; open(IN, '<', $file) or die("open failed: $!"); my $count_uniq = 0; my %seen; while (<IN>) { chomp; my ($ele0, $ele1, undef) = split ';', $_, 3; $count_uniq++ if !$seen{"$ele0;$ele1;$file_date"}++; } print "$.\n"; # Total number of lines. print "$count_uniq\n"; close(IN); }
    - Miller

      i need them in an array so i adjusted ur code like this

      my @rows; my %seen; while (<IN>) { chomp; my ($ele0, $ele1, undef) = split ';', $_, 3; push @rows,"$ele0;$ele1;$file_date" if !$seen{"$ele0;$ele1 +;$file_date"}++; } close(IN);

      and waht shoud i say AWSOME, from 203 seconds down do around 11 seconds,great so no over hours in office needed ;)
      thx alot.

      kd ultibuzz

        you already have them in array.
        $seen{"$ele0;$ele1;$file_data"}++;
        then if you need the data you can, for example:
        foreach (keys %seen) { .... }
        and the value of the hash is the number of times the string is repeated

        Oha

Re: how to speed up dupe checking of arrays
by FunkyMonk (Canon) on Jul 31, 2007 at 10:23 UTC
    Either @non_dupe_rows is very poorly named, or my @non_dupe_rows = do { my %seen;grep !$seen{$_}++, @rows }; isn't doing what you think it's doing:

    my @rows = qw/ 1 2 3 4 1 2 /; # 1 & 2 are dups my @non_dupe_rows = do { my %seen;grep !$seen{$_}++, @rows }; print "@non_dupe_rows\n";

    prints

    1 2 3 4

    If you really want non-dup rows, try

    my @rows = qw/ 1 2 3 4 1 2 /; # 1 & 2 are dups my @non_dupe_rows = do { my %seen; $seen{$_}++ for @rows; grep $seen{$_} == 1, keys %seen }; print "@non_dupe_rows\n";

      i want non dupes but want to keep 1 of all the dupes.
      so 1 2 3 4 is waht i want ;D

      kd ultibuzz

Re: how to speed up dupe checking of arrays
by mjscott2702 (Pilgrim) on Jul 31, 2007 at 12:35 UTC
    The reduce method available at List::Util may make this faster. I haven't benchmarked it (getting some weird benchmark results on my Win32/Cygwin install), but these calls are supposed to be REALLY fast.
      Update: the module at List::MoreUtils actually has a uniq function that does exactly what you need. This may be faster?

        List::More-Utils is faster as my try but slower then the dupe checking while in while loop.
        problem might be the pass trough each element of the array wich becomee the main time consuming element.
        as i see it dosnt matte rmuch waht u use when u have kinda small arrays, but when u have arrays with several million elements, all the saved milliseconds count ;)

        kd ultibuzz

Re: how to speed up dupe checking of arrays
by radiantmatrix (Parson) on Jul 31, 2007 at 14:45 UTC

    Just a thought that a DB could be leveraged here? LOAD DATA INFILE is pretty fast, and SELECT DISTINCT is both easy to code and pretty quick...

    <radiant.matrix>
    Ramblings and references
    The Code that can be seen is not the true Code
    I haven't found a problem yet that can't be solved by a well-placed trebuchet

      we have a oracle 10g on a hp superdome loading all data with sqlloader and direct=true as option takes longer then the dupechecking with perl and then load filtered data in.
      loading the data with perl into the db or normal sqlloader without tunig woud take ages.
      with direct=true the sqlloader pumps in 10million rows in less then 20 sec, without direct=true it takes 10 minutes+ because orcale set commit points and check for data correctness ^^.
      other point is we need several indexes and partiotion groups on it wich woud take hours to create if we used the unfiltered data.

      kd ultibuzz

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://629773]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (6)
As of 2014-12-22 04:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (110 votes), past polls