Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight

how to speed up dupe checking of arrays

by ultibuzz (Monk)
on Jul 31, 2007 at 09:44 UTC ( #629773=perlquestion: print w/replies, xml ) Need Help??
ultibuzz has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

i need to duplicate check arrays , array size is from 5 million till 1 billion elements.
i use the following code

if ($file =~ $spec_text){ my $file_date = (split(/\./,$file))[3]; open(IN, '<', $file) or die("open failed: $!"); my @rows; while (<IN>) { chomp; my @eles = split(";",$_); push @rows,$eles[0].";".$eles[1].";".$file_date; } print scalar(@rows),"\n"; my @non_dupe_rows = do { my %seen;grep !$seen{$_}++, @rows }; print scalar(@non_dupe_rows),"\n"; }

this code need for 9 million elements 203 seconds on our server, but i need it faster, i need to parse in total approx 15 billion elements ( from all files).
idears how do speed it up are greatly welcome

kd ultibuzz

Replies are listed 'Best First'.
Re: how to speed up dupe checking of arrays
by wind (Priest) on Jul 31, 2007 at 09:57 UTC
    Most likely you're going to simply be throttled by IO speed. However, your above code could in theory be simplified by limiting the split to only 3 parts, and by adding the dup check to the while loop. Assuming that dup count is all you really care about.
    if ($file =~ $spec_text){ my $file_date = (split(/\./,$file))[3]; open(IN, '<', $file) or die("open failed: $!"); my $count_uniq = 0; my %seen; while (<IN>) { chomp; my ($ele0, $ele1, undef) = split ';', $_, 3; $count_uniq++ if !$seen{"$ele0;$ele1;$file_date"}++; } print "$.\n"; # Total number of lines. print "$count_uniq\n"; close(IN); }
    - Miller

      i need them in an array so i adjusted ur code like this

      my @rows; my %seen; while (<IN>) { chomp; my ($ele0, $ele1, undef) = split ';', $_, 3; push @rows,"$ele0;$ele1;$file_date" if !$seen{"$ele0;$ele1 +;$file_date"}++; } close(IN);

      and waht shoud i say AWSOME, from 203 seconds down do around 11 seconds,great so no over hours in office needed ;)
      thx alot.

      kd ultibuzz

        you already have them in array.
        then if you need the data you can, for example:
        foreach (keys %seen) { .... }
        and the value of the hash is the number of times the string is repeated


Re: how to speed up dupe checking of arrays
by FunkyMonk (Canon) on Jul 31, 2007 at 10:23 UTC
    Either @non_dupe_rows is very poorly named, or my @non_dupe_rows = do { my %seen;grep !$seen{$_}++, @rows }; isn't doing what you think it's doing:

    my @rows = qw/ 1 2 3 4 1 2 /; # 1 & 2 are dups my @non_dupe_rows = do { my %seen;grep !$seen{$_}++, @rows }; print "@non_dupe_rows\n";


    1 2 3 4

    If you really want non-dup rows, try

    my @rows = qw/ 1 2 3 4 1 2 /; # 1 & 2 are dups my @non_dupe_rows = do { my %seen; $seen{$_}++ for @rows; grep $seen{$_} == 1, keys %seen }; print "@non_dupe_rows\n";

      i want non dupes but want to keep 1 of all the dupes.
      so 1 2 3 4 is waht i want ;D

      kd ultibuzz

Re: how to speed up dupe checking of arrays
by mjscott2702 (Pilgrim) on Jul 31, 2007 at 12:35 UTC
    The reduce method available at List::Util may make this faster. I haven't benchmarked it (getting some weird benchmark results on my Win32/Cygwin install), but these calls are supposed to be REALLY fast.
      Update: the module at List::MoreUtils actually has a uniq function that does exactly what you need. This may be faster?

        List::More-Utils is faster as my try but slower then the dupe checking while in while loop.
        problem might be the pass trough each element of the array wich becomee the main time consuming element.
        as i see it dosnt matte rmuch waht u use when u have kinda small arrays, but when u have arrays with several million elements, all the saved milliseconds count ;)

        kd ultibuzz

Re: how to speed up dupe checking of arrays
by radiantmatrix (Parson) on Jul 31, 2007 at 14:45 UTC

    Just a thought that a DB could be leveraged here? LOAD DATA INFILE is pretty fast, and SELECT DISTINCT is both easy to code and pretty quick...

    Ramblings and references
    The Code that can be seen is not the true Code
    I haven't found a problem yet that can't be solved by a well-placed trebuchet

      we have a oracle 10g on a hp superdome loading all data with sqlloader and direct=true as option takes longer then the dupechecking with perl and then load filtered data in.
      loading the data with perl into the db or normal sqlloader without tunig woud take ages.
      with direct=true the sqlloader pumps in 10million rows in less then 20 sec, without direct=true it takes 10 minutes+ because orcale set commit points and check for data correctness ^^.
      other point is we need several indexes and partiotion groups on it wich woud take hours to create if we used the unfiltered data.

      kd ultibuzz

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://629773]
Approved by Corion
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (5)
As of 2017-01-22 13:12 GMT
Find Nodes?
    Voting Booth?
    Do you watch meteor showers?

    Results (187 votes). Check out past polls.