Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

grep not equal

by rnaeye (Friar)
on Mar 21, 2014 at 15:15 UTC ( [id://1079281]=perlquestion: print w/replies, xml ) Need Help??

rnaeye has asked for the wisdom of the Perl Monks concerning the following question:

Hi! all,
I was wondering if the following is the best way to grep non-equal elements from two arrays.
for my $item (@all_genes) { say "$item" if ! grep {$item eq $_} @covered_genes; }
Should I expect my code to work properly (it's working fine in my tests). If I wanted to use "ne" operator, how could I write the code. I would appreciate suggestions. Thank you.

Replies are listed 'Best First'.
Re: grep not equal
by davido (Cardinal) on Mar 21, 2014 at 15:22 UTC

    It probably works fine. But it's an inefficient solution; for each element in @all_genes, your grep must inspect every element in @covered_genes. That's probably fine if your data set is small, but what little I know about genome projects leads me to the conclusion that data sets are rarely small.

    A more efficient solution would put "@covered_genes" into hash keys for constant time lookups:

    my %covered_lookup; @covered_lookup{ @covered_genes } = (); for my $item ( @all_genes ) { say $item unless exists $covered_lookup{$item}; }

    With Perl, almost most times you're solving set problems you can solve them efficiently with hashes.


    Dave

Re: grep not equal
by choroba (Cardinal) on Mar 21, 2014 at 15:31 UTC
    If you really want to use ne, you have to check that the number of different elements equals the number of all the elements:
    for my $gene (@all_genes) { say $gene if @covered_genes == grep $_ ne $gene, @covered_genes; }

    A benchmark shows your way, as inefficient as it might be, is about 15% faster.

    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: grep not equal
by hazylife (Monk) on Mar 21, 2014 at 15:26 UTC
    Why not use a hash instead?
    my %covered; @covered{@covered_genes} = (); exists $covered{$_} || say for @all_genes;
Re: grep not equal
by boerdry (Sexton) on Mar 21, 2014 at 18:58 UTC
    I am wondering why you dont use BioPerl BLAST. Seems to me you are looking for a solution that is already solved. site: http://www.bioperl.org/wiki/BLAST
Re: grep not equal
by rnaeye (Friar) on Mar 21, 2014 at 15:32 UTC

    Thank you for hash suggestion. Yes, my current pilot project is small but data will be huge in near future. I will change my code.

      Just some words of warning. There is huge, and there is huge . If its huge enough, storing the lookup table in memory may be problematic. You might have to move to a database approach, or binary searches in disk based files.


      Dave

        When data becomes too large to fit into a hash (I am often working on, for example, comparing files that are several GB large, in some cases even dozens of GB), the approach that I am taking is somewhat different: sorting both files (using generally the Unix sort utility) in accordance with the comparison key and reading them in parallel. Sorting takes some time, but not more (and usually significantly less) than inserting into a database and calculating an index.

        And once the files are sorted, I can chain up all kinds of processing: removing duplicates from each files, reading both files in parallel to find the differences, etc, and these processes are lightning fast, just about as fast as you can hope to get. And they are very significantly faster than database access or binary search: you are just reading the files sequentially, no lookup time through DB index or binary search. There is a O (N log N) penalty for the initial sorting, but everything else is in O(N) complexity.

        The only slightly complicated thing is the algorithm to read two files in parallel correctly (because there are quite a few edge cases to manage), but I built a module to handle this complexity so that I no longer have to worry about this. I wrote that module at home, outside of my working hours, so that I could make it open source and freely available, but I haven't figured out yet how to package it correctly for the CPAN (I have made a number of modules before, but it is the first one that might of some real interest outside of our working environment). I made a detailed POD documentation, but, among other things, I do not know how to build a meaningful test case suite. Also, I asked for a PAUSE account about six months ago, and never got any answer.

        I have compared my solution with three ETL software packages, two free (or possibly even open-source) ones and a commercial one (with a $200,000+ licensing fee in our corporate environment), my solution is significantly faster (and in my view simpler) than all three of them, at least for the type of problems that I have to solve regularly, despite the fact that these ETLs are written in a compiled language (C or Java) that is presumably natively faster than Perl. The reason being that, in a Perl program, when I read some data, I can do everything that I need on it, whereas in an ETL, you usually need to make a data pipeline with one simple task per action, so that you end up reading the data many times over.

        Well, I am afraid I am getting off-topic, sorry about that.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1079281]
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (6)
As of 2024-03-28 12:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found