It probably works fine. But it's an inefficient solution; for each element in @all_genes, your grep must inspect every element in @covered_genes. That's probably fine if your data set is small, but what little I know about genome projects leads me to the conclusion that data sets are rarely small.
A more efficient solution would put "@covered_genes" into hash keys for constant time lookups:
my %covered_lookup;
@covered_lookup{ @covered_genes } = ();
for my $item ( @all_genes ) {
say $item unless exists $covered_lookup{$item};
}
With Perl, almost most times you're solving set problems you can solve them efficiently with hashes.
| [reply] [Watch: Dir/Any] [d/l] |
If you really want to use ne, you have to check that the number of different elements equals the number of all the elements:
for my $gene (@all_genes) {
say $gene if @covered_genes == grep $_ ne $gene, @covered_genes;
}
A benchmark shows your way, as inefficient as it might be, is about 15% faster.
| [reply] [Watch: Dir/Any] [d/l] [select] |
Why not use a hash instead?
my %covered;
@covered{@covered_genes} = ();
exists $covered{$_} || say for @all_genes;
| [reply] [Watch: Dir/Any] [d/l] |
I am wondering why you dont use BioPerl BLAST. Seems to me you are looking for a solution that is already solved.
site:
http://www.bioperl.org/wiki/BLAST | [reply] [Watch: Dir/Any] |
Thank you for hash suggestion. Yes, my current pilot project is small but data will be huge in near future. I will change my code.
| [reply] [Watch: Dir/Any] |
Just some words of warning. There is huge, and there is huge . If its huge enough, storing the lookup table in memory may be problematic. You might have to move to a database approach, or binary searches in disk based files.
| [reply] [Watch: Dir/Any] |
When data becomes too large to fit into a hash (I am often working on, for example, comparing files that are several GB large, in some cases even dozens of GB), the approach that I am taking is somewhat different: sorting both files (using generally the Unix sort utility) in accordance with the comparison key and reading them in parallel. Sorting takes some time, but not more (and usually significantly less) than inserting into a database and calculating an index.
And once the files are sorted, I can chain up all kinds of processing: removing duplicates from each files, reading both files in parallel to find the differences, etc, and these processes are lightning fast, just about as fast as you can hope to get. And they are very significantly faster than database access or binary search: you are just reading the files sequentially, no lookup time through DB index or binary search. There is a O (N log N) penalty for the initial sorting, but everything else is in O(N) complexity.
The only slightly complicated thing is the algorithm to read two files in parallel correctly (because there are quite a few edge cases to manage), but I built a module to handle this complexity so that I no longer have to worry about this. I wrote that module at home, outside of my working hours, so that I could make it open source and freely available, but I haven't figured out yet how to package it correctly for the CPAN (I have made a number of modules before, but it is the first one that might of some real interest outside of our working environment). I made a detailed POD documentation, but, among other things, I do not know how to build a meaningful test case suite. Also, I asked for a PAUSE account about six months ago, and never got any answer.
I have compared my solution with three ETL software packages, two free (or possibly even open-source) ones and a commercial one (with a $200,000+ licensing fee in our corporate environment), my solution is significantly faster (and in my view simpler) than all three of them, at least for the type of problems that I have to solve regularly, despite the fact that these ETLs are written in a compiled language (C or Java) that is presumably natively faster than Perl. The reason being that, in a Perl program, when I read some data, I can do everything that I need on it, whereas in an ETL, you usually need to make a data pipeline with one simple task per action, so that you end up reading the data many times over.
Well, I am afraid I am getting off-topic, sorry about that.
| [reply] [Watch: Dir/Any] |