NicholasNVS has asked for the wisdom of the Perl Monks concerning the following question:
Hi Monks,
I'm with the following problem:
I have data in a CSV file:(contig name, species name)
contig2, ponkan
contig2, orange
contig2, clementine
contig2, maxima
contig8, maxima
contig8, orange
contig8, clementine
contig8, ponkan
contig8, medica
contig8, maxima
contig8, ponkan
contig9, maxima
contig9, orange
contig9, clementine
contig9, ponkan
contig9, medica
contig9, maxima
contig9, ponkan
contig9, ponkan
.
.
.
Excel CSV. file with more rows of data
My output should look like:
contig8 = maxima, orange, clementine, ponkan, medica
contig9 = maxima, orange, clementine, ponkan, medica
Perl script should read the first column and select entries equal (ie: lines with contig8 entry, for example) and then evaluate the second column of those selected entries and check if it have the five elements (maxima, orange, clementine, ponkan, doctors), selected will be in the output file. The entries (lines) that do not have the five elements (maxima, orange, clementine, ponkan, medical) will not be selected (ie: contig2) and not will be in the output file.
I ask help and suggestions from you all on how to proceed in the analysis of these data. Please excuse any mistakes as English is my second language
Thank you very much!
Re: Extracting data from CSV file
by toolic (Bishop) on Dec 18, 2014 at 17:23 UTC
|
I haven't taken the time to absorb your dense spec, but I think reading all your data into a hash-of-arrays might help. I also show how to find the uniq types.
my %data;
while (<DATA>) {
chomp;
my ($k, $v) = split /,\s+/;
push @{ $data{$k} }, $v;
}
for my $contig (sort keys %data) {
my %uniq = map { $_ => 1 } map { s/_.*//; $_ } @{ $data{$contig} }
+;
print "$contig = ";
print join ' ', sort keys %uniq;
print "\n";
}
__DATA__
contig_8, maxima_contig_63500
contig_8, orange_scaffold_0026
contig_8, clementine_scaffold_6
contig_8, ponkan_scaffold_27456
contig_8, medica_contig_12945
contig_8, maxima_contig_235908
contig_8, ponkan_scaffold_144138
contig_9, maxima_contig_63500
contig_9, orange_scaffold_0026
contig_9, clementine_scaffold_6
contig_9, ponkan_scaffold_27456
contig_9, medica_contig_12945
contig_9, maxima_contig_235908
contig_9, ponkan_scaffold_144138
contig_9, ponkan_scaffold_144138
output:
contig_8 = clementine maxima medica orange ponkan
contig_9 = clementine maxima medica orange ponkan
| [reply] [d/l] [select] |
Re: Extracting data from CSV file
by AnomalousMonk (Archbishop) on Dec 18, 2014 at 21:06 UTC
|
Not directly related to your OPed question, but if you're doing a lot of CSV file parsing, please consider one of the many, fine CSV support modules avaliable through CPAN (Text::CSV is the prime candidate) and maybe you can reduce your Tylenol | Acetaminophen intake somewhat.
| [reply] |
|
Well, I am sorry, I understand your point, but I have to disagree somewhat with that. Yes, in the general case of CSV files with complicated rules on separating, escaping and quoting characters (and so on), by all means use the Text::CSV or some other CSV specialized module.
But you don't need a 30-ton truck to deliver one TV set to a residential home. (It might even be counterproductive.)
To me, using a CSV module for such an extremely simple CSV file is just plain over-engineering or technical overkill. For that, the split is just the right tool, efficient, simple, very well integrated into the language, easy to use. I do not see any reason to make things more complicated than they should be when they can be very simple. This is really not my view of the Perl philosophy: make simple things simple...
Update: Hehe, 3 negative XP points for stating something that should be obvious, i.e. that there is not just one single solution for absolutely everuthing, I thought the monks here had more freedom to think independently. ;-)
| [reply] |
|
Even on simple CSV data, Text::CSV_XS (not Text::CSV) can outperform plain perl (using split) given the right options and using the right methods. YMMV. Note, for the given data it most likely will not be faster
The main reason to start using Text::CSV_XS (or Text::CSV) even with simple CSV data is that moving onward when the data changes (yes, it will, eventually) or on other projects that look simple but are not.
For programmers that know both approaches, you are definitely right. No single modules (or tool or even language) is the best choice for every problem, but when in doubt, choose the option that is most versatile and best documented and that will also help you with likewise problems in the (near) future.
Even the simplest change, like changing the line ending from \n to \r\n will make you happy having made that choice as early as possible.
Enjoy, Have FUN! H.Merijn
| [reply] [d/l] [select] |
|
I kinda agree with your disagreement, but then, on the other hand, not so much. I tried to qualify my post ("Not directly related to your OPed question, but if you're doing a lot of CSV file parsing...") to make it clear that what I had in mind was more like using a 30-ton truck to deliver a 30-ton load.
And then there are also Tux's well-founded points (++).
| [reply] |
|
|
|
Update: Hehe, 3 negative XP points for stating something that should be obvious, i.e. that there is not just one single solution for absolutely everuthing, I thought the monks here had more freedom to think independently. ;-) If it is so obvious why did it need stating?
Basically you pounced on chided criticize complain at AnomalousMonk for saying: now that you've gotten a solution, FYI this other thing exists also
The split solution did not need defending, it was code solution , ready to use, not a link to a manual for another thing that could be used ... an actual solution for immediate data
- OP:
- I'm going fishing today, I need boots quick, help
- toolic:
- Here is some nylon and duck tape, it should last you one trip easy
- AnomalousMonk :
- you know, indirectly, for about the same price (free) there are boots on CPAN
- Laurent_R :
- i'm sorry I disagree, free boots are 30-ton truck of over-engineering , nylon+ducktape is perl philosophy of make things simple
Still funny :)
| [reply] |
|
Well, I am sorry, I understand your point, but I have to disagree somewhat with that. Hehe ... hey buddy, if you're doing a lot of this kind of thing, there are modules for that, you know, FYI ... its overkill to consider alternatives
| [reply] |
|
|