Extracting data from CSV file + Extracting selected few lines through perl

NicholasNVS has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I'm with the following problem:

I have data in a CSV file:(contig name, species name)

contig2, ponkan

contig2, orange

contig2, clementine

contig2, maxima

contig8, maxima

contig8, orange

contig8, clementine

contig8, ponkan

contig8, medica

contig8, maxima

contig8, ponkan

contig9, maxima

contig9, orange

contig9, clementine

contig9, ponkan

contig9, medica

contig9, maxima

contig9, ponkan

Excel CSV. file with more rows of data

My output should look like:

contig8 = maxima, orange, clementine, ponkan, medica

contig9 = maxima, orange, clementine, ponkan, medica

Perl script should read the first column and select entries equal (ie: lines with contig8 entry, for example) and then evaluate the second column of those selected entries and check if it have the five elements (maxima, orange, clementine, ponkan, doctors), selected will be in the output file. The entries (lines) that do not have the five elements (maxima, orange, clementine, ponkan, medical) will not be selected (ie: contig2) and not will be in the output file.

I ask help and suggestions from you all on how to proceed in the analysis of these data. Please excuse any mistakes as English is my second language

Thank you very much!

Comment on Extracting data from CSV file + Extracting selected few lines through perl

Replies are listed 'Best First'.
Re: Extracting data from CSV file by toolic (Bishop) on Dec 18, 2014 at 17:23 UTC
I haven't taken the time to absorb your dense spec, but I think reading all your data into a hash-of-arrays might help. I also show how to find the uniq types. my %data; while (<DATA>) { chomp; my ($k, $v) = split /,\s+/; push @{ $data{$k} }, $v; } for my $contig (sort keys %data) { my %uniq = map { $_ => 1 } map { s/_.*//; $_ } @{ $data{$contig} } +; print "$contig = "; print join ' ', sort keys %uniq; print "\n"; } __DATA__ contig_8, maxima_contig_63500 contig_8, orange_scaffold_0026 contig_8, clementine_scaffold_6 contig_8, ponkan_scaffold_27456 contig_8, medica_contig_12945 contig_8, maxima_contig_235908 contig_8, ponkan_scaffold_144138 contig_9, maxima_contig_63500 contig_9, orange_scaffold_0026 contig_9, clementine_scaffold_6 contig_9, ponkan_scaffold_27456 contig_9, medica_contig_12945 contig_9, maxima_contig_235908 contig_9, ponkan_scaffold_144138 contig_9, ponkan_scaffold_144138 [download] output: `contig_8 = clementine maxima medica orange ponkan contig_9 = clementine maxima medica orange ponkan` [download] perldsc	[reply] [d/l] [select]
Re: Extracting data from CSV file by AnomalousMonk (Archbishop) on Dec 18, 2014 at 21:06 UTC
Not directly related to your OPed question, but if you're doing a lot of CSV file parsing, please consider one of the many, fine CSV support modules avaliable through CPAN (Text::CSV is the prime candidate) and maybe you can reduce your ~~Tylenol~~ \| Acetaminophen intake somewhat.	[reply]
Re^2: Extracting data from CSV file by Laurent_R (Canon) on Dec 18, 2014 at 22:36 UTC
Well, I am sorry, I understand your point, but I have to disagree somewhat with that. Yes, in the general case of CSV files with complicated rules on separating, escaping and quoting characters (and so on), by all means use the Text::CSV or some other CSV specialized module. But you don't need a 30-ton truck to deliver one TV set to a residential home. (It might even be counterproductive.) To me, using a CSV module for such an extremely simple CSV file is just plain over-engineering or technical overkill. For that, the split is just the right tool, efficient, simple, very well integrated into the language, easy to use. I do not see any reason to make things more complicated than they should be when they can be very simple. This is really not my view of the Perl philosophy: make simple things simple... Update: Hehe, 3 negative XP points for stating something that should be obvious, i.e. that there is not just one single solution for absolutely everuthing, I thought the monks here had more freedom to think independently. ;-)	[reply]
Re^3: Extracting data from CSV file by Tux (Canon) on Dec 19, 2014 at 07:46 UTC
Even on simple CSV data, Text::CSV_XS (not Text::CSV) can outperform plain perl (using split) given the right options and using the right methods. YMMV. Note, for the given data it most likely will not be faster The main reason to start using Text::CSV_XS (or Text::CSV) even with simple CSV data is that moving onward when the data changes (yes, it will, eventually) or on other projects that look simple but are not. For programmers that know both approaches, you are definitely right. No single modules (or tool or even language) is the best choice for every problem, but when in doubt, choose the option that is most versatile and best documented and that will also help you with likewise problems in the (near) future. Even the simplest change, like changing the line ending from `\n` to `\r\n` will make you happy having made that choice as early as possible. Enjoy, Have FUN! H.Merijn	[reply] [d/l] [select]
Re^3: Extracting data from CSV file by AnomalousMonk (Archbishop) on Dec 19, 2014 at 17:39 UTC
I kinda agree with your disagreement, but then, on the other hand, not so much. I tried to qualify my post ("Not directly related to your OPed question, but if you're doing a lot of CSV file parsing...") to make it clear that what I had in mind was more like using a 30-ton truck to deliver a 30-ton load. And then there are also Tux's well-founded points (++).	[reply]
Re^4: Extracting data from CSV file by Laurent_R (Canon) on Dec 19, 2014 at 17:59 UTC
Re^5: Extracting data from CSV file by AnomalousMonk (Archbishop) on Dec 19, 2014 at 18:03 UTC
Re^3: Extracting data from CSV file by Anonymous Monk on Dec 19, 2014 at 08:58 UTC
Update: Hehe, 3 negative XP points for stating something that should be obvious, i.e. that there is not just one single solution for absolutely everuthing, I thought the monks here had more freedom to think independently. ;-) If it is so obvious why did it need stating? Basically you ~~pounced on~~ ~~chided~~ ~~criticize~~ complain at AnomalousMonk for saying: now that you've gotten a solution, FYI this other thing exists also The split solution did not need defending, it was code solution , ready to use, not a link to a manual for another thing that could be used ... an actual solution for immediate data OP: I'm going fishing today, I need boots quick, help toolic: Here is some nylon and duck tape, it should last you one trip easy AnomalousMonk : you know, indirectly, for about the same price (free) there are boots on CPAN Laurent_R : i'm sorry I disagree, free boots are 30-ton truck of over-engineering , nylon+ducktape is perl philosophy of make things simple Still funny :)	[reply]
Re^3: Extracting data from CSV file by Anonymous Monk on Dec 18, 2014 at 23:19 UTC
Well, I am sorry, I understand your point, but I have to disagree somewhat with that. Hehe ... hey buddy, if you're doing a lot of this kind of thing, there are modules for that, you know, FYI ... its overkill to consider alternatives	[reply]


go ahead... be a heretic
	PerlMonks