Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Extracting data from CSV file + Extracting selected few lines through perl

by NicholasNVS (Initiate)
on Dec 18, 2014 at 17:00 UTC ( [id://1110767]=perlquestion: print w/replies, xml ) Need Help??

NicholasNVS has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I'm with the following problem:

I have data in a CSV file:(contig name, species name)

contig2, ponkan

contig2, orange

contig2, clementine

contig2, maxima

contig8, maxima

contig8, orange

contig8, clementine

contig8, ponkan

contig8, medica

contig8, maxima

contig8, ponkan

contig9, maxima

contig9, orange

contig9, clementine

contig9, ponkan

contig9, medica

contig9, maxima

contig9, ponkan

contig9, ponkan

.

.

.

Excel CSV. file with more rows of data

My output should look like:

contig8 = maxima, orange, clementine, ponkan, medica

contig9 = maxima, orange, clementine, ponkan, medica

Perl script should read the first column and select entries equal (ie: lines with contig8 entry, for example) and then evaluate the second column of those selected entries and check if it have the five elements (maxima, orange, clementine, ponkan, doctors), selected will be in the output file. The entries (lines) that do not have the five elements (maxima, orange, clementine, ponkan, medical) will not be selected (ie: contig2) and not will be in the output file.

I ask help and suggestions from you all on how to proceed in the analysis of these data. Please excuse any mistakes as English is my second language

Thank you very much!

  • Comment on Extracting data from CSV file + Extracting selected few lines through perl

Replies are listed 'Best First'.
Re: Extracting data from CSV file
by toolic (Bishop) on Dec 18, 2014 at 17:23 UTC
    I haven't taken the time to absorb your dense spec, but I think reading all your data into a hash-of-arrays might help. I also show how to find the uniq types.
    my %data; while (<DATA>) { chomp; my ($k, $v) = split /,\s+/; push @{ $data{$k} }, $v; } for my $contig (sort keys %data) { my %uniq = map { $_ => 1 } map { s/_.*//; $_ } @{ $data{$contig} } +; print "$contig = "; print join ' ', sort keys %uniq; print "\n"; } __DATA__ contig_8, maxima_contig_63500 contig_8, orange_scaffold_0026 contig_8, clementine_scaffold_6 contig_8, ponkan_scaffold_27456 contig_8, medica_contig_12945 contig_8, maxima_contig_235908 contig_8, ponkan_scaffold_144138 contig_9, maxima_contig_63500 contig_9, orange_scaffold_0026 contig_9, clementine_scaffold_6 contig_9, ponkan_scaffold_27456 contig_9, medica_contig_12945 contig_9, maxima_contig_235908 contig_9, ponkan_scaffold_144138 contig_9, ponkan_scaffold_144138

    output:

    contig_8 = clementine maxima medica orange ponkan contig_9 = clementine maxima medica orange ponkan

Re: Extracting data from CSV file
by AnomalousMonk (Archbishop) on Dec 18, 2014 at 21:06 UTC

    Not directly related to your OPed question, but if you're doing a lot of CSV file parsing, please consider one of the many, fine CSV support modules avaliable through CPAN (Text::CSV is the prime candidate) and maybe you can reduce your Tylenol | Acetaminophen intake somewhat.

      Well, I am sorry, I understand your point, but I have to disagree somewhat with that. Yes, in the general case of CSV files with complicated rules on separating, escaping and quoting characters (and so on), by all means use the Text::CSV or some other CSV specialized module.

      But you don't need a 30-ton truck to deliver one TV set to a residential home. (It might even be counterproductive.)

      To me, using a CSV module for such an extremely simple CSV file is just plain over-engineering or technical overkill. For that, the split is just the right tool, efficient, simple, very well integrated into the language, easy to use. I do not see any reason to make things more complicated than they should be when they can be very simple. This is really not my view of the Perl philosophy: make simple things simple...

      Update: Hehe, 3 negative XP points for stating something that should be obvious, i.e. that there is not just one single solution for absolutely everuthing, I thought the monks here had more freedom to think independently. ;-)

        Even on simple CSV data, Text::CSV_XS (not Text::CSV) can outperform plain perl (using split) given the right options and using the right methods. YMMV. Note, for the given data it most likely will not be faster

        The main reason to start using Text::CSV_XS (or Text::CSV) even with simple CSV data is that moving onward when the data changes (yes, it will, eventually) or on other projects that look simple but are not.

        For programmers that know both approaches, you are definitely right. No single modules (or tool or even language) is the best choice for every problem, but when in doubt, choose the option that is most versatile and best documented and that will also help you with likewise problems in the (near) future.

        Even the simplest change, like changing the line ending from \n to \r\n will make you happy having made that choice as early as possible.


        Enjoy, Have FUN! H.Merijn

        I kinda agree with your disagreement, but then, on the other hand, not so much. I tried to qualify my post ("Not directly related to your OPed question, but if you're doing a lot of CSV file parsing...") to make it clear that what I had in mind was more like using a 30-ton truck to deliver a 30-ton load.

        And then there are also Tux's well-founded points (++).

        Update: Hehe, 3 negative XP points for stating something that should be obvious, i.e. that there is not just one single solution for absolutely everuthing, I thought the monks here had more freedom to think independently. ;-)

        If it is so obvious why did it need stating?

        Basically you pounced on chided criticize complain at AnomalousMonk for saying: now that you've gotten a solution, FYI this other thing exists also

        The split solution did not need defending, it was code solution , ready to use, not a link to a manual for another thing that could be used ... an actual solution for immediate data

        OP:
        I'm going fishing today, I need boots quick, help
        toolic:
        Here is some nylon and duck tape, it should last you one trip easy
        AnomalousMonk :
        you know, indirectly, for about the same price (free) there are boots on CPAN
        Laurent_R :
        i'm sorry I disagree, free boots are 30-ton truck of over-engineering , nylon+ducktape is perl philosophy of make things simple

        Still funny :)

        Well, I am sorry, I understand your point, but I have to disagree somewhat with that.

        Hehe ... hey buddy, if you're doing a lot of this kind of thing, there are modules for that, you know, FYI ... its overkill to consider alternatives

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1110767]
Approved by Ratazong
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (6)
As of 2024-04-16 12:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found