http://www.perlmonks.org?node_id=950315

garyboyd has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I would like to average the data for certain rows in a tab-delimited file, for example the file looks like this:

Name10128 1A 33.2 A Test 0.71565 -0.011379692 Name4382 1A 34.3 A Test 0.9043 -0.035366577 Name1635 1A 34.9 A Test 0.836 -0.053808688 Name10267 1A 34.9 A Test 0.04575625 0.050878715 Name10039 1A 34.9 A Test 0.7606 -0.011123925 Name22270 1A 44.7 A Test 0.047 0.122532651 Name22285 1A 44.7 A Test 0 0.298871748 Name22701 1A 44.7 A Test 0.045 0.05038769 Name10054 1A 46.4 A Test 0 0.641618497

I want to scan down column3 and where there are identical entries, for example 34.9 and 44.7 I want to average the values in columns 6 and 7. So for the three rows that have 34.9 in column3, I would need to average the column6 values 0.836, 0.04575625 and 0.7606 and then replace those values in column6 with the average value, and do the same in column 7 for the values found there. What would be the most sensible way to proceed with this?

Replies are listed 'Best First'.
Re: average a column in tab-delimited file
by GrandFather (Saint) on Jan 27, 2012 at 10:47 UTC

    What have you tried? How would you do it on paper? When you do it on paper what are the steps you actually follow?

    Are the rows sorted by column 3, or at least are all the matching rows adjacent? What value does column 1 get for the combined rows? Are the contents of the remaining columns really invariant?

    True laziness is hard work
Re: average a column in tab-delimited file
by fisher (Priest) on Jan 27, 2012 at 10:39 UTC
    Well, the general algorithm might look like this:

    First, open your file using open().

    Then, in cycle, you read your file line by line, split it using split() or s///, find values of your interest, do calculations and form a content for new file in an array of lines. After cycle is done, just close this file, maybe rename it for backup version, then open it for write and push there your array with newly formed lines.

      Yep,

      you just can't start reading columns in a file. Cause there ain't any. There ain't no lines in a file actually either.

      There is no spoooon.

      Well that is - a text file in your computer should be just a big string of values. Those lines you see in your text editor are there only because once in a while in this big string there are line break characters, so your text editor can know when to make a break. No program can just bam go and start reading "columns" from this.

      So you have to make those columns yourself. As fisher said - the basic algorithm is to take your file line by line, split those lines(by whatever separates your values(whitespace,comma or tab), then you take each fourth(counting from zero) element from those split lines(you were looking for an average of third column if i remember correctly. So if you append those fourth elements in an array, print out somewhere or whatnot, then you've got your third column.

        To make life easier for you I will post a piece of code in here. Earlier today I had to transpose a large text file so I blurted out this piece of code. (Takes the rows from a file and makes them into a column)

        It is a rather stupid code. (I'm a noob). For example it would not be prudent to actually load all the file into an array, but better to do it line by line in a while($line=<>) loop. etc. But Being a rather stupid code it should also be quite readily understandable and demonstrates the idea of getting columns out from your lines pretty well.

        #!/usr/bin/perl -w use strict; use warnings; my @data=<>; my @column=(); for (my $count=3;$count<=4215;$count++) { foreach (@data) { my @row=split(/\t/, $_); push (@column,$row[$count]); } print "@column\n"; @column=() }

        As you can see this piece of code was used to extract columns 3 to 4215. You only need to extract one single column right now. Enjoy!

Re: average a column in tab-delimited file
by Anonymous Monk on Jan 27, 2012 at 10:35 UTC
Re: average a column in tab-delimited file
by sundialsvc4 (Abbot) on Jan 27, 2012 at 13:57 UTC

    Umm, maybe (seriously...) the way to do it is to put it into a database (e.g. an SQLite file), where the problem then becomes:   SELECT COL1, COL2, AVERAGE(COL3), AVERAGE(COL4) FROM MEH GROUP BY COL1, COL2;

    I am, once again, being perfectly serious.   Consider how many other results you might also need to obtain against this data, either now or in the near future, which would be reduced to “a query, not a program,” if the format of this disk file were, instead of being delimited text, SQLite, with all of the power that the aforesaid public-domain tool brings to bear.

    Since (IIRC...) SQLite already knows how to suck a delimited-file in, to make a table out of it, you literally might not need to “program” anything at all to accomplish this entire task.

Re: average a column in tab-delimited file
by Anonymous Monk on Jan 28, 2012 at 00:41 UTC
    Should output look like this?
    Name10128 1A 33.2 A Test 0.715650000 -0.011 +379692 Name4382 1A 34.3 A Test 0.904300000 -0.035 +366577 Name1635 1A 34.9 A Test 0.547452083 -0.004 +684633 Name10267 1A 34.9 A Test 0.547452083 -0.004 +684633 Name10039 1A 34.9 A Test 0.547452083 -0.004 +684633 Name22270 1A 44.7 A Test 0.030666667 0.1572 +64030 Name22285 1A 44.7 A Test 0.030666667 0.1572 +64030 Name22701 1A 44.7 A Test 0.030666667 0.1572 +64030 Name10054 1A 46.4 A Test 0.000000000 0.6416 +18497

      thanks for all the suggestions, and yes the output should look like the output that anonymous monk posted above.

      I can read the file into an array, but I wasn't sure how to proceed through the array to average columns 6 and 7.

      #!/usr/bin/perl use strict; use warnings; use Getopt::Long; my $infile; my @fields; my $i; my $j; GetOptions ( "infile=s" => \$infile, ); open INFILE, "<$infile" or die $!; #open RES, ">result.txt" or die $!; while (<INFILE>){ @fields = split(/\t+/, $_); my ($name, $chr, $location, $gen, $dom, $pval, $fst) = @fields +[0..6]; }

      once I have my values in an array, how do I find average of all similar values for column 3? eg how do I find all of those that have 34.9 and average the result in column 6?

        Please explain in words, how you would do it, using paper and pencil