http://www.perlmonks.org?node_id=1013076

ejohnston7 has asked for the wisdom of the Perl Monks concerning the following question:

The first one who helps me with this to the point where I can get it to work on my dataset can get a $15 paypal payment. Or you can have me donate it to PerlMonks, or a charity, or you could let me keep the money. Either way, this is very important to me and a deadline is coming up, so I am throwing that offer out there. This is my first time posting here so please bear with me. I am new to programming and have been successfully manipulating my data in perl until now. My data is in the following format:

occurence1 A a__bear;c__black occurence2 B a__wolf;c__grey occurence3 A a__wolf;c__white occurence4 A a__bear;c__ occurence5 C a__wolf;c__grey occurence6 C a__bear;c__brown occurence7 A a__wolf;c__ occurence8 B a__wolf;c__ occurence9 C a__bear;c__black occurence10 C a__wolf;c__ occurence11 A a__wolf;c__red occurence12 B a__wolf;c__grey occurence13 C a__wolf;c__grey occurence14 C a__wolf;c__grey occurence15 B a__bear;c__brown occurence16 C a__bear;c__brown occurence17 A a__bear;c__ occurence18 A a__bear;c__brown occurence19 C a__wolf;c__white occurence20 B a__wolf;c__grey occurence21 B a__bear;c__ occurence22 B a__wolf;c__grey occurence23 A a__wolf;c__grey occurence24 A a__bear;c__brown occurence25 C a__bear;c__brown occurence26 A a__bear;c__brown occurence27 C a__bear;c__ occurence28 C a__bear;c__brown occurence29 B a__wolf;c__red occurence30 B a__wolf;c__grey

Data is tab delimited. In the real dataset, instead of 30 occurrences (data points/lines) , there are ~5 million. There are also A LOT more animals in my data than just wolves and bears (I just included those two for simplification), so it would be unhelpful to create a script that contained if statements that specifically use words like 'bear' and 'wolf', because then we would need to make if statements for the hundreds of possible animals that can be spotted. Similarly, there are hundreds of possible colors for each animal. However, if you show me how to get to my desired end with this data then I'm sure I can apply it to my real data set. The data contained within the first column is not important, it just means that each row is a unique data point, that information doesn't need to exist in my desired output file. The numbers in the second column represent the sample ID. The third column is (you guessed it!) the animal ID. 'a__' is a broad observation of the animal and 'c__' is more specific information about the observed animal. For example, for the first data point, I observed a black bear in sample A. Note that for some data points, the color of the animal was not obtained. I am trying to generate a consolidated table that gives the number of occurrences of each animal per sample. I am imagining a table where the column headers are Sample IDs A, B, C, etc. and the rows represent the animal. The table will be filled with values (integers) that represent the number of times the animal occurred in that specific sample. It should look something like this....

A B C wolf 4 7 5 bear 6 2 6

I also want a table that gives similar information, but the rows represent the possible colors of the animal, like this

A B C black 1 0 1 grey 1 5 3 white 1 0 1 brown 3 1 4 red 1 1 0

Note that for the second table, if an occurrence (data point) did not contain a color, than that data should not be considered. Notice how the sum of all of the points in the second table is less than the sum of all data points in the second table. Comments throughout the script that let me know whats being done at each step would be super helpful, so I can alter it in the future for related data manipulations. Thanks in advance to anyone who tries to help me with this. I have been reading through forums for many hours trying to get answers.