http://www.perlmonks.org?node_id=960446


in reply to Select only desired features from a text

The basic answer is to make a hash from the relationships in input1, and use that to parse and process the information you need from input2. If I understand your problem, in this case I would probably create a hash of arrays, keyed on the values from column4, so I'd have something like this:

%hoa = ( 'frog-n' => ['alligator-n'], 'crocodile-n' => ['alligator-n'], );

(I'd use a hash of arrays instead of a simple hash because I assume other values from column1 could have a relationship with 'frog-n'. If that's not true, then this could be a simple hash.) Even if input1 is 4GB, since you're only interested in parts of certain lines, your hash may be much smaller.

Then I'd start going through input2, building a new multilevel hash based on the array elements from %hoa, with sub-keys from the new file, so I would be assigning values like this:

# from the first line: frog-n about adage-n 8.8016 for $key (@$hoa{frog-n}){ $newhash{$key}{about}{adage-n} += 8.8016; }

That will sum up repeated patterns as it goes, and it won't matter if they are consecutive. When it's done, go through that second hash and print it out in whatever format you like. There are still details to work out (like if you really want the sum elements displayed next to the sum like that, you may want to store them as an array and sum them in the last step), but that's the basic structure.

Aaron B.
My Woefully Neglected Blog, where I occasionally mention Perl.