http://www.perlmonks.org?node_id=960408


in reply to Select only desired features from a text

The general procedure is to first read the file that contains the interesting mapping, read the mapping into a hash, and then traverse the second file and do the transformation of these lines based on the hash.

Something like this:

use 5.010; use strict; use warnings; use autodie; my %map; open my $IN, '<', 'f1'; while (<$IN>) { my ($first, undef, $type, $fourth) = split; $map{$fourth} = $first if $type eq 'coord'; } close $IN; open $IN, '<', 'f2'; while (<$IN>) { my ($first, $rest) = split /\s/, $_, 2; if ($map{$first}) { print "not_$map{$first} $rest" } } close $IN;

Note that the variable names are quite terrible, because I don't know what the values stand for.

Replies are listed 'Best First'.
Re^2: Select only desired features from a text
by remluvr (Sexton) on Mar 19, 2012 at 15:15 UTC

    Thanks, this was really useful, but my problem is I don't want to have duplicates. Given this output:

    not_alligator-n about adage-n 8.8016 not_alligator-n appearance-1 broad-j 11.9640 not_alligator-n coord albino-n 6.7667 not_alligator-n be jumper-n 6.0272 not_alligator-n be key-n 3.8779 not_alligator-n of body-n 8.3063 not_alligator-n of bone-n 20.7982 not_alligator-n of book-n 0.4229 not_alligator-n be key-n 3.2572 not_alligator-n of chorus-n 24.9515 not_alligator-n of book-n 2.3460 not_alligator-n obj sit-v 3.1857 not_alligator-n obj size-v 57.3257 not_alligator-n obj skewer-v 6.1105

    I'd like for not_alligator-n be key-n 3.8779 and not_alligator-n be key-n 3.2572 to appear just once, but with their score summed up.
    How can I achieve that?
    Thanks
    Giulia

      Use a second hash to store those (partial) lines that you've already seen, and only print out those lines that aren't in the hash yet.

      You might want to consider loading the whole thing into a database if it's that large and you need to do a lot of key lookup (e.g. to avoid dupes) as you process the data, particularly if you need to sort on it in different ways or pull out subsets based on certain conditions.