Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re: Select only desired features from a text

by moritz (Cardinal)
on Mar 19, 2012 at 13:10 UTC ( #960408=note: print w/ replies, xml ) Need Help??


in reply to Select only desired features from a text

The general procedure is to first read the file that contains the interesting mapping, read the mapping into a hash, and then traverse the second file and do the transformation of these lines based on the hash.

Something like this:

use 5.010; use strict; use warnings; use autodie; my %map; open my $IN, '<', 'f1'; while (<$IN>) { my ($first, undef, $type, $fourth) = split; $map{$fourth} = $first if $type eq 'coord'; } close $IN; open $IN, '<', 'f2'; while (<$IN>) { my ($first, $rest) = split /\s/, $_, 2; if ($map{$first}) { print "not_$map{$first} $rest" } } close $IN;

Note that the variable names are quite terrible, because I don't know what the values stand for.


Comment on Re: Select only desired features from a text
Download Code
Re^2: Select only desired features from a text
by remluvr (Sexton) on Mar 19, 2012 at 15:15 UTC

    Thanks, this was really useful, but my problem is I don't want to have duplicates. Given this output:

    not_alligator-n about adage-n 8.8016 not_alligator-n appearance-1 broad-j 11.9640 not_alligator-n coord albino-n 6.7667 not_alligator-n be jumper-n 6.0272 not_alligator-n be key-n 3.8779 not_alligator-n of body-n 8.3063 not_alligator-n of bone-n 20.7982 not_alligator-n of book-n 0.4229 not_alligator-n be key-n 3.2572 not_alligator-n of chorus-n 24.9515 not_alligator-n of book-n 2.3460 not_alligator-n obj sit-v 3.1857 not_alligator-n obj size-v 57.3257 not_alligator-n obj skewer-v 6.1105

    I'd like for not_alligator-n be key-n 3.8779 and not_alligator-n be key-n 3.2572 to appear just once, but with their score summed up.
    How can I achieve that?
    Thanks
    Giulia

      You might want to consider loading the whole thing into a database if it's that large and you need to do a lot of key lookup (e.g. to avoid dupes) as you process the data, particularly if you need to sort on it in different ways or pull out subsets based on certain conditions.

      Use a second hash to store those (partial) lines that you've already seen, and only print out those lines that aren't in the hash yet.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://960408]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (9)
As of 2014-12-23 01:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (133 votes), past polls