Re^2: Select only desired features from a text

by remluvr (Sexton)
on Mar 19, 2012 at 15:15 UTC

in reply to Re: Select only desired features from a text
in thread Select only desired features from a text

Thanks, this was really useful, but my problem is I don't want to have duplicates. Given this output:

not_alligator-n about adage-n 8.8016 not_alligator-n appearance-1 broad-j 11.9640 not_alligator-n coord albino-n 6.7667 not_alligator-n be jumper-n 6.0272 not_alligator-n be key-n 3.8779 not_alligator-n of body-n 8.3063 not_alligator-n of bone-n 20.7982 not_alligator-n of book-n 0.4229 not_alligator-n be key-n 3.2572 not_alligator-n of chorus-n 24.9515 not_alligator-n of book-n 2.3460 not_alligator-n obj sit-v 3.1857 not_alligator-n obj size-v 57.3257 not_alligator-n obj skewer-v 6.1105

I'd like for not_alligator-n be key-n 3.8779 and not_alligator-n be key-n 3.2572 to appear just once, but with their score summed up.
How can I achieve that?

Re^3: Select only desired features from a text
by moritz (Cardinal) on Mar 19, 2012 at 18:03 UTC

    Use a second hash to store those (partial) lines that you've already seen, and only print out those lines that aren't in the hash yet.

Re^3: Select only desired features from a text
by bitingduck (Chaplain) on Mar 19, 2012 at 15:29 UTC

    You might want to consider loading the whole thing into a database if it's that large and you need to do a lot of key lookup (e.g. to avoid dupes) as you process the data, particularly if you need to sort on it in different ways or pull out subsets based on certain conditions.

Node Type: note
As of 2017-02-24 18:12 GMT
