Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Select only desired features from a text

by remluvr (Sexton)
on Mar 19, 2012 at 11:25 UTC ( #960394=perlquestion: print w/ replies, xml ) Need Help??
remluvr has asked for the wisdom of the Perl Monks concerning the following question:

Hi everyone.
Here I am with a new problem I can't solve.
I have two input files. One contains a list of semantic relations structured like the following (lets' call it INPUT1):

alligator-n amphibian_reptile attri long-j alligator-n amphibian_reptile attri old-j alligator-n amphibian_reptile coord crocodile-n alligator-n amphibian_reptile coord frog-n alligator-n amphibian_reptile event walk-v alligator-n amphibian_reptile hyper animal-n

And another one that is like the following (obviously the following is just a very reduced version):

frog-n about adage-n 8.8016 frog-n appearance-1 broad-j 11.9640 frog-n coord albino-n 6.7667 frog-n be jumper-n 6.0272 frog-n be key-n 3.8779 frog-n of body-n 8.3063 frog-n of bone-n 20.7982 frog-n of book-n 0.4229 crocodile-n be key-n 3.2572 crocodile-n of chorus-n 24.9515 crocodile-n of book-n 2.3460 crocodile-n obj sit-v 3.1857 crocodile-n obj size-v 57.3257 crocodile-n obj skewer-v 6.1105 animal-n coord-1 investigation-n 0.9666 animal-n coord-1 irrigation-n 2.6058 animal-n coord-1 isolation-n 1.4074 animal-n coord-1 isotope-n 2.7420

I need to check input1 for relations eq "coord" (third field of the rows) and search input2 for occurrences of fourth field of the row element in it. In this case I have crocodile-n and frog-n. I have to build another file that looks like input2 but contains every row whose first field is crocodile-n or frog-n. If one element is already found, I need not to repeat it, but sum the score it has with the one I already found.
I understand this explanation is not really clear, so here it is an example of desired output:

not_alligator-n about adage-n 8.8016 not_alligator-n appearance-1 broad-j 11.9640 not_alligator-n coord albino-n 6.7667 not_alligator-n be jumper-n 6.0272 not_alligator-n be key-n 7.1351(3.8779+3.2572) not_alligator-n of body-n 8.3063 not_alligator-n of chorus-n 24.9515 not_alligator-n of bone-n 20.7982 not_alligator-n of book-n 2.7689(0.4229+2.3460) not_alligator-n obj sit-v 3.1857 not_alligator-n obj size-v 57.3257 not_alligator-n obj skewer-v 6.1105

I have no idea where to start. Less than one month since I started back using perl, and still a lot I have to learn
Every suggestion, tip, indication on what to do would be really appreciated
I need it because I'm analyzing some statistical measure to be used on semantic relation for my ph.D Theses.
Thanks to all
Giulia

Comment on Select only desired features from a text
Select or Download Code
Re: Select only desired features from a text
by RichardK (Priest) on Mar 19, 2012 at 12:24 UTC

    Well, that depends on how many lines there are in second file. The easiest way is to store the matched records in a hash.

    You might find it useful to look at the perl data structures cookbook perldsc

    BTW, there is lots of documentation shipped with your copy of perl - try 'man perl' or 'perldoc perl' ;)

      Problem is, it is a 4G file..
Re: Select only desired features from a text
by moritz (Cardinal) on Mar 19, 2012 at 13:10 UTC

    The general procedure is to first read the file that contains the interesting mapping, read the mapping into a hash, and then traverse the second file and do the transformation of these lines based on the hash.

    Something like this:

    use 5.010; use strict; use warnings; use autodie; my %map; open my $IN, '<', 'f1'; while (<$IN>) { my ($first, undef, $type, $fourth) = split; $map{$fourth} = $first if $type eq 'coord'; } close $IN; open $IN, '<', 'f2'; while (<$IN>) { my ($first, $rest) = split /\s/, $_, 2; if ($map{$first}) { print "not_$map{$first} $rest" } } close $IN;

    Note that the variable names are quite terrible, because I don't know what the values stand for.

      Thanks, this was really useful, but my problem is I don't want to have duplicates. Given this output:

      not_alligator-n about adage-n 8.8016 not_alligator-n appearance-1 broad-j 11.9640 not_alligator-n coord albino-n 6.7667 not_alligator-n be jumper-n 6.0272 not_alligator-n be key-n 3.8779 not_alligator-n of body-n 8.3063 not_alligator-n of bone-n 20.7982 not_alligator-n of book-n 0.4229 not_alligator-n be key-n 3.2572 not_alligator-n of chorus-n 24.9515 not_alligator-n of book-n 2.3460 not_alligator-n obj sit-v 3.1857 not_alligator-n obj size-v 57.3257 not_alligator-n obj skewer-v 6.1105

      I'd like for not_alligator-n be key-n 3.8779 and not_alligator-n be key-n 3.2572 to appear just once, but with their score summed up.
      How can I achieve that?
      Thanks
      Giulia

        You might want to consider loading the whole thing into a database if it's that large and you need to do a lot of key lookup (e.g. to avoid dupes) as you process the data, particularly if you need to sort on it in different ways or pull out subsets based on certain conditions.

        Use a second hash to store those (partial) lines that you've already seen, and only print out those lines that aren't in the hash yet.

Re: Select only desired features from a text
by JavaFan (Canon) on Mar 19, 2012 at 15:44 UTC
    I'm getting the impression, the same question, with similar data, is asked every few days here. The only thing that seems to be changing is the name of the animal.

    Given the size of the file, and the fact it seems you need to do this over and over again, I'd say take a 2-day basic SQL course, load your data in a database, and run some SQL queries.

    Considering how you're struggling with Perl, the 2 day investment should pay itself of in about 2.1 days!

      Thanks for your suggestion, really useful in this moment

Re: Select only desired features from a text
by aaron_baugher (Deacon) on Mar 19, 2012 at 17:01 UTC

    The basic answer is to make a hash from the relationships in input1, and use that to parse and process the information you need from input2. If I understand your problem, in this case I would probably create a hash of arrays, keyed on the values from column4, so I'd have something like this:

    %hoa = ( 'frog-n' => ['alligator-n'], 'crocodile-n' => ['alligator-n'], );

    (I'd use a hash of arrays instead of a simple hash because I assume other values from column1 could have a relationship with 'frog-n'. If that's not true, then this could be a simple hash.) Even if input1 is 4GB, since you're only interested in parts of certain lines, your hash may be much smaller.

    Then I'd start going through input2, building a new multilevel hash based on the array elements from %hoa, with sub-keys from the new file, so I would be assigning values like this:

    # from the first line: frog-n about adage-n 8.8016 for $key (@$hoa{frog-n}){ $newhash{$key}{about}{adage-n} += 8.8016; }

    That will sum up repeated patterns as it goes, and it won't matter if they are consecutive. When it's done, go through that second hash and print it out in whatever format you like. There are still details to work out (like if you really want the sum elements displayed next to the sum like that, you may want to store them as an array and sum them in the last step), but that's the basic structure.

    Aaron B.
    My Woefully Neglected Blog, where I occasionally mention Perl.

      Aaron, thanks a lot. I tried writing my code based on your suggestions and I succeeded. Thanks!!!!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://960394]
Approved by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (7)
As of 2014-10-26 09:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (153 votes), past polls