Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Help with pushing into a hash

by jemswira (Novice)
on Aug 29, 2012 at 16:55 UTC ( #990508=perlquestion: print w/ replies, xml ) Need Help??
jemswira has asked for the wisdom of the Perl Monks concerning the following question:

Hi there monks.

I'm trying to combine two databases into one for my computational bio project. The first database (Test) looks something like this:

Q197F8 ORFNames=IIV3-002R Q197F7 ORFNames=IIV3-003L Q6GZX2 ORFNames=FV3-003R Q6GZX1 ORFNames=FV3-004R Q197F5 ORFNames=IIV3-005L Q6GZX0 ORFNames=FV3-005R ;PF02393 Q91G88 ORFNames=IIV6-006L ;PF12299;PF04383 Q6GZW9 ORFNames=FV3-006R

It's just a small part of it. The .{6} before the ORFNAMES is the accession numbers. Next I have this sort of format in Test2.

Q197F8 | PF04947.9 Q91G88 | PF01486.12 PF00319.13

The format I need in the end is this:

Q197F8 IIV3-002R PF04947.9 Q91G88 IIV6-006L PF01486.12 PF00319.13

Since the first database test was going to be much larger than the second, I wanted to make it into an array list with the accession numbers and names, then go down the list of accession numbers in Test2 and print them out. Like so:

#!/usr/bin/perl use warnings; use strict; open DATA, "C:\\Users\\Jems\\Desktop\\Perl\\test\\test.txt" or die $!; use Modern::Perl; use Data::Dump qw/dump/; our %data; my $ac; while (<DATA>) { my @splitted= split(/=|;/); foreach (@splitted){ if (/^(.{6})\sORFNames/) { $ac = $1; chomp ($ac); next; } if (/^(.+)\s\n/) { #print "$ac $1\n"; push @{ $data{$1} }, $ac; next; #print @{$data{$ac}} if exists $data{$ac}; } if (/^(.+)\s;PF/) { push @{ $data{$1} }, $ac; next; } next;} next; } my $acn; open ACTIVATOR, "C:\\Users\\Jems\\Desktop\\Perl\\test\\Test2.txt" or d +ie $!; open ACTIVOUT, ">C:\\Users\\jems\\Desktop\\Perl\\test\\ActivACNPF.txt" + or die $!; select ACTIVOUT; while ($acn= <ACTIVATOR>){ if($acn =~ m/^(......)\s\|/){ my $ab = $1; chomp ($ab); #print "$ab"; print "$acn | @{$data{$ab}}\n" if exists $data{$ab}; next; } } print STDOUT "DONE ACTIV";

The commented parts were my print testing. However I just can't seem to get the print at line 44 to print what I want. the commented print at line 23 returns blank, but the commented print at line 19 prints that both are correct. Also, if I use a print %data, it will return values. Am I checking something wrongly?

Thanks monks!

Comment on Help with pushing into a hash
Select or Download Code
Re: Help with pushing into a hash
by Kenosis (Priest) on Aug 29, 2012 at 18:13 UTC

    Here's an option to consider:

    use Modern::Perl; use File::Slurp qw/read_file write_file/; my $test = 'test.txt'; my $test2 = 'test2.txt'; my $activout = 'ActivACNPF.txt'; my @lines; my %data = map { /(.+)\s+\|\s+(.+)/; $1 => $2 } read_file $test2; for ( read_file $test ) { /(.+)\s+.+=([^\s]+)/; push @lines, "$1 $2 $data{$1}\n" if $data{$1}; } write_file $activout, @lines;

    Output to file:

    Q197F8 IIV3-002R PF04947.9 Q91G88 IIV6-006L PF01486.12 PF00319.13

    %data is initialized using the captured data from test2.txt as key/value pairs. Next, the 'keys' and associated 'values' are captured from test.txt, and the completed line is push onto @lines if a matching key is found. Finally, @lines is written to ActivACNPF.txt.

    Hope this helps!

    Update: Replaced a single-line map with a multi-line for to improve readability.

      Thanks so much! It works like a charm. On a side note, how do I remove the decimal place. I tried using (PF.{5}) instead of the (.+) but it would only return the last PF value and nothing else.

        Ask yourself what you want to match?

        In regular expression dot means any character

        You're welcome, jemswira!

        To remove the decimal values in the test2.txt data, try changing the following:

        my %data = map { /(.+)\s+\|\s+(.+)/; $1 => $2 } read_file $test2;

        to:

        my %data = map {s/\.\d+//g; /(.+)\s+\|\s+(.+)/; $1 => $2 } read_file $ +test2;

        New output to file:

        Q197F8 IIV3-002R PF04947 Q91G88 IIV6-006L PF01486 PF00319

        The substitution at the beginning of the map block will globally remove a decimal point followed by one or more digits. Since only the test2.txt values (not keys) contain decimal points, this should work.

    Re: Help with pushing into a hash
    by 2teez (Priest) on Aug 29, 2012 at 18:15 UTC
      Hi,

      The script below could do want you want like so:

      use warnings; use strict; my ( $file1, $file2 ) = @ARGV; my $matched_word = {}; open my $fh, '<', $file1 or die "can't open file: $!"; while (<$fh>) { s/^\s+?|\s+?$//; if (m{(.+?)\s+?.+?=(.+?)\s+?.*?$}) { push @{ $matched_word->{$1} }, $2; } } close $fh or die "can't close file: $!"; open $fh, '<', $file2 or die "can't open file: $!"; while (<$fh>) { s/^\s+?|\s+?$//; my ( $value1, $value2 ) = split /\s+?\|\s+?/, $_; print $value1, " ", @{ $matched_word->{$value1} }, " ", $value2, $ +/ if exists $matched_word->{$value1}; } close $fh or die "can't close file: $!";
      OUTPUT Q197F8 IIV3-002R PF04947.9 Q91G88 IIV6-006L PF01486.12 PF00319.13
      Please, I need also point out some other things I think might be good you look out for
      1. Please, use 3 - arugment open function,
      2. Please, don't use "DATA" as your filehandles, it is used by perl, see SelfLoader,
      3. use lexical variable instead of BAREWORDs,
      4. You might not need Modern::Perl, since you have used use warnings;use strict; or vise versa

    Re: Help with pushing into a hash
    by CountZero (Bishop) on Aug 29, 2012 at 19:56 UTC
      I did it so:
      use Modern::Perl; use Data::Dump qw/dump/; my %acn_database; while (<DATA>) { last if /END/; my ($acn, $orfname) = split /\sORFNames=|\s/; $acn_database{$acn} = [$orfname]; } while (<DATA>) { my ($acn, @pf_data) = split /\s\|\s|\s/; push @{$acn_database{$acn}}, @pf_data; } say dump(\%acn_database); __DATA__ Q197F8 ORFNames=IIV3-002R Q197F7 ORFNames=IIV3-003L Q6GZX2 ORFNames=FV3-003R Q6GZX1 ORFNames=FV3-004R Q197F5 ORFNames=IIV3-005L Q6GZX0 ORFNames=FV3-005R ;PF02393 Q91G88 ORFNames=IIV6-006L ;PF12299;PF04383 Q6GZW9 ORFNames=FV3-006R END Q197F8 | PF04947.9 Q91G88 | PF01486.12 PF00319.13
      I have assumed that the stray space at the first line of your first file is just a typo.

      Note that this will still do the right thing if you have multiple lines in your second file with the same accession numbers: their data will just be added to right array in the hash.

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      My blog: Imperial Deltronics

    Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Node Status?
    node history
    Node Type: perlquestion [id://990508]
    Approved by Corion
    help
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others studying the Monastery: (13)
    As of 2014-08-29 14:50 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      The best computer themed movie is:











      Results (280 votes), past polls