http://www.perlmonks.org?node_id=11119283

v15 has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have in this example 2 files, each of which is a 9 column file. Format of both the files are same. This is how my input dummy files look like: file1.tab
chr10 94055 94554 2 2 1 1 0 23 chr10 95850 122380 2 2 1 1 0 15 chr10 181243 196457 1 1 1 1 0 21 chr10 181243 225933 1 1 1 4 0 21 chr10 181500 225933 1 1 1 15 0 25 chr10 226069 255828 1 1 1 53 0 22 chr10 255948 267233 1 1 1 2 0 12 chr10 255989 267134 1 1 1 65 0 25 chr10 255989 282777 1 1 1 14 0 23 chr10 264440 267134 1 1 1 1 0 25
file2.tab
chr10 94055 94554 2 2 1 1 0 21 chr10 94055 94743 2 2 1 1 0 20 chr10 94666 94743 2 2 1 3 0 20 chr10 94853 95347 2 2 1 2 0 25 chr10 181243 225933 1 1 1 3 0 21 chr10 181500 225933 1 1 1 7 0 10 chr10 226069 255828 1 1 1 38 0 22 chr10 255989 267134 1 1 1 29 0 24 chr10 255989 282777 1 1 1 15 0 23 chr10 267297 282777 1 1 1 44 0 24
The columns in the files are tab separated. I want to combine the 2 files in a way that first 4 columns is used as a key and the value is the 7th column. If the key is missing in either file, the value for that key for that file should be 0. This is the output I would like:
chr fivep threep strand file1.tab file2.tab chr10 181500 225933 - 15 7 chr10 264440 267134 - 1 0 chr10 94055 94554 - 1 1 chr10 181243 225933 - 4 3 chr10 94666 94743 - 0 3 chr10 255989 282777 - 14 15 chr10 94853 95347 - 0 2 chr10 255948 267233 - 2 0 chr10 95850 122380 - 1 0 chr10 267297 282777 - 0 44 chr10 226069 255828 - 53 38 chr10 94055 94743 - 0 1 chr10 255989 267134 - 65 29 chr10 181243 196457 - 1 0
My code which is below, works perfectly fine, but I open my input files twice. Is there a way where I just have to open up the input files once? Code:
#!/usr/bin/perl use strict; use warnings; my @files = glob("file*.tab"); my %output = (); foreach my $file ( @files ){ open(my $fh,'<',$file) or die $!; while(<$fh>){ chomp; my @s = split /\t/; die("_ERROR_ not 9 columns [ $_ ]\n") if (@s != 9); $output{$s[0]."\t".$s[1]."\t".$s[2]."\t".$s[3]} = []; } close $fh; } foreach my $file ( @files ){ foreach my $k(keys %output){ push @{$output{$k}},0; } open(my $fh,'<',$file) or die $!; while(<$fh>){ chomp; my @s = split /\t/; die("_ERROR_ not 9 columns [ $_ ]\n") if (@s != 9); my @array = @{$output{$s[0]."\t".$s[1]."\t".$s[2]."\t".$s[3]}} +; $array[-1] = $s[6]; $output{$s[0]."\t".$s[1]."\t".$s[2]."\t".$s[3]} = \@array; } close $fh; } print join("\t",'chr','fivep','threep','strand',@files),"\n"; foreach my $k(keys %output){ my @s = split /\t/,$k; my $chr = $s[0]; my $fivep = $s[1]; my $threep = $s[2]; my $strand = $s[3]; next if($strand =~ /^0$/); if($strand =~ /^0$/){ print join("\t",$chr,$fivep,$threep,"+",@{$output{$chr."\t".$f +ivep."\t".$threep."\t".$strand}} ),"\n"; }else{ print join("\t",$chr,$fivep,$threep,"-",@{$output{$chr."\t".$f +ivep."\t".$threep."\t".$strand}} ),"\n"; } }
I have several input files and many more lines than the current dummy files I have shown here.

Replies are listed 'Best First'.
Re: open input files once
by rnewsham (Curate) on Jul 14, 2020 at 06:49 UTC

    If you don't want to read the files twice you need to think about your data structure and how you turn it into the desired output. Something like this should work, read the data into a temporary data structure with the file as a secondary key. Then process it to fill in the blanks at output time.

    #!/usr/bin/perl use strict; use warnings; my @files = glob("file*.tab"); my %data; for my $file ( @files ) { open my $fh, '<', $file or die $!; while ( <$fh> ) { chomp; my @columns = split /\t/; die "_ERROR_ not 9 columns [ $_ ]\n" if @columns != 9; my $key = join( "\t", @columns[0..3] ); $data{$key}->{$file} = $columns[6]; } close $fh; } print join("\t", 'chr', 'fivep', 'threep', 'strand', @files), "\n"; for my $key ( keys %data ) { my ( $chr, $fivep, $threep, $strand ) = split /\t/, $key; next if ( $strand =~ /^0$/ ); my @output; for my $file ( @files ) { push @output, $data{$key}->{$file} // 0; } print join("\t", $chr, $fivep, $threep, "-", @output ), "\n"; }

    I also cleaned up the code a little and added some whitespace around things as I find it makes it easier to read.

      Whitespace to make things easier to read is *completely* within the eyes of the beholder. Most often, I find *removing* extraneous redundant annoying empty lines improving the readability and maintainability. Also cleaning up is only useful if it matches the style of the surrounding project/script/module(s) after the cleanup. e.g. your cleaned-up code would not fit my standards and my code would not fit yours..

      And if you clean up, why not go further?

      foreach my $key (grep { !m/\t0$/ } keys %data) { my ($chr, $fivep, $threep, $strand) = split m/\t/ => $key; say join "\t" => $chr, $fivep, $threep, "-", map { $data{$key}{$_} // 0 } @files; }

      Less lines more to the point, no need for empty space as each line is clear on itself.

      Personally, I'd use Text::CSV_XS with sep => "\t" like this:

      use Text::CSV_XS qw( csv ); my @key = qw( chr fivep threep strand ); my @files = qw( file1.tab file2.tab ); my %c7; foreach my $file (@files) { csv (in => $file, out => undef, sep => "\t", strict => 1, on_in => sub { $c7{join ":" => @{$_[1]}[0..3]}{$file} = $_[1 +][6] } ); } say join "\t" => @key, @files; foreach my $key (sort keys %c7) { say join "\t" => (split m/:/ => $key), map { $c7{$key}{$_} // 0 } +@files; }

      Enjoy, Have FUN! H.Merijn

        Oh I agree everyone has their own preferences. My comment was mainly to point out why I changed other things as I was finding their code a little unreadable in my pre-caffine state.

      Hi,

      Can you explain this line of code:

      push @output, $data{$key}->{$file} // 0

      What is // doing in the above code? Also is there a way I can get a sorted output where column 1 is sorted, then column2 and then column3?

        You should be more specific in your definition of "sorted". Look at my solution elsewhere in this thread. If you mean that the 2nd, 3rd and 4th column being sorted numeric, that isn't very hard either: use pack/unpack

        my @key = qw( chr fivep threep strand ); my @files = qw( file1.tab file2.tab ); my %c7; foreach my $file (@files) { csv ( in => $file, out => undef, sep => "\t", on_in => sub { $c7{pack "A10l>l>l>", @{$_[1]}[0..3]}{$file} = $_ +[1][6] } ); } say join "\t" => @key, @files; foreach my $key (sort keys %c7) { say join "\t" => (unpack "A10l>l>l>" => $key), map { $c7{$key}{$_} + // 0 } @files; }

        Enjoy, Have FUN! H.Merijn
        What is // doing in the above code?

        // in this case is the Logical Defined Or (as opposed to the empty regular expression, for example in "$foo =~ //;"). The expression $data{$key}{$file} // 0 is the same as defined($data{$key}{$file}) ? $data{$key}{$file} : 0

Re: open input files once
by Marshall (Canon) on Jul 14, 2020 at 10:44 UTC
    I am not quite sure if I replicated your expected output.
    Read file1 into data struct leaving placeholder for the 2nd value in file2.
    Read file2, replace placeholder with 2nd value if key exists otherwise make a 0 for 1st value.
    Fiddle with this to get what you want. Basic idea is only read each file once.

    Update: Below I used a single string, but an anon array of 2 elements would also get the job done.

    use strict; use warnings; use Inline::Files; use Data::Dumper; $|=1; my %data; while (<FILE1>) { next unless /^\S/; #skip blank lines my (@tokens) = (split( " ",$_))[0,1,2,3,6]; my $value = pop @tokens; $data{"@tokens"} = "$value 0"; } while (<FILE2>) { next unless /^\S/; #skip blank lines my (@tokens) = (split( " ",$_))[0,1,2,3,6]; my $value2 = pop @tokens; if ($data{"@tokens"}) { $data{"@tokens"}=~ s/0$/$value2/; } else { $data{"@tokens"} = "0 $value2"; } } foreach (sort keys %data) { print "$_: $data{$_} \n"; } =prints: chr10 181243 196457 1: 1 0 chr10 181243 225933 1: 4 3 chr10 181500 225933 1: 15 7 chr10 226069 255828 1: 53 38 chr10 255948 267233 1: 2 0 chr10 255989 267134 1: 65 29 chr10 255989 282777 1: 14 15 chr10 264440 267134 1: 1 0 chr10 267297 282777 1: 0 44 chr10 94055 94554 2: 1 1 chr10 94055 94743 2: 0 1 chr10 94666 94743 2: 0 3 chr10 94853 95347 2: 0 2 chr10 95850 122380 2: 1 0 =cut __FILE1__ chr10 94055 94554 2 2 1 1 0 23 chr10 95850 122380 2 2 1 1 0 15 chr10 181243 196457 1 1 1 1 0 21 chr10 181243 225933 1 1 1 4 0 21 chr10 181500 225933 1 1 1 15 0 25 chr10 226069 255828 1 1 1 53 0 22 chr10 255948 267233 1 1 1 2 0 12 chr10 255989 267134 1 1 1 65 0 25 chr10 255989 282777 1 1 1 14 0 23 chr10 264440 267134 1 1 1 1 0 25 __FILE2__ chr10 94055 94554 2 2 1 1 0 21 chr10 94055 94743 2 2 1 1 0 20 chr10 94666 94743 2 2 1 3 0 20 chr10 94853 95347 2 2 1 2 0 25 chr10 181243 225933 1 1 1 3 0 21 chr10 181500 225933 1 1 1 7 0 10 chr10 226069 255828 1 1 1 38 0 22 chr10 255989 267134 1 1 1 29 0 24 chr10 255989 282777 1 1 1 15 0 23 chr10 267297 282777 1 1 1 44 0 24
Re: open input files once
by tybalt89 (Monsignor) on Jul 14, 2020 at 19:11 UTC

    Works for any number of files...

    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11119283 use warnings; @ARGV = <file*.tab>; # or just default from command line my $column = 0; my %key; while( <> ) { my @fields = split; $key{ join "\t", @fields[0 .. 3] }[$column] = $fields[6]; eof and ++$column; } for ( sort keys %key ) { print join( "\t", (split)[0 .. 2], '-', map $_ // 0, @{ $key{$_} }[ 0 .. $column - 1 ] ), "\n"; }
Re: open input files once
by jwkrahn (Abbot) on Jul 14, 2020 at 20:45 UTC

    This seems to work:

    #!/usr/bin/perl use strict; use warnings; my @header = qw/ chr fivep threep strand /; my %output; @ARGV = glob 'file*.tab'; while ( <> ) { tr/\t// == 8 or die "_ERROR_ not 9 columns: $_"; my ( $key, $val ) = /^ ( [^\t]+ \t [^\t]+ \t [^\t]+ \t [^\t]+ ) \t + [^\t]+ \t [^\t]+ \t ( [^\t]+ ) /x; $output{ $key }[ @header - 4 ] = $val || 0; if ( eof ) { push @header, $ARGV; } } print join( "\t", @header ), "\n"; for my $key ( keys %output ) { my $new = $key =~ s/ \t \K ( [^\t]+ ) \z / $1 eq '0' ? '+' : '-' / +rex; $output{ $key }[ $_ ] //= 0 for 0 .. $#header - 4; print join( "\t", $new, @{ $output{ $key } } ), "\n"; }
Re: open input files once
by BillKSmith (Monsignor) on Jul 14, 2020 at 19:25 UTC
    The following approach is similar to your first pass. The entire output hash is built in this pass. Every time I see a new key, rather than initializing it with an empty array, I use an array of zeros and overwrite the one corresponding to the file. Every time I find an existing key, I overwrite the zero which corresponds to both the file and that key. The resulting hash of arrays is passed to the print section. It prints one line for each key. It corrects the format of srand then prints the corrected key and the values of the data array.
    use strict; use warnings; use feature 'state'; my @files = glob "file*.tab"; my %output; foreach my $file (@files) { open my $fh, '<', $file or die $!; state $fn = -1; $fn++; while (<$fh>) { chomp; my @s = split /\t/; die "_ERROR_ not 9 colmns [$-]\n" if @s != 9; my $key = join "\t", @s[0..3]; $output{$key} = [('0') x @files] if (!exists $output{$key}); $output{$key}[$fn] = $s[6]; } close $fh; } print join("\t",'chr','fivep','threep','strand',@files),"\n"; foreach my $key (sort keys %output) { my $values = $output{$key}; my $temp_key = $key; $temp_key =~ s/[^\t]+$/ ($& eq 0) ? '+' : '-'/e; print join( "\t", $temp_key, @$values ), "\n"; }
    Bill
Re: open input files once
by Anonymous Monk on Jul 14, 2020 at 17:00 UTC
    There is, indeed, no need at all for the first loop. You don't need to initialize $output with empty arrays before replacing those empty arrays with real ones.