open input files once

v15 has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have in this example 2 files, each of which is a 9 column file. Format of both the files are same. This is how my input dummy files look like: file1.tab

chr10    94055    94554    2    2    1    1    0    23
chr10    95850    122380    2    2    1    1    0    15
chr10    181243    196457    1    1    1    1    0    21
chr10    181243    225933    1    1    1    4    0    21
chr10    181500    225933    1    1    1    15    0    25
chr10    226069    255828    1    1    1    53    0    22
chr10    255948    267233    1    1    1    2    0    12
chr10    255989    267134    1    1    1    65    0    25
chr10    255989    282777    1    1    1    14    0    23
chr10    264440    267134    1    1    1    1    0    25
[download]

file2.tab

chr10    94055    94554    2    2    1    1    0    21
chr10    94055    94743    2    2    1    1    0    20
chr10    94666    94743    2    2    1    3    0    20
chr10    94853    95347    2    2    1    2    0    25
chr10    181243    225933    1    1    1    3    0    21
chr10    181500    225933    1    1    1    7    0    10
chr10    226069    255828    1    1    1    38    0    22
chr10    255989    267134    1    1    1    29    0    24
chr10    255989    282777    1    1    1    15    0    23
chr10    267297    282777    1    1    1    44    0    24
[download]

The columns in the files are tab separated. I want to combine the 2 files in a way that first 4 columns is used as a key and the value is the 7th column. If the key is missing in either file, the value for that key for that file should be 0. This is the output I would like:

chr    fivep    threep    strand    file1.tab    file2.tab
chr10    181500    225933    -    15    7
chr10    264440    267134    -    1    0
chr10    94055    94554    -    1    1
chr10    181243    225933    -    4    3
chr10    94666    94743    -    0    3
chr10    255989    282777    -    14    15
chr10    94853    95347    -    0    2
chr10    255948    267233    -    2    0
chr10    95850    122380    -    1    0
chr10    267297    282777    -    0    44
chr10    226069    255828    -    53    38
chr10    94055    94743    -    0    1
chr10    255989    267134    -    65    29
chr10    181243    196457    -    1    0
[download]

My code which is below, works perfectly fine, but I open my input files twice. Is there a way where I just have to open up the input files once? Code:

#!/usr/bin/perl
use strict;
use warnings;

my @files = glob("file*.tab");
my %output = ();

foreach my $file ( @files ){
    open(my $fh,'<',$file) or die $!; 
    while(<$fh>){
        chomp;
        my @s = split /\t/;
        die("_ERROR_ not 9 columns [ $_ ]\n") if (@s != 9);
        $output{$s[0]."\t".$s[1]."\t".$s[2]."\t".$s[3]} = [];
    }
    close $fh;
}

foreach my $file ( @files ){
    
    foreach my $k(keys %output){
        push @{$output{$k}},0;
    }

        open(my $fh,'<',$file) or die $!;
        while(<$fh>){
        chomp;
                my @s = split /\t/;
                die("_ERROR_ not 9 columns [ $_ ]\n") if (@s != 9);

        my @array = @{$output{$s[0]."\t".$s[1]."\t".$s[2]."\t".$s[3]}}
+;

        $array[-1] = $s[6]; 

        $output{$s[0]."\t".$s[1]."\t".$s[2]."\t".$s[3]} = \@array;
    }
    close $fh;
}

print join("\t",'chr','fivep','threep','strand',@files),"\n";

foreach my $k(keys %output){
    my @s = split /\t/,$k;
    my $chr = $s[0];
    my $fivep = $s[1];
    my $threep = $s[2];
    my $strand = $s[3];
    next if($strand =~ /^0$/);
    if($strand =~ /^0$/){
        print join("\t",$chr,$fivep,$threep,"+",@{$output{$chr."\t".$f
+ivep."\t".$threep."\t".$strand}} ),"\n";
    }else{
        print join("\t",$chr,$fivep,$threep,"-",@{$output{$chr."\t".$f
+ivep."\t".$threep."\t".$strand}} ),"\n";
    }
}
[download]

I have several input files and many more lines than the current dummy files I have shown here.

Comment on open input files once Select or Download Code

Replies are listed 'Best First'.
Re: open input files once by rnewsham (Curate) on Jul 14, 2020 at 06:49 UTC
If you don't want to read the files twice you need to think about your data structure and how you turn it into the desired output. Something like this should work, read the data into a temporary data structure with the file as a secondary key. Then process it to fill in the blanks at output time. #!/usr/bin/perl use strict; use warnings; my @files = glob("file*.tab"); my %data; for my $file ( @files ) { open my $fh, '<', $file or die $!; while ( <$fh> ) { chomp; my @columns = split /\t/; die "_ERROR_ not 9 columns [ $_ ]\n" if @columns != 9; my $key = join( "\t", @columns[0..3] ); $data{$key}->{$file} = $columns[6]; } close $fh; } print join("\t", 'chr', 'fivep', 'threep', 'strand', @files), "\n"; for my $key ( keys %data ) { my ( $chr, $fivep, $threep, $strand ) = split /\t/, $key; next if ( $strand =~ /^0$/ ); my @output; for my $file ( @files ) { push @output, $data{$key}->{$file} // 0; } print join("\t", $chr, $fivep, $threep, "-", @output ), "\n"; } [download] I also cleaned up the code a little and added some whitespace around things as I find it makes it easier to read.	[reply] [d/l]
Re^2: open input files once by Tux (Canon) on Jul 14, 2020 at 10:25 UTC
Whitespace to make things easier to read is completely within the eyes of the beholder. Most often, I find removing extraneous redundant annoying empty lines improving the readability and maintainability. Also cleaning up is only useful if it matches the style of the surrounding project/script/module(s) after the cleanup. e.g. your cleaned-up code would not fit my standards and my code would not fit yours.. And if you clean up, why not go further? `foreach my $key (grep { !m/\t0$/ } keys %data) { my ($chr, $fivep, $threep, $strand) = split m/\t/ => $key; say join "\t" => $chr, $fivep, $threep, "-", map { $data{$key}{$_} // 0 } @files; }` [download] Less lines more to the point, no need for empty space as each line is clear on itself. Personally, I'd use Text::CSV_XS with `sep => "\t"` like this: `use Text::CSV_XS qw( csv ); my @key = qw( chr fivep threep strand ); my @files = qw( file1.tab file2.tab ); my %c7; foreach my $file (@files) { csv (in => $file, out => undef, sep => "\t", strict => 1, on_in => sub { $c7{join ":" => @{$_[1]}[0..3]}{$file} = $_[1 +][6] } ); } say join "\t" => @key, @files; foreach my $key (sort keys %c7) { say join "\t" => (split m/:/ => $key), map { $c7{$key}{$_} // 0 } +@files; }` [download] Enjoy, Have FUN! H.Merijn	[reply] [d/l] [select]
Re^3: open input files once by rnewsham (Curate) on Jul 14, 2020 at 11:47 UTC
Oh I agree everyone has their own preferences. My comment was mainly to point out why I changed other things as I was finding their code a little unreadable in my pre-caffine state.	[reply]
Re^2: open input files once by v15 (Sexton) on Jul 14, 2020 at 18:10 UTC
Hi, Can you explain this line of code: push @output, $data{$key}->{$file} // 0 What is // doing in the above code? Also is there a way I can get a sorted output where column 1 is sorted, then column2 and then column3?	[reply]
Re^3: open input files once by Tux (Canon) on Jul 14, 2020 at 18:58 UTC
You should be more specific in your definition of "sorted". Look at my solution elsewhere in this thread. If you mean that the 2nd, 3rd and 4th column being sorted numeric, that isn't very hard either: use `pack/unpack` `my @key = qw( chr fivep threep strand ); my @files = qw( file1.tab file2.tab ); my %c7; foreach my $file (@files) { csv ( in => $file, out => undef, sep => "\t", on_in => sub { $c7{pack "A10l>l>l>", @{$_[1]}[0..3]}{$file} = $_ +[1][6] } ); } say join "\t" => @key, @files; foreach my $key (sort keys %c7) { say join "\t" => (unpack "A10l>l>l>" => $key), map { $c7{$key}{$_} + // 0 } @files; }` [download] Enjoy, Have FUN! H.Merijn	[reply] [d/l] [select]
Re^3: open input files once by haukex (Archbishop) on Jul 14, 2020 at 20:11 UTC
What is // doing in the above code? `//` in this case is the Logical Defined Or (as opposed to the empty regular expression, for example in "`$foo =~ //;`"). The expression `$data{$key}{$file} // 0` is the same as `defined($data{$key}{$file}) ? $data{$key}{$file} : 0`	[reply] [d/l] [select]
Re: open input files once by Marshall (Canon) on Jul 14, 2020 at 10:44 UTC
I am not quite sure if I replicated your expected output. Read file1 into data struct leaving placeholder for the 2nd value in file2. Read file2, replace placeholder with 2nd value if key exists otherwise make a 0 for 1st value. Fiddle with this to get what you want. Basic idea is only read each file once. Update: Below I used a single string, but an anon array of 2 elements would also get the job done. use strict; use warnings; use Inline::Files; use Data::Dumper; $\|=1; my %data; while (<FILE1>) { next unless /^\S/; #skip blank lines my (@tokens) = (split( " ",$_))[0,1,2,3,6]; my $value = pop @tokens; $data{"@tokens"} = "$value 0"; } while (<FILE2>) { next unless /^\S/; #skip blank lines my (@tokens) = (split( " ",$_))[0,1,2,3,6]; my $value2 = pop @tokens; if ($data{"@tokens"}) { $data{"@tokens"}=~ s/0$/$value2/; } else { $data{"@tokens"} = "0 $value2"; } } foreach (sort keys %data) { print "$_: $data{$_} \n"; } =prints: chr10 181243 196457 1: 1 0 chr10 181243 225933 1: 4 3 chr10 181500 225933 1: 15 7 chr10 226069 255828 1: 53 38 chr10 255948 267233 1: 2 0 chr10 255989 267134 1: 65 29 chr10 255989 282777 1: 14 15 chr10 264440 267134 1: 1 0 chr10 267297 282777 1: 0 44 chr10 94055 94554 2: 1 1 chr10 94055 94743 2: 0 1 chr10 94666 94743 2: 0 3 chr10 94853 95347 2: 0 2 chr10 95850 122380 2: 1 0 =cut __FILE1__ chr10 94055 94554 2 2 1 1 0 23 chr10 95850 122380 2 2 1 1 0 15 chr10 181243 196457 1 1 1 1 0 21 chr10 181243 225933 1 1 1 4 0 21 chr10 181500 225933 1 1 1 15 0 25 chr10 226069 255828 1 1 1 53 0 22 chr10 255948 267233 1 1 1 2 0 12 chr10 255989 267134 1 1 1 65 0 25 chr10 255989 282777 1 1 1 14 0 23 chr10 264440 267134 1 1 1 1 0 25 __FILE2__ chr10 94055 94554 2 2 1 1 0 21 chr10 94055 94743 2 2 1 1 0 20 chr10 94666 94743 2 2 1 3 0 20 chr10 94853 95347 2 2 1 2 0 25 chr10 181243 225933 1 1 1 3 0 21 chr10 181500 225933 1 1 1 7 0 10 chr10 226069 255828 1 1 1 38 0 22 chr10 255989 267134 1 1 1 29 0 24 chr10 255989 282777 1 1 1 15 0 23 chr10 267297 282777 1 1 1 44 0 24 [download]	[reply] [d/l]
Re: open input files once by tybalt89 (Monsignor) on Jul 14, 2020 at 19:11 UTC
Works for any number of files... `#!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11119283 use warnings; @ARGV = <file*.tab>; # or just default from command line my $column = 0; my %key; while( <> ) { my @fields = split; $key{ join "\t", @fields[0 .. 3] }[$column] = $fields[6]; eof and ++$column; } for ( sort keys %key ) { print join( "\t", (split)[0 .. 2], '-', map $_ // 0, @{ $key{$_} }[ 0 .. $column - 1 ] ), "\n"; }` [download]	[reply] [d/l]
Re: open input files once by jwkrahn (Abbot) on Jul 14, 2020 at 20:45 UTC
This seems to work: #!/usr/bin/perl use strict; use warnings; my @header = qw/ chr fivep threep strand /; my %output; @ARGV = glob 'file*.tab'; while ( <> ) { tr/\t// == 8 or die "_ERROR_ not 9 columns: $_"; my ( $key, $val ) = /^ ( [^\t]+ \t [^\t]+ \t [^\t]+ \t [^\t]+ ) \t + [^\t]+ \t [^\t]+ \t ( [^\t]+ ) /x; $output{ $key }[ @header - 4 ] = $val \|\| 0; if ( eof ) { push @header, $ARGV; } } print join( "\t", @header ), "\n"; for my $key ( keys %output ) { my $new = $key =~ s/ \t \K ( [^\t]+ ) \z / $1 eq '0' ? '+' : '-' / +rex; $output{ $key }[ $_ ] //= 0 for 0 .. $#header - 4; print join( "\t", $new, @{ $output{ $key } } ), "\n"; } [download]	[reply] [d/l]
Re: open input files once by BillKSmith (Monsignor) on Jul 14, 2020 at 19:25 UTC
The following approach is similar to your first pass. The entire output hash is built in this pass. Every time I see a new key, rather than initializing it with an empty array, I use an array of zeros and overwrite the one corresponding to the file. Every time I find an existing key, I overwrite the zero which corresponds to both the file and that key. The resulting hash of arrays is passed to the print section. It prints one line for each key. It corrects the format of srand then prints the corrected key and the values of the data array. use strict; use warnings; use feature 'state'; my @files = glob "file*.tab"; my %output; foreach my $file (@files) { open my $fh, '<', $file or die $!; state $fn = -1; $fn++; while (<$fh>) { chomp; my @s = split /\t/; die "_ERROR_ not 9 colmns [$-]\n" if @s != 9; my $key = join "\t", @s[0..3]; $output{$key} = [('0') x @files] if (!exists $output{$key}); $output{$key}[$fn] = $s[6]; } close $fh; } print join("\t",'chr','fivep','threep','strand',@files),"\n"; foreach my $key (sort keys %output) { my $values = $output{$key}; my $temp_key = $key; $temp_key =~ s/[^\t]+$/ ($& eq 0) ? '+' : '-'/e; print join( "\t", $temp_key, @$values ), "\n"; } [download] Bill	[reply] [d/l]
Re: open input files once by Anonymous Monk on Jul 14, 2020 at 17:00 UTC
There is, indeed, no need at all for the first loop. You don't need to initialize `$output` with empty arrays before replacing those empty arrays with real ones.	[reply]

Back to Seekers of Perl Wisdom