http://www.perlmonks.org?node_id=596074

Marsel has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I'd like to merge different data matrices into one. The input are text files, with a header as a first line with the names of the cols, and on each following line, a identifier, then the numerical data for each row.
The cols are different in each file (but not always ...) and the rows are almost the same in all the files (but not always ...). The total size is about 10000 cols, and 55000 lines.

The simple way to explain what i want to do is the script below :
#!/usr/bin/perl -w use strict; use warnings; my(%data); foreach my $filename (@ARGV) { open(GSE, "$filename"); my(@samples); T:while(my $ligne = <GSE>) { $ligne =~ s/[\r\n]//g; my @t = split(/\t/, $ligne); if($. == 1) { shift(@t); @samples = @t; next T; } my $probe = shift(@t); for(my $i = 0; $i<@t; $i++) { $data{$probe}{$samples[$i]} = $t[$i]; $data{'samples'}{$samples[$i]}++; } } close(GSE); } my @samples = keys%{$data{'samples'}}; print "Probe\t".join("\t",@samples)."\n"; foreach my $probe (keys %data) { unless($probe eq 'samples') { print "$probe\t".join("\t", @{$data{$probe}}{@samples})."\n"; } }

But in fact, even with 8Go of RAM and 8Go of swap, it's too much, due to the space needed to store a hash i think. So do i have to leave perl there, and go to C, or is there another solution ?


Thanks a lot in advance for any clues,

Marcel

Replies are listed 'Best First'.
Re: How to deal with Huge data
by lin0 (Curate) on Jan 23, 2007 at 12:27 UTC
Re: How to deal with Huge data
by kyle (Abbot) on Jan 23, 2007 at 12:32 UTC

    I think DBM::Deep would help. With it, you'd put your %data on disk, basically. Your code would be mostly the same except:

    • my(%data) would be my $data = DBM::Deep->new( "foo.db" );
    • $data{x} would be $data->{x}
    • keys %data would be keys %$data
Re: How to deal with Huge data
by glasswalk3r (Friar) on Jan 23, 2007 at 12:39 UTC

    First of all, what is the nature of the text file? Does it have repeated keys inside the same file?

    Using hashes does consumes a lot of memory... but you can allways divide to conquer. ;-)

    Suposing that you have two different files that you want to merge, you could try to reduce the file of each one looking for repeated keys and summing the numeric data.

    Of course, this depends on the possibility to reduce the size of each file to an acceptable size. If this is not possible, you could consider working with slices of the file or using a database, since it will hold the data on the disc this should work. You can use any database, but DBD::DBM looks like ideal to your needs

    What do you mean by saying "The cols are different in each file (but not always ...)"? This means the columns are in different places? Is easier to normalize that (put all columns and values in a previous defined position) and start working. For instance, once the files that the columns in the correct position, you can even forget about jumping over the first line: the program can easialy print the columns names in the output file later.

    You're using hash references inside hashes... the more complicate structures you start using, the more memory the program will required.

    Some other tips:

    1. If you're in a UNIX OS, then change the DOS new line characters (\r\n) to UNIX new lines (\n) before start processing the files. You can avoid using a regular expression and start using chomp to remove the new line
    2. Do not initialize an variable with my everytime the program enter in a loop. Initialize the variable before the loop and them, inside the loop, once the variable will not be used anymore, just clean up the value it holds. This is faster.
    3. Use Benchmark.
    Alceu Rodrigues de Freitas Junior
    ---------------------------------
    "You have enemies? Good. That means you've stood up for something, sometime in your life." - Sir Winston Churchill
      Do not initialize an variable with my everytime the program enter in a loop. Initialize the variable before the loop and them, inside the loop, once the variable will not be used anymore, just clean up the value it holds. This is faster.

      The OP is doing I/O. How could this possibly matter, if it's even true?

        I didn't understand very well what means "OP", but anyway... the tip is a bit off-topic since it's not related to the memory issue. But is a tip anyway.

        Doing things like the code below:

        my @t; T: while( my $line = <GSE> ) { $line =~ s/[\r\n]//g; @t = split(/\t/, $ligne); if( $. == 1 ) { shift(@t); @samples = @t; next T; } @t = ();

        Should avoid memory allocation everytime the variable is created/removed. If this really does not work like that, please let me know.

        Alceu Rodrigues de Freitas Junior
        ---------------------------------
        "You have enemies? Good. That means you've stood up for something, sometime in your life." - Sir Winston Churchill
Re: How to deal with Huge data
by Thelonius (Priest) on Jan 23, 2007 at 13:56 UTC
    The key thing to do is to have your data sorted by your key (in this case the first column of your data). Then you only need to keep a small amount of data in memory.

    By the way, in the case where the column headers of your input files are not all different (except for the first), your program's output depends on the order the files are named in the command line.

    Here's one way to process the data. I haven't tested this at all except to check that it passes perl -c.

    #!/usr/bin/perl -w use strict; use warnings; my @fileHeaders; my %usedHeaders; my $sortname = "sortfile.tmp"; # If you have more than one disk, consider putting # a disk different from most of your data. # Beware of disk space. open SORTFILE, ">", $sortname or die "Cannot create $sortname: $!\n"; # first pass, write to sortfile foreach my $filename (@ARGV) { open GSE, "<", $filename or die "Cannot open $filename: $!\n"; my $headerline = <GSE>; chomp($headerline); $headerline =~ s/\r$//; my @thisfileHeaders = split /\t/, $headerline; shift @thisfileHeaders; push @fileHeaders, \@thisfileHeaders; my $filekey = sprintf "%04d", $#fileHeaders; while(<GSE>) { y/\r//d; s/^([^\t]*)/$1\t$filekey/; print SORTFILE $_ or die "print SORTFILE failed: $!\n"; } if ($. > 1) { # if any data past headers $usedHeaders{$_}++ for @thisfileHeaders; } close GSE; } close SORTFILE or die "close SORTFILE failed: $!\n"; # Yes, close can fail. Buffered I/O, for one thing system("sort -o sortfile.tmp sortfile.tmp") == 0 or die "Sort failed!\n"; # second pass; consolidate data from multiple files my $prevkey; open SORTFILE, "<", $sortname or die "Cannot open $sortname: $!\n"; my @samples = sort keys %usedHeaders; print "Probe\t".join("\t",@samples)."\n"; my %data; while (my $ligne = <SORTFILE>) { chomp($ligne); my @t = split(/\t/, $ligne); my $probe = shift @t; my $filenum = shift @t; if (!$prevkey || $probe ne $prevkey) { dumpdata(); $prevkey = $probe; } @data{@{$fileHeaders[$filenum]}} = @t; } dumpdata(); sub dumpdata { if (defined($prevkey)) { print "$prevkey\t", join("\t", @data{@samples}), "\n"; } $data{$_} = "" for @samples; #initialize to blanks so no warnings on printing undef }
    Update: small edits.
Re: How to deal with Huge data
by roboticus (Chancellor) on Jan 23, 2007 at 14:09 UTC
    Marsel:

    Another way you might be able to do the job is with a file merge. To do so, sort both files on the key(s) of interest, then read records in order and merge them as appropriate.

    Example:

    #!/usr/bin/perl -w use strict; use warnings; open F1, 'sort -k3 mergefile.1|' or die "opening file 1"; open F2, 'sort -k2 mergefile.2|' or die "opening file 2"; open OUF, '>', 'mergefile.out' or die "opening output file"; my @in1; my @in2; sub getrec1 { @in1 = (); if (!eof(F1)) { (@in1) = split /\t/, <F1>; chomp $in1[2]; } } sub getrec2 { @in2 = (); if (!eof(F2)) { (@in2) = split /\t/, <F2>; chomp $in2[2]; } } sub write1 { print OUF "$in1[2]\t$in1[0]\t$in1[1]\tnull\tnull\n"; getrec1; } sub write2 { print OUF "$in2[1]\tnull\tnull\t$in2[0]\t$in2[2]\n"; getrec2; } sub writeboth { print OUF "$in1[2]\t$in1[0]\t$in1[1]\t$in2[0]\t$in2[2]\n"; getrec1; getrec2; } # Prime the pump getrec1; getrec2; while (1) { last if $#in1<0 and $#in2<0; if ($#in1<0 or $#in2<0) { # Only one file is left... write2 if $#in1<0; write1 if $#in2<0; } elsif ($in1[2] eq $in2[1]) { # Matching records, merge & write 'em writeboth; } elsif ($in1[2] lt $in2[1]) { # unmatched item in file 1, write it & get next rec write1; } else { # unmatched item in file 2, write it & get next rec write2; } }
    Example output:

    root@swill ~/PerlMonks $ cat mergefile.1 15 20 foo 22 30 bar 30 33 baz 14 22 fubar root@swill ~/PerlMonks $ cat mergefile.2 alpha baz 17.30 gamma foobar 22.35 gamma bar 19.01 delta fromish 33.03 sigma bear 14.56 root@swill ~/PerlMonks $ ./file_merge.pl root@swill ~/PerlMonks $ cat mergefile.out bar 22 30 gamma 19.01 baz 30 33 alpha 17.30 bear null null sigma 14.56 foo 15 20 null null foobar null null gamma 22.35 fromish null null delta 33.03 fubar 14 22 null null root@swill ~/PerlMonks $
    --Roboticus
Re: How to deal with Huge data
by NiJo (Friar) on Jan 23, 2007 at 20:36 UTC
    So you are joining two files on a common field?

    That looks like you want to use DBD:CSV, which deals with text databases. Your code turns into a single SQL query. Handling of variable column order comes for free, as DBD::CSV reads the names from the first line in the file.

Re: How to deal with Huge data
by bsb (Priest) on Jan 24, 2007 at 00:06 UTC
    You could use DBD::SQLite to store and process the data and possibly "sqlite3" itself to load the data.
Re: How to deal with Huge data
by TOD (Friar) on Jan 23, 2007 at 16:12 UTC
    umm... what if you use a spreadsheet application and simply import the files? transpone the matrix, so that rows will be columns and columns rows. and if you still see the necessity for fiddling about a perl algorithm, export the result ro tsv's. because then you are able to manipulate what has originally been columns with the standard array functions, such as grep, map, and so on.

    your approach via hashtables seems to be basically wrong. try pointers instead.