Re: How to deal with Huge data
by lin0 (Curate) on Jan 23, 2007 at 12:27 UTC
|
| [reply] [Watch: Dir/Any] |
Re: How to deal with Huge data
by kyle (Abbot) on Jan 23, 2007 at 12:32 UTC
|
I think DBM::Deep would help. With it, you'd put your %data on disk, basically. Your code would be mostly the same except:
- my(%data) would be my $data = DBM::Deep->new( "foo.db" );
- $data{x} would be $data->{x}
- keys %data would be keys %$data
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: How to deal with Huge data
by glasswalk3r (Friar) on Jan 23, 2007 at 12:39 UTC
|
First of all, what is the nature of the text file? Does it have repeated keys inside the same file?
Using hashes does consumes a lot of memory... but you can allways divide to conquer. ;-)
Suposing that you have two different files that you want to merge, you could try to reduce the file of each one looking for repeated keys and summing the numeric data.
Of course, this depends on the possibility to reduce the size of each file to an acceptable size. If this is not possible, you could consider working with slices of the file or using a database, since it will hold the data on the disc this should work. You can use any database, but DBD::DBM looks like ideal to your needs
What do you mean by saying "The cols are different in each file (but not always ...)"? This means the columns are in different places? Is easier to normalize that (put all columns and values in a previous defined position) and start working. For instance, once the files that the columns in the correct position, you can even forget about jumping over the first line: the program can easialy print the columns names in the output file later.
You're using hash references inside hashes... the more complicate structures you start using, the more memory the program will required.
Some other tips:
- If you're in a UNIX OS, then change the DOS new line characters (\r\n) to UNIX new lines (\n) before start processing the files. You can avoid using a regular expression and start using chomp to remove the new line
- Do not initialize an variable with my everytime the program enter in a loop. Initialize the variable before the loop and them, inside the loop, once the variable will not be used anymore, just clean up the value it holds. This is faster.
- Use Benchmark.
Alceu Rodrigues de Freitas Junior
---------------------------------
"You have enemies? Good. That means you've stood up for something, sometime in your life." - Sir Winston Churchill
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
| [reply] [Watch: Dir/Any] |
|
I didn't understand very well what means "OP", but anyway... the tip is a bit off-topic since it's not related to the memory issue. But is a tip anyway.
Doing things like the code below:
my @t;
T: while( my $line = <GSE> ) {
$line =~ s/[\r\n]//g;
@t = split(/\t/, $ligne);
if( $. == 1 ) {
shift(@t);
@samples = @t;
next T;
}
@t = ();
Should avoid memory allocation everytime the variable is created/removed. If this really does not work like that, please let me know.
Alceu Rodrigues de Freitas Junior
---------------------------------
"You have enemies? Good. That means you've stood up for something, sometime in your life." - Sir Winston Churchill
| [reply] [Watch: Dir/Any] [d/l] |
|
|
Re: How to deal with Huge data
by Thelonius (Priest) on Jan 23, 2007 at 13:56 UTC
|
The key thing to do is to have your data sorted by your key (in this case the first column of your data). Then you only need to keep a small amount of data in memory.
By the way, in the case where the column headers of your input files are not all different (except for the first), your program's output depends on the order the files are named in the command line.
Here's one way to process the data. I haven't tested this at all except to check that it passes perl -c.
#!/usr/bin/perl -w
use strict;
use warnings;
my @fileHeaders;
my %usedHeaders;
my $sortname = "sortfile.tmp";
# If you have more than one disk, consider putting
# a disk different from most of your data.
# Beware of disk space.
open SORTFILE, ">", $sortname
or die "Cannot create $sortname: $!\n";
# first pass, write to sortfile
foreach my $filename (@ARGV) {
open GSE, "<", $filename
or die "Cannot open $filename: $!\n";
my $headerline = <GSE>;
chomp($headerline);
$headerline =~ s/\r$//;
my @thisfileHeaders = split /\t/, $headerline;
shift @thisfileHeaders;
push @fileHeaders, \@thisfileHeaders;
my $filekey = sprintf "%04d", $#fileHeaders;
while(<GSE>) {
y/\r//d;
s/^([^\t]*)/$1\t$filekey/;
print SORTFILE $_
or die "print SORTFILE failed: $!\n";
}
if ($. > 1) { # if any data past headers
$usedHeaders{$_}++ for @thisfileHeaders;
}
close GSE;
}
close SORTFILE
or die "close SORTFILE failed: $!\n";
# Yes, close can fail. Buffered I/O, for one thing
system("sort -o sortfile.tmp sortfile.tmp") == 0
or die "Sort failed!\n";
# second pass; consolidate data from multiple files
my $prevkey;
open SORTFILE, "<", $sortname
or die "Cannot open $sortname: $!\n";
my @samples = sort keys %usedHeaders;
print "Probe\t".join("\t",@samples)."\n";
my %data;
while (my $ligne = <SORTFILE>) {
chomp($ligne);
my @t = split(/\t/, $ligne);
my $probe = shift @t;
my $filenum = shift @t;
if (!$prevkey || $probe ne $prevkey) {
dumpdata();
$prevkey = $probe;
}
@data{@{$fileHeaders[$filenum]}} = @t;
}
dumpdata();
sub dumpdata {
if (defined($prevkey)) {
print "$prevkey\t", join("\t", @data{@samples}), "\n";
}
$data{$_} = "" for @samples;
#initialize to blanks so no warnings on printing undef
}
Update: small edits.
| [reply] [Watch: Dir/Any] [d/l] |
Re: How to deal with Huge data
by roboticus (Chancellor) on Jan 23, 2007 at 14:09 UTC
|
Marsel:
Another way you might be able to do the job is with a file merge. To do so, sort both files on the key(s) of interest, then read records in order and merge them as appropriate.
Example:
#!/usr/bin/perl -w
use strict;
use warnings;
open F1, 'sort -k3 mergefile.1|' or die "opening file 1";
open F2, 'sort -k2 mergefile.2|' or die "opening file 2";
open OUF, '>', 'mergefile.out' or die "opening output file";
my @in1;
my @in2;
sub getrec1 {
@in1 = ();
if (!eof(F1)) {
(@in1) = split /\t/, <F1>;
chomp $in1[2];
}
}
sub getrec2 {
@in2 = ();
if (!eof(F2)) {
(@in2) = split /\t/, <F2>;
chomp $in2[2];
}
}
sub write1 {
print OUF "$in1[2]\t$in1[0]\t$in1[1]\tnull\tnull\n";
getrec1;
}
sub write2 {
print OUF "$in2[1]\tnull\tnull\t$in2[0]\t$in2[2]\n";
getrec2;
}
sub writeboth {
print OUF "$in1[2]\t$in1[0]\t$in1[1]\t$in2[0]\t$in2[2]\n";
getrec1;
getrec2;
}
# Prime the pump
getrec1;
getrec2;
while (1) {
last if $#in1<0 and $#in2<0;
if ($#in1<0 or $#in2<0) {
# Only one file is left...
write2 if $#in1<0;
write1 if $#in2<0;
}
elsif ($in1[2] eq $in2[1]) {
# Matching records, merge & write 'em
writeboth;
}
elsif ($in1[2] lt $in2[1]) {
# unmatched item in file 1, write it & get next rec
write1;
}
else {
# unmatched item in file 2, write it & get next rec
write2;
}
}
Example output:
root@swill ~/PerlMonks
$ cat mergefile.1
15 20 foo
22 30 bar
30 33 baz
14 22 fubar
root@swill ~/PerlMonks
$ cat mergefile.2
alpha baz 17.30
gamma foobar 22.35
gamma bar 19.01
delta fromish 33.03
sigma bear 14.56
root@swill ~/PerlMonks
$ ./file_merge.pl
root@swill ~/PerlMonks
$ cat mergefile.out
bar 22 30 gamma 19.01
baz 30 33 alpha 17.30
bear null null sigma 14.56
foo 15 20 null null
foobar null null gamma 22.35
fromish null null delta 33.03
fubar 14 22 null null
root@swill ~/PerlMonks
$
--Roboticus
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: How to deal with Huge data
by NiJo (Friar) on Jan 23, 2007 at 20:36 UTC
|
So you are joining two files on a common field?
That looks like you want to use DBD:CSV, which deals with text databases. Your code turns into a single SQL query. Handling of variable column order comes for free, as DBD::CSV reads the names from the first line in the file.
| [reply] [Watch: Dir/Any] |
Re: How to deal with Huge data
by bsb (Priest) on Jan 24, 2007 at 00:06 UTC
|
You could use DBD::SQLite to store and process the data
and possibly "sqlite3" itself to load the data. | [reply] [Watch: Dir/Any] |
Re: How to deal with Huge data
by TOD (Friar) on Jan 23, 2007 at 16:12 UTC
|
umm... what if you use a spreadsheet application and simply import the files? transpone the matrix, so that rows will be columns and columns rows. and if you still see the necessity for fiddling about a perl algorithm, export the result ro tsv's. because then you are able to manipulate what has originally been columns with the standard array functions, such as grep, map, and so on.
your approach via hashtables seems to be basically wrong. try pointers instead. | [reply] [Watch: Dir/Any] |