biohisham has asked for the wisdom of the Perl Monks concerning the following question:
So I have this Perl program reading a very large file, building hashes and keeping line number counters. The Ubuntu 20.04 system (on a 16 GB RAM and 4GB swap) stills kills the program because the hash eats the RAM. My workaround is use MCE to loop through the file chunks using 16 parallel workers.
The program builds the hashes and increments the counter alright but problem is the variables inside mce_loop_f{} are not available outside it so the rest of the code can not use the counter to run through the hash and print the hash vals to an output table. How can I get these variables and hashes be available?
I could not find in the many MCE docs a solution to achieve this except a mention of how Perl may re-spawn It is possible that Perl may create a new code ref on subsequent runs causing MCE models to re-spawn. One solution to this is to declare global variables, referenced by workers, with "our" instead of "my".
use strict;
use warnings;
use MCE::Loop;
use Data::Dumper;
my %hash;
our %hash2;
our $counter1=0;
our $counter2=0;
open (my $fh, "<", "DATA_F") or die($!);
print "printing counter1\n";
while(<$fh>){
my ($k, $v)=split;
$hash{$k}=$v;
print $counter1++,$/;
}
print "done printing counter1\n";
print "printing counter2\n";
MCE::Loop::init {
use_slurpio => 1,
max_workers => 16, init_relay=>0
};
mce_loop_f{
my ($mce, $file, $id)=@_;
open (my $ifh, "<", $file) or die("$!");
while(<$ifh>){
my ($k, $v)=split;
$hash2{$k}=$v;
print $counter2++,$/;
}
} $fh;
print "done printing counter2\n";
print $counter1, "=counter1 final\n";
print $counter2,"=counter2 final\n";
print Dumper(\%hash);
print Dumper(\%hash2);
printing counter1
0
1
2
done printing counter1
printing counter2
0
1
2
done printing counter2
3=counter1 final
0=counter2 final #How can I get MCE to make $counter2 available to pri
+nt()?
$VAR1 = {
'2' => 'two',
'1' => 'one',
'3' => 'three'
};
$VAR1 = {}; #How can I get MCE to make %hash2 available to Dumper()?
Something or the other, a monk since 2009
Re: MCE: How to access variables globally
by marioroy (Prior) on Dec 20, 2021 at 08:11 UTC
|
Greetings, fellow biohisham,
This is adapted from yours and kcott's example. One can provide a callback function for gather.
See kcott's example for the input file.
use strict;
use warnings;
use autodie;
use MCE::Loop;
use Data::Dumper;
my $data_file = 'DATA_F.dat';
my (%global_hash1, %global_hash2);
my ($global_counter1, $global_counter2) = (0, 0);
MCE::Loop::init {
use_slurpio => 1,
max_workers => 8,
init_relay => 0,
gather => sub {
my ($counter, $hash_ref) = @_;
$global_counter2 += $counter;
while (my ($k,$v) = each %{$hash_ref}) {
$global_hash2{$k} = $v;
}
},
};
print "# printing counter1\n";
{
open (my $fh, '<', $data_file);
while (<$fh>) {
my ($k,$v) = split;
$global_hash1{$k} = $v;
print ++$global_counter1, $/;
}
}
print "# done printing counter1\n";
print "# printing counter2\n";
mce_loop_f {
my ($mce, $chunk_file, $chunk_id) = @_;
my ($wid, $counter, %hash) = (MCE->wid, 0);
my $output = "# worker $wid\n";
open my $fh, "<", $chunk_file or die "$!";
while (<$fh>) {
my ($k, $v) = split;
$hash{$k} = $v;
$output .= (++$counter).$/;
}
close $fh;
MCE->gather($counter, \%hash);
MCE->print($output);
} $data_file;
MCE::Loop->finish();
print "# done printing counter2\n";
print "counter1 final: ", $global_counter1, $/;
print "counter2 final: ", $global_counter2, $/;
print "hash1: ", Dumper(\%global_hash1);
print "hash2: ", Dumper(\%global_hash2);
Output
| [reply] [d/l] [select] |
|
Greetings, fellow monks,
This one is a MCE::Relay demonstration. Well my friends, MCE::Relay boggles my mind. What a treat :)
See kcott's example for the input file.
use strict;
use warnings;
use autodie;
use MCE::Loop;
use Data::Dumper;
my $data_file = 'DATA_F.dat';
my (%global_hash1, %global_hash2);
my ($global_counter1, $global_counter2) = (0, 0);
MCE::Loop::init {
use_slurpio => 1,
max_workers => 8,
init_relay => 0,
gather => sub {
my ($counter, $hash_ref) = @_;
$global_counter2 += $counter;
while (my ($k,$v) = each %{$hash_ref}) {
$global_hash2{$k} = $v;
}
},
};
print "# printing counter1\n";
{
open (my $fh, '<', $data_file);
while (<$fh>) {
my ($k,$v) = split;
$global_hash1{$k} = $v;
print ++$global_counter1, $/;
}
}
print "# done printing counter1\n";
print "# printing counter2\n";
mce_loop_f {
my ($mce, $chunk_file, $chunk_id) = @_;
my ($wid, $counter, %hash) = (MCE->wid, 0);
my $output = "# worker $wid\n";
my $numlines = ${ $chunk_file } =~ tr/\n//;
my $relaycount = MCE->relay_recv;
MCE::relay { $_ += $numlines };
open my $fh, "<", $chunk_file or die "$!";
while (<$fh>) {
my ($k, $v) = split;
$hash{$k} = $v;
$output .= (++$counter + $relaycount).$/;
}
close $fh;
MCE->gather($counter, \%hash);
MCE->print($output);
} $data_file;
MCE::Loop->finish();
print "# done printing counter2\n";
print "counter1 final: ", $global_counter1, $/;
print "counter2 final: ", $global_counter2, $/;
print " relay final: ", MCE->relay_final, $/;
print "hash1: ", Dumper(\%global_hash1);
print "hash2: ", Dumper(\%global_hash2);
Output
| [reply] [d/l] [select] |
Re: MCE: How to access variables globally
by kcott (Archbishop) on Dec 19, 2021 at 19:56 UTC
|
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
use MCE::Loop;
use Data::Dumper;
my $data_file = 'DATA_F.dat';
my (%hash, %hash2);
{
open (my $fh, '<', $data_file);
while (<$fh>) {
my ($k, $v) = split;
$hash{$k} = $v;
}
}
MCE::Loop::init {
use_slurpio => 1,
max_workers => 16,
init_relay => 0,
};
%hash2 = mce_loop_f {
MCE->gather(split ' ', $$_);
} $data_file;
print Dumper \%hash;
print Dumper \%hash2;
See MCE and MCE::Loop
for an explanation of what I've done there.
The rest of the Perl code is very straightforward but, of course, do ask if there's anything you don't understand.
With this input (which I think should be the same as your original "DATA_F" input):
$ cat DATA_F.dat
1 one
2 two
3 three
I get this output:
$VAR1 = {
'2' => 'two',
'3' => 'three',
'1' => 'one'
};
$VAR1 = {
'3' => 'three',
'1' => 'one',
'2' => 'two'
};
I ran another test with much larger input.
Due the amount of data, it's in the spoiler.
| [reply] [d/l] [select] |
|
%hash2 = mce_loop_f {
MCE->gather(split ' ', $$_);
} $data_file;
Because (and I can be totally wrong in assuming so) gather will only return %hash2 in this instance while I am also interested in returning $counter2. The docs show gather can be called multiple times, doing so will complicate teasing apart the returned output from mce_loop_f{}. It would be great if gather can behave a bit like a sub so WYSIWYG
#hypothetical code:
#alas if gather can gather two or more data types
(%hash2, $counter2) = mce_loop_f {
my $internal_counter2++;
MCE->gather(split '\s', $$_, $internal_counter2); #return data typ
+es
} $DATA_F;
Something or the other, a monk since 2009
| [reply] [d/l] [select] |
|
$ perl -E 'my %x = (a=>1, b=>2, c=>3); say "Count: ", 0+keys(%x)'
Count: 3
If the situation is more complex than that
— non-unique keys; lines skipped for some reason; and so on —
you'll need to provide more information or I'm only guessing (and I don't really want to waste time doing that).
You should show some sample input: keep it short but still realistic with example exception cases.
Then show the expected output from that input.
If duplicate keys are encountered, should they be skipped or should their value overwrite the previous value.
Other reasons that lines might be skipped are: they're blank, are comments, don't match /^\S+\s+\S+$/, or something else.
What else is special that I should know about?
When I saw your OP code, I thought the first (non-MCE) loop, and the two counters, were just for testing.
Clearly, that was a poor guess; please help me out here.
| [reply] [d/l] [select] |
|
|
|
Re: MCE: How to access variables globally
by 1nickt (Canon) on Dec 19, 2021 at 11:29 UTC
|
See the doc for MCE::Shared.
Hope this helps!
The way forward always starts with a minimal test.
| [reply] |
Re: MCE: How to access variables globally
by etj (Priest) on Dec 19, 2021 at 15:26 UTC
|
I can't tell from your question whether these hashes are really about numerical data (which would make this more like a pandas-style "data frame", see Data::Frame for a Perl equivalent). If it really is just numerical stuff, you could try PDL which (after you convert your currently-shaped data to a single memory-mapped data block) can memory-map a file of any size with PDL::IO::FastRaw, and for numerical operations over that, will now (as of 2.063) automatically pthread (using all available CPU cores) any operations over higher dimensions. | [reply] |
Re: MCE: How to access variables globally
by Marshall (Canon) on Dec 19, 2021 at 10:22 UTC
|
So I have this Perl program reading a very large file, building hashes and keeping line number counters.
Sounds to me like you should put the data into a DB and use the DB to make these histograms. The data set is so large that using Perl in memory hashes is not the best solution. A good DB will spawn multiple threads to handle a big requests and you will be able to control the amount of memory that the DB has available to it.
Having multiple workers reading separate parts of a single disk file typically doesn't help because the bottleneck is the maximum bandwidth of reading the single hard disk. Having multiple workers doesn't magically make new physical memory appear.
Update: If you are doing significant bio work, then a machine with 16GB RAM is WAY, WAY under powered. Seriously think about buying more hardware with more RAM, a lot more RAM and SSD disks instead of the spinning kind. | [reply] |
|
TBH I could not wrap my head around how to dynamically write to DBs like Storable and DBM::Deep (I am assuming this is what you mean by DB??). The hash structure in the original program is HoA's that grows
$ProbState{$State}[$IndexLine]=$linelist[$i+1];
Yup 16GB is underpowered for a 3 yo laptop but I am stuck with it. Would you please kindly direct me to an example or a working code that demonstrates how I can use a DB.
UPDATE: I had some luck with DBM::Deep where I can store and retrieve from the database however the write step seems to be very slow although it keeps the memory footprint acceptable.
Something or the other, a monk since 2009
| [reply] [d/l] [select] |
|
| [reply] [d/l] |
|
I was thinking of an SQL DB. My recommendation would be to start with the most simple thing, an SQLite DB. This doesn't have fancy multi-threading to process SQL commands, but it does use standard SQL and the code that you write for it can be used with more capable DB's like Postgress, etc. If SQL (pronounced "See-quel") is a foreign word to you, all the more reason to start with something simple. The installation, configuration and management of a DB server can get complicated. SQLite doesn't require any of that. In addition, it is possible to dynamically vary the memory footprint available to SQLite. I've used that feature before and it works. It will use the amount of memory that it has - it might get a lot slower with less memory, but it won't "blow up".
I have no idea of what your raw data looks like or what end report you are trying to generate. Showing failed Perl code is not enough. You will have to back up and explain the initial problem (show some abbreviated example data) and some example of an end result. Then perhaps I or other Monks can suggest an appropriate table structure.
Don't underestimate SQLite. One project that I'm working on now has 3 directories with 3,000+ files in each. Creating the table for each directory takes about 8 seconds for 1M rows. Processing each directories worth of data takes <2 seconds. I am telling you that 3M row DB is nothing. How many million lines do you have? and how much data is each line? It very could be that instead of a single complete "de-normalized" table, you will wind up with multiple tables that are related somehow. For one project, I wound up using the URL that the data came from as a link between 2 tables. Didn't have to be that way, but that was sufficient for that project. DB's often use integer values as "keys" that link tables, but it doesn't have to be that way.
I don't know enough to advise further.
UPDATE: re: "UPDATE: I had some luck with DBM::Deep where I can store and retrieve from the database however the write step seems to be very slow although it keeps the memory footprint acceptable."
The "write step" to a DB is fast. The "commit step" is slow and DB journaling does take time. I have never used DBM::Deep. With SQLite as with other SQL db's, you want to: $db->begin_work;, then do millions of inserts, then do $dbh->commit; A single transaction can have millions of rows. I do recommend switching to SQLite.
SQLite is written in C. DBM::Deep is written in pure Perl. SQLite is the most used DB by far in the world - its in your cell phone - its in your browser - its everywhere. There are many performance tweaks that can be done to SQLite - often at the expense of compromising ACID properties. However, sometimes this is completely appropriate. In my example referenced above, I could speed things up by stopping journaling, starting async writes, etc. In my case 30 sec is "fast enough" for me and I don't worry about DB creation. 3 vs 30 seconds is the same for me. The "big lifting" is making a million inserts one transaction. But I could make my app faster if I really wanted to or needed to (which I don't).
Recommendation: Switch to SQLite. Faster, more options, gigantic user base, very stable. It already comes with the DBI module, so it is already on your machine.
Another Update: Re: Memory usage - DBM::Deep is a Perl program. SQLite is a C program and it can and does play by different memory management rules. I am not sure if this limit still exists or not, but at one time SQLite was limited to 2GB of RAM. Its default is way, way less than that. In one application that I targeted for WinXP, I ran ran SQLite's memory up to the extravagant level of 500MB for one "expensive" indexing operation and then back down again after that. A Perl program cannot give memory back to the O/S, but SQLite can. None of this matters for DB creation, but if you needed say to make a histogram of a column with 50M rows, more memory probably would help. I would have to look back at previous code to find the commands for adjusting memory, but they do exist. My XP application was targeted at a machine which only had max of 1-2GB of RAM. For your app, I would run memory footprint up to 2GB and not worry about it after that.
| [reply] [d/l] [select] |
|
|
|
Re: MCE: How to access variables globally
by beautyfulman (Sexton) on Dec 19, 2021 at 15:47 UTC
|
| |
Re: MCE: How to access variables globally
by Anonymous Monk on Dec 19, 2021 at 11:28 UTC
|
| [reply] |
|
|