beherasan has asked for the wisdom of the Perl Monks concerning the following question:
I am trying to match a set of small sequences (~ 7000 in number) to another large set of sequences stored in hash (size is ~100GB). As this is a huge task, I thought of using multi-threading and search 40 sequences at a time. Although the script is able to create threads, I faced problem in RAM usage when 40 threads are generated. It seems like each thread is taking a copy of hash (and each hash is ~100 GB), due to this the program get killed in between. I am using a system with 512 GB of RAM and 88 threads.
Is there any way of doing this in memory efficient way.
Thank you,
Santosh
#!/usr/bin/perl
use strict;
use threads;
my $num_of_threads = 40;
my @peptides=(); ## Store the peptides to search in array
open(IN,"peptides.txt") or die "Could not open the file:$!\n";
while(<IN>)
{
chomp;
$_=~s/\r//g;
push(@peptides,$_);
}
close IN;
my %hashNR=(); ## Store the Sequence in this hash
my %hashRes=(); ## Store the matched results
my $nrid="";
open(REF,"NR.fasta") or die "Could not open the file:$!\n";
while(<REF>)
{
chomp;
$_=~s/\r//g;
if(/^>/)
{
$nrid=(split /\s/)[0];
}
else
{
$hashNR{$nrid}.="$_";
}
}
close REF;
my @allIDS=(keys %hashNR);
my $L = scalar(@allIDS);
print "Reference Reading Completed\n";
my $j= 0;
while($j < scalar(@peptides))
{
my @threads = initThreads();
foreach(@threads)
{
my $pep = $peptides[$j];
$_ = threads -> create(\&doOperation,$pep,$L);
$j++;
}
foreach(@threads)
{
$_ -> join();
}
}
open(OUT,">Outfile.txt") or die "Could not create the file:$!\n";
foreach my $k (keys %hashRes)
{
print "$k\t$hashRes{$k}\n";
}
close OUT;
###############################################
## Subroutine for initializing the Thread array
sub initThreads
{
my @initThreads;
for(my $i=1;$i<=$num_of_threads;$i++)
{
push(@initThreads,$i);
}
return @initThreads;
}
## Task run by each threads
sub doOperation
{
my @allp = @_;
my $id = threads -> tid();
for (my $i=0; $i<$allp[1]; $i++)
{
if($hashNR{$allIDS[$i]}=~/$allp[0]/)
{
$hashRes{$allp[0]}.=",$allIDS[$i]";
}
}
threads -> exit();
}
Re: Problem in RAM usage while threading the program
by marioroy (Prior) on Dec 20, 2019 at 08:41 UTC
|
Hi beherasan,
I'm hoping that you're able to run on a Unix platform. If yes, there's a way to run without consuming lots of memory. Imagine that :) There is a fast database named Kyoto Cabinet. It is quite fast. See more info here and here, on PerlMonks. The latter provides installation instructions.
This post was made for folks not having access to the gigantic Fasta file or would rather generate something on the fly. I shuffle the bits in order to have unique records. The OP will not need these scripts and can skip to Part Three over here.
Part One create fake Fasta file for simulation (144 MiB)
# https://www.perlmonks.org/?node_id=11110379
# usage: perl create_fasta.pl > NR.fasta
use strict;
use warnings;
use List::Util 'shuffle';
# https://www.ncbi.nlm.nih.gov/nuccore/NR_165028.1?report=fasta
my $nr_partial_seq = <<'EOSEQ';
TGGCTCAGAACGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGAGAAGGATCCTTCGGGATCTG
GAGAGCGGCGGACGGGTGAGTAACGCGTGGGAACGTGCCCTTAGGTGGGGGACAACAGTTGGAAACGACT
GCTAATACCGCATAAGCCAATTTGGGAAAGCCTTCGGGCGCCTTGGGATCGGCCCGCGTTAGATTAGGTA
GTTGGTGGGGTAAAGGCCTACCAAGCCTACGATCTATAGCTGGTTTGAGAGAATGATCAGCCACACTGGG
ACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCGCAATGGGGGAAACCCTGAC
GCAGCCATGCCGCGTGGGTGAAGAAGGCCTTCGGGTTGTAAAGCCCTTTCAACGGTGAAGATGATGACTG
TAGCCGTAGAAGAAGCCCCGGCTAACTTCGTGCCAGCAGCCGCGGTAATACGAAGGGGGCTAGCGTTGTT
CGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGACTGACCAGTCAGGGGTGAAATCCCGAGGCTCAACCT
CGGAACTGCCTTTGATACTGTCAGTCTAGAGTCTGTGAGAGGATGACGGAATACCCAGTGTAGAGGTGAA
ATTCGTAGATATTGGGTAGAACACCGGTGGCGAAGGCGGTCATCTGGCGCAGCACTGACGCTGAGGCGCG
AAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGTGTGCTAGCCGTCG
GGAGTTAGGCTCTCGGTGGCGCCGCTAACGCATTAAGCACACCGCCTGGGGAGTACGGTCGCAAGATTAA
AACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGCAGA
ACCTTACCAACCCTTGACATGTGAAGTTTGGTTAGTGGAGACACTTTCCTTCAGTTCGGCTGGCTTCAAC
ACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACC
CTCACCCTTAGTTGCCATCATTTAGTTGGGCACTCTAGGGGAACTGCCGGTGACAAGCCGGAGGAAGGCG
GGGATGACGTCAAGTCCTCATGGCCCTTATGGGTTGGGCTACACACGTGCTACAATGGCGACTACAGAGG
GCAGCGAGGCGGCAACGCCAAGCAAATCCCAAAAAGTCGTCTCAGTTCGGATTGTTCTCTGCAACTCGAG
AGCATGAAGGCGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGCCTTGTAC
ACACCGCCCGTCACACCATGGGAGTTGGTTTTACCCGAAGACGGTGGGCTAACCGTAAGGAAGCAGCCGG
CCACGGTAAGATCAGCGACTGGGGTGAAGTC
EOSEQ
$nr_partial_seq =~ s/\s//g;
srand 42;
for ( 1..1e5 ) {
my $nr = sprintf('%06d', $_);
my $seq = join '', shuffle(split //, $nr_partial_seq), "\n";
print ">NR_${nr} Fake 16S ribosomal RNA, partial sequence\n";
while ( $seq ) {
my $s = substr($seq, 0, 70, '');
print "$s\n";
}
}
Part Two create fake peptides file
# https://www.perlmonks.org/?node_id=11110379
# usage: perl create_peptides.pl > peptides.txt
use strict;
use warnings;
use List::Util 'shuffle';
# https://www.ncbi.nlm.nih.gov/nuccore/NR_165028.1?report=fasta
my $nr_partial_seq = <<'EOSEQ';
TGGCTCAGAACGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGAGAAGGATCCTTCGGGATCTG
GAGAGCGGCGGACGGGTGAGTAACGCGTGGGAACGTGCCCTTAGGTGGGGGACAACAGTTGGAAACGACT
GCTAATACCGCATAAGCCAATTTGGGAAAGCCTTCGGGCGCCTTGGGATCGGCCCGCGTTAGATTAGGTA
GTTGGTGGGGTAAAGGCCTACCAAGCCTACGATCTATAGCTGGTTTGAGAGAATGATCAGCCACACTGGG
ACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCGCAATGGGGGAAACCCTGAC
GCAGCCATGCCGCGTGGGTGAAGAAGGCCTTCGGGTTGTAAAGCCCTTTCAACGGTGAAGATGATGACTG
TAGCCGTAGAAGAAGCCCCGGCTAACTTCGTGCCAGCAGCCGCGGTAATACGAAGGGGGCTAGCGTTGTT
CGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGACTGACCAGTCAGGGGTGAAATCCCGAGGCTCAACCT
CGGAACTGCCTTTGATACTGTCAGTCTAGAGTCTGTGAGAGGATGACGGAATACCCAGTGTAGAGGTGAA
ATTCGTAGATATTGGGTAGAACACCGGTGGCGAAGGCGGTCATCTGGCGCAGCACTGACGCTGAGGCGCG
AAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGTGTGCTAGCCGTCG
GGAGTTAGGCTCTCGGTGGCGCCGCTAACGCATTAAGCACACCGCCTGGGGAGTACGGTCGCAAGATTAA
AACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGCAGA
ACCTTACCAACCCTTGACATGTGAAGTTTGGTTAGTGGAGACACTTTCCTTCAGTTCGGCTGGCTTCAAC
ACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACC
CTCACCCTTAGTTGCCATCATTTAGTTGGGCACTCTAGGGGAACTGCCGGTGACAAGCCGGAGGAAGGCG
GGGATGACGTCAAGTCCTCATGGCCCTTATGGGTTGGGCTACACACGTGCTACAATGGCGACTACAGAGG
GCAGCGAGGCGGCAACGCCAAGCAAATCCCAAAAAGTCGTCTCAGTTCGGATTGTTCTCTGCAACTCGAG
AGCATGAAGGCGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGCCTTGTAC
ACACCGCCCGTCACACCATGGGAGTTGGTTTTACCCGAAGACGGTGGGCTAACCGTAAGGAAGCAGCCGG
CCACGGTAAGATCAGCGACTGGGGTGAAGTC
EOSEQ
$nr_partial_seq =~ s/\s//g;
srand 42;
my $len = length($nr_partial_seq) - 20;
for ( 1..100 ) {
my $seq = join '', shuffle(split //, $nr_partial_seq), "\n";
print substr($seq, rand($len), 15), "\n";
}
That covers the two input files. The usage is at the top of the script. The next post will contain the create_db and search_db scripts.
Regards, Mario
| [reply] [d/l] [select] |
|
This is a continuation of Re: Problem in RAM usage while threading the program.
Update 1: Added parallel chunking demonstration.
Update 2: Inline::C demonstrations can be found here.
Part Three create the Kyoto Cabinet DB
This is where the fun begins. The # after the file name is how to pass options. See the perldoc for more info. Notice the split line. There is no reason to split the entire sequence header line.
# https://www.perlmonks.org/?node_id=11110379
# usage: perl create_db.pl
use strict;
use warnings;
use KyotoCabinet;
unlink 'db.kch';
# construct the database object
my $db = KyotoCabinet::DB->new();
# open the database
if (! $db->open('db.kch#msiz=128m', $db->OWRITER | $db->OCREATE)) {
die "open error (db): ", $db->error;
}
open my $fh, '<', 'NR.fasta' or die "open error: $!\n";
my $nrid = '';
my $seq = '';
while ( <$fh> ) {
chomp; $_ =~ s/\r$//g;
if ( /^>/ ) {
$db->set($nrid, $seq) if $seq;
$nrid = ( split /\s/, $_, 2 )[0];
$seq = '';
}
else {
$seq .= $_;
}
}
$db->set($nrid, $seq) if $seq;
close $fh;
$db->close;
Part Four search DB serial code
One might like to measure the difference between running serially and parallel. This makes use of cursor in Kyoto Cabinet. The substr is for stripping the leading comma when displaying the value to STDOUT.
# https://www.perlmonks.org/?node_id=11110379
# usage: perl search_db.pl > Outfile.txt
use strict;
use warnings;
use KyotoCabinet;
# construct the database object
my $db = KyotoCabinet::DB->new();
# open the database
if (! $db->open('db.kch#msiz=128m', $db->OREADER | $db->ONOLOCK)) {
die "open error (db): ", $db->error;
}
my $cur = $db->cursor;
open my $fh, '<', 'peptides.txt' or die "open error: $!\n";
while ( my $pep = <$fh> ) {
my $ids = ''; chomp $pep;
$cur->jump; # first record
while ( my ($key, $val) = $cur->get(1) ) {
$ids .= ",$key" if index($val, $pep) >= 0;
}
print "$pep\t", substr($ids, 1), "\n" if $ids;
}
close $fh;
$cur->disable;
$db->close;
Part Five search DB parallel code
Here it is. This is the parallel code. It's practically the serial code above wrapped into MCE. Each worker opens a handle to the DB. This is done inside the user_begin block. Likewise, closing is done inside the user_end block. MCE::Relay is helpful for ensuring workers write serially to STDOUT. Thus, no reason to have workers send output to the manager process or store in a shared hash (i.e. MCE::Shared).
# https://www.perlmonks.org/?node_id=11110379
# usage: perl search_db_mce.pl > Outfile.txt
use strict;
use warnings;
use KyotoCabinet;
use MCE;
# construct the database object
my $db = KyotoCabinet::DB->new();
my $cur;
my $mce = MCE->new(
max_workers => MCE::Util::get_ncpu(),
chunk_size => 1,
init_relay => 1,
user_begin => sub {
# open the database
if (! $db->open('db.kch#msiz=128m', $db->OREADER | $db->ONOLOC
+K)) {
die "open error (db): ", $db->error;
}
$cur = $db->cursor;
},
user_end => sub {
# close the database
$cur->disable;
$db->close;
},
user_func => sub {
my $pep = $_; chomp $pep;
my $ids = '';
$cur->jump; # first record
while ( my ($key, $val) = $cur->get(1) ) {
$ids .= ",$key" if index($val, $pep) >= 0;
}
# output serially, one worker at a time
MCE::relay {
print "$pep\t", substr($ids, 1), "\n" if $ids;
};
}
);
$mce->process('peptides.txt');
$mce->shutdown;
Part Six search DB parallel code - chunking
Chunking helps reduce time by another 25% ~ 30% for this use case. This is beneficial on machines with many cores.
# https://www.perlmonks.org/?node_id=11110379
# usage: perl search_db_chunk.pl > Outfile.txt
use strict;
use warnings;
use KyotoCabinet;
use MCE;
# construct the database object
my $db = KyotoCabinet::DB->new();
my $cur;
my $mce = MCE->new(
max_workers => MCE::Util::get_ncpu(),
chunk_size => 7,
init_relay => 1,
user_begin => sub {
# open the database
if (! $db->open('db.kch#msiz=128m', $db->OREADER | $db->ONOLOC
+K)) {
die "open error (db): ", $db->error;
}
$cur = $db->cursor;
},
user_end => sub {
# close the database
$cur->disable;
$db->close;
},
user_func => sub {
my ( $mce, $chunk_ref, $chunk_id ) = @_;
my %ret; chomp @$chunk_ref;
$cur->jump; # first record
while ( my ($key, $val) = $cur->get(1) ) {
for ( @$chunk_ref ) {
$ret{$_} .= ",$key" if index($val, $_) >= 0;
}
}
# output serially, one worker at a time
MCE::relay {
for ( @$chunk_ref ) {
print "$_\t", substr($ret{$_}, 1), "\n" if $ret{$_};
}
};
}
);
$mce->process('peptides.txt');
$mce->shutdown;
Outfile.txt
GAAGGACTGGGACCA >NR_000001,>NR_006611
AGGCTGCGGCAGGAC >NR_062102
GTGAGCCGGGCAGAG >NR_089584
AGGGGGGGTTGCTGA >NR_036454,>NR_068535,>NR_097889
CTGACATGCGGCGCA >NR_087289
GTGCATCGATGGCCG >NR_005535
GGGGTCAAGCGAACC >NR_076289,>NR_087856
TGAGACGGCGAACCT >NR_064242
AGCGACAAAGGAAAC >NR_045865
AGGTGCAACCATGGA >NR_046602,>NR_056869
GAGTAAACCGCGCGA >NR_093455
AACGACTGAACAGCG >NR_070693
ACGCGTAATTCGATA >NR_080086
GATGAGCGGAGCACT >NR_070118
CGTAGCGAAACCGAG >NR_092384,>NR_098291
GGGGGGGAGGTCCGA >NR_021671,>NR_036907,>NR_080961
AGGGAGGGGGGTTGT >NR_026207
ATGGGGCAGACGCGA >NR_072314
Regards, Mario
| [reply] [d/l] [select] |
|
This is a continuation of Re: Problem in RAM usage while threading the program.
Update: Added Inline::CPP code.
Part Seven Inline::C and Inline::CPP demonstrations.
There was one more thing to try. Of all things, Inline::C / CPP on Linux to see what improvements that would bring.
Inline::C
# https://fallabs.com/kyotocabinet/api/
use Inline C => config => inc => '-I/usr/local/include';
use Inline C => config => libs => '-L/usr/local/lib -lkyotocabinet';
use Inline C => <<'EOC';
#include <string.h>
#include <kclangc.h>
KCDB *db;
KCCUR *cur;
// open the database
int open_db(char* file) {
db = kcdbnew();
if (!kcdbopen(db, file, KCOREADER | KCONOLOCK)) return 0;
cur = kcdbcursor(db);
return 1;
}
// close the database
int close_db() {
if (cur) kccurdel(cur);
if (!kcdbclose(db)) return 0;
return 1;
}
// search records containing substring
SV * search_db(char* substr) {
AV *ret = newAV();
char *kbuf, *vbuf;
size_t ksiz, vsiz;
const char *cvbuf;
kccurjump(cur);
while ((kbuf = kccurget(cur, &ksiz, &cvbuf, &vsiz, 1)) != NULL) {
if (strstr(cvbuf, substr) != NULL) {
av_push(ret, newSVpvn(kbuf, ksiz));
}
kcfree(kbuf);
}
return newRV_noinc((SV *) ret);
}
EOC
Inline::CPP
# https://fallabs.com/kyotocabinet/api/
use Inline CPP => config => inc => '-I/usr/local/include';
use Inline CPP => config => libs => '-L/usr/local/lib -lkyotocabinet';
use Inline CPP => <<'EOCPP';
#undef do_open
#undef do_close
#include <string.h>
#include <kcpolydb.h>
using namespace std;
using namespace kyotocabinet;
PolyDB db;
DB::Cursor *cur;
// open the database
int open_db(char* file) {
if (!db.open(file, PolyDB::OREADER | PolyDB::ONOLOCK)) return 0;
cur = db.cursor();
return 1;
}
// close the database
int close_db() {
if (cur) delete cur;
if (!db.close()) return 0;
return 1;
}
// search records containing substring
SV * search_db(char* substr) {
AV *ret = newAV();
string ckey, cvalue;
cur->jump();
while (cur->get(&ckey, &cvalue, true)) {
if (strstr(cvalue.c_str(), substr) != NULL) {
av_push(ret, newSVpvn(ckey.c_str(), ckey.length()));
}
}
return newRV_noinc((SV *) ret);
}
EOCPP
Serial
# https://www.perlmonks.org/?node_id=11110379
# usage: perl search_db_inline_c.pl > Outfile.txt
use strict;
use warnings;
# insert the Inline::C or Inline::CPP code here
open_db('db.kch#msiz=128m') or die "db.kch: open error\n";
open my $fh, '<', 'peptides.txt' or die "open error: $!\n";
while ( my $pep = <$fh> ) {
chomp $pep; my $ids = search_db($pep);
print "$pep\t", join(',', @$ids), "\n" if @$ids;
}
close $fh;
close_db();
Parallel
# https://www.perlmonks.org/?node_id=11110379
# usage: perl search_db_inline_c_mce.pl > Outfile.txt
use strict;
use warnings;
# insert the Inline::C or Inline::CPP code here
use MCE;
my $mce = MCE->new(
max_workers => MCE::Util::get_ncpu(),
chunk_size => 1,
init_relay => 1,
user_begin => sub {
open_db('db.kch#msiz=128m') or die "db.kch: open error\n";
},
user_end => sub {
close_db();
},
user_func => sub {
my $pep = $_; chomp $pep;
my $ids = search_db($pep);
# output serially, one worker at a time
MCE::relay {
print "$pep\t", join(',', @$ids), "\n" if @$ids;
};
}
);
$mce->process('peptides.txt');
$mce->shutdown;
Outfile.txt
GAAGGACTGGGACCA >NR_000001,>NR_006611
AGGCTGCGGCAGGAC >NR_062102
GTGAGCCGGGCAGAG >NR_089584
AGGGGGGGTTGCTGA >NR_036454,>NR_068535,>NR_097889
CTGACATGCGGCGCA >NR_087289
GTGCATCGATGGCCG >NR_005535
GGGGTCAAGCGAACC >NR_076289,>NR_087856
TGAGACGGCGAACCT >NR_064242
AGCGACAAAGGAAAC >NR_045865
AGGTGCAACCATGGA >NR_046602,>NR_056869
GAGTAAACCGCGCGA >NR_093455
AACGACTGAACAGCG >NR_070693
ACGCGTAATTCGATA >NR_080086
GATGAGCGGAGCACT >NR_070118
CGTAGCGAAACCGAG >NR_092384,>NR_098291
GGGGGGGAGGTCCGA >NR_021671,>NR_036907,>NR_080961
AGGGAGGGGGGTTGT >NR_026207
ATGGGGCAGACGCGA >NR_072314
Benchmark - 8-core VM, CentOS 7.7
62.883 seconds - op's code
33.594 seconds - search_db.pl
6.367 seconds - search_db_mce.pl
4.991 seconds - search_db_chunk.pl
8.093 seconds - search_db_inline_c.pl
1.530 seconds - search_db_inline_c_mce.pl
Benchmark - 8-core VM, Xubuntu 18.04.3
36.684 seconds - search_db.pl
7.403 seconds - search_db_mce.pl
5.986 seconds - search_db_chunk.pl
11.009 seconds - search_db_inline_c.pl
2.188 seconds - search_db_inline_c_mce.pl
Benchmark - 8-core, macOS Mojave 10.14.6
28.857 seconds - search_db.pl
6.130 seconds - search_db_mce.pl
5.219 seconds - search_db_chunk.pl
27.322 seconds - search_db_inline_c.pl
5.808 seconds - search_db_inline_c_mce.pl
This completes the exercise. It boggles my mind comparing CentOS vs. Xubuntu. Ditto, Inline::C running faster on Linux vs. macOS. Testing was done with Perl 5.30.1. The virtualization is handled by VMware Fusion.
Regards, Mario
| [reply] [d/l] [select] |
|
| [reply] [d/l] |
|
Thanks for the laugh karlgoethebier :) From time to time, I search for MCE and threads on PerlMonks. This looked interesting and thought a great use case for Kyoto Cabinet. See also, Tokyo Dystopia.
It is mind boggling to me, seeing many processes running near 100% CPU utilization no matter if involving IO. The OP stated running on a machine having 500 GiB of memory. The beauty of it all is that the 100 GiB NR.fasta file is read into file system cache by the operating system (this is true on Linux - the entire file if available memory). Subsequent scans are then read from cache. Workers do not make an extra copy due to reading from a memory-mapped I/O handle. Meaning that one can run with many workers. The limiting factor is the memory subsystem - how fast.
Well, I enjoyed writing the demonstrations including one with chunking.
Lots of love, Mario
| [reply] |
|
|
Kyoto db runs in memory? Your benchmark doesn't compare to in memory hash ... Disk is slow
| [reply] |
|
Update: Do not run the in-memory demo below. Unfortunately, each worker makes a copy. Another possibility, on Linux platforms, is taking the prior demonstration and opening the Kyoto DB file in /dev/shm. Follow-up: From testing, storing in /dev/shm is not necessary. The Linux OS typically caches the entire file into memory, if available. Subsequent scans pull from the file system cache. For extra performance, see the parallel chunking demonstration on the same page.
# open the database
if (! $db->open('/dev/shm/db.kch#msiz=128m', $db->OREADER)) {
die "open error (db): ", $db->error;
}
Yes, of course ++ when data fits in-memory. The following constructs an in-memory DB. Combine the script in Part Three with Part Four or Part Five.
# https://www.perlmonks.org/?node_id=11110379
# usage: perl search_in_memory.pl > Outfile.txt
use strict;
use warnings;
use KyotoCabinet;
use MCE;
###########################################################
# Part Three
###########################################################
# construct the database object
my $db = KyotoCabinet::DB->new();
# open the database - in memory (i.e. *)
if (! $db->open('*#msiz=128m', $db->OREADER | $db->OWRITER | $db->OCRE
+ATE)) {
die "open error (db): ", $db->error;
}
open my $fh, '<', 'NR.fasta' or die "open error: $!\n";
my $nrid = '';
my $seq = '';
while ( <$fh> ) {
chomp; $_ =~ s/\r$//g;
if ( /^>/ ) {
$db->set($nrid, $seq) if $seq;
$nrid = ( split /\s/, $_, 2 )[0];
$seq = '';
}
else {
$seq .= $_;
}
}
$db->set($nrid, $seq) if $seq;
close $fh;
###########################################################
# Part Five
###########################################################
my $cur;
my $mce = MCE->new(
max_workers => MCE::Util::get_ncpu(),
chunk_size => 1,
init_relay => 1,
user_begin => sub {
$cur = $db->cursor;
},
user_end => sub {
$cur->disable;
},
user_func => sub {
my $pep = $_; chomp $pep;
my $ids = '';
$cur->jump; # first record
while ( my ($key, $val) = $cur->get(1) ) {
$ids .= ",$key" if index($val, $pep) >= 0;
}
# output serially, one worker at a time
MCE::relay {
print "$pep\t", substr($ids, 1), "\n" if $ids;
};
}
);
$mce->process('peptides.txt');
$mce->shutdown;
$db->close;
Regards, Mario
| [reply] [d/l] [select] |
Re: Problem in RAM usage while threading the program
by bliako (Monsignor) on Dec 19, 2019 at 13:06 UTC
|
Regarding the hash copied in each thread. this is the normal behaviour with perl threads (Edit: As Anonymous Monk already said). But you can declare your hash my %hashNR :shared; and hopefully this will be enough. However your thread strategy can become much leaner, for example employ a pool of threads so you do not continually creating the threads, as Anonymous Monk said. The answer by BrowserUK will help you regarding a thread pool: How to create thread pool of ithreads
| [reply] [d/l] |
Re: Problem in RAM usage while threading the program
by karlgoethebier (Abbot) on Dec 19, 2019 at 17:27 UTC
|
Probably you may take a look at MCE and MCE::Shared? Regards, Karl
«The Crux of the Biscuit is the Apostrophe»
perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help
| [reply] [d/l] |
|
I swear I was gonna reply to you by saying that as we speak marioroy is probably cooking up one of his virtual supercomputers with MCE to accomplish this task exponentially faster and more efficiently than the OP can possibly imagine. I then started thinking about those folks who complain that Perlmonks is not a code writing service, and it occurred to me that a free code writing service could propel Perl as immensely as CPAN itself and that we should do that as much as possible! And then it happened: Re: Problem in RAM usage while threading the program
💬 🐪
| [reply] |
|
Exponentially doesnt work like that
| [reply] |
|
Re: Problem in RAM usage while threading the program
by Anonymous Monk on Dec 19, 2019 at 08:50 UTC
|
It seems like each thread is taking a copy of hash (and each hash is ~100 GB
Hi
Sounds about right, see threads::shared, and study Re^5: threading a perl script?;spoil=1;node_id=901037, Thread::Queue...
continually creating 40 new threads is not as fast as creating only 40 threads :)
in perl initThreads is written as my @initThreads = 1..$num_of_threads; but
There is no need to maintain a list of threads, the threads module already does that
threads->list(threads::joinable)
| [reply] [d/l] [select] |
Re: Problem in RAM usage while threading the program
by GrandFather (Saint) on Dec 20, 2019 at 07:37 UTC
|
Simplistically you don't get any advantage with computation bound tasks that use more threads than there are cores in the processor. For typical high end processors that will be between 8 and 16 threads.
Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
| [reply] |
|
| [reply] |
Re: Problem in RAM usage while threading the program
by NERDVANA (Deacon) on Dec 20, 2019 at 05:32 UTC
|
Perl iThreads runs multiple perl interpreters in the same process, each with their own separate copy of variables. It gives you that "threading feeling" but in some ways it is actually less efficient than just running multiple processes. Perl behaves well with fork(), so if you used a non-threaded perl to allocate the data structure, then fork a bunch of children, Linux will let the processes share any page of ram that hasn't been written and you'll get a lot of memory savings. Also, perl without the iThreads feature compiled into it will run notably faster. | [reply] |
A reply falls below the community's threshold of quality. You may see it by logging in. |
|
|