I was running a simple task, almost one-shot, throw-away script, therefore using basic tools. Because there were thousands of files to process, I decided to parallelize. With small test-suite, and "no-op", SSCCE code, sometimes output is OK:
use strict;
use warnings;
use feature 'say';
use threads;
use Thread::Queue;
use CAM::PDF;
use File::Find;
my $q = Thread::Queue-> new;
my @gang = map async( sub {
while ( defined( my $f = $q-> dequeue )) {
say threads-> tid, ' ', $f;
my $pdf = CAM::PDF-> new( $f ) or die;
}
}), 1 .. 2;
find( sub {
-f and /\.pdf$/i and $q-> enqueue( $File::Find::name )
}, './1' );
$q-> end;
$_-> join for @gang;
__END__
1 ./1/1/106/10627.pdf
2 ./1/1/107/10703.pdf
2 ./1/1/186/18673.pdf
1 ./1/1/209/20946.pdf
2 ./1/1/26/2656.pdf
1 ./1/1/33/3384.pdf
2 ./1/1/57/5742.pdf
1 ./1/1/58/5869.pdf
2 ./1/1/63/6395.pdf
1 ./1/1/70/7099.pdf
1 ./1/1/74/7466.pdf
But sometimes not (example 1, one worker dead):
1 ./1/1/106/10627.pdf
2 ./1/1/107/10703.pdf
Thread 2 terminated abnormally: *****Undefined subroutine &Compress::Z
+lib::Parse
Parameters called at C:/strawberry-perl-5.28.0.1-32bit-PDL/perl/lib/Co
+mpress/Zli
b.pm line 366.
1 ./1/1/186/18673.pdf
1 ./1/1/209/20946.pdf
1 ./1/1/26/2656.pdf
1 ./1/1/33/3384.pdf
1 ./1/1/57/5742.pdf
1 ./1/1/58/5869.pdf
1 ./1/1/63/6395.pdf
1 ./1/1/70/7099.pdf
1 ./1/1/74/7466.pdf
Example 2 (both workers dead, but for different reason):
1 ./1/1/106/10627.pdf
2 ./1/1/107/10703.pdf
Thread 2 terminated abnormally: *****Global symbol "@ISA" requires exp
+licit pack
age name (did you forget to declare "my @ISA"?) at C:/strawberry-perl-
+5.28.0.1-3
2bit-PDL/perl/site/lib/Text/PDF/Filter.pm line 342.
Global symbol "@basedict" requires explicit package name (did you forg
+et to decl
are "my @basedict"?) at C:/strawberry-perl-5.28.0.1-32bit-PDL/perl/sit
+e/lib/Text
/PDF/Filter.pm line 343.
Global symbol "@basedict" requires explicit package name (did you forg
+et to decl
are "my @basedict"?) at C:/strawberry-perl-5.28.0.1-32bit-PDL/perl/sit
+e/lib/Text
/PDF/Filter.pm line 351.
Global symbol "@basedict" requires explicit package name (did you forg
+et to decl
are "my @basedict"?) at C:/strawberry-perl-5.28.0.1-32bit-PDL/perl/sit
+e/lib/Text
/PDF/Filter.pm line 374.
Compilation failed in require at C:/strawberry-perl-5.28.0.1-32bit-PDL
+/perl/site
/lib/CAM/PDF.pm line 5608.
Thread 1 terminated abnormally: *****Global symbol "@ISA" requires exp
+licit pack
age name (did you forget to declare "my @ISA"?) at C:/strawberry-perl-
+5.28.0.1-3
2bit-PDL/perl/site/lib/Text/PDF/Filter.pm line 342.
Global symbol "@basedict" requires explicit package name (did you forg
+et to decl
are "my @basedict"?) at C:/strawberry-perl-5.28.0.1-32bit-PDL/perl/sit
+e/lib/Text
/PDF/Filter.pm line 343.
Global symbol "@basedict" requires explicit package name (did you forg
+et to decl
are "my @basedict"?) at C:/strawberry-perl-5.28.0.1-32bit-PDL/perl/sit
+e/lib/Text
/PDF/Filter.pm line 351.
Global symbol "@basedict" requires explicit package name (did you forg
+et to decl
are "my @basedict"?) at C:/strawberry-perl-5.28.0.1-32bit-PDL/perl/sit
+e/lib/Text
/PDF/Filter.pm line 374.
Compilation failed in require at C:/strawberry-perl-5.28.0.1-32bit-PDL
+/perl/site
/lib/CAM/PDF.pm line 5608.
Actually, CAM::PDF, "as is", is coded to issue a single warning (large source file!), but with filter undefined it becomes somewhat broken and useless and floods terminal with further thousands of warnings, therefore I prepended that line with
die '*****' . $@;
so the output is as shown above. My impression is that threads are trying to read the same source files -- CAM::PDF requires Text::PDF::Filter, which requires Compress::Zlib, and hence some sort of race condition happens and failed (partial) reading from file.
Is that even possible? I thought that files can be opened for reading safely by different processes, and OS would "arbitrate" "parallel" access to them. Is it not the case in general, or with require only?
If it's not the case, then is it a common knowledge (which I missed) that main thread should take care to "pre-require" all modules possibly needed by several workers before spawning them?
(Note, if someone wants to run tests: PDFs are of "compressed xref table" variety (they are client's files I won't share), and with other (simple xref table) files the sub containing line 5608 won't be called, i.e. Text::PDF::Filter won't be required, on simply reading a file).