Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

using parallel processing to concatenate a string, where order of concatenation doesn't matter

by tphyahoo (Vicar)
on Oct 18, 2006 at 10:48 UTC ( #579015=perlquestion: print w/ replies, xml ) Need Help??
tphyahoo has asked for the wisdom of the Perl Monks concerning the following question:

I am experimenting with using threading, or forking, or whatever you want to call it, to speed things up that are slow, but where order of execution doesn't matter.

Here we have an artificial example, where I want to concatenate some letters, but the order of concatenation doesn't matter.

I thought I could use Parallel::ForkManager to achieve my aim like in the following, but as the tests demonstrate, this doesn't work.

The sub concatenate_serial does what I want, but is slow. It takes 9 seconds total, because I slowed things artifically with sleep(3). The other method, concatenate_parallel, finishes in 3 seconds, but it doesn't concatenate the letters.

I think I am probably making an error in thinking, or in misunderstanding what threads, or forks, or whatever you call it, is supposed to work.

**************
UPDATE: I see that the problem is that after each fork I am in a separate process, a copy of the parent process. And the variable I want to build the concat string with isn't shared across these processes.

I see that's the problem. What I don't have is a way to share these variables across the memory process. Well, I suppose one way would be to write the concat string to a permanent store on the hard drive, like a file or a db. But I'm wondering if there's a more elegant way, something that works "just with ram".

ANOTHER UPDATE: Okay, so I guess threads works "just with ram". Could someone throw some code my way that I could plug in to convert my serial code into parallel code (well, to be real pedantic, parallel code that can't be relied on execute sequentially)?
**************

Please help me understand this!

I withdraw to my little cell now for further study and meditation.....

use strict; use warnings; use Parallel::ForkManager; use Test::More qw( no_plan ); use Data::Dumper; my $letters = [ qw(a b c) ]; # the order of the concatenated letters doesn't matter, as long as we +get all three letters my $content; ok( index( $content = concatenate_parallel($letters), 'b' ) > 0, "it c +ontains b, total content: $content" ); # fast, but doesn't work ok( index( $content = concatenate_serial($letters), 'b' ) > 0, "it con +tains b, total content: $content" ); # works, but slow #works, but slow sub concatenate_serial { my $letters = shift; my $content=''; foreach my $letter ( @$letters ) { sleep 3; #something happens that takes noticeable time, like fetch + a url $content .= $letter; } print "serial content: $content\n"; return $content } # fast, but doesn't work sub concatenate_parallel { my $letters = shift; my $pm=new Parallel::ForkManager(10); my $content=''; foreach my $letter ( @$letters ) { $pm->start and next; sleep 3; #something happens that takes time, like fetch a url $content .= $letter; $pm->finish; } $pm->wait_all_children; return $content }
UPDATE: I changed the title to "using parallel processing to...." where originally it was "using Parallel::ForkManager to...". Seems like that module doesn't do what I want, but hopefully something with threads does.

UPDATE 2: Now I'm wondering if I could do this with MapReduce, pipin' fresh on cpan... Could there be ThreadedMapReduce (and/or ForkedMapReduce) instead of DistributedMapReduce?

Comment on using parallel processing to concatenate a string, where order of concatenation doesn't matter
Select or Download Code
Re: using Paralell::ForkManager to concatenate a string, where order of concatenation doesn't matter
by RMGir (Prior) on Oct 18, 2006 at 11:14 UTC
    You're misunderstanding what Parallel::ForkManager does, I'm afraid. (As long as it actually does use fork, which seems likely from the name).

    When fork() (the underlying system call) is called, the process is split into 2 identical copies, and the fork call returns in each copy, indicating by its return value which copy is which. (That's a simplified view, but it'll do for now)

    So your concatentate_parallel function is effectively adding a letter to $content in each subprocess, but that doesn't affect the parent process (your main application) where $content remains unchanged.

    You can find a lot of good info online about how fork works, what the variants are, what gets copied for subprocesses and what doesn't. It's a good thing to understand.

    By the way, if you'd been using threads instead, you'd have been on the right path. Threads are simultaneous paths of execution in the _same_ process, so different threads could modify $content. But without using synchronization mechanisms, you're right, the letters would arrive in a jumbled, unpredictable order.

    Good luck!


    Mike
      Thanks, you are surely right (see my update).

      So ok, how do I do this with threads in the cleanest possible way?

      Basically what I am after is a way to transform the serial code into the parallel code with a minimum amount of fuss, and a minimum amount of typing. And be reasonably certain that it will work.

        If you are on UNIX, take a look at shmread and the associated links in perlfunc.
Re: using Paralell::ForkManager to concatenate a string, where order of concatenation doesn't matter
by Hue-Bond (Priest) on Oct 18, 2006 at 11:21 UTC

    You expect the timeline of your processes to be:

    ____ / \ ---<------>--- \____/

    But it really is:

    _________ / ---<---------- \_________

    The problem is that the updated $content dies with its process. You need some sort of IPC. Welcome :^).

    Update: Or threads, like RMGir correctly suggests.

    --
    David Serrano

Re: using Paralell::ForkManager to concatenate a string, where order of concatenation doesn't matter
by cdarke (Prior) on Oct 18, 2006 at 11:22 UTC
    fork() is part of UNIX architecture, and creates a new process. It is also a Perl built-in and is emulated in Activestate Perl on Windows as creation of a new thread, which might be confusing you. So, which OS?
    A process has its own area of virtual memory, and no one else can access that (unless invited). A process is an instance of a running program, but is also a container for threads, in that all threads share the same virtual memory area (process address space). So they can stomp on each other if you are not careful.
    Processes are slow to create and destroy, but are robust. Communication between processes is (relatively) slow, because it involves a kernel call. Threads are faster to create and destroy, and provide faster communication between them, but are more difficult to program because you always have to remember that all threads use the same (off-stack) data areas.
      I'm on unix.

      And I see now that I can't do what I want using forks, and probably need threads. (see update).

Re: using parallel processing to concatenate a string, where order of concatenation doesn't matter
by blazar (Canon) on Oct 18, 2006 at 11:39 UTC
    I am experimenting with using threading, or forking, or whatever you want to call it, to speed things up that are slow, but where order of execution doesn't matter.

    Beware of doing so, that is, for that reason. It may have a sense if your main program has other stuff to do, like responding to user input -but that's a whole another story-, or you actually have more CPUs (also in the form of hyperthreading or anything), or if those things are slow but not due to CPU-boundedness, i.e. if they comprise responding to network connections. Otherwise you won't see any performance gain from splitting you logic amongst threads or processes.

      > or if those things are slow but not due to CPU-boundedness, > i.e. if they comprise responding to network connections

      yes, ultimately this will wind up fetching urls with WWW::Mechanize. I was doing this with LWP::Parallel::UserAgent, but doing things this way ties me to that particular user agent, and the more I use it the more I pine for WWW::Mechanize, which makes so many things so easy.

      But really, my goal is to gain an understanding for how to take arbitrary code and "parallelize" it. I've been reading a lot of lisp propaganda lately, and if we were in lisp world there would probably be a macro to do the transformation of the code. But we're not, so the macro has to be my brain... ah, to turn code a into code b I have to do such and so. That kind of understanding is actually what I'm after.

        You might want to have a look at POE about how "parallelizing" is done without threads, in a cooperative multitasking way. The POE kernel doesn't deal out time slices (thus virtually parallelizing in the way an operating system kernel does), but expects of it's programs (or "sessions") to behave nicely and give back control whenever appropriate.

        --shmem

        _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                      /\_¯/(q    /
        ----------------------------  \__(m.====·.(_("always off the crowd"))."·
        ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
Re: using parallel processing to concatenate a string, where order of concatenation doesn't matter
by diotalevi (Canon) on Oct 18, 2006 at 14:15 UTC

    If Parallel::Queue's threading worked for me I'd suggest the following.

    use threads::shared; use Parallel::Queue; print concatenate_parallel( ['a' .. 'z'], 4 ) . "\n"; sub concatenate_parallel { my $result :shared; my @input = map { my $string = $_; sub { $result .= $string; }; } @{ shift @_ }; my $max_threads = shift @_; my $mgr = Parallel::Queue->construct( 'thread' ); $mgr->runqueue( $max_threads, @input ); return $result; }

    ⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊

      Thanks, that's intriguing, but my perl doesn't seem to like that code.
      hartman@ds0207:~/pfmArena> perl test2.pl Bogus Parallel::Queue: "fork" and "thread" are exclusive at test2.pl l +ine 17 hartman@ds0207:~/pfmArena> perl -e '@ARGV=("test2.pl"); while (<>) {di +e $_ if $. == 17}' my $mgr = Parallel::Queue->construct( 'thread' ); hartman@ds0207:~/pfmArena> hartman@ds0207:~/pfmArena> perl -V | grep -i thread osname=linux, osvers=2.6.16, archname=i586-linux-thread-multi config_args='-ds -e -Dprefix=/usr -Dvendorprefix=/usr -Dinstallusr +binperl -Dusethreads -Di_db -Di_dbm -Di_ndbm -Di_gdbm -Duseshrplib=tr +ue -Doptimize=-O2 -march=i586 -mtune=i686 -fmessage-length=0 -Wall -D +_FORTIFY_SOURCE=2 -g -Wall -pipe' usethreads=define use5005threads=undef useithreads=define usemulti +plicity=define cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS +-DDEBUGGING -fno-strict-aliasing -pipe -Wdeclaration-after-statement +-D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBUGGI +NG -fno-strict-aliasing -pipe -Wdeclaration-after-statement' libs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E - +Wl,-rpath,/usr/lib/perl5/5.8.8/i586-linux-thread-multi/CORE' PERL_MALLOC_WRAP THREADS_HAVE_PIDS USE_ITHREAD +S /usr/lib/perl5/5.8.8/i586-linux-thread-multi /usr/lib/perl5/site_perl/5.8.8/i586-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.8/i586-linux-thread-multi hartman@ds0207:~/pfmArena> perl -V | grep -i ithread usethreads=define use5005threads=undef useithreads=define usemulti +plicity=define PERL_MALLOC_WRAP THREADS_HAVE_PIDS USE_ITHREAD +S hartman@ds0207:~/pfmArena>
      Any ideas?

        Tell it to use threading by default: use Parallel::Queue 'thread';.

        ⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊

Re: using parallel processing to concatenate a string, where order of concatenation doesn't matter
by neilwatson (Curate) on Oct 18, 2006 at 14:35 UTC
Re: using parallel processing to concatenate a string, where order of concatenation doesn't matter
by BrowserUk (Pope) on Oct 18, 2006 at 17:04 UTC

    Try this. I've used a randomly variable 'work pause', as if this is constant, the threads will obviously finish in the same order as they started.

    Update: Applied locks per ikegami's post below.

    #! perl -slw use strict; use threads; use threads::shared; my $content : shared = ''; sub concatenate_parallel { my $letter = shift; sleep 1 + rand 2; ## Do stuff that takes time { lock $content; $content .= $letter; } } print scalar localtime; my @threads = map{ threads->create( \&concatenate_parallel, $_ ) } 'a' + .. 'c'; $_->join for @threads; print $content; print scalar localtime; __END__ c:\test>579015 Wed Oct 18 18:01:20 2006 cab Wed Oct 18 18:01:22 2006 c:\test>579015 Wed Oct 18 18:01:24 2006 bca Wed Oct 18 18:01:26 2006 c:\test>579015 Wed Oct 18 18:01:27 2006 abc Wed Oct 18 18:01:29 2006 c:\test>579015 Wed Oct 18 18:01:30 2006 abc Wed Oct 18 18:01:32 2006

    If you don't like the shared buffer being passed to the threads through closure--it smacks of globals--then you could use this version that passes a reference to the shared buffer to the threads as an argument and dereferences it when appending:

    #! perl -slw use strict; use threads; use threads::shared; sub concatenate_parallel { my( $contentRef, $letter ) = @_; sleep 1 + rand 2; ## Do stuff that takes time { lock $contentRef; $$contentRef .= $letter for 1 .. 1e4; } } my $content : shared = ''; my $contentRef = \$content; print scalar localtime; my @threads = map{ threads->create( \&concatenate_parallel, $contentRef, $_ ); } 'a' .. 'z'; $_->join for @threads; print $content; print scalar localtime; __END__ c:\test>579015 Thu Oct 19 00:33:11 2006 cbefghlmqsvwadijknoprtuxyz Thu Oct 19 00:33:13 2006

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Is .= atomic, or do you have a race condition?

        On my single cpu processor, only one thread runs at a time, so there is no race condition. On a multi-cpu machine, reading the docs on, and using threads::shared::lock() will be required.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://579015]
Approved by wazoox
Front-paged by diotalevi
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (13)
As of 2014-07-24 11:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (160 votes), past polls