Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Threading read access to a filedescriptor

by smferris (Beadle)
on Jan 29, 2001 at 22:35 UTC ( [id://55035]=perlquestion: print w/replies, xml ) Need Help??

smferris has asked for the wisdom of the Perl Monks concerning the following question:

Ok,

Simplified, short version:
How do I multithread read access to a single file? (using fork)

Long version: 8)
perl 5.005, sun solaris and linux (rh7)

I have a requirement to parse and load a flat file to an rdbms. Perl will need to scrub the data and the files can be on the large scale. (Couple million records) I thought, heck, lets multithread this thing! How hard could it be? Here's some sample code that I thought would work... (Copied from memory and commented)

--- Code snipplet
#!/usr/bin/perl package SMF::Threader; # Used for IPC between processes. # Create a SMF::Threader object for memory sharing sub new { my($class)=shift; my $self; open($self->{filehandle},"test.dat") || die $!; bless $self, $class; } # Every process will have to wait her turn to get a record sub lock { my($self, $pid)=@_; push($self->{waits},$pid); until (${$self->{waits}}[0] == $pid) { ; #waiting for my turn }; 1; } # Release the next process sub unlock { my $self=shift; shift (@{$self->{waits}}); 1; } # Get a record from the filehandle sub fetch { my($self, $pid)=@_; $self->lock($pid); # Can anyone tell me how to combine these next 2 lines? # <$self->{filehandle}> is a syntax problem my $fh=$self->{filehandle}; my $row=<$fh>; $self->unlock; return $row or undef; } 1; package main; use POSIX; my $new=new SMF::Threader; for (1..2) { # Fork 2 processes unless (fork) { open(OUT, ">".$$.".out"); while(my $record=$new->fetch($$) ) { # Record format is "0000000000abcdefg..xyz" my($num,$alpha)=unpack("a10 a26",$record); print $record unless length($alpha) == 26; } close OUT; exit; } sleep 1; # I don't think this is necessary because of my locking met +hod, # But.. Just in case. } my $child; do { $child = waitpid(-1,POSIX::WNOHANG); # Is WNOHANG not exported?? } until $child == -1; exit;
--- Code snipplet

My assumption was that if I build $new (SMF::Threader) in the parent and use that in each child, it would create a memory segment shareable between the processes. Is that true? The problem is that the processes don't always get a complete record. (RS=newline) What am I overlooking? Am I going to have to use a semaphore to keep track of the locks? I think I will still build that into SMF::Threader (Named something different) as I might have a reason to reuse it for database read access. (MUCH LATER) 8) Any problems you see with that? (CORBA? Definately overkill I think)

All help will be greatly appreciated!

Shawn M Ferris
Oracle DBA - Time Warner Telecom

Replies are listed 'Best First'.
(tye)Re: Threading read access to a filedescriptor
by tye (Sage) on Jan 29, 2001 at 23:10 UTC

    No, fork() doesn't share much of anything between processes. The new process inherits copies of everything, including open file descriptors. In particular, fork() never creates shared memory1.

    Another thing that isn't shared is the buffers that efficiently reading of a file one-line-at-a-time requires. So using <$fh> isn't going to do a very good job of distributing lines between processes because each process is going to read much more than just the next line and buffer it.

    Now the current file position can be shared between file descriptors. My first guess would have been that it wouldn't be shared after a simple fork(), but you seem to imply otherwise. But I think it can be shared between processes and if fork() didn't share that, then I don't know how you'd go about sharing it (perhaps by passing an open file descriptor over a socket?).

    To do something like this I'd resort to a pipe with record lengths preceeding each record so that the readers could efficiently read an entire record. This requires a writer process that reads the input file and puts the records onto the pipe. (If the records are always short, then you could just use the behavior of pipes with multiple readers under Unix and have the writer just put the records onto the pipe with a single syswrite() per record and then each reader would get a single record for each sysread() -- but I'd probably avoid that type of fragile solution.)

    Alternately you could have the writer process have a separate pipe to each process and pick between the pipes using select() (this would avoid the need for writing the record length in front of each record).

    1Well, it probably makes shared read-only memory that gets copied by an exception handler when and if a process ever tries to write to it. But this is really just an optimization trick, not a way to share anything between processes.

            - tye (but my friends call me "Tye")

      The consensus is that the instance of $new is going to cloned rather than shared. Bummer.. However, I don't see this as a problem yet..

      The docs say that open filehandles will be dup-ed so that closing is handled properly. (one doesn't close the other) Although the seek pointer IS shared between the processes.

      My problem is still that I have to lock them from reading at the same instant. Now to implement it.

      ( Keeping in mind I'd really like to stick with default/stock perl modules.)

      My first thought is to use semaphores. But I don't see an easy interface such as pop/shift. Or am I making it too complicated. I'll keep working on it.

      But if someone would like to chime in with an example of locking processes using semaphores, I'd appreciate it! 8)

      Thx, SMF 8)
        "perldoc -f flock"

        But most of the rest of my post still applies.

                - tye (but my friends call me "Tye")
Re: Threading read access to a filedescriptor
by Fastolfe (Vicar) on Jan 30, 2001 at 00:14 UTC
    You probably don't want to be sharing the open filehandle between processes, since each process will be moving that filehandle around, and reading arbitrary bits of data.

    Ignoring alternatives that force you to re-think your design fundamentally, I might suggest that you fork first, then open the file. Have each thread know how much of the file it's going to be reading, and what thread number it is, and then have it seek to the right spot in the file, find the next new line, and read until it passes the starting point for the next thread.

    I have no idea how easy it will be to add all of this to the DBM file, though.

Re: Threading read access to a filedescriptor
by jeroenes (Priest) on Jan 29, 2001 at 23:01 UTC
    In reply to Sorting data that don't fit in memory, tilly told me to try BerkeleyDB to fool around with large amounts of data. I have implemented stuff, and I'm not totally through now, but I already can say that Berkeley shows impressive performance. You can d'8 it at http://www.sleepycat.com. The distribution carries the perl interface. This DB is scalable and portable. Don't forget to specify a large cache.

    Hope this helps,

    Jeroen
    "We are not alone"(FZ)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://55035]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (5)
As of 2024-05-30 19:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.