http://www.perlmonks.org?node_id=1092634

pimperator has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks

The script I am writing will be implemented on a windows device. I'm not sure if the reason why it fails on windows but works in mac is due to incompatibility of commands

Overview of the script. I have A LOT of fancy binary 'chp' files that need to be converted to txt files. The only way to convert this file is through a proprietary program

cont. The script takes an array of the directories of the files, chops them up into smaller arrays of 20. For each 20 files it forks into 20 child processes which sends the command for text conversion into system. Then it parses through the file and gets the data I want

#!/usr/bin/perl use strict; use File::Find; use Parallel::ForkManager; my @toParse; # array of chip files foreach(@toParse){ my $encryptKey = $_; #Encrypted for patients my @name = split /\//, $fileManifest{$encryptKey}; #hash contains +the directories my $id = pop(@name); # @name is an array of the directory, last el +ement is the sample ID my $perlOut = join "\/",@name; # directory for perl commands my @safeID = split / /, $id; my $safeDirect = $perlOut."/".$safeID[0]; # hipaa compliant next if($safeID[0 ] =~ /cgo/ig); #----------------------------------------------------------------- # Prepare the commands for conversion + #----------------------------------------------------------------- my $cychpCommand = $fileManifest{$encryptKey}; #directory of the c +ychp file $cychpCommand =~ s/\//\\/g; # Format the directory for windows and + for the affy program $cychpCommand = "\"".$cychpCommand."\""; my $cychpTxtOut = join "\\", @name; #CYCHP TXT OUTPUT DIRECTORY + $cychpTxtOut ="\"".$cychpTxtOut."\""; + my $affyCommand = "apt-chp-to-txt -o ".$cychpTxtOut." ".$cychpComm +and; unless(exists($cychpLog{$encryptKey})) { my $rmCom = $perlOut."/apt-chp-to-txt.log"; push @toConvert1, $affyCommand; # To send commands for conversion push @toConvert2, $rmCom; # To remove the log files $cychpLog{$encryptKey}++; $timestamp = localtime(time); open LOG, ">>".$logFileName; print LOG "$encryptKey\t$safeID[0]\t$safeDirect\t-\t-\t$timestamp +\n"; close(LOG); } } #----------------------------------------------------------------- # Begin forking #----------------------------------------------------------------- my $maxProcs = scalar(@toConvert1); my $pm = new Parallel::ForkManager($maxProcs); foreach(@toConvert1){ my $pid = $pm-> start and next; # FORK Begin system($_); # CONVERSION COMMAND $pm->finish; # End of FORK } $pm->wait_all_children; # Blocks until all children are finished foreach(@toConvert2){ unlink $_; #Delete log files made by the conversion program } foreach(@toParse){ my $txtFile = $_.".txt"; open IN, $txtFile or die "Cannot open [$txtFile]\n";   while(<IN>){ parsing happens here } }

So my question is as follows,

Is my code converting the files, and then waiting until they are all finished? If not then why not? How can I accomplish this on a windows machine. I cannot convert all files at once because the resulting files are too large. I have to convert a few, parse them, and then unlink them later. The error I'm getting is that the proprietary program cannot open the log file, I presume it's deleted because I unlink them right after the conversion takes place. The only way I think this error is occurring is that the script moves on without waiting for the forking to finish

Replies are listed 'Best First'.
Re: Is my code forking the way I want it to? Forking and System Calls
by SimonPratt (Friar) on Jul 08, 2014 at 08:35 UTC

    OK, from what I can see, you are forking off a separate process for each and every file that you need to process. This is really inefficient and will probably cause your code to run very slowly (more slowly than running a single process in series). I strongly recommend you rethink your forking strategy (ie, build it to not be parallel, then parallelise it when you have it working). Also bear in mind that when you have processes that are fairly busy (more than 50% core utilisation, on average), having more than one thread per physical core in your machine is likely to cause your code to Go Slow.

    Next, unless all of your files are in individual folders, the likely cause of your error is that you will have multiple instances of your proprietary decryption program attempting to write to the same log file. This may or may not be the actual cause of your issue, but should be simple to test

    The final point is that you say you don't have enough space to convert all of your files before parsing and unlinking them, however your code as it stands does exactly that - converts all files, then unlinks the log files, then parses the output.

Re: Is my code forking the way I want it to? Forking and System Calls
by DrHyde (Prior) on Jul 08, 2014 at 10:48 UTC

    First, reduce your code to the absolute bare minimum that exhibits the problem. This will probably lead to you solving it yourself, but if it doesn't it'll make it easier for us to help you. Then, show us *all* of that bare minimum. As it stands, your code won't even compile (and that's ignoring the bit you cut out and replace with "parsing happens here").

    However, a coupla pointers to get your started - first, get rid of the fork()ing. Windows doesn't do it very well. Second, Windows doesn't open() files properly. You say that you have to parse binary 'chp' files, whatever they are, but your code says that you're opening and parsing .txt files. If those are the files you mean to open, then you need to use binmode() to open the files properly. You might need to do that anyway even if they are text files, if your parsing code cares about line endings.

      I have to convert the chp files using the system call first and then open the txt file that it converted
        OK. You still might have to care about line endings though.