http://www.perlmonks.org?node_id=1022065


in reply to require sugesstions for writing a perl script

I would definitely recommend that you launch one Perl process that will scan for the files and process them (forking or not forking, that's not the issue I have in mind here), rather than having a shell script launching Perl 1,000 times, meaning that you have to start the interpreter and compile your program each time.

I've had such a case at my job 8 or 9 years ago. I wrote a rather simple Perl script to reprocess a very large amount of input data. I thought the data would be coming as one big file or possibly a few big files, so my script was initially designed to process just one file. My colleague who was using my script came to me and told me: "it is awfully slow, it takes hours and hours, we can't use your script." I was surprised, because I knew my script was perfectly able to process the expected amount of data in a dozen minutes or so. After having looked with him at the problem, I figured out that the data was in fact coming in the form of tens (or possibly hundreds) of thousands of small files, so my colleague (who did not know Perl) had decided to write a shell script to launch my Perl script again and again for each incoming file. Now, the Perl script first had to load into memory a large parameter file (about 250,000 telephone numbers) before processing the input files. The result was that, each time, the script had to load this large parameter file for processing a small data file (which had each in the order of possibly 1,000 to 2,000 lines), which was of course utterly inefficient. I just changed my script so that it would be launched only once, load once the parameter file and then only process all the data files in the relevant directory. That worked perfectly well, we no longer had a performance problem, the processing of the data ran 60 or 70 times faster if I remember the figures right.

I am telling you this story for 2 reasons:

1. it is not efficient to launch Perl 1,000 times if you can do it differently.

2. Knowing exactly what the program is doing is of paramount importance. The time spent in loading the parameter file was almost totally irrelevant compared to the amount of data to be processed if done only once or a couple of times, but became a major bottleneck when it had to be done tens or hundreds of thousand times. So tell us what your validation script is doing.