Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic

Best Practices for Uncompressing/Recompressing Files?

by biosysadmin (Deacon)
on Aug 10, 2003 at 22:47 UTC ( #282694=perlquestion: print w/replies, xml ) Need Help??
biosysadmin has asked for the wisdom of the Perl Monks concerning the following question:

Right now I'm revising a script that has to do the following:
  • iterate over a list of ~600 files
  • uncompress each of the files using gzip
  • reformat each uncompressed file using a proprietary program over which I have no control
  • recompress each file so that our disk doesn't run out of space
Right now, I'm doing this all sequentially, one step after another and man is it SLOW. There are lots of ways to optimize this, the easiest and most obvious of which is to 'use Thread'. However, my Perl wasn't compiled with Thread support, and I'm reluctant to replace it on my system when so much of what we do depends on Perl. :(

A few questions that I have:
  • Should I even try to optimize this without using Thread? Any ideas on how to do this?
  • Is recompiling Perl with threading support that big of a deal?
  • How usable is Perl threading for a task such as this? It's not enabled in a default compile of Perl, is it a stable feature or a semi-usable hack?
Being lazy I've left my script it's current state until I've had time to step back and really look at it. I've thought about using some system("gunzip $myfile &") style calls, but I can't figure out how to make synchronize the way I want without some unnecessary assumptions about file size and how it relates to compression/uncompression speed.

Also, I'm a long time reader of PM, and first time poster, I just made a small donation to the cause. Much thanks to the Perl Monks community for making this such a valuable resource. :)
  • Comment on Best Practices for Uncompressing/Recompressing Files?

Replies are listed 'Best First'.
Re: Best Practices for Uncompressing/Recompressing Files?
by BrowserUk (Pope) on Aug 11, 2003 at 01:58 UTC

    With 4 cpus, there ought to be some benefit available from parallelising the process, but I doubt you would see any benefit from using threads rather than processes for this. Threads only really come into their own if there is a need to share data. If your reformatter could accept its input from a ram buffer, threads might make sense, but when the interchange medium has to be disk, processes will serve you better.

    You say that the process seems to be cpu-bound "whilst decompressing", just 1 of the cpus?

    You indicate that there are 600 files and a total uncompressed size of 120GB. That implies a filesize of around 200 MB? If this has to run on a single disc, I think I would sacrific 0.5 - 1.0 GB of my ram to a RAM drive. I would then use

    1. One process to unzip the files onto the RAM drive.
    2. The second process to do the re-formatting
    3. A third process to zip the re-formatted back to the harddrive.

    By using a RAM disc to store the intermediate files, you should reduce the competition for the one drive.

    If the re-formatting process is slow, then a second process performing that function might help, but thats a suck-it-and-see test.

    By splitting the overall task into 3, you stand the best chance of overlapping the cpu intensive parts with the IO-bounds parts. Have each of the process controlled by watching the RAM drive.

    1. The first process would decompress two files onto the ramdrive and then wait for one to disappear before it started on a the third.
    2. The second (and maybe third) process(es) would wait for a decompressed file to appear and then run the formatting process on it, deleteing the input file once it is done.
    3. The last process, waits for the reformatted file to appear and zips it back to the harddrive.

    This means that you have 3 files on the ram drive at a time. Two waiting to be re-formatted, one waiting to be zipped. It also mean that each stage is event driven and self-limiting, giving the best chance of extracting the maximum throughput.

    Just throwing lots threads or processes at it, especially if those processes are all doing the complete task, is unlikely to benefit you as you would have no way of controlling them in any meaningful way. The chances are that each of your threads would end up hitting the disk at the same time slowing the io-bound parts, and more than 1 cpu-intensive processes/ threads per cpu will slow things down with context switching.

    This kind of assumes that the box will be dedicated to this task whilst it is running. It also makes a lot of (hopefully not to wild) assumptions about your set up and processesing.

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
    If I understand your problem, I can solve it! Of course, the same can be said for you.

      Most of this I agree on, exept where to split the poccess boundy, I would make the whole task seperate forked out up to 4 at a time in parrallel. His CPU throttleing is happening at the time of compress and decompress. As far as storing the uncompressed files in ram, I think that the ram is more useful in this case for the zlib/gzip operation which tends to slow down a lot when starved. Of course who knows about the disk IO =) he could have a just array of JBOD striped on multiple interfaces =)...

        Just felt like adding this after re-reading the thread: Sorry about not being specific about the setup that I'm working with, but I actually have 2 separate drives:
        • A SCSI RAID array which is the source and eventual destination of the files
        • A single SCSI drive that holds the files temporarily.
        So, the files start on drive 1, go to drive 2, and are reformatted back to drive 1.
Re: Best Practices for Uncompressing/Recompressing Files?
by PodMaster (Abbot) on Aug 10, 2003 at 23:37 UTC
      Given the recent questions re parallelism (see also Forking Issues) is it time for a tutorial on the topic? It could cover the use of Parallel::ForkManager to 'parallelize' these types of problems. Is a more experienced monk prepared to take the job on?


Re: Best Practices for Uncompressing/Recompressing Files?
by liz (Monsignor) on Aug 11, 2003 at 00:26 UTC
    ...Right now, I'm doing this all sequentially, one step after another and man is it SLOW...

    If it's slow because the disks can't go faster (indicated by using less than near 100% of the CPU), then parallelizing may not be a good idea. This will cause the heads on the disk to have to do still more (very slow) seeks than they already have to do.

    The only way to speed up things in this case, is to cause fewer head movements. If the proprietary program can read from STDIN, or write to STDOUT, then a (de-)compressing pipe seems like a good idea. If you have more than one physical disk, try storing the uncompressed file on a disk different from where the compressed file is located.

    If possible, add more RAM to the machine (if you can't add a new disk). Having disk blocks in memory reads much faster than having to re-read from disk (just after having been extracted).

    If it's slow because of the CPU being used 100%, then you have a problem that cannot be solved apart from throwing more CPU at it.


    Fixed typo. Indeed, we don't want paralyzed disks ;-)

      I want to second this:
      The only way to speed up things in this case, is to cause fewer head movements. If the proprietary program can read from STDIN, or write to STDOUT, then a (de-)compressing pipe seems like a good idea. If you have more than one physical disk, try storing the uncompressed file on a disk different from where the compressed file is located.
      Note that you should not decompress the files in place because you then have to recompress them. You can, e.g. "gunzip <file.gz >/tmp/file"
        Ack bad idea. Hes running on solaris and /tmp is a virt filesystem that dumps to ram/swap switching one evil (disk i/o) for another (memory starvation -- swap out) is not a great idea. As noted in one of his replys above his cpu is near 100% durring the run so it looks like CPU contention.

      ... parallizing may not be a good idea

      I have to agree, if there's one thing you don't want to do, that is to paralyse your disks! :)


Re: Best Practices for Uncompressing/Recompressing Files?
by dws (Chancellor) on Aug 10, 2003 at 23:42 UTC
    ... so that our disk doesn't run out of space

    Given the amount of time you're pouring into this, and the cost of that time, both direct and indirect, I would think a business case could be made for buying more disk.

Re: Best Practices for Uncompressing/Recompressing Files?
by waswas-fng (Curate) on Aug 10, 2003 at 23:47 UTC
    You can always install a thread enabled version of perl in the non default location just for this script. Or you can fork like is staed above. The question is what platform and how many CPUs is this running on? If it is decompressing and compressing there may be a CPU bottleneck to begin with (hence it is slow) sor threading or forking on a system may make performance slow if the threads are fighting for time on the same cpu. force contex switches can play not-so-fun games with your speed. If you have enough cpus on the box then fork or thread, if not consider taking dws's advice and make a buisness case for more diskspace.

      I definitely could install a threaded version of Perl in a non-standard location, not a bad idea.

      It's running on a Sun Enterprise 450 server with 4 CPU's and 4 gigs of RAM, which makes me think that parallelizing could give me great performance gains. When uncompressing files, the CPU usage is always at 99-100% as viewed by top, so the operation appears to be cpu limited.

      As far as disk space goes, my lab is a Bioinformatics lab, and I'm installing more disk space this week. :) Unfortunately, I can't use it as temporary space for this project (I would need about 120gigs to uncompress all of the files). :(

      I'm thinking that I'll just try Parallel::ForkManager, if I can find the time while on vacation next week I may even write up a tutorial on the subject. Thanks for the tips everytone. :)
        I think you will find that the fork meathod on that box will give you better performance than threading. If you have sole access to that box while this is running limit your fork to 4 proccess, any more and you will see diminishing returns as the cpus will need to csw on the uncompress proccesses. If you end up using gzip, you may want to look at a gnu version -- I have seen 20% speed ups from the vanilla Sun bin. You may also want to investigate the difference the different compress flags have on your files -- if these are dna tag files a small dict with low compression may buy you the same compress ratio (or close) with way less cpu time.

Re: Best Practices for Uncompressing/Recompressing Files?
by mpd (Monk) on Aug 10, 2003 at 23:40 UTC
    In addition, how are you actually doing this process? I've had success with IO::Zlib in the past, which should help prevent you from having to use system().
Re: Best Practices for Uncompressing/Recompressing Files?
by fglock (Vicar) on Aug 10, 2003 at 23:06 UTC

    If you can't use threads, there is an easy way to parallelize: you could split the list in small batches, and then call the script on each batch.

Re: Best Practices for Uncompressing/Recompressing Files?
by sgifford (Prior) on Aug 11, 2003 at 07:13 UTC

    Here's what I would do.

    First, write your program to take a list of filenames on the command-line, process all of them, then exit. This gives you maximum flexibility in calling the script.

    Second, install GNU xargs and find if you don't already have them. Linux/BSD will come with these; commercial Unices won't.

    Now you have everything you need to parallelize this process. Simply use the -P flag to GNU xargs:

      find . -type f -print0 |xargs -0 -P 4
    will start up to 4 copies of your program in parallel, feeding each of them as long a list of files as will work. When one batch finishes, another will be started with another batch.
      find . -type f -print0 |xargs -0 -n 1 -P 6
    will start up 6 copies of your program in parallel, processing one file each. When one copy finishes, the next will be started. You can vary this process and experiment by writing the file list to another file, then processing chunks of this. If your filenames don't have spaces in them, you can use simple tools like split, head, and tail to do this; otherwise you'll have to write short Perl scripts to deal with a null-terminated list of files.

    I would also consider using pipes and/or Compress::Zlib to minimize disk I/O. If you're decompressing to a temp file, then converting this and writing to another file, then compressing the written file, you're effectively writing the file to disk twice uncompressed, and once compressed. Further, while the blocks should mostly be in your buffer cache so not actually read from disk, the different copies of the file are wasting memory with multiple copies of the same file. If you could turn this into something like:

      gunzip -c <file.gz |converter |gzip -c >newfile.gz
      mv newfile.gz file.gz
    you would only write the file to disk once compressed, and never uncompressed. This should save you tons of I/O and buffer cache memory (although, as always, YMMV and you should benchmark to see for sure).
      Just for the record solaris native xargs supports -P and -0. Solaris also comes with find. -


        Huh. I just scanned the manpage for Solaris 8 find and xargs, and they don't mention this. These arguments also give errors when I try them from the command-line:

        bash-2.04$ uname -a SunOS 5.8 Generic_108528-18 sun4u sparc SU +NW,UltraAX-e2 bash-2.04$ find . -print0 find: bad option -print0 find: path-list predicate-list bash-2.04$ find . -print |xargs -P 4 echo xargs: illegal option -- P xargs: Usage: xargs: [-t] [-p] [-e[eofstr]] [-E eofstr] [-I replstr] [ +-i[replstr]] [-L #] [-l[#]] [-n # [-x]] [-s size] [cmd [args ...]]bas +h-2.04$ bash-2.04$ find . -print |xargs -0 echo xargs: illegal option -- 0 xargs: Usage: xargs: [-t] [-p] [-e[eofstr]] [-E eofstr] [-I replstr] [ +-i[replstr]] [-L #] [-l[#]] [-n # [-x]] [-s size] [cmd [args ...]]bas +h-2.04$ bash-2.04$ which find /usr/bin/find bash-2.04$ which xargs /usr/bin/xargs

        Perhaps the GNU versions are provided in later versions of Solaris?

Re: Best Practices for Uncompressing/Recompressing Files?
by bart (Canon) on Aug 11, 2003 at 13:04 UTC
    To be honest, unless you can spread the total of processing over multiple CPU's, I don't think parallellizing the script is going to speed this program up one single bit. On the contrary.
Re: Best Practices for Uncompressing/Recompressing Files?
by coreolyn (Parson) on Aug 11, 2003 at 14:18 UTC

    Just to make sure the obvious wasn't overlooked; Make sure you are running gzip silently. The overhead created by STDOUT on the zip/unzip process is significant

      Make sure you are running gzip silently. The overhead created by STDOUT on the zip/unzip process is significant

      Uhh . . . huh?

      Even with --verbose, gzip only prints one line for every file it processes. That's hardly what I'd call significant. Especially not in a case like this where each file averages 200M of uncompressed data!

      "My two cents aren't worth a dime.";

      Unfortunatley I don't have time to write a benchmark but can only speak from expirience. On one large tar & zip routine that we run here in production we shaved 15 seconds off the operation just by going silent

Re: Best Practices for Uncompressing/Recompressing Files?
by husker (Chaplain) on Aug 11, 2003 at 14:21 UTC
    The best practice is to not recompress. As someone else mentioned, your quickest / easiest / cheapest fix is to not alter your machine or your Perl environment, but your algorithm. Re-compressing your file is an unnecessary waste. At worst, uncompress a COPY of the compressed file, process it, then throw it away. Do it in /tmp, or not in /tmp, whatever works best for you. Just don't recompress everything.

    After that, you must know your system bottleneck before you can decide what else to change. Threading may or may not be a good idea.

    BTW, the "system" call always waits for the process to terminate. So your idea won't work. But, if you are not recompressing as the last step, it doesn't really matter. :)

Re: Best Practices for Uncompressing/Recompressing Files?
by jonadab (Parson) on Aug 12, 2003 at 00:51 UTC

    Out of curiousity, are you uncompressing all the files first, then processing them all, then compressing them all, or are you taking each file and uncompressing, processing, and recompressing it before moving on to the next file?

    That may or may not make a difference for performance, but it will absolutely make a difference for the disk space used while your script runs.

    $;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}} split//,".rekcah lreP rehtona tsuJ";$\=$ ;->();print$/
      Definitely, ...took these steps for my own situation...instead of uncompressing, which there is not such thing...I utilized disc decompression

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://282694]
Approved by Paladin
Front-paged by derby
[shmem]: trading perl for python is just like throwing away a big tool box, to be given a hammer, a gripper and a screwdriver instead
[choroba]: but they're shiny!
[shmem]: ...and, a plethora of tools built with hammer, gripper and screwdriver, I should say

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (9)
As of 2018-04-20 13:13 GMT
Find Nodes?
    Voting Booth?