Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked

Re^4: Handling very big gz.Z files

by mbethke (Hermit)
on Feb 07, 2013 at 16:08 UTC ( [id://1017686]=note: print w/replies, xml ) Need Help??

in reply to Re^3: Handling very big gz.Z files
in thread Handling very big gz.Z files

Yup, "gz.Z" is strange indeed, although I don't think the extra compress would gain anything :)

Compression is even more interesting on these huge machines we have nowadays than it was before, since someone found it's usually faster to compress memory to be "swapped" and keep it in RAM than to write it to disk. Or for doing anything else disk-based for that matter as CPU speed has grown much faster than disk speed. The BNC the OP is dealing with has 100 million word forms and would fit in memory on most machines but meanwhile Google has raised the bar to a trillion word forms. They don't distribute that as text but even their n-gram lists are 24 GB gzipped. If your HD sustains 100 MB/s that's 4 minutes just to read it into memory, or 8 if it's twice the size uncompressed. But on a single core I can zcat at 154 MB/s so it's just faster to keep the stuff gzipped and unzip on the fly. Unzipping to a tempfile and reading that back is much slower on all but the fastest SSDs.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1017686]
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (2)
As of 2024-06-15 21:12 GMT
Find Nodes?
    Voting Booth?

    No recent polls found

    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.