Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re: Handling very big gz.Z files

by flexvault (Parson)
on Feb 06, 2013 at 16:43 UTC ( #1017468=note: print w/ replies, xml ) Need Help??


in reply to Handling very big gz.Z files

Welcome albascura,

Using 'more' will take a long time to see that the format is correct at the end of the file. I don't use 'more', but I use 'pg' to display a page at a time. So 'more' may have an option to display the end of the file, but I would do the following:

gunzip -c bnc.xml.gz.Z > vi.testlist # I use 'vi.' for any temp + lists tail -100 vi.testlist | more # Check if end of file is +correct cat vi.testlist | perl provalong.pl testlist.txt rm vi.testlist # clean-up
If it works then you can try the original, and if that doesn't work, you have at least a temporary work around until you find the specific problem.

Note: My background of AIX, '.Z' is used by the 'compress/uncompress' system commands and '.gz' is used with 'gzip/gunzip' system commands. Are you sure that the file wasn't created that way? 'compress' gets a 10% additional compression over 'gzip' and when disk drives were small, was a big deal. Today is not worth the CPU cycles.

Good Luck...Ed

"Well done is better than well said." - Benjamin Franklin


Comment on Re: Handling very big gz.Z files
Download Code
Re^2: Handling very big gz.Z files
by mbethke (Hermit) on Feb 07, 2013 at 05:16 UTC
    My background of AIX, '.Z' is used by the 'compress/uncompress' system commands and '.gz' is used with 'gzip/gunzip' system commands. Are you sure that the file wasn't created that way? 'compress' gets a 10% additional compression over 'gzip' and when disk drives were small, was a big deal. Today is not worth the CPU cycles.

    OT: I've yet to see the file that compress crunches to a smaller size than gzip. Actually I thought for a long time (before I heard of the patents) that everyone had ditched compress for gzip because compress sucks so badly in comparison. Today, people burn a lot more CPU cycles using lzma, xz & Co. for a much better compression than either.

      mbethke,

      I think we agree!

      What I referred to is that 'gzip' does great in compressing text, and the result is a binary file. Now that file can be compressed further by 'compress'. But I haven't done that since the RT or early RS\6000 days. I don't even know if 'compress' on AIX 6.1 or 7.1 exists( my in-house box with AIX 5.2 has it ), but I found it "funny" to see the ".qz.Z" and remembered when it was done. I pointed it out in case the file was being created differently then the OP thought.

      I just fired up last week a Debian AMD box with 8-core and 4-2TB drives.

      Why bother with compression!

      Regards...Ed

      "Well done is better than well said." - Benjamin Franklin

        Yup, "gz.Z" is strange indeed, although I don't think the extra compress would gain anything :)

        Compression is even more interesting on these huge machines we have nowadays than it was before, since someone found it's usually faster to compress memory to be "swapped" and keep it in RAM than to write it to disk. Or for doing anything else disk-based for that matter as CPU speed has grown much faster than disk speed. The BNC the OP is dealing with has 100 million word forms and would fit in memory on most machines but meanwhile Google has raised the bar to a trillion word forms. They don't distribute that as text but even their n-gram lists are 24 GB gzipped. If your HD sustains 100 MB/s that's 4 minutes just to read it into memory, or 8 if it's twice the size uncompressed. But on a single core I can zcat at 154 MB/s so it's just faster to keep the stuff gzipped and unzip on the fly. Unzipping to a tempfile and reading that back is much slower on all but the fastest SSDs.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1017468]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (7)
As of 2014-10-02 17:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    What is your favourite meta-syntactic variable name?














    Results (66 votes), past polls