http://www.perlmonks.org?node_id=902713

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Esteemed monks

Please forgive my ignorance as I am a perl beginner. I have a file which is a bgzip file. I would like to read the contents of this file in a text editor. However when i unzip the file (using bgzip -d option for decompression) the file is not readable. Am i missing some crucial piece of information? How do i read these files? Do i have to do something like read them into perl (or another programming language) and convert them manually? If this is the case please could you advise how I do this or provide pointers in the right direction (as I'm not trying to get anyone to do my work for me)

thank you very much

Replies are listed 'Best First'.
Re: (OT) help reading a bgzip file
by Mr. Muskrat (Canon) on May 03, 2011 at 17:26 UTC
Re: (OT) help reading a bgzip file
by Utilitarian (Vicar) on May 03, 2011 at 13:52 UTC
    Without more info it's hard to say.
    What does file unzipped_file say the resultant data format is ?

    PS: I assume you mean bzip2 -d rather than bgzip -d;) which on most systems can also be called as bunzip2

    print "Good ",qw(night morning afternoon evening)[(localtime)[2]/6]," fellow monks."
      HI. no i don't mean bzip2. This program is called bgzip. I have never used it before. I just know it is a compression/decompression tool. This is its page on sourceforge sourceforge.net/projects/bgzip/. This is the file info. The test..txt.gz file is the file created by 'bgzipping' test.txt
      bgzip test.txt file test.txt.gz test.txt.gz: gzip compressed data, extra field
        Ah, OK, new one on me - anyway, what does the decompressed file that won't load into your editor say it is when you run file on it?

        print "Good ",qw(night morning afternoon evening)[(localtime)[2]/6]," fellow monks."

        Just thought I'd add a note here since this is popping up fairly high on Google searches for bgzip. Bgzip uses the the BGZF format which is a fully backward compliant but application specific extension of gzip. In other words you can unzip a bgzipped file with gunzip, but you can't create one with gzip.

        The addition that bgzip adds is block level compression. You can use the library to compress and uncompress input data in blocks which provides for a level of random access to the compressed file. The format was developed by Bob Handsaker of the Broad Institute for use in genomics/bioinformatics applications. It has been modified and used by Bob and Heng Li (also currently at the Broad) in next-generation sequence alignment and sequence variant analysis tools developed as part of the 1,000 genomes project. Application such as the BAM file format, samtools, and tabix use bgzip/BGZF to compress sequence alignment and sequence variant files and allow rapid random access to the data compressed within those files.

        There are perl libraries that provide an API to files compressed in BAM and to the tabix library.

        http://search.cpan.org/~lds/Bio-SamTools-1.33/lib/Bio/DB/Bam/Alignment.pm
        http://samtools.sourceforge.net/tabix.shtml

        For more information see:

        http://samtools.sourceforge.net/
        http://samtools.sourceforge.net/SAM1.pdf

        The SeqAnswers.com forum would be a good place for questions about the format and it's applications as the authors and many users are active there.