Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

index for large text file

by cafeblue (Novice)
on Mar 28, 2011 at 05:44 UTC ( #895841=perlquestion: print w/ replies, xml ) Need Help??
cafeblue has asked for the wisdom of the Perl Monks concerning the following question:

I have large text files (5G-30G): like below:
@HWUSI-EAS1734_0032_FC620F7AAXX:5:1:18184:1176#CGATGT/1 GGATTTCTCGTGGANACCATTTGTTGGTCAANNNNNNNNNNGTGTTNGNCTTCANNGNNATTGAAAATGN +TCATTCGTGGCTATTTTCGCNNNNNATNNNN +HWUSI-EAS1734_0032_FC620F7AAXX:5:1:18184:1176#CGATGT/1 gggfggggfgeeecB```^]gffgegadcgBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB +BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @HWUSI-EAS1734_0032_FC620F7AAXX:5:1:1934:1185#CGATGT/1 GTCATCCTTAATTANCGTATGTGCTCTTCCTNCNNNNNNNNGCTGCTANTTATTTCTNNGCAGCTTTGCT +CTTATTAGTTACGAACATGCCNNNNTANNNN +HWUSI-EAS1734_0032_FC620F7AAXX:5:1:1934:1185#CGATGT/1 acdad`^ddd^aa^B_\VZZfcfccaffBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB +BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB ..........
each 4 lines is a block, all the blocks are alike. all the even lines are the same length. I need to extract random lines in these files. So I build index files for these large text files. like script like below:
if (-e "$ARGV[0].idx") { open (INDEXFQ1, "$ARGV[0].idx") or die $!; } else { open (INDEXFQ1, "+>$ARGV[1].idx") or die $!; build_index(*FQ1, *INDEXFQ1); }
the question is, whenever I print lines in large line number, the out put is defective. the print code is like below:
print OQ10_1 line_with_index(*FQ1, *INDEXFQ1, $line);
no error output information, but the line in large line number is defective, like below:
741:20058#ATCACG/1 GTTCGTGAGAGCTCTAGGTTGTCGTCTCCCAGTCAACTATGGTCGCTGTAACGCGCTGACTT 41:20058#ATCACG/1 dgggg_ddadbaggedbXdd]^[UVYX]XR_BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
can anybody help? thank you! sorry, there two sub for the build_index and line_with_index like below:
sub build_index { my $data_file = shift; my $index_file = shift; my $offset = 0; while (<$data_file>) { print $index_file pack("N", $offset); $offset = tell($data_file); } } sub line_with_index { my $data_file = shift; my $index_file = shift; my $line_number = shift; my $size; my $i_offset; my $entry; my $d_offset; $size = length(pack("N", 0)); $i_offset = $size * ($line_number-1); seek($index_file, $i_offset, 0) or return; read($index_file, $entry, $size); $d_offset = unpack("N", $entry); seek($data_file, $d_offset, 0); return scalar(<$data_file>); }

Comment on index for large text file
Select or Download Code
Re: index for large text file
by moritz (Cardinal) on Mar 28, 2011 at 06:25 UTC
    When all the blocks are of the same length, you don't even need to build an index.
    # the usual precautions use strict; use warnings; # automatically die when open() or so fails use autodie; my $filename = shift @ARGV; open my $handle, '<', $filename ; # determine the block length my $first_block = ''; $first_block .= <$handle> for 1..4; my $block_length = length $first_block; sub get_line { my ($handle, $block_length, $line) = @_; # assume the file starts with line 1: seek $handle, $block_length * ($line - 1), 0; my $result; my $read_len = read $handle, $result, $block_length; if ($read_len == $block_length) { return $result } else { die "Error while reading file: expected $block_length bytes, bu +t got $read_len instead"; } }

    (untested)

      thank you for your advice!!

      but the odd lines are not of the same length. maybe your solution would get a wrong result. but you give me a good advice!

      I change the pack parameter from "N" to "L", maybe the "L" could be enough to store the line number, while the "N" is not long enough.

      thank you!

        I change the pack parameter from "N" to "L", maybe the "L" could be enough to store the line number, while the "N" is not long enough.

        Both "N" and "L" are 32-bit usigned integers, so that's not going to make a (useful) difference.  They only differ with respect to byte order:

        • "N" — 32-bit unsigned integer in big-endian byte order
        • "V" — 32-bit unsigned integer in little-endian byte order
        • "L" — 32-bit unsigned integer in native byte order of the architecture perl is running on

        A 32-bit unsigned int can hold values up to 4294967296 (4G). If you need to store larger values, you could use the "Q" template (64-bit), if your build of perl supports it.  Otherwise - or if you want to save space - you could just "add" another single byte ("C"), so you have 5 bytes / 40-bit in total — which would be able to handle indices of up to around 1 Terabyte:

        my $i = 78187493530; write_index($i); print read_index(); # 78187493530 sub write_index { my $i = shift; open my $f, ">", "myindex" or die $!; my $pi = pack("CN", $i / 2**32, $i % 2**32); print $f $pi; # writes 5 bytes close $f; } sub read_index { open my $f, "<", "myindex" or die $!; read $f, my $pi, 5; my ($C, $N) = unpack "CN", $pi; return $C * 2**32 + $N; }

        This works even with 32-bit-int perls, because then numbers larger than 32-bit are handled as floats internally.

Re: index for large text file
by Anonymous Monk on Mar 28, 2011 at 06:33 UTC
    Some ideas
    • use autodie, 3 argument open, lexical file handles
      use autodie; open my($stuff), '<', 'in.txt' ; # with autodie, no need to "or die" ... <$stuff> ... close $stuff; # with autodie, no need to "or die"
    • post a runnable program, not fragments, see How do I post a question effectively?
    • document said program (expected input, expected output, how the actual output differs from expected), so it makes sense to a stranger like me :)

      for example, you say no error output information, but the line in large line number is defective, like below:

      but then you don't explain the format, or how your output is defective

      This entire reply is anti-helpful, serving only to distract from the simple, one-line solution to the OP's problem: the "N" format to pack is 32-bit, so it won't work with offsets larger than 2^32. The "Q" format is 64-bit, so it will.
        Really?

        If moritz makes the same suggestion, minutes before I do, its helpful.

        If davido asks for clarification, minutes after I do, its helpful.

        A few minutes later, its clear moritz misunderstood the original question, I guess asking for clarification was helpful after all.

        A few minutes later, cafeblue clarifies the missing details.

        A few hours later Eliya answers the question.

        A few more hours, you show up, to call my reply anti-helpful and as as serving only to distract from the simple, one-line solution to the OP's problem.

        Anti-helpful? How would you classify your posting, redundant?

Re: index for large text file
by davido (Archbishop) on Mar 28, 2011 at 06:49 UTC

    Typeglob filehandles make my eyes bleed. I know they're out there in the wild. But if you can avoid creating new code that uses them you'll be happier in the longrun.

    If you could post what your expected output is, as opposed to what you're getting, that would be helpful. Also, when you say all blocks are identical, do you mean identical? If that's the case, you have 30 gigs of millions of identically repeated four-line blocks? That doesn't make any sense. I re-read the question a few times and just couldn't come up with a concept of what you mean to say. ...probably my fault. But could you clarify what your dataset looks like, what output you're getting, and what you're expecting?


    Dave

      thank you, maybe I should not use the word "block", I should say pattern.

      like the text file I have given in the first post, the first line starts with a symbol "@", and the third line starts with "+",the other part are identical, but not identical in other four-lines.

      the even lines are of the same length in the whole text file.

      the whole file follow this pattern.

      I am a newbie. sorry for my careless.

Re: index for large text file
by vkon (Deacon) on Mar 28, 2011 at 07:52 UTC
    also check your perl verision if it can deal correctly with large files:
    D:\>perl -V Summary of my perl5 (revision 5 version 13 subversion 11) configuratio +n: Platform: osname=MSWin32, osvers=5.1, archname=MSWin32-x86-perlio uname='' config_args='undef' hint=recommended, useposix=true, d_sigaction=undef useithreads=undef, usemultiplicity=undef useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=und +ef ....
    Or just
    D:\>perl -V:uselargefiles uselargefiles='define';
    if it isn't, then upgrade
Re: index for large text file
by GrandFather (Cardinal) on Mar 28, 2011 at 09:32 UTC

    Would this data be better stored in a database? To answer that you need to think about how the files are generated and used. If you look them up much more often than they are generated a database may be very worth while. If relatively small numbers of records change from time to time that may be another good reason to use a database. If you have control over generation of the current file that will make using a database easier.

    True laziness is hard work
      I do not think a database would be appropriate considering this file type (FASTQ) can easily have 200,000,000 records (200,000,000 X 4 lines). We (biologists) can often have 100's if not 1,000s of these files.
Re: index for large text file
by umasuresh (Hermit) on Mar 28, 2011 at 14:19 UTC
    If I understand your problem correctly:
    1. each 4 lines has a key for e.g.
    5:1:1934:1185#CGATGT/1
    2. with values:
    sequence: GTCATCCTTAATTANCGTATGTGCTCTTCCTNCNNNNNNNNGCTGCTANTTATTTCTNNG +CAGCTTTGCT Its ascii code: acdad`^ddd^aa^B_\VZZfcfccaffBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB +BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB<br>
    It might be easier to grow a Hash of Array and access random keys to print the lines: e.g.
    my %HoA = ( 5:1:1934:1185#CGATGT/1 =>[ GTCATCCTTAATTANCGTATGTGCTCTTCC +TNCNNNNNNNNGCTGCTANTTATTTCTNNGCAGCTTTGCT, acdad`^ddd^aa^B_\VZZfcfccaffBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB +BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB ] );
    NOTE: untested

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://895841]
Approved by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (7)
As of 2014-12-22 05:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (110 votes), past polls