Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Large file split into smaller using sysread()

by rkshyam (Acolyte)
on Mar 27, 2012 at 15:57 UTC ( [id://961967]=perlquestion: print w/replies, xml ) Need Help??

rkshyam has asked for the wisdom of the Perl Monks concerning the following question:

Hi I am new to perl.My requirement is to split huge file(3GB) into multiples of 200MB files.I have used the option of perl built-in functions like read()/sysread().The files are getting split to 200MB files and the basic logic is working what I have written the code.But since I am using the read function, the data is getting transferred in bytes and in each file, there is a chunk of information which does not actually terminate at end of line and next line does not start from the beginning of the line.Please help!! I need to know if I use read function, can I transfer the data exactly for the end of line?

The output looks like this 2011/11/30 @ 18:08:52,103 @ -> GetLogicalServerByIdTask-124191 - Impl.addLogEntry 2011/11/30 @ 18:08:52,112 @ -> [ActiveMQ Session Task 2011/11/30 @ 18:12:12,042 @ -> ActiveMQ Session Task - WARN le for synchronizing its resources.

my $filesize_in_MB=0; my $file_size_compare=100; my $filename; my $filesize; my $block_size=131072; my $file_size_sorted= -s $file_sorted; my $file_size_sorted_MB=$file_size_sorted/(1024*1024); my $buffer; my $count=15; open FH_sort, "$file_sorted"; for (my $i=1;$i <= $count;$i++) { while($filesize_in_MB <= $file_size_compare) { my $rv=read(FH_sort,$buffer,$block_size); #or die "$?"; #my $rv=read(FH_sort,$buffer,$block_size,O_APPEND ); #or die "$?"; #print $rv; if (!eof(FH_sort) && ($rv <= $block_size)) { open FH_split, ">>sort_split$i" or die "$!"; print FH_split $buffer; $filename="sort_split$i"; $filesize= -s $filename; $filesize_in_MB=$filesize/(1024*1024); close FH_split; } else { open FH_split, ">>sort_split$i" or die "$!"; print FH_split $buffer; close FH_split; last; } } $filesize_in_MB=0; } close FH_sort;

Replies are listed 'Best First'.
Re: Large file split into smaller using sysread()
by kejohm (Hermit) on Mar 28, 2012 at 02:10 UTC

    If I am understanding your question correctly, you want to split a 3GB text file into 200MB parts, but some lines are being split between files.

    The functions read() and sysread() operate in characters and don't have a concept of lines. So, if you read exactly 200MB of data, there is a high probability that the boundary will be in the middle of a line.

    One way to do it would be to read in lines, instead of characters, from the big file and print them to a new file part, keeping track of the number of characters read so far. When the character count goes over 200MB, you close the current file part and open the next one. Here is an example:

    #!perl # Untested use 5.012; my $partsize = 200 * 1024 * 1024; my $file = shift or die 'no file'; open my $in, '<', $file or die "Can't open '$file' for reading: $!"; my $part = 1; my $size = 0; open my $out, '>', "$file.part$part" or die "Can't open '$file.part$part' for writing: $!"; while (<$in>) { print $out $_; $size += length $_; if ( $size >= $partsize ) { close $out; $part++; open $out, '>', "$file.part$part" or die "Can't open '$file.part$part' for writing: $!"; $size = 0; } }

      Hi kejohm, This great solution worked very well for me. Thanks a lot !!!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://961967]
Approved by herveus
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (4)
As of 2024-03-29 00:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found