Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Split file into 4 smaller ones

by Anonymous Monk
on Feb 07, 2013 at 22:26 UTC ( [id://1017730]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi, is there a quick way to split a large file (around 10 GBs) into 4 smaller files equally sized? If so, will the content of each of the smaller files be "continuous", i.e. each file's content will begin where the previous file stops?

Replies are listed 'Best First'.
Re: Split file into 4 smaller ones
by BrowserUk (Patriarch) on Feb 08, 2013 at 00:42 UTC

    This will create output files split1 through split4. They will vary slightly in size in order that each file contains complete lines. The memory usage is minimal:

    #! perl -sw use strict; open I, '<', $ARGV[0] or die $!; seek I, 0, 2; my $n = int( tell( I ) / 4 ); seek I, 0, 0; my $s = $n; for my $i ( 1 .. 4 ) { open O, '>', "split$i" or die $!; while( tell( I ) < $s ) { print O scalar <I>; } $s += $n; close O; } __END__ C:\test>dir words.txt 17/07/2011 15:52 1,941,858 words.txt C:\test>split4 words.txt C:\test>dir split* 08/02/2013 00:41 485,470 split1 08/02/2013 00:41 485,458 split2 08/02/2013 00:41 485,468 split3 08/02/2013 00:41 485,462 split4 08/02/2013 00:38 298 split4.pl

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Split file into 4 smaller ones
by davido (Cardinal) on Feb 07, 2013 at 22:32 UTC

    use the -s operator to determine the file's size, and set the $/ special variable to equal 1/4th of the size returned by -s. Then read in a while() loop, just like you always would.

    See perlvar for an explanation of how to set $/, and perlfunc -X to learn about -s

    ...or, again use -s, and then seek to specific locations and use sysread with a length of 1/4th of the total. But then beware of off-by-one errors.

    Of course with any of these methods, you're going to run into some memory constraints; reading 25% of a 10GB file will consume 2.5GB. Might be better to set $/ to 1/16th of the file size, and then read/write four times for each output file.


    Dave

      I always find it a bit silly when people suggest reading such large amounts of data into memory. Yes, I guess you can do that nowadays, but I'm not sure you always should. I prefer to conserve memory where possible.

      use POSIX 'ceil'; my $buffer_size = 64 * 1024; my $bytes_wanted = ceil($size / 4); sub copy { my ($in, $out, $bytes) = @_; my ($buffer, $bytes_read); $bytes_read = sysread($in, $buffer, $bytes); print $out $buffer; return $bytes_read; } open my $in, '<', $infn or die $!; for my $outfn (1..4) { open my $out, '>', $prefix . $outfn or die $!; my $bytes = 0; while ($bytes + $buffer_size < $bytes_wanted) { $bytes += copy($in, $out, $buffer_size); } copy($in, $out, $bytes_wanted - $bytes); close $out; } close $in;
Re: Split file into 4 smaller ones
by Anonymous Monk on Feb 08, 2013 at 07:13 UTC

    May I recommend the Unix utility split (1)?

    
    NAME
         split -- split a file into pieces
    
    SYNOPSIS
         split [-a suffix_length] [-b byte_count[k|m] | -l line_count -n
               chunk_count] [file [name]]
    
    DESCRIPTION
         The split utility reads the given file and breaks it up into files of
         1000 lines each.  If file is a single dash or absent, split reads from
         the standard input.  file itself is not altered.
    
         The options are as follows:
    
    [...]
    
         -n      Split file into chunk_count smaller files.
    

    Okay, maybe that was a bad suggestion. The GNU version does not have a split-to-n-chunks feature, it seems.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1017730]
Approved by davido
Front-paged by MidLifeXis
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (5)
As of 2024-04-26 08:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found