Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

How to split big files with Perl ?

by zalezny (Novice)
on Dec 26, 2014 at 17:20 UTC ( #1111413=perlquestion: print w/replies, xml ) Need Help??

zalezny has asked for the wisdom of the Perl Monks concerning the following question:

Hi Perl Gurus, does anybody knows how to split big files (for example: 10GB) to multiple small ones ? For example, I would like to take each big file in my backup folder and split for small peaces. For example 10GB file backup.dat needs to be splitted to :

backup.dat.aa

backup.dat.ab

backup.dat.ac

Is there any library in Perl for splitting files base on the size ? Or maybe some compress parameter to split automaticly files if they are bigger than size XX ? Thanks in advance for Your support ! Zalezny

Replies are listed 'Best First'.
Re: How to split big files with Perl ?
by GotToBTru (Prior) on Dec 26, 2014 at 17:50 UTC
    man split
    1 Peter 4:10
Re: How to split big files with Perl ?
by pme (Monsignor) on Dec 26, 2014 at 17:55 UTC
    Hi zalezny,

    Have you tried 'split' command of Fedora? It can be called from perl like this:

    `split filename`;
    Regards
Re: How to split big files with Perl ?
by herveus (Parson) on Dec 26, 2014 at 17:28 UTC
    Howdy!

    Have you tried looking on CPAN for something like, say, "split"?

    yours,
    Michael
      Not realy, I asked only uncle Google, but he didnt provide me any sensible answer. Unfortunately, I`m not using CPAN on that server, only packages from Fedora repository (ughggg...). Would be perfect if You can send me some hint ;).
        Howdy!

        Um...I did provide a pretty broad hint.

        yours,
        Michael
Re: How to split big files with Perl ?
by james28909 (Deacon) on Dec 26, 2014 at 18:28 UTC
    Get the length of the file, divide that by how many times you want to split it, then read it into a buffer and write it to a file :)
    use strict; use warnings; open my $fh, '<', 'filename.dat'; binmode($fh); my $len = -s $fh; my $split_length = $length / 5; #would split 10gb into 2gb chunks my $split_fh = $fh . 'split'; #creates 'filename.split' my $num = '1'; for ( 1 .. 5 ) { read $fh, $buf, $split_length; open my $out_file, '>', $split_fh . $num; #vreates '$filename.spli +t000, 001, 002 ect binmode($out_file); print $out_file, $buf; close($out_file); $num++; } close($fh);
    I am sure there are other ways to do it. It is completely untested code.

      I'm sorry but this is really not good.

      Aside from the fact that it doesn't compile, what is my $split_fh = "$fh" . 'split'; supposed to do? print $buf $outfile; or opening $out_file in read mode are pretty obvious errors. No error handling on open or read is also not great.

      Do you think that reading 2GB of the input file into memory at a time is a very efficient way to go about it?

      What happens when the size of the file is not exactly divisible by 5?

        Well honestly it was like I said, it was purely untested code and was just for an example. I did not intend on it being a copy and paste example. All this does is reads the file then makes another file on the fly appending 001++ to it, thats all. I will however revise it and make any corrections so user can copy and paste it.
        I stand corrected, it will take more than what i posted to be able to split it up. What I was considering was takeing a 10gb file, and split it into exactly 4gb chunks. I think that would require to read 1 byte at a time, and write the buf to outfile until a counter reaches the 4gb limit. That way it is not filling the memory with all this data at one time and would work smoothly. Ill see what i can cook up.

      This works much better :)

      This splits the file into 2gb chunks, I have tested on about 25-30 iso's is have stored on my PC and it works great, though sometimes writing performance is a little bit slow. You can also change how many gb's you want to split it into by changing the iterator's value.
      use strict; use warnings; files(); sub files { foreach (@ARGV) { print "processing $_\n"; open my $fh, '<', $_ || die "cannot open $_ $!"; binmode($fh); my $num = '000'; my $iterator = 0; split_file( $fh, $num, $_, $iterator ); } } sub split_file { my ( $fh, $num, $name, $iterator ) = @_; my $split_fh = "$name" . '.split'; open( my $out_file, '>', $split_fh . $num ) || die "cannot ope +n $split_fh$num $!"; binmode($out_file); while (1) { $iterator++; my $buf; read( $fh, $buf, 32 ); print( $out_file $buf ); my $len = length $buf; if ( $iterator == 67108864 ) { #split into 2gb chun +ks $iterator = 0; $num++; split_file( $fh, $num, $name ); } elsif ( $len !~ "32" ) { last; } } }
      Works pretty quickly! split almost 5gb in 4.4333 mins. I do see a decrease in performance sometimes, though other times it writes very quickly. Go ahead and test it on one of your iso's. What would be the most efficient read/write buffer?

        The most efficient block size will depend on lots of things, but the memory page size of your OS will likely be the most significant. 32 bytes is way too small, I'd start with 4k or 8k and go up from there. Why not try several different multiples of 4K and see which one works best for you?

        Also, read returns the number of bytes actually read so there's really no need to use length.

        my $len = read($in,$buf,4*1024); ...

        And $len is an integer so it would be better to use the numeric not equal '!=' rather than the pattern match operator.

        Thanks for taking the time to update. Some points to review:

        • Calling split_file recursively means that your stack will fill up as the number of chunks goes up. You've got one buffer per sub call, so that's probably the source of the memory usage and slowdown you reported.
        • Your algorithm/logic, even though it works, is confusing, and actually can possibly go wrong: Right after you read from the file, you use $iterator to determine whether to call split_file again - I think you need to look at $len first. Keeping a running count of the bytes written to the current chunk and comparing it to the desired chunk size might be better. Also, inside the while(1) loop, you don't seem to consider what happens after the call to split_file - the loop keeps going! In fact, if the file being split is exactly divisible by the chunk size, you create one final .splitNNN file that is empty.
        • This is not correct: open my $fh, '<', $_ || die "cannot open $_ $!";, since it gets parsed as open(my $fh, '<', ($_ || die("cannot open $_ $!"))); (you can see this by running perl -MO=Deparse,-p -e 'open my $fh, "<", $_ || die "cannot open $_ $!";'). Either write open my $fh, '<', $_ or die "cannot open $_ $!"; (or has lower precedence) or write open( my $fh, '<', $_ ) || die "cannot open $_ $!";
        • You're still not checking the return value of read, which is undef on error.
        • The code could also use a bit of cleanup. Just a couple of examples: The name $split_fh is a bit confusing, and you could append $num to it right away. In split_file you set $iterator = 0; but then don't use it in the recursive call to split_file.

        I think this might be one of those situations where it would make sense to take a step back and try to work the best approach out without a computer - how would you solve this problem on paper?

        But anyway, I am glad you took the time to work on and test your code! Tested code is important for a good post.

Re: How to split big files with Perl ?
by sundialsvc4 (Abbot) on Dec 28, 2014 at 19:58 UTC

    Given that there is a split command (at least, on most Unixes), I would be very strongly inclined to try to use it, hoping that its implementation is most efficient.   (In any case, it is an existing implementation of a classic Thing That Is Already Done.™)

    “Splitting a file” is never a thing that should call for recursion:   all that you’re really doing is reading from one file and switching from one output-file to the next one at specified intervals.   I’ve seen lots of ways to do that, probably the most-elaborate ones (not in Perl ...) using memory-mapped files to actually exploit the virtual-memory subsystem’s I/O capabilities as the means of reading from the target and getting the data (in one step) where it needs to go.

    Still, in my mind, it all comes back to the same thing:   this is A Classic Thing That Has Already Been Done.™   Search for an existing tool that you can rely-upon, and use it, to avoid having to write-and-debug “yet another” piece of software to do such a trivial task.   Surely you can find one that will meet your project’s performance expectations.

      "...in my mind, it all comes back to the same thing ..."

      Holy shit, nothing but the truth. You got it!

      «The Crux of the Biscuit is the Apostrophe»

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1111413]
Approved by herveus
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (7)
As of 2022-05-16 09:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you prefer to work remotely?



    Results (62 votes). Check out past polls.

    Notices?