Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options

reversing files line by line

by naturalsciences (Beadle)
on Feb 14, 2010 at 15:00 UTC ( #823128=perlquestion: print w/replies, xml ) Need Help??
naturalsciences has asked for the wisdom of the Perl Monks concerning the following question:

I have a need to reverse the order of text lines in a HUGE, text file so I cant just @ filehandle and then reverse. I was going for a solution to read file in line by line then to write those lines out in a way that they will be appended to the beginning of a file. code i smashed together from bits and peaces from the web looked like this
#!/usr/bin/perl -w use strict; use warnings; my $file = 'hugefiletobereversed'; open FH, 'hugefiletobereversed'; while (our $newline = <FH>) { { local @ARGV = ($file); local $^I = '.bac'; while(<>){ if ($. == 1) { print "$newline\n"; } else { print; } } } }
Reaction is that it will only take the last line and print it in front of the file and then generate a lots of empty lines. Can anybody help me with this code or suggest a better one. JP Updated. Thanks for your replys. I found a module File::ReadBackwards that soothed all my problems in a sec :D Though im still curious whats the problem with my original code. And no it is not somekind of a programming homework - I have to work with several bioinformatics programs for my thesis, and being academic software they are usually not at all meant for compability. So i have to reformat outputfiles from some of them to use as input files for others. Sometimes the formatting differences are small and easy, sometimes differences are large but solutions are easy and this time the difference was small ( reverse the file) but solution just dindt came to my mind :D Also files tend to be in several Gb-s so any solutions involving reading those files in memory are out. ( makes my life harder)

Replies are listed 'Best First'.
Re: reversing files line by line
by Corion (Pope) on Feb 14, 2010 at 15:08 UTC

    There are two three four potential approaches to this problem, but as it mostly sounds like homework to me, I'll only outline the general approaches:

    The first approach is to observe that reverse(A+B) = reverse(B) + reverse(A), so you can split your long list into many smaller lists, reverse these smaller lists and then join the reversed lists back together. If you want to do this in an recursive approach, you could also observe that reverse(a) == a and treat that as a stopping condition, although that would be quite inefficient, as would be the observation that reverse(a+TAIL) == reverse(TAIL)+a.

    The second approach, which works as long as you have enough core memory for storing a list of numbers, would be to read through the whole file, noting the offset of each line (as told via tell), and then to reverse that list of numbers, seek to each offset in the file and to then write out the line from there into the new file. If you are careful with your memory usage, or if you are tricky and store your offsets packed in a string, for example as 4-byte or 8-byte tuples, or maybe even as just the (presumed shorter) list of line-lengths, you can reduce the amount of memory you need to store the information and still reverse the whole file.

    A third approach would be to not store any information and just seek a bit backwards from the current location and read data up to the current location, then scan for a line break. If one is found, output the line from there. Seek further backwards.

    A fourth approach would be to use File::ReadBackwards.

      I found this module before I read your post and i might admit it worked really well. If i get some free time on my hands i try to nbble with this problem to solve without modules( just out of curiosity and thanks for other solutions as well, i seem to recall seeing seek used for a thing similar)
Re: reversing files line by line
by bart (Canon) on Feb 14, 2010 at 16:25 UTC

    Assuming you don't mind that the file gets read several times over, this is the way I would do it. It ought to be fairly efficient.

    1. In a first pass, you must get an array of offsets of where lines start in the file. However, you don't need the offset for every single line.

      Divide the file into chunks of reasonable size. For example, if you allow at most about 500k from the file in memory at one, first seek in the file with a value close to 500_000, read one (incomplete) line, call tell on the handle, and store it in an array. Add 500_000 to that offset, and repeat, until you get past the file size. You end up with an array of indexes of where lines start. Note that the first value in the array must be 0; the last value ought to be the length of the file.

    2. Loop through that array backwards, each time getting one offset, and the next offset. seek to the lower value; read a block of as many bytes as the difference between the two. Reverse the lines in that block, and write it to a new file ,each time appending to what you wrote there last time.

Re: reversing files line by line
by afoken (Abbot) on Feb 14, 2010 at 19:45 UTC

    In other words: You want to reimplement tac in Perl. Why? (Windows is no excuse here, there are also implementations of tac for Windows.)


    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
Re: reversing files line by line
by repellent (Priest) on Feb 14, 2010 at 19:49 UTC
    You could also use tac.
Re: reversing files line by line
by Marshall (Abbot) on Feb 15, 2010 at 07:52 UTC
    The ways to reverse lines get a bit messy because they involve merging segments together (provided of course as your problem statement says that you can't fit all lines into memory at once). As it turns out well written system sort routines handle huge files and all this merging can split the job up into smaller pieces.

    My suggestion would be to make a pass thorough this huge file, adding a number in the first column while generating a new file. Then use a system sort on that file using this first column. Then make a pass through the sorted file to remove this "index" number that was added in the first pass. And you have a reversed file.

    This idea may sound "stupid", but you may be surprised at how fast this can work. It is dependent upon a lot of things. But you can write the code in some number of minutes and let it run while you are pondering more efficient ways. And of course there are more efficient ways!

    I don't want to get too involved in the more complex ways until I hear back about how fast and what limitations were involved with this "stupid idea".

    I just propose that you write some simple code and launch it off on its job, while thinking about other ways.

    Update: This "tac" idea is also good. It may or may not be faster, depending upon how it is implemented. The main point is try the simple things first.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://823128]
Approved by planetscape
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (11)
As of 2018-03-23 15:30 GMT
Find Nodes?
    Voting Booth?
    When I think of a mole I think of:

    Results (294 votes). Check out past polls.