Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Edit huge file

by AI Cowboy (Sexton)
on Jun 18, 2013 at 22:02 UTC ( #1039674=perlquestion: print w/ replies, xml ) Need Help??
AI Cowboy has asked for the wisdom of the Perl Monks concerning the following question:

I am needed to edit a large, 500+ megabyte file in Perl, to remove one line near the beginning of the file, and one near the end. I need to do this for many many files, so performance and speed are a slight issue; reading every file and editing/reprinting them out could take days.

How can I find two lines of the file, remove them, without reading the entire file and taking a huge amount of time?

Comment on Edit huge file
Re: Edit huge file
by frozenwithjoy (Curate) on Jun 18, 2013 at 22:18 UTC

    First thing that pops into mind for me if you really want to avoid reading/re-writing entire file:

    • Read in file until you find line near start you want to delete (note line number using $.
    • Read in file in reverse until you find line near end to delete (can use File::ReadBackwards)
    • Use Tie::File to represent file as an array
    • Use splice to get rid of unwanted lines
    • Then untie @array;

    Disclaimer: I have limited experience with Tie::File, so not sure how it performs w/ larger files.

      Disclaimer: I have limited experience with Tie::File, so not sure how it performs w/ larger files.

      Horribly!


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Thanks. For my own future reference, do you have a suggestion for a file size that is too large to use with Tie::File?
Re: Edit huge file
by BrowserUk (Pope) on Jun 18, 2013 at 22:24 UTC
    How can I find two lines of the file, remove them, without reading the entire file ... ?

    You don't!(*)

    But with files that size it needn't take long at all. This removes the 3rd and 8 millionth lines in a 500MB/8 million line file in 10 seconds:

    C:\test>dir 500MB.csv 31/07/2012 20:22 536,870,913 500MB.csv C:\test>wc -l 500MB.csv 8388608 500MB.csv [23:19:43.35] C:\test>perl -nle"$.==3 || $.==8e6 and next; print" 500M +B.csv >nul [23:19:53.65] C:\test>

    (* If your file contains fixed length lines, then you can, but you still need to move most of the file to close up the gap at the beginning, so it doesn't gain you much.)


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Edit huge file
by davido (Archbishop) on Jun 18, 2013 at 22:39 UTC

    One point that needs to be made; while you *can* truncate at the end a file, you cannot just remove a few lines or bytes from the beginning of a file. To do that, the file has to be re-written. So while you could cheaply remove that line near the end, you will not remove the line at the beginning without going through the process of re-writing the entire file.


    Dave

Re: Edit huge file
by rjt (Deacon) on Jun 18, 2013 at 23:38 UTC

    There is no way to remove bytes from the middle of a file without rewriting it, and there is no way to rewrite it without reading it. However, 500MiB is really nothing, unless you need to access it over a (slow) network or very slow local media such as tape, which is thankfully, finally becoming less common. With typical consumer-level HDDs from a few years ago, you're probably looking at a few seconds.

    The basic structure would be something like this:

    while (<>) { print unless /^pattern to skip$/; }

    If you need to do something based on line number, use $. to get the current line number.

    This also leads itself fairly readily to a one-liner if your logic is relatively simple:

        $ perl -nle 'print unless /^pattern to skip$/ or $. == 10' filename.txt
Re: Edit huge file
by AI Cowboy (Sexton) on Jun 18, 2013 at 23:58 UTC
    Alright, thanks all; without a method to do this more cleanly I can just write my own program to read the files, rewrite them without the lines, etc. etc..... A little disappointed there's not a better way to do this, oh well though.

      Keep in mind that this isn't a Perl shortcoming. It's how files work, across any language, and across every operating system I've used (which is certainly not every, but a large enough sampling to see a trend. ;)


      Dave

      AI Cowboy:

      If you just want information on the first line to disappear and don't really have any particular reason that you must move the data in the file, you could always overwrite it with blanks. You could do that *very quickly*:

      #!/usr/bin/perl use strict; use warnings; use autodie; # Open file in read/write mode open my $FH, '+<', 'tmp.txt'; # Skip the first line my $t = <$FH>; # Remember the starting location of the second line my $pos = tell $FH; # Read the second line $t = <$FH>; # Rewind back to the start of the second line and obliterate it seek $FH, $pos, 0; print $FH "*" x (length($t)-length($/));

      Here's a quick demonstration:

      $ head -5 tmp.txt You are on the edge of a breath-taking view. Far below you is an active volcano, from which great gouts of molten lava come surging out, cascading back down into the depths. The glowing rock fills the farthest reaches of the cavern with a blood-red glare, giving every- thing an eerie, macabre $ perl obliterate_second_line.pl $ head -5 tmp.txt You are on the edge of a breath-taking view. Far below you ************************************************************ come surging out, cascading back down into the depths. The glowing rock fills the farthest reaches of the cavern with a blood-red glare, giving every- thing an eerie, macabre

      ...roboticus

      When your only tool is a hammer, all problems look like your thumb.

        I used software that had its "database" in a set of text files, and that is how it removed data, awaiting a repack. It would identify the record as a deleted record, record the length, then fill the rest with some padding character of some sort to remove the old data.

        --MidLifeXis

Re: Edit huge file
by zork42 (Monk) on Jun 19, 2013 at 15:44 UTC
    rjt + BrowserUk : what is the 'l' doing in your one-liners that start with:
    perl -nle"..."
    ?
    I've read the perlrun documentation but I'm not really much wiser.
      what is the 'l' doing in your one-liners that start with: perl -nle"..."

      Literally, it autochomps input and auto-appends $\ for print statements.

      Generically, it is useful when using print as it avoids having to append "\n". Also, when comparing with literals, the autochomping avoids having to add the "\n" into the literals.

      In the specific case of my one-liner in this thread, it is redundant and actually slows the solution down by ~20%. So good catch++.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Edit huge file
by LanX (Canon) on Jun 19, 2013 at 16:50 UTC
Re: Edit huge file
by pvaldes (Chaplain) on Jun 19, 2013 at 22:34 UTC

    I am needed to edit a large, 500+ megabyte file in Perl, to remove one line near the beginning of the file, and one near the end

    (... This problem reminds me to head and tail programs...) you could use head to obtain the line, read to pick up just the first X characters of a file and then overwrite this line, maybe using also seek. Just some untested ideas...

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1039674]
Approved by frozenwithjoy
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (5)
As of 2014-08-02 05:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Who would be the most fun to work for?















    Results (54 votes), past polls