Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

scalable chomping

by xorl (Deacon)
on Oct 29, 2008 at 13:56 UTC ( [id://720227]=perlquestion: print w/replies, xml ) Need Help??

xorl has asked for the wisdom of the Perl Monks concerning the following question:

So I have this rather large datafile. Unfortunately it somehow got corrupted. There are random newlines all over it. And what really should be the new line char is a +. What should be the record separator has turned into one of three different characters. Personally I don't believe the data is even correct, but the boss says to try and recover it anyway.

So I was thinking of doing something like open the file, loop through it and chomp out the newlines and stuff that all into a variable. Then do a regex (or probably more than one) on that variable to replace the chars. Then finally write that variable out to the output file.

The thing is we're looking at a pretty large file and stuffing that much data into a single variable seems like a pretty good way of crashing my box.

So are there any better ways of doing this?

Thanks in advance.

Replies are listed 'Best First'.
Re: scalable chomping
by ccn (Vicar) on Oct 29, 2008 at 14:02 UTC

    Read the file line by line using $/ = '+'.

        Apply regexps on each line to remove \n and replace record separators

    This is onliner as example

    perl -l -0x2B -pe 's/\n//g;s/[XYZ]/;/g' corruptedfile > recoveredfile
    where 'X', 'Y', 'Z' are characters to be replaced with record separator ';'

      If X Y and Z can legitimately be in the file you are going to have to do more work. Keep track of values that you have "fixed" substitutions in, and what the original character was. You will then have a list of 'known suspect values' as well as a way to get the original value.

      The best approach (short of retrieval from a backup) would be to do as much parsing and sanity checking on the data as you process the file. Trivial/Obvious fixes can be automated, but anything questionable needs to be flagged and ask for human intervention.

      Good luck. I think you'll need it :/.


      TGI says moo

Re: scalable chomping
by mpeever (Friar) on Oct 29, 2008 at 14:14 UTC
    So I have this rather large datafile. Unfortunately it somehow got corrupted. There are random newlines all over it. And what really should be the new line char is a +. What should be the record separator has turned into one of three different characters. Personally I don't believe the data is even correct, but the boss says to try and recover it anyway.

    I understand you need to do what the boss says, but if the corruption is as random as it sounds, you can't reasonably expect to recover. The problem is a lack of pattern: if you try and sub out your "random" \n characters, for example, you might find out some were random insertions while others were random substitutions. And your record separator that's turned into one of three other characters: do they legitimately appear anywhere?

    Unless ypu have a pattern for the corruption, trying to recover it is basically blind luck. Especially since attempting to recover it via Perl is certainly going to rely on you applying patterns...

    And the truly problematic case is when you get your data to what looks to be correct: how can you tell for sure?

    I'm not trying to dump on your efforts, but I've done a lot of this sort of thing, and at some point the only reasonable course of action is to restore from backup.

    Now if you have been able to determine a pattern for the corruption, then you stand a very good chance of recovery. I'm just more than a little terrified by your description.

Re: scalable chomping
by NiJo (Friar) on Oct 29, 2008 at 18:45 UTC
    sed s/X/Y/ data.txt | sed /a/b/ >out.txt
    should be scalable in programming and execution time. I've not seen a benchmark shootout on substitutions...
Re: scalable chomping
by picabotwo (Initiate) on Oct 29, 2008 at 20:58 UTC
    You are going to have to open the file and process each item line by line.
    $input = "oldlog"; my $output = "newlog"; open (INPUT,"<$input") || die "$!\n"; open (OUTPUT,">$output") || die "$!\n"; foreach $line (<INPUT>) { chomp($line); # remove newlines or whatever regex # do line-by-line processing. print OUTPUT $line; } close INPUT; close OUTPUT;
    This will open your log file, do something to it line by line and dump the results in another logfile. The regex stuff is the hard part.
Re: scalable chomping
by brycen (Monk) on Oct 30, 2008 at 23:37 UTC
    You're right about crashing. You need to process huge files in chunks. If + really is the "end of line" character, it's easy to change the end of line separator in perl. If it's more complex you have to pick some other line end character. Or, use sysread() to read the file in chunks.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://720227]
Approved by marto
Front-paged by tye
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (4)
As of 2024-04-20 00:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found