scalable chomping

xorl has asked for the wisdom of the Perl Monks concerning the following question:

So I have this rather large datafile. Unfortunately it somehow got corrupted. There are random newlines all over it. And what really should be the new line char is a +. What should be the record separator has turned into one of three different characters. Personally I don't believe the data is even correct, but the boss says to try and recover it anyway.

So I was thinking of doing something like open the file, loop through it and chomp out the newlines and stuff that all into a variable. Then do a regex (or probably more than one) on that variable to replace the chars. Then finally write that variable out to the output file.

The thing is we're looking at a pretty large file and stuffing that much data into a single variable seems like a pretty good way of crashing my box.

So are there any better ways of doing this?

Thanks in advance.

Comment on scalable chomping

Replies are listed 'Best First'.
Re: scalable chomping by ccn (Vicar) on Oct 29, 2008 at 14:02 UTC
Read the file line by line using `$/ = '+'`. Apply regexps on each line to remove `\n` and replace record separators This is onliner as example `perl -l -0x2B -pe 's/\n//g;s/[XYZ]/;/g' corruptedfile > recoveredfile` [download] where 'X', 'Y', 'Z' are characters to be replaced with record separator ';'	[reply] [d/l] [select]
Re^2: scalable chomping by TGI (Parson) on Oct 29, 2008 at 17:14 UTC
If X Y and Z can legitimately be in the file you are going to have to do more work. Keep track of values that you have "fixed" substitutions in, and what the original character was. You will then have a list of 'known suspect values' as well as a way to get the original value. The best approach (short of retrieval from a backup) would be to do as much parsing and sanity checking on the data as you process the file. Trivial/Obvious fixes can be automated, but anything questionable needs to be flagged and ask for human intervention. Good luck. I think you'll need it :/. TGI says moo	[reply]
Re: scalable chomping by mpeever (Friar) on Oct 29, 2008 at 14:14 UTC
So I have this rather large datafile. Unfortunately it somehow got corrupted. There are random newlines all over it. And what really should be the new line char is a +. What should be the record separator has turned into one of three different characters. Personally I don't believe the data is even correct, but the boss says to try and recover it anyway. I understand you need to do what the boss says, but if the corruption is as random as it sounds, you can't reasonably expect to recover. The problem is a lack of pattern: if you try and sub out your "random" `\n` characters, for example, you might find out some were random insertions while others were random substitutions. And your record separator that's turned into one of three other characters: do they legitimately appear anywhere? Unless ypu have a pattern for the corruption, trying to recover it is basically blind luck. Especially since attempting to recover it via Perl is certainly going to rely on you applying patterns... And the truly problematic case is when you get your data to what looks to be correct: how can you tell for sure? I'm not trying to dump on your efforts, but I've done a lot of this sort of thing, and at some point the only reasonable course of action is to restore from backup. Now if you have been able to determine a pattern for the corruption, then you stand a very good chance of recovery. I'm just more than a little terrified by your description.	[reply] [d/l]
Re: scalable chomping by NiJo (Friar) on Oct 29, 2008 at 18:45 UTC
`sed s/X/Y/ data.txt \| sed /a/b/ >out.txt` [download] should be scalable in programming and execution time. I've not seen a benchmark shootout on substitutions...	[reply] [d/l]
Re: scalable chomping by picabotwo (Initiate) on Oct 29, 2008 at 20:58 UTC
You are going to have to open the file and process each item line by line. `$input = "oldlog"; my $output = "newlog"; open (INPUT,"<$input") \|\| die "$!\n"; open (OUTPUT,">$output") \|\| die "$!\n"; foreach $line (<INPUT>) { chomp($line); # remove newlines or whatever regex # do line-by-line processing. print OUTPUT $line; } close INPUT; close OUTPUT;` [download] This will open your log file, do something to it line by line and dump the results in another logfile. The regex stuff is the hard part.	[reply] [d/l]
Re: scalable chomping by brycen (Monk) on Oct 30, 2008 at 23:37 UTC
You're right about crashing. You need to process huge files in chunks. If + really is the "end of line" character, it's easy to change the end of line separator in perl. If it's more complex you have to pick some other line end character. Or, use sysread() to read the file in chunks.	[reply]


Don't ask to ask, just ask
	PerlMonks