in reply to scalable chomping

Read the file line by line using $/ = '+'.

    Apply regexps on each line to remove \n and replace record separators

This is onliner as example

perl -l -0x2B -pe 's/\n//g;s/[XYZ]/;/g' corruptedfile > recoveredfile
where 'X', 'Y', 'Z' are characters to be replaced with record separator ';'

Replies are listed 'Best First'.
Re^2: scalable chomping
by TGI (Parson) on Oct 29, 2008 at 17:14 UTC

    If X Y and Z can legitimately be in the file you are going to have to do more work. Keep track of values that you have "fixed" substitutions in, and what the original character was. You will then have a list of 'known suspect values' as well as a way to get the original value.

    The best approach (short of retrieval from a backup) would be to do as much parsing and sanity checking on the data as you process the file. Trivial/Obvious fixes can be automated, but anything questionable needs to be flagged and ask for human intervention.

    Good luck. I think you'll need it :/.

    TGI says moo