Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
So I have this rather large datafile. Unfortunately it somehow got corrupted. There are random newlines all over it. And what really should be the new line char is a +. What should be the record separator has turned into one of three different characters. Personally I don't believe the data is even correct, but the boss says to try and recover it anyway.

I understand you need to do what the boss says, but if the corruption is as random as it sounds, you can't reasonably expect to recover. The problem is a lack of pattern: if you try and sub out your "random" \n characters, for example, you might find out some were random insertions while others were random substitutions. And your record separator that's turned into one of three other characters: do they legitimately appear anywhere?

Unless ypu have a pattern for the corruption, trying to recover it is basically blind luck. Especially since attempting to recover it via Perl is certainly going to rely on you applying patterns...

And the truly problematic case is when you get your data to what looks to be correct: how can you tell for sure?

I'm not trying to dump on your efforts, but I've done a lot of this sort of thing, and at some point the only reasonable course of action is to restore from backup.

Now if you have been able to determine a pattern for the corruption, then you stand a very good chance of recovery. I'm just more than a little terrified by your description.

In reply to Re: scalable chomping by mpeever
in thread scalable chomping by xorl

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?

What's my password?
Create A New User
Domain Nodelet?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (4)
As of 2023-04-01 00:43 GMT
Find Nodes?
    Voting Booth?
    Which type of climate do you prefer to live in?

    Results (77 votes). Check out past polls.