http://www.perlmonks.org?node_id=805395

puterboy has asked for the wisdom of the Perl Monks concerning the following question:

I have a set of records in plaintext where each record begins with "# file" and that line doesn't appear anywhere but the start of the record. Records should be separated by a blank line (\n\n) but some records are missing the space.

I thought the following perl one-liner would work but it doesn't. What am I doing wrong???

 cat <record file> | perl -p -e "s/(?=\w)\n# file/\n\n# file/sg"

My problem is that somehow I can't seem to search across the newline which is what I thought the /s was supposed to help with

Replies are listed 'Best First'.
Re: Adding back missing newlines between records
by ikegami (Patriarch) on Nov 06, 2009 at 06:30 UTC

    My problem is that somehow I can't seem to search across the newline which is what I thought the /s was supposed to help with

    The s modifier makes . match every character, including the newline which it doesn't match by default. Useless here since you don't use ..

    The -p causes the expression to be applies to each line of input. You're trying to match something you haven't read yet! One way of fixing this is to change the definition of line so that the whole file is read at once. (-0777)

    Then there's the issue that /(?=\w)\n/ will never match. How can the next character be both a word character and a newline?

    perl -0777pe's/(?<!\n)\n# file/\n\n# file/g' record_file
      perl -pe 's/(?<!\n)\n# file/\n\n# file/g' record_file

      I don't see how that is supposed to work. The -p flag creates a while(<>) loop around the code specified for the -e flag(with print; as the last line in the while loop). The s/// operator in your code is going to operate on the $_ variable, and the diamond operator(<>) will assign each line in the file to $_ one line at a time.

      As far as I can tell, at some point $_ will be equal to the string "# file\n", and the previous string will have been "hello world\n" (i.e. not "\n" as desired). Your regex is looking for "\n# file" preceded by a "\n". First, because it seems to me that the diamond operator will produce the line "# file\n", your regex won't match because there is no "\n# file" in that line. Second, it looks to me like you are doing a negative lookbehind beyond the start of the string. How is that supposed to work?

        I don't see how that is supposed to work.

        It's supposed to work because, as ikegami pointed out, you use the -0777 switch to make the interpreter slurp the whole file in one go, the equivalent of undefining $/ in a script. Thus, the global replace operates on a single string which is the whole file and the while implied by -p only iterates once.

        I hope this is helpful.

        Cheers,

        JohnGG

        Its the magick -0777 option that sets input record separator, so instead of reading lines, it reads records of no more than oct(0777) (511) bytes, or if your platform doesn't have record oriented files, it reads the whole file.
        The -p flag creates a while(<>) loop around the code specified for the -e flag(with print; as the last line in the while loop)

        Actually, that's not quite accurate. According to what I read, the while loop looks like this:

        LINE: while (<>) { # your code goes here } continue { print or die "-p destination: $!\n"; }

        A continue block gets executed the instant before the loop condition is evaluated. So 'redo' does not cause the continue block to execute, but 'next' does, and a normal iteration of the loop causes the continue block to execute as well.

        This works for me:

        perl -pe 'if($_ eq "\n"){$n=1;next;} if($n){$n=0;next;}else{s/# file/\ +n# file/;}' data1.txt
      Thanks for the code and the helpful explanation. I have read 'man perlre' many times but as you pointed out I missed several points there. Thanks for the clarification.