Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Iterator to parse multiline string with \\n terminator

by three18ti (Scribe)
on Oct 06, 2013 at 05:46 UTC ( #1057121=perlquestion: print w/ replies, xml ) Need Help??
three18ti has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I would like to parse a file that has single lines split into multiple lines by terminating the line with a "\".

I basically want to read a file line by line and grab all continued lines at once.

I was thinking about something along the lines of:

#/usr/bin/perl use strict; use warnings; use IO::File; use 5.010; my $filename = shift @ARGV; my $fh = IO::File->new($filename, 'r'); my $fh_iterator = sub { my $fh = shift; my $line = $fh->getline; } while (my $line = $fh_iterator->($fh)) { # do stuff to $line }

Where I get stuck is the logic to grab the next line if the line is terminated with a /\\\n/... My first thought was some kind of recursion:

$line .= $fh_iterator->($fh) if $line =~ /\\\n/

(Maybe I should be using a named sub instead of a closure; eventually I would like to include this as part of an object I'm trying to work out the parsing logic first)

I could just be over thinking the problem...

Thanks for your thoughts

Edit: Some further thinking I tried using a named sub instead of a closure

sub fh_iterator { my $fh = shift; my $line = $fh->getline; $line .= fh_iterator($fh) if $line =~ /\\\n/; return $line; }

But each time fh_iterator is called, it's going to clobber $line... So this doesn't do what I'd expect. I'd like to preserve the \ but should probably chomp the $fh->getline somehow.

Comment on Iterator to parse multiline string with \\n terminator
Select or Download Code
Re: Iterator to parse multiline string with \\n terminator
by Athanasius (Monsignor) on Oct 06, 2013 at 05:56 UTC

    No need for recursion. Just change the anonymous sub as follows (untested):

    my $fh_iterator = sub { my $fh = shift; my $line = $fh->getline(); $line .= $fh->getline() while $line =~ m{\\$}; return $line; }

    Update: Here is a tested script which eliminates the Use of uninitialized value warning reported in the post below:

    use strict; use warnings; use IO::File; my $filename = shift @ARGV; my $fh = IO::File->new($filename, 'r'); sub fh_iterator { my $fh = shift; my $line = $fh->getline(); if (defined $line) { $line .= $fh->getline() while $line =~ m{\\$}; } return $line; } while (my $line = fh_iterator($fh)) { print $line; }

    Output:

    16:29 >perl 738_SoPW.pl test.file foo \ bar \ baz single line 16:29 >

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      EDIT: Oh, I'm an idiot... for got to change:
      $line .= $fh_iterator while $line =~ m{\\$};

      To:

      $line .= $fh->getline while $line =~ m{\\$};

      However, the code does return the error below on the last line of my test file...

      Use of uninitialized value $line in pattern match (m//) at parser3.pl +line 17, <GEN0> line 5.

      Begin Original Post

      Hmm... well, I get a new error at least:/p>

      Use of uninitialized value $line in pattern match (m//) at parser3.pl +line 17, <GEN0> line 5.

      Here's the accompanying code and test file:

      #/usr/bin/perl use strict; use warnings; use IO::File; use 5.010; my $filename = shift @ARGV; my $fh = IO::File->new($filename, 'r'); sub fh_iterator { my $fh = shift; my $line = $fh->getline; $line .= fh_iterator($fh) while $line =~ m{\\$}; } while (my $line = fh_iterator $fh ) { print $line; } __END___ test.file foo \ bar \ baz single line

      WRT your edit

      What do you think about a return instead of an if? e.g.:

      sub fh_iterator { my $fh = shift; my $line = $fh->getline(); return $line unless $line; $line .= $fh->getline() while $line =~ m{\\$}; return $line; }

      I don't think there's any functional difference, but one may be more readable than the other...

      Thanks for your help!

        return $line unless $line;
        make that:
        return $line unless defined $line;
        theoretically the last line could be missing the newline and only contain "0". then $line would be false and ignored with your code.

        ++tinita for highlighting the important difference between testing for definedness and testing for truth (see perlsyn#Truth-and-Falsehood).

        But even with the correction I prefer my version. Readability is in the eye of the programmer, but the first high-level programming subject I took at Uni (in Pascal!) emphasised structured programming, and this has remained with me. I prefer a function to have a single exit point (at the end) where possible. In Perl this is not always optimum, so I’ve had to learn to be flexible. But when — as in this case — the structured version is as straightforward as the non-structured one, I prefer the former. YMMV.

        As always in Perl, TMTOWTDI.

        Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: Iterator to parse multiline string with \\n terminator
by kcott (Abbot) on Oct 06, 2013 at 06:34 UTC

    G'day three18ti,

    In the absence of seeing a context requiring anything more complex, I'd probably code something along these lines:

    #!/usr/bin/env perl use strict; use warnings; my $re = qr{^(.*)(?<![\\])[\\]\n$}; my $line = ''; while (<DATA>) { if (/$re/) { $line .= $1; next; } $line .= $_; print $line; $line = ''; } __DATA__ Line 1 Part A \ Line 1 Part B \ Line 1 Part C Line 2 ALL Line 3 Part X \ Line 3 Part Y Line 4 END WITH BACKSLASH \\ Line 5 LAST Z

    Output:

    Line 1 Part A Line 1 Part B Line 1 Part C Line 2 ALL Line 3 Part X Line 3 Part Y Line 4 END WITH BACKSLASH \\ Line 5 LAST Z

    That code could easily be adapted for an iterator if one is required for your application.

    If you're not familiar with negative look-behind assertions ((?<!pattern)), they're documented under Look-Around Assertions in "perlre: Extended Patterns".

    -- Ken

      Neat! Thanks for the link.

      I've been reading Higher Order Perl and was just reading the chapter on Lexers where MJD makes use of look-behind assertions. This actually helps make more sense of what I was reading.

      What is the difference between next and redo in this context? A user below had a similar solution but used redo instead of next.

        The difference is that redo does not re-evaluate the loop condition (in this case: "(<DATA>)", which fetches the next line) before evaluating the loop body again, whereas next does.

        This is why in jwkrahn's solution, the next line is fetched manually before calling redo:

        $_ .= <$fh>;

        The advantage of jwkrahn's solution with redo, is that the implicit variable $_ can be used to store the complete multiline record.

        The advantage of kcott's solution with next, is that there is only one place where the <> operator for fetching the next line is used (inside the loop condition) - but re-evaluating the the loop condition also resets $_, so in this case a custom variable needs to be declared above the loop to store the current record.

Re: Iterator to parse multiline string with \\n terminator
by CountZero (Bishop) on Oct 06, 2013 at 07:42 UTC
    If your file is not terribly huge and/or you have enough memory, this is an alternative solution:
    use Modern::Perl; my $file; { local $/ = ''; $file = <DATA>; } $file =~ s/\\\n/ /gs; my @lines = split /\n/, $file; say for @lines; __DATA__ First line Second line (part1)\ Second line (first continuation)\ Second line (second continuation) Third line (part1)\ Third line (first continuation)\ Third line (second continuation) Fourth line (part1)\ Fourth line (first continuation)\ Fourth line (second continuation)
    Output:
    First line Second line (part1) Second line (first continuation) Second line (seco +nd continuation) Third line (part1) Third line (first continuation) Third line (second +continuation) Fourth line (part1) Fourth line (first continuation) Fourth line (seco +nd continuation)

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics

      Thanks!

      At the moment I'm only parsing a file that's a few megs / few thousands lines, however, I have the potential to be parsing a lot larger files so memory could become a problem in the future.

Re: Iterator to parse multiline string with \\n terminator
by jwkrahn (Monsignor) on Oct 06, 2013 at 08:26 UTC
    open my $fh, '<', $filename or die "Cannot open '$filename' because: $ +!"; while ( <$fh> ) { chomp; if ( s/\\$// ) { $_ .= <$fh>; redo; } # now complete line in $_ }
      What's the difference between using redo and using next as a similar example above does?

        next goes back to:

        while ( <$fh> ) {

        And therefore reads the next line into $_ while redo goes back to the line after that leaving $_ unchanged.

Re: Iterator to parse multiline string with \\n terminator
by Laurent_R (Parson) on Oct 06, 2013 at 08:58 UTC

    Hi,

    as a side note, your anonymous function does not really act as a closure:

    my $fh = IO::File->new($filename, 'r'); my $fh_iterator = sub { my $fh = shift; my $line = $fh->getline; } while (my $line = $fh_iterator->($fh)) { # do stuff to $line }

    because you are passing $fh each time to the sub (and you have to, you actually get a fresh copy of $fh each time the anonymous subroutine is called).

    If you want it to act as a closure, you may do something like this (untested):

    sub create_iterator{ my $filename = shift; my $fh = IO::File->new($filename, 'r'); return sub { my $line = $fh->getline; } } my $fh_iterator = create_iterator($file_name); while (my $line = $fh_iterator->()) { # do stuff to $line }

    Now, $fh is really a persistent variable within the sub scope, this is a real closure.

      What exactly makes it not a closure? Is it that I'm passing a variable each time? If it was a new copy of $fh, wouldn't $fh_iterator->($fh) always return the same line (since it's creating a copy of the $fh object, on next passing it would be a copy of the original)?

      Thanks for setting me straight, I always like learning new things.

        $fh is a file handler, i.e. it is actually an iterator on a file, so that each time you read from $fh, you get the next line. In your sub, your my $fh = shift; actually creates a new copy of $fh each time the sub is called. It still works because $fh "knows" which is the next line to read from the file. But your anonymous sub is not a closure; the alternative code I wrote is actually keeping its own copy of $fh, my anonymous sub actually closes on $fh. Please note that an anonymous function is not necessarily a closure, and a closure does not necessarily have to be anonymous (although is is often the case).

        You might want to have a look to this: Closure on Closures.

Re: Iterator to parse multiline string with \\n terminator
by Lennotoecom (Pilgrim) on Oct 06, 2013 at 12:55 UTC
    while(<DATA>){s/[\\\n]//g; $line.=$_;} print $line; __DATA__ fist line\ second line\ third line\ fourth
    no?

      Not quite. Putting \\\n into a character class causes every occurrence of either a backslash or a newline to be removed. This results in a file consisting of just a single line, which is not what is wanted. Remove the character class:

      #! perl use strict; use warnings; my $file = ''; while (<DATA>) { s{\\\n}{}; $file .= $_; } print $file; __DATA__ first line second line \ third line \ fourth line fifth line

      Output:

      13:10 >perl 738a_SoPW.pl first line second line third line fourth line fifth line 13:13 >

      Note that the substitution operates on each line of input in turn, and a single input line can contain no more than one backslash-newline sequence. So, the /g modifier is not needed.

      Hope that helps,

      Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

        my brackets was just a typo

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1057121]
Approved by Athanasius
Front-paged by mtmcc
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (10)
As of 2014-12-19 21:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (91 votes), past polls