http://www.perlmonks.org?node_id=684949

Tanoti has asked for the wisdom of the Perl Monks concerning the following question:

I have some text from a third-party app which I'm storing in a single variable that needs parsing. I am using: split /\n/, $app_text; to break it into lines for processing. I'm looking for "Field:Value" lines and ignoring everything else. For most of the text this is fine, however the external app is borking some of the fields and putting a \n after the colon meaning the Value for that field ends up in the next array element. Here's some sample code:
#!/usr/bin/perl use strict; my $app_text = "one:partridge\ntwo:\nturtle doves\nthree:french hens\n +"; foreach my $line (split /\n/, $app_text) { print "$line\n"; }

Produces:
one:partridge two: turtle doves three:french hens

How can I tell split to split on the \n but not if preceeded by a colon, so I get two:turtle doves for the second array element in the above example?

Many thanks,
John

Replies are listed 'Best First'.
Re: Pattern match for split() - need match but not match syntax
by citromatik (Curate) on May 06, 2008 at 14:11 UTC

    You are almost there, include the condition inside the split pattern:

    use strict; my $app_text = "one:partridge\ntwo:\nturtle doves\nthree:french hens\n +"; foreach my $line (split /(?<!:)\n/, $app_text) { $line =~ s/\n//g; # Eliminate internal "\n"s print "$line\n"; }

    Outputs

    one:partridge two:turtle doves three:french hens

    Update: Corrected the split pattern to use lookbehinds, see perlre

    citromatik

      Thanks, that works a treat and has saved a lot of case-specific workaround code. I had played with the lookbehind syntax but couldn't get them to work so thought I was on the wrong track!

      John
Re: Pattern match for split() - need match but not match syntax
by Narveson (Chaplain) on May 06, 2008 at 14:59 UTC

    As long as you have to think about regexes anyway, you can use a regex that parses your text at the same time that it's splitting it.

    my $LINE_PATTERN = qr{ ([^:]+) # capture everything before ... :\s* # the colon and any newline or other whitespace, ([^\n]+) # then capture everything before \n # the next newline }msx; my $app_text = "one:partridge\ntwo:\nturtle doves\nthree:french hens\n +"; while ($app_text =~ /$LINE_PATTERN/g) { print "$1: $2\n"; }

    If you were planning to put the fields in a hash, you can do it all at once:

    my %value_of = $app_text =~ /$LINE_PATTERN/g; while (my ($field, $value) = each %value_of) { print "$field: $value\n"; }
      While there's nothing wrong with your $LINE_PATTERN regex I think it would be simpler to keep the record and field/value processing separate. To my eye it looks tidier and easier to maintain but others may disagree.

      use strict; use warnings; use Data::Dumper; my $app_text = qq{one:partridge\ntwo:\nturtle doves\nthree:french hens\n}; my %fvPairs = map { split m{:\n?} } map { split m{(?<!:)\n} } $app_text; print Data::Dumper->Dumpxs( [ \ %fvPairs], [ q{*fvPairs} ] );

      produces ...

      %fvPairs = ( 'three' => 'french hens', 'one' => 'partridge', 'two' => 'turtle doves' );

      Cheers,

      JohnGG

Re: Pattern match for split() - need match but not match syntax
by GrandFather (Saint) on May 06, 2008 at 22:23 UTC

    Would you perhaps be better using Text::xSV or Text::CSV to be doing the parsing for you?


    Perl is environmentally friendly - it saves trees