Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

How to improve regex for parsing equals delimited data

by Lotus1 (Chaplain)
on May 11, 2012 at 02:44 UTC ( #969892=perlquestion: print w/ replies, xml ) Need Help??
Lotus1 has asked for the wisdom of the Perl Monks concerning the following question:

I found out that one of my programs wasn't parsing the field values correctly due to unexpected white space inside the field values. What I came up with works but seems cumbersome. I used a positive lookahead for the first three fields.

Any suggestions for how to improve this or a better approach?

#!/usr/bin/perl use warnings; use strict; while(<DATA>) { chomp; # original regex #if( /[A-Za-z]+\s*=\s*(\S*)\s*[A-Za-z]+\s*=\s*(\S*)\s*[A-Za-z]+\s* +=\s*(\S*)\s*[A-Za-z]+\s*=\s*(\S*)/i ) { # Now trying to allow spaces in the fields without gobbling the fi +eld names (which never have spaces) # This is getting cumbersome quickly. if( /^[A-Za-z]+\s*=\s*(.+?)(?=\s*[A-Za-z]+\s*=)\s*[A-Za-z]+\s*=\s* +(.+?)(?=\s*[A-Za-z]+\s*=)\s*[A-Za-z]+\s*=\s*(.+?)(?=\s*[A-Za-z]+\s*=) +\s*[A-Za-z]+\s*=\s*(.+?)\s*$/ ) { print "$1,$2,$3,$4\n"; } else { print ",,,,not recognized: $_\n"; } } __DATA__ FIELDA = ONEAL FIELDB = RELAY FIELDC = L1208 FIELDD = ALTS FIELDA = OSSIPEE FIELDB = DISC FIELDC = SOH: 169879251 FIELDD = DISC FIELDA = OSSIPEE FIELDB = RELAY FIELDC = L1201 FIELDD = ALTS FIELDA = OSSIPEE FIELDB = RELAY FIELDC = L1203 FIELDD = ALTS

Here is the desired output.

ONEAL,RELAY,L1208,ALTS OSSIPEE,DISC,SOH: 169879251,DISC OSSIPEE,RELAY,L1201,ALTS OSSIPEE,RELAY,L1203,ALTS

Comment on How to improve regex for parsing equals delimited data
Select or Download Code
Re: How to improve regex for parsing equals delimited data
by NetWallah (Abbot) on May 11, 2012 at 03:25 UTC
    Use split, then zap the empty first field.
    The idea is that the delimiter is not just "=", but something like " FIELDx = ".
    my @fld = split /\s*FIELD\w = /; shift @fld; # Zap empty first field print join(",",@fld ),"\n";

                 I hope life isn't a big joke, because I don't get it.
                       -SNL

      my @fld = split /\s*FIELD\w = /; shift @fld; # Zap empty first field print join(",",@fld ),"\n";
      my @fld = /FIELD\w\s*=\s*(\S+)/g; print join( ',', @fld ), "\n";

        His fields can have spaces: "SOH: 169879251".

        -sauoq
        "My two cents aren't worth a dime.";

      Same idea in one operation, a little more robust, and using his original [A-Za-z]+ for fields...

      my ($toss, @list) = split /\s*[A-Za-z]+\s*\=\s*/;

      -sauoq
      "My two cents aren't worth a dime.";

      Thanks, this worked well. I tried split at one point but didn't realize everything was shifted by one so I missed the fourth field. This is easy to follow also.

Re: How to improve regex for parsing equals delimited data
by tobyink (Abbot) on May 11, 2012 at 07:12 UTC

    Personally I'd parse it into a data structure first, and then use that data structure to generate the output:

    use Modern::Perl; use String::Trim; use Data::Dumper; $Data::Dumper::Sortkeys = 1; $Data::Dumper::Terse = 1; my @rows; while (<DATA>) { chomp; trim; my @F = split /\s*=\s*/; push @rows, [map { my %x = (field => $F[$_ - 1], value => $F[$_]); $x{field} =~ s/.*\s+(\S+)$/$1/ unless $_ == 1; $x{value} =~ s/\s*\S+$// unless $_ == $#F; \%x; } 1 .. $#F]; } print Dumper \@rows; for (@rows) { say join q(,), map { $_->{value} } @$_; } __DATA__ FIELDA = ONEAL FIELDB = RELAY FIELDC = L1208 FIELDD = ALTS FIELDA = OSSIPEE FIELDB = DISC FIELDC = SOH: 169879251 FIELDD = DISC FIELDA = OSSIPEE FIELDB = RELAY FIELDC = L1201 FIELDD = ALTS FIELDA = OSSIPEE FIELDB = RELAY FIELDC = L1203 FIELDD = ALTS
    perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'

      Hmmm... I wish that the result of a pre-increment was an lvalue. That would allow this:

      my %x = (field => $F[$_ - 1], value => $F[$_]);

      to become this...

      my %x = (field => $F[(--$_)++], value => $F[$_]);

      or maybe even...

      my %x = (field => $F[--$_++], value => $F[$_]);

      Now that would be a fun idiom! As things stand, this works:

      my %x = (field => $F[--$_], value => $F[++$_]);

      That said, my initial boring version is probably more readable.

      Update: I've just remembered the secret inchworm-on-a-stick operator. This is one of those rare opportunities it's actually useful:

      my %x = (field => $F[~-$_], value => $F[$_]);
      perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
      my @F = split /\s*=\s*/; push @rows, [map { my %x = (field => $F[$_ - 1], value => $F[$_]); $x{field} =~ s/.*\s+(\S+)$/$1/ unless $_ == 1; $x{value} =~ s/\s*\S+$// unless $_ == $#F; \%x; } 1 .. $#F];

      You could improve that a great deal if you just used one of the splits already given with the slight modification of capturing the field name. There's no need for all the conditionals, substitutions, array indexing, and length checking. It's more readable too. . .

      my ($toss, @list) = split /\s*([A-Za-z]+)\s*\=\s*/; my @row; while (@list) { push @row, { field => shift @list, value => shift @list }; } push @rows, \@row;

      -sauoq
      "My two cents aren't worth a dime.";
Re: How to improve regex for parsing equals delimited data
by jwkrahn (Monsignor) on May 11, 2012 at 18:03 UTC

    Another way to do it:

    while ( <DATA> ) { $_ = reverse; my @fld; unshift @fld, scalar reverse $1 while s/(.+?)\s*=\s*\S+\s*//; print join( ',', @fld ), "\n"; }

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://969892]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (4)
As of 2014-08-30 02:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (291 votes), past polls