http://www.perlmonks.org?node_id=977655


in reply to Re^2: Regex help
in thread Regex help

There are number of ways to do it with straight regular expressions, but the technology you are feeding this into (and thus the necessary input-output mapping) is unfamiliar to me. Parsing one of your lines may be as easy as /^\s*(\S+)\s+(\S+)\s*$/, but maybe this needs the m modifier depending on context. This particular expression will fail on all lines other than your data lines, since all it does is (thanks to YAPE::Regex::Explain):
The regular expression: (?m-isx:^\s*(\S+)\s+(\S+)\s*$) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?m-isx: group, but do not capture (with ^ and $ matching start and end of line) (case- sensitive) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- ^ the beginning of a "line" ---------------------------------------------------------------------- \s* whitespace (\n, \r, \t, \f, and " ") (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ( group and capture to \1: ---------------------------------------------------------------------- \S+ non-whitespace (all but \n, \r, \t, \f, and " ") (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- \s+ whitespace (\n, \r, \t, \f, and " ") (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ( group and capture to \2: ---------------------------------------------------------------------- \S+ non-whitespace (all but \n, \r, \t, \f, and " ") (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ) end of \2 ---------------------------------------------------------------------- \s* whitespace (\n, \r, \t, \f, and " ") (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- $ before an optional \n, and the end of a "line" ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------
and can be shown to work on this example with
#!/usr/bin/perl -w use strict; use Data::Dumper; $_ = <<'EOT'; Port 1 Database Assignments Region Data Type # Records GLOBAL -- LOCAL -- BUF -- D1 Unused D2 Unused D3 Unused D4 Unused D5 Unused D6 Unused D7 Unused D8 Unused A1 Unused A2 Unused A3 Unused USER Unused EOT my %hash; while (/^\s*(\S+)\s+(\S+)\s*$/mg) { $hash{$1} = $2; } print Dumper \%hash;
However, it'll break pretty quickly if your input is not representative; e.g. if you Region or Data Type contain white space (this looks fixed width to me) or if # Records is not null.

#11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Replies are listed 'Best First'.
Re^4: Regex help
by jayto (Acolyte) on Jun 21, 2012 at 15:11 UTC
    Thanks this was really helpful, you pretty much did all the work for me. I am trying to understand your regex expression which will take a bit. One more questions, is there no way to record the "# Record" column cells as being blank?

      Recording the "# Record" column cells as being blank is easy; the hard part would be changing the expression to record non-blank entries. You could just add an empty pair of parentheses, a la /^\s*(\S+)\s+(\S+)\s*()$/. Given that none of the lines contain three non-space blocks, you might even say /^\s*(\S+)\s+(\S+)\s*(\S*)\s*$/. Keep in mind that given these are such general expressions, it is absolutely essential that you test this against real input and be skeptical of the results.


      #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.