Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number

Regexp glitch while parsing record in CSV

by greenhorn (Sexton)
on Jul 17, 2000 at 04:21 UTC ( #22799=perlquestion: print w/replies, xml ) Need Help??

greenhorn has asked for the wisdom of the Perl Monks concerning the following question:

A regular expression I'm using hasn't been doing what I expect, and so far I'm baffled if I can see why.

The input file is a CSV file with 12 fields per record. The task is to get the contents of field 4. Field 1 always has content. Fields 2 and 3 might be blank. Field 4 always has content. Sometimes the field separators (commas) are preceded by or followed by spaces. I want to ensure these spaces are not included in the string extracted from field 4. One approach, as the file is being read line-by-line:

@record = split( /\s*,\s*/o, $_); $field4 = $record[3];

The regexp used with split does effectively filter out white space before and after the field separators. But despite the expected greater efficiency of "split", I found that in this particular script, a regular expression makes the script run noticeably faster (damned if I know why):

($s) = $_ =~ m/^\s*[^,]+,[^,]*,[^,]*,\s*([^,]+)\s*,/o; # "$s" now contains the contents of field #4

I would have thought the inclusion of "\s*" before and after the parens would ensure that 0+ leading and trailing spaces within the field would be removed.

But it isn't working. When field 4 has leading and trailing spaces in it, they are included in what's returned by the regular expression. Then I have to use:   $s =~ s/^\s+|\s+$//g;

to get rid of the spaces. There's something that's not working in the regexp--but I'm not seeing the error. If someone here does, I'd be much obliged for your advice about it.

(This to Ovid: This script is similar to--but simpler than--the one discussed in the other recent thread. Here again the regular expression seems to out-perform "split" every time. Dunno what to make of it.)

Replies are listed 'Best First'.
Re: Regexp glitch while parsing record in CSV
by turnstep (Parson) on Jul 17, 2000 at 04:39 UTC
    Looks to me like your regexp is saying:
    ... \s*([^,]+)\s*,... Find some whitespace, or perhaps none at all. Find anything that is not a comma, at least once, which *includes* whitespace, then... Find some more whitespace, perhaps none Find a comma

    Since perl is greedy, it takes the whitespace into the parens, and leaves nothing for the \s* afterwards to match.

    I came up with this:

    which seems to do the trick. Breaking it down:
    (?: ## Group, but do not store it into $1 [^,]* ## Anything that is not a comma , ## Followed by a comma \s* ## Followed by possible whitespace ){3} ## Find three of these (the first three) (.*?) ## Match any character, but don't be so greedy about it \s* ## Possible whitespace , ## Stops at first comma, because we are not being greedy

    There is probably a better way to write it, but I'm tired and this seems to work.

      <kbd>> m#(?:[^,]*,\s*){3}(.*?)\s*,#</kbd>

      This worked, thanks--and taught me a couple of things about regular expressions that I'd seen in several of the books but hadn't yet understood.

      Still perplexed by the failure of the other regexp. Even though<kbd> [^,]+ </kbd>does indeed include spaces, I'm perplexed by why the<kbd> \s* </kbd>preceding and following it fail to catch spaces when they exist in those locations in the string.<kbd> \s* </kbd>does catch them when it is used this way:

      <kbd>split /\s*,\s*/ , $_</kbd>

      Thanks again.

        The asterisk can be a little tricky. It means, match zero or more of the preceding. In a construct like \s*[^,]+\s*, the character class does match the spaces and the comma. The + is greedy, too.
      I have two problems with the above regex. This first is that everything except the commas is optional.
      #!/usr/bin/perl $_ = ",,,,"; print "Good\n" if /(?:[^,]*,\s*){3}(.*?)\s*,/;
      The above code will print "Good\n". If you've worked with CSV data for any length of time you know that the odds are good that sooner or later you'll get a line with all commas (at least I have). Depending upon what is done with the data, you could have serious data corruption.

      Also, I would try to avoid the (.*?)\s*, construct. It's not terribly specific and can cause problems. ([^,]+)\s*, is very specific and is more appropriate. In fact, if you know that the data you are capturing won't have any embedded spaces or tabs (and I'm assuming that everything is on one line), you can use ([^, \t]+),.

      In this case, I don't feel that it will cause a problem with how your regex is crafted, but subtle errors can creep in down the road as maintenance occurs. Your regex is fine because the whitespace behind it is optional, but the negated character class is almost always preferable because it states exactly what you want.

      Consider the following problem: you want to print the first field of comma-delimited text if the last character prior to the comma is a sharp (#), but you don't want to capture the sharp. If the data doesn't fit this format, you want the regex to fail completely. The following regex looks fine at first glance:

      print "$1\n" if /^(.+?)#,/;
      It is, however, a bad choice. The negated character class is proper:
      #!/usr/bin/perl $_ = "test1, test2#,test3"; print "$1\n" if /^(.+?)#,/; # Returns a false positive print "$1\n" if /^([^,]+)#,/; # This fails, as we expect
      The first regex above will print test1, test2. I'm not trying to sound picky, but any time I see the .* or .+ used in a regex, I always look for a way to remove it because it's not terribly precise.
Re: Regexp glitch while parsing record in CSV
by Shoeboy (Sexton) on Jul 17, 2000 at 07:37 UTC
    Have you considered optimizing split a little bit. Your regex finds 4 commas and returns whatever's between the 3rd and 4th. Your split finds 11 commas and returns a list of 12 items. It's no suprise that the regex is faster. If you're only trying to return a part of a string, a regex seems to be more appropriate than split. Still, if you like the idea of using split (it is shorter and more readable) -- try limiting split to 5 items like so:
    @record = split( /\s*,\s*/o, $_, 5); $field4 = $record[3];
    I don't know if split will speed up from this, but it's worth a try.
    perl -e "do {kill $java, $ada, $cobol, $pascal, $csh;} until die 'Just another perl hacker';"
      I have found this collection of replies downright fascinating. Bloody regular expressions, anyway. :)

      Given some of the regexp pitfalls that have become obvious as I read through the replies, I find myself thinking that--TMTOWTDI notwithstanding--in this case the "correct" approach might well be the one you've suggested:

      <kbd>@record = split( /\s*,\s*/o, $_, 5);
      $field4 = $record[3];</kbd>

      Judging by q&d benchmarking I did, restricting the split to 5 fields did speed things up a bit. So this approach, while not as sexy as the ones using the regular expressions:
      1) gets the job done;
      2) isn't breathtakingly fast--but isn't dismally slow, either;
      3) is immediately understandable; a complex regexp might lead, later on, to some head-scratching...;
      4) ensures, in a simple way, that the leading and trailing spaces are removed. Once more, thanks to all of you for the extensive feedback.

Re: Regexp glitch while parsing record in CSV
by lhoward (Vicar) on Jul 17, 2000 at 04:29 UTC
    This is slightly off-topic, but have you considered using Text::CSV or DBD::CSV for parsing the file? Both of those modules are really convenient to use.
      Concerning the modules on CPAN: I had a look at the links you provided (thanks for that). As often happens, I find that I don't grasp the reference material--I don't get the syntax. I will have to spend more time with sundry Perl books I have here to see if I can learn how such modules work.

      Adding to my confusion: the links within the page to which you referred me appear to point to .PM files, but they download as HTML files, not as .PM files. Obviously I'll need to work out how to "use" CPAN as well. (Aside from all this, I'm quite knowledgeable. <g>)

(jjhorner)Regexp glitch while parsing record in CSV
by jjhorner (Hermit) on Jul 17, 2000 at 06:35 UTC

    I'm not getting the same results as everyone else:

    #!/usr/bin/perl -w use strict; use Benchmark; my $string = " field1 , field2 , field3 , field4 , field5 , field6 , field7 , field8 , field9 , field10 , field11 , field12 "; sub regexp { $string =~ m#^(\s*)(\w*?)(\s*), (\s*)(\w*?)(\s*),(\s*)(\w*?)(\s*), (\s*)(\w*?)(\s*),(\s*)(\w*?)(\s*), (\s*)(\w*?)(\s*),(\s*)(\w*?)(\s*), (\s*)(\w*?)(\s*),(\s*)(\w*?)(\s*), (\s*)(\w*?)(\s*),(\s*)(\w*?)(\s*), (\s*)(\w*?)(\s*)$#o; my $s = $11; print $s; } sub splitter { my $s = (split(/\s*,\s*/, $string))[3]; print $s; }


    [22:31:17] Orders, Master? ./ Benchmark: timing 100000 iterations of regexpression, splitting... regexpression: 12 wallclock secs (12.78 usr + 0.00 sys = 12.78 CPU) @ + 7824.73/s (n=100000) splitting: 8 wallclock secs ( 7.90 usr + 0.00 sys = 7.90 CPU) @ 12 +658.23/s (n=100000)

    Is there something wrong with my code? I know that isn't the best regular expression ever created, but I would like to know why it doesn't beat the split.

    Note: To whomever I saw with the shell prompt like I have above, that is so cool!

    J. J. Horner
    Linux, Perl, Apache, Stronghold, Unix
      Taking out the capturing parenthesis around the whitespace helped a lot. Here's the benchmark results:
      Benchmark: timing 50000 iterations of regex, split... regex: 3 wallclock secs ( 3.03 usr + 0.01 sys = 3.04 CPU) split: 4 wallclock secs ( 3.79 usr + 0.00 sys = 3.79 CPU)
(Ovid) RE: Regexp glitch while parsing record in CSV
by Ovid (Cardinal) on Jul 17, 2000 at 21:30 UTC
    Couple of comments:
    1. Reread lhoward's post. Those two modules might save you a lot of headaches. Your regex, for example, will not allow for embedded or escaped commas (assuming you ever get any).
    2. @record = split( /\s*,\s*/o, $_); is better written as @record = split /\s*,\s*/;.
    3. I'm not commenting on the meat of your node as you already have tons of good comments about that.
    The /o modifier is not necessary if an interpolated variable is not in the regex. Perl automatically compiles the regex only once if it's a straight regex with no variable interpolation (I've made this mistake in posts here, too. See Mastering Regular Expressions, second edition, p159).

    The /o modifier can be useful if you have a regex like /\s+${someword}ing/o. If that variable's value never changes, then using /o to force the regex compiler to only compile the expression once is good. However, if you use it to compile an expression with a variable whose value may change, then you're in trouble. The variable's value in the regex will be stuck in the first value and never get updated. The following code illustrates the danger:

    while (<SOMEFILE>) { chomp; $found = 1 if $user =~ /$_/o; }
    If $user isn't the first one in SOMEFILE, then $user will never match. Incidentally, you can get around this limitation with anonymous subroutines. merlyn has an excellent article about this.

    This node was brought to you by the letter "O".

Re: Regexp glitch while parsing record in CSV
by Anonymous Monk on Jul 18, 2000 at 00:28 UTC

    Just a quick note to add to Ovid's fine comments.

    You can use ([^,]+?) to have the best of both worlds (robust and specific as well as not being so greedy as to suck up the trailing white space before the \s* has a chance to get it.

    You can also get fancier and help the regex engine match quicker by doing something like /^\s*([^,]*[^,\s])\s*,/, which (I think) will reduce the amount of backtracking possible (is it called "forwardtracking" in the case of non-greedy regex elements? ;> ). Note that this, like the last ones, doesn't match empty values.

    You can support empty values with /^\s*((?:[^,]*[^,\s])?)\s*,/, but then you are going a long way for what may be little benefit.

RE: Regexp glitch while parsing record in CSV
by mrmick (Curate) on Jul 17, 2000 at 15:38 UTC
    If you are only concerned about the leading and trailing whitespace in the 4th field, I suggest the following (or something similar):
    my @record = split( /,/); my $field4 = $record[3]; $field4 =~ /^\s*//; $field4 =~ /\s*$//;
    I know that this isn't exactly a sophisticated regex but we are only working with one field. This is still faster than performing the replacement on all fields, I'm sure.


Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://22799]
Approved by root
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (3)
As of 2022-10-01 08:32 GMT
Find Nodes?
    Voting Booth?
    I prefer my indexes to start at:

    Results (126 votes). Check out past polls.