Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer

Looking for appropriate Regex

by better (Acolyte)
on Apr 07, 2013 at 09:07 UTC ( #1027336=perlquestion: print w/replies, xml ) Need Help??
better has asked for the wisdom of the Perl Monks concerning the following question:


I am looking for an appropriate regular expression.

After parsing a csv file, commas are missing within the strings:

A B 1234 a,b -> A B 1234 a b

A B 1234 c,e -> A B 1234 c e

Not knowing how to avoid the problem in the first place, I thought to fix the problem afterwards. I tried to define a regex, which should find matches within the parsed strings and to substitute them like this:

$id =~ s/[a-z] [a-z]/[a-z],[a-z]/;

But this doesn't work!

Has anybody any idea how to solve this problem?


Replies are listed 'Best First'.
Re: Looking for appropriate Regex
by james2vegas (Chaplain) on Apr 07, 2013 at 09:21 UTC
    Not sure I follow exactly what the question is, but have you tried using Text::CSV to parse your CSV?

      Yes, I did. Thanks!

      Here is the code I use for parsing the CSV file:

      #! /usr/local/bin/perl # #script opens and parses a CSV file, #removes carriage return #and writes all into a new text file # #tested: --ok! use strict; use warnings; use Text::CSV; #Input CSV filename my $file = $ARGV[0]; if (!$ARGV[0]) { $file = './data/IDs_test.csv'; #Default } if (!-f $file) { print "Can not find $file: $!\n"; exit 1; } #Parsing CSV local $/ = "\r\n"; #add windows carriage return to perl's eol +(newline) my $csv = Text::CSV ->new ({binary =>1, eol => $/}); open (my $fhCSV, '<', $file) or die "Can not open $file: $!\n"; open (my $fhOUT, '>', './data/IDs.txt') or die "Can not open: $!\n"; while (my $line = <$fhCSV>) { if ($csv->parse($line)) { my @fields = $csv->fields (); chomp (@fields); print $fhOUT "@fields\n"; } else { warn "Line could not be parsed: $line\n"; } } print "CSV parsed and saved as text file: /data/IDs.txt!"; close $fhCSV; close $fhOUT;
Re: Looking for appropriate Regex
by hdb (Monsignor) on Apr 07, 2013 at 09:56 UTC

    Try this.

    $id =~ s/([a-z]) ([a-z])/$1,$2/;


      Thanks a lot!

      It works!

      But,oops, I found a few exceptions, like: A B 1234 a c d

      I tried to fix this by simply extending your script:

      $id =~ s/([a-z]) ([a-z]) ([a-z])/$1,$2,$3/;

      But this doesn't work!


      Oh yes, it does!

      I just swaped both lines, so that the extended line preceeds your line. And voilą!

Re: Looking for appropriate Regex
by Loops (Curate) on Apr 07, 2013 at 09:22 UTC

    There are a number of ways to craft a regular expression to do this. One is to use look-behind and look-ahead assertions. These make sure the characters surrounding the space character match, but will leave them alone and not include them in the text to be replaced:

    $id =~ s/(?<=[a-z]) (?=[a-z])/,/;

    However, the results of any regex are likely to be spotty. If instead you use a module like Text::CSV to parse your csv files, you should be able to receive your data clean with all commas intact

Re: Looking for appropriate Regex
by ww (Archbishop) on Apr 07, 2013 at 15:31 UTC

    If the sample data are exactly as you posted them (hard to tell since you didn't use code tags),

    A B 1234 a,b -> ... A B 1234 c,e -> ....

    then it appears to me that you are dealing NOT with CSV. but rather, with SPACE-SEPARATED values (or, perhaps TSV if something in the writeup process transformed tabs to single spaces). If that's the case, you can't solve your problem by treating the data as if it were comma-separated.

    OTOH (and again presuming that the OP reflects the data precisely), could this actually be CSV data with just two fields?

    If you didn't program your executable by toggling in binary, it wasn't really programming!

      Well, I parsed variably structured strings, consisting of different characters, like: letters, numbers, space, brackets, minus, comma from a CSV file.

      The CSV file is a database output in the form of an Excel worksheet, saved as a CSV file. The first coloumn contains fields with a series of letters and numbers, called IDs, i.e.

      |A B 1234 a <1>| |A B 1234 a,b|

      In order to look up the image files corresponding to the IDs, I parse the IDs of the CSV file into a text file, which will later be opened as a file handle. During the parsing of the IDs the comma of comma separated letters (like: a,b or c,e) which are part of the ID is accidentally (?) deleted.

      Just to point out the context: The parsing is the first step of a file import, see nodes:

      Read text file - Encoding problem?

      Looking up elements of an array in another array!
Re: Looking for appropriate Regex (Text::CSV)
by Anonymous Monk on Apr 07, 2013 at 09:22 UTC

      Hello again, Anonymous Monk!

      Thanks a lot for your extensive response!

      I'm fighting not to get buried in this avalanche of wisdom ;-)

      Seriously, I appreciate your help, which seems to guide me through my whole project.

      I can find already hints in your list of links for the next step: parsing a CSV into a hash, which I will need for tagging the image files, I imported before (s. my node:

      Looking up elements of an array in another array!


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1027336]
Approved by Corion
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (8)
As of 2018-07-20 20:55 GMT
Find Nodes?
    Voting Booth?
    It has been suggested to rename Perl 6 in order to boost its marketing potential. Which name would you prefer?

    Results (441 votes). Check out past polls.