|
benlaw has asked for the wisdom of the Perl Monks concerning the following question:
Hi all,
I try to pick some data within CSV file, when it has no response....~ I think I have met the limitation of regex, is there any method I can do in regex ? Thanks a lot
#!perl -w
while(<DATA>){
$count = ($_ =~ tr/\,//);
print $count;
if($_ =~ /A\,(.*?)\,(.*?)\,(.*?)\,(.*?)\,(.*?)\,(.*?)\,.*?\,.*?\,.
+*?\,.*?\,(.*?)\,.*?\,.*?\,.*?\,.*?\,.*?\,.*?\,.*?\,.*?\,.*?\,.*?\,.*?
+\,.*?\,.*?\,.*?\,.*?\,.*?\,.*?\,.*?\,(.*?)\,.*?\,.*?\,.*?\,.*?\,.*?\,
+.*?\,.*?\,.*?\,.*?\,.*?\,.*?\,.*?\,.*?\,.*?\,.*?\,.*?\,.*?\,.*?\,.*?\
+,.*?\,.*?\,/){
print $_;
}
}
__DATA__
Action,userID,DEPT,FolderLocation,Place,First Name,Surname,Dept2,subje
+ct,brief,Phone,InternetEmail,AccessCode,AccessID,A,IA,SC,UC,PC,JD,ED,
+JC,G,SCA,FNA,OAD,AAK,MBB,SACTA,Index,STATE,Fax,uID,LDisa,Second Intro
+,BossName,EXphoneCall,OXCode,NameID,Section,Com,status,final Am,Speci
+alID,unit,BFIT,Old Show,Second mail alias,footer,Build,Notes Server,C
+ellPhone
X,LLXEAS2,,,"Wah Woo, Section A",Lamshi,Coo,DSA,No vell,,,,,,,,,,HKA,,
+,,,,,,,,,,Important,,,,,,,A,LLXEAS2,CWW,,NAW,,,,,,,"Welcome to Coo Du
+k, ltd.",,,
X,LLXEAS,,,"Wah Woo, Section A",Lamshi,Coo,DSA,No vell,,,,,,,,,,HKA,,,
+,,,,,,,,,Important,,,,,,,A,LLXEAS,CWW,,NAW,,,,,,,"Welcome to Coo Duk,
+ ltd.",,,
G,CHXAS,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,sam.mo@hotmail.co
+m,,,,
Re: Is there any regex limitation?
by moritz (Cardinal) on Jul 24, 2011 at 12:03 UTC
|
The problem is that your regex can cause a lot of backtracking. Since (.*?) is also allowed to match a comma, a failed match will mean that each .*? group tries to match a different number of characters. Since you have so many of these groups, there's a huge number of possible matches (all of which will fail), which is why the regex engine takes so much time.
The better approach is of course to use Text::CSV, which will also handle the quoting inside a field for you.
Another efficient but less robust approach is to use split.
| [reply] [d/l] [select] |
Re: Is there any regex limitation?
by Jim (Curate) on Jul 24, 2011 at 17:14 UTC
|
#!perl
use strict;
use warnings;
use Text::CSV_XS;
my $csv = Text::CSV_XS->new( { binary => 1 } );
while (my $record = $csv->getline(*DATA)) {
my @values = @$record;
for (my $i = 0; $i <= $#values; $i++) {
printf "(%d)\t[%s]\n", $i + 1, $values[$i];
}
}
exit 0;
__DATA__
Action,userID,DEPT,FolderLocation,Place,First Name,Surname,Dept2,subje
+ct,brief,Phone,InternetEmail,AccessCode,AccessID,A,IA,SC,UC,PC,JD,ED,
+JC,G,SCA,FNA,OAD,AAK,MBB,SACTA,Index,STATE,Fax,uID,LDisa,Second Intro
+,BossName,EXphoneCall,OXCode,NameID,Section,Com,status,final Am,Speci
+alID,unit,BFIT,Old Show,Second mail alias,footer,Build,Notes Server,C
+ellPhone
X,LLXEAS2,,,"Wah Woo, Section A",Lamshi,Coo,DSA,No vell,,,,,,,,,,HKA,,
+,,,,,,,,,,Important,,,,,,,A,LLXEAS2,CWW,,NAW,,,,,,,"Welcome to Coo Du
+k, ltd.",,,
X,LLXEAS,,,"Wah Woo, Section A",Lamshi,Coo,DSA,No vell,,,,,,,,,,HKA,,,
+,,,,,,,,,Important,,,,,,,A,LLXEAS,CWW,,NAW,,,,,,,"Welcome to Coo Duk,
+ ltd.",,,
G,CHXAS,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,sam.mo@hotmail.co
+m,,,,
Here's the output of the script.
Neither your regular expression pattern nor split will parse your example CSV records correctly.
| [reply] [d/l] [select] |
Re: Is there any regex limitation?
by Anonymous Monk on Jul 24, 2011 at 12:06 UTC
|
What limitation are you talking about? Use split
my @fudge = split /,/, $_
Or a better idea is to use Text::CSV or Text::xSV | [reply] [d/l] |
|
|
| [reply] |
|
|
You can't use split to parse CSV records—at least not the general form of CSV records represented in benlaw's example. It won't work.
Actually yes you can
Using split as I've shown is as-good as benlaws intended solution, with the benefit of no backtracking
But its a solved problem, so Text::CSV or Text::xSV
| [reply] |
|
|
|
|
| |
|
How does the split solution you posted handle the following portion of the OP's input?
"Wah Woo, Section A"
I can tell you how it does: One field gets ("Wah Woo), the next field gets ( Section A"), and the embedded but significant comma gets gobbled up. Or to put it simply: It handles it wrong.
() parens added to disambiguate whitespace in the example output
| [reply] [d/l] [select] |
|
|
How does the split solution you posted handle the following portion of the OP's input?
obviously it doesn't -- just like the OPs actual code
use Text::CSV or Text::xSV
| [reply] |
Re: Is there any regex limitation?
by Marshall (Canon) on Jul 24, 2011 at 22:14 UTC
|
This CSV file is complex enough that I think using the DBI in conjunction with the DBD::CSV module is certainly worth considering. This enables the use of standard SQL statements to extract, update, create information in the .CSV file. Later if you want to move the data to a real DB, the same code will work. For that next step, I recommend SQLite as a great DB that provides about 90% of what a "real DB" provides. I show some code below.
A few notes on the code... There are many ways to write the connect statement. I show one way and also some of the defaults on purpose. The file I made with the OP's data is DataBase.CSV - that file is a verbatim copy of the posted data. With DBD::CSV normally each file is a table so I set default extension to be .csv so that I didn't have to type that in the SQL statements. RaiseError=1 is very common and with that you don't have a bunch of "or die" statements in the code. Of course there is no username or password for a .csv file so those are undef in the connect statement.
I show how to make a simple query to print a couple of values. The DB returns a NULL string "" if there is no data for that column.
I got confused about which columns actually had data or not, so that was a good opportunity to show another type of query where I printed out the column name and value for every column that had data.
Anyway this approach does require a lot of upfront learning, but I've found it to be a very good way of dealing with CSV files. The classic book: Programming the Perl DBI" by Alligator Descartes and Tim Bunce is highly recommended.
Update: I added an example of updating the .csv file to the previous code. Also corrected a mistake, csv_eol => "\n", the \n was in single quotes before which caused problems with the update operation! Normally, I would not even have these kind of default parameters in the code, but I wanted to demonstrate that there are many parameters that can be tweaked and show a few of them.
I do not claim that the DBI approach executes faster than the CVS_XS parser, indeed DBD::CSV uses it. The advantage of this DBI approach is that it is more flexible than other approaches. And when things grow in size or complexity, the DBI/SQL code can be directly used in a real database where it will definitely run much faster than with CVS_XS code.
The big disadvantage of the approach shown is that some knowledge of SQL and the Perl DBI is needed. The total scope of SQL and the Perl DBI is huge, but for jobs like this only about 1/2 dozen functions are needed. So learn what you need and learn more when needed.
In the code below, I use "pod" directives (like "=cut") to embed the output within the Perl code itself. I hope that this does not confuse anybody. I do it this way so that you have a runnable set of code that includes the expected output.
#!/usr/bin/perl -w
use strict;
use DBI; ## need to have DBD::CSV installed for this script!
my $dbh = DBI->connect ("DBI:CSV:", undef, undef, # no uname, pwd
{
f_dir => './', # default current directory
f_ext => ".csv", # DB files end in .csv
csv_eol => "\n", # default (used for writing)
csv_sep_char => ',', # default of course is comma!
RaiseError => 1, # errors are fatal (with description)
PrintError => 0, # default is 0 not needed if RaiseErro
+r=1
# non-zero prints error and continue
+s
}
) or die "Cannot connect: " . $DBI::errstr;
my $st_basic_actions = $dbh->prepare("SELECT action,userID,Place
FROM DataBase
ORDER BY userID");
$st_basic_actions->execute();
printf "%-10s %-10s %-10s\n", 'Action','userID','Place';
while (my ($action,$userID,$Place) = $st_basic_actions->fetchrow)
{
$Place ||= 'unknown'; # use 'unknown' if null string, ''
printf "%-10s %-10s %-10s\n", $action,$userID,$Place
}
=outputs ################
Action userID Place
G CHXAS unknown
X LLXEAS Wah Woo, Section A
X LLXEAS2 Wah Woo, Section A
=cut ################
my $st = $dbh->prepare ("SELECT * FROM DataBase");
$st->execute();
while (my $href = $st->fetchrow_hashref())
{
print "****\n";
foreach my $key (sort keys %$href)
{
print "$key=$href->{$key}\n" if $href->{$key};
}
}
=outputs ###############
****
action=X
dept2=DSA
first_name=Lamshi
footer=Welcome to Coo Duk, ltd.
nameid=LLXEAS2
oxcode=A
pc=HKA
place=Wah Woo, Section A
section=CWW
state=Important
status=NAW
subject=No vell
surname=Coo
userid=LLXEAS2
****
action=X
dept2=DSA
first_name=Lamshi
footer=Welcome to Coo Duk, ltd.
nameid=LLXEAS
oxcode=A
pc=HKA
place=Wah Woo, Section A
section=CWW
state=Important
status=NAW
subject=No vell
surname=Coo
userid=LLXEAS
****
action=G
second_mail_alias=sam.mo@hotmail.com
userid=CHXAS
=cut
# **CODE UPDATE**: adding an "update" to show modification is possible
$st = $dbh->prepare ("UPDATE DataBase
SET action='Done'
WHERE userID='LLXEAS2'");
my $rows_modified = $st->execute();
print "\nRows Modified from X to Done = $rows_modified\n";
## now re-running previous query, can re-use a prepared statement
## handle
$st_basic_actions->execute();
printf "%-10s %-10s %-10s\n", 'Action','userID','Place';
while (my ($action,$userID,$Place) = $st_basic_actions->fetchrow)
{
$Place ||= 'unknown'; # use 'unknown' if null string, ''
printf "%-10s %-10s %-10s\n", $action,$userID,$Place
}
##########################
=this new section of code prints:
Rows Modified from X to Done = 1
Action userID Place
G CHXAS unknown
X LLXEAS Wah Woo, Section A
Done LLXEAS2 Wah Woo, Section A
=cut
############################
| [reply] [d/l] |
Re: Is there any regex limitation?
by JavaFan (Canon) on Jul 24, 2011 at 20:25 UTC
|
| [reply] |
|
|