Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation

Data validation and blank spaces in tab formatted csv file

by Ma (Novice)
on Oct 24, 2013 at 13:36 UTC ( #1059455=perlquestion: print w/replies, xml ) Need Help??
Ma has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have developed a script which takes the tab formatted csv file and validates couple of fields and then generates valid and invalid tab separated csv files. Code is pasted below. The issue is, when there are no values, those records should not go to invalid. However, if the checked field have empty spaces and no values, this should go to the invalid file. This means if the value is never provided (null), it should be in the valid file. If no values are provided but the field contains empty spaces, this is considered invalid record and go to that file. I have done lot of attempts, still not able to handle the null and blank spaces issues. The records simply go to invalid when the checked fields contains null or blank spaces. Any help is greatly appreciated. Last time, I got so much help that I was at least able to write this script.
#!/usr/bin/perl # use strict; use warnings; use Diagnostics; use Text::CSV_XS; #use Regexp::Common::time; my $file_to_parse = "AddressDevCHUBldgProj_LM.csv"; my $file_to_parse_valid = "AddressDevCHUBldgProj_LM_valid.csv"; my $file_to_parse_invalid = "AddressDevCHUBldgProj_LM_invalid.csv"; open(FH, '<', $file_to_parse) or error("cannot open file ($!)"); open(FH1, '>', $file_to_parse_valid) or error("Cannot open file for wr +ite($!)"); open(FH2, '>', $file_to_parse_invalid) or error("Cannot open file for +write($!)"); while (<FH>) { chomp; my ($chu, $dev, $development_name, $bldg_no, $bldg_name, $unit_id, + $proj, $proj_name, $bdsz, $st_no, $street_name, $apt_no, $p_code, $un_st, $mkt, $max_rent, $fut_movein, $p, $last_ +move_in, $last_move_out, $pgm, $clt_no, $mgr_cd, $manager, $unit_status_dt, $cont_occ_dt, $pending_move_out_dt, $unit_status_ +desc) = split("\t"); $max_rent =~ s/,//g; #print "$dev\t$bldg_no\t$unit_id\t$proj\t$max_rent\t\t$clt_no\t\t\t$la +st_move_in\t\t$last_move_out\t\t$unit_status_dt\t\t$pending_move_out_ +dt\n"; print "dev = $dev\n"; print "bldg_no = $bldg_no\n"; print "unit_id = $unit_id\n"; print "proj = $proj\n"; print "max_rent = $max_rent\n"; print "clt_no = $clt_no\n"; print "last_move_in = $last_move_in\n"; print "last_move_out = $last_move_out\n"; print "unit_status_dt = $unit_status_dt\n"; print "pending_move_out_dt = $pending_move_out_dt\n"; if ( ($dev =~/^[0-9]+$/) && ($bldg_no=~/^[0-9]+$/) && ($unit_id=~/ +^[0-9]+$/) && ($proj=~/^[0-9]+$/) && ($max_rent=~/^[0-9]+(\.[0-9][0-9]?)$/) && ($clt_no=~/^[0-9]+$/) && ($last_move_in=~ s#(\d+)/(\d+)/(\d+)#$1/$2/$3#) && ($last_move_out=~ s#(\d+)/(\d+)/(\d+)#$1/$2/$3#) && ($unit_status_dt=~ s#(\d+)/(\d+)/(\d+)#$1/$2/$3#) && ($pending_move_out_dt=~ s#(\d+)/(\d+)/(\d+)#$1/$2/$3#) ) { print FH1 "$_", "$chu\t$dev\t$development_name\t$bldg_no\t$bldg_ +name\t$unit_id\t$proj\t$proj_name\t$bdsz\t$st_no\t$street_name\t$apt_ +no\t$p_code\t$un_st\t$mkt\t$max_rent\t$fut_movein\t$p\t$last_move_in\ +t$last_move_out\t$pgm\t$clt_no\t$mgr_cd\t$manager\t$unit_status_dt\t$ +cont_occ_dt\t$pending_move_out_dt\t$unit_status_desc", "\n"; + } else { print FH2 "$_", "$chu\t$dev\t$development_name\t$bldg_no\t$bldg_ +name\t$unit_id\t$proj\t$proj_name\t$bdsz\t$st_no\t$street_name\t$apt_ +no\t$p_code\t$un_st\t$mkt\t$max_rent\t$fut_movein\t$p\t$last_move_in\ +t$last_move_out\t$pgm\t$clt_no\t$mgr_cd\t$manager\t$unit_status_dt\t$ +cont_occ_dt\t$pending_move_out_dt\t$unit_status_desc", "\n"; } } close (FH); close (FH1); close (FH2); exit;

Replies are listed 'Best First'.
Re: Data validation and blank spaces in tab formatted csv file
by keszler (Priest) on Oct 24, 2013 at 13:44 UTC
    I'm a bit confused - you use Text::CSV_XS;, but don't use it. I would think that the blank_is_undef and empty_is_undef options of Text::CSV would be useful to you.
      Thanks, I have not used Text:CSV_XS. Can the existing code be changed to accommodate the requirements for null and blank spaces.

        For output, that would be similar to input

        my $csv_o = Text::CSV_XS->new ({ binary => 1, sep_char => "\t", auto_d +iag => 1, eol => $/ }); open my $fhv, ">", $file_to_parse_valid or error ("Cannot open file +for write: $!"); open my $fhi, ">", $file_to_parse_invalid or error ("Cannot open file +for write: $!"); while () { if ("valid") { $csv_o->print ($fhv, [ $chu, $dev, $development_name, $bldg_no, $bldg_name, $unit +_id, $proj, $proj_name, $bdsz, $st_no, $street_name, $apt_no, $p_code, + $un_st, $mkt, $max_rent, $fut_movein, $p, $last_move_in, $last_move_out, + $pgm, $clt_no, $mgr_cd, $manager, $unit_status_dt, $cont_occ_dt, $pending +_move_out_dt, $unit_status_desc ]); } else { $csv_o->print ($fhi, [ $chu, $dev, $development_name, $bldg_no, $bldg_name, $unit +_id, $proj, $proj_name, $bdsz, $st_no, $street_name, $apt_no, $p_code, + $un_st, $mkt, $max_rent, $fut_movein, $p, $last_move_in, $last_move_out, + $pgm, $clt_no, $mgr_cd, $manager, $unit_status_dt, $cont_occ_dt, $pending +_move_out_dt, $unit_status_desc ]); }

        Enjoy, Have FUN! H.Merijn
Re: Data validation and blank spaces in tab formatted csv file
by Tux (Abbot) on Oct 24, 2013 at 14:12 UTC

    What keszler said.

    my $csv = Text::CSV_XS->new ({ binary => 1, sep_char => "\t", auto_dia +g => 1 }); open my $fh_in, "<", $file_to_parse or die "$file_to_parse: $!"; while (my $row = $csv->getline ($fh_in)) { my ($chu, $dev, $development_name, $bldg_no, $bldg_name, $unit_id, + $proj, $proj_name, $bdsz, $st_no, $street_name, $apt_no, $p_code, $un +_st, $mkt, $max_rent, $fut_movein, $p, $last_move_in, $last_move_out, $pg +m, $clt_no, $mgr_cd, $manager, $unit_status_dt, $cont_occ_dt, $pending_mov +e_out_dt, $unit_status_desc) = @$row;

    Enjoy, Have FUN! H.Merijn
      Thank you keszler and Merijin. You have transformed my code to be much better. Can you provide example of one field testing for blanks and empty. For example, if the $unit_status_dt contains data or empty string, put that record into valid file. But if $unit_status_dt contains blank spaces and no data, put that into invalid file. I very much appreciate one example and then I will develop the remaining from there.

        There are several approaches. You can skip those spaces while parsing resulting in empty fields:

        my $csv = Text::CSV_XS->new ({ binary => 1, sep_char => "\t", auto_diag => 1, allow_whitespace => 1, });

        but as you are dealing with tab separated data, I'd personally would coose to do it inside the loop

        while (my $row = $csv->getline ($fh)) { # Check if the 5th field contains data if ($row->[4] =~ m/\S/) { # more than just whitespace $csv_o->print ($fhv, $row); } else { # sorry, this is not filled: invalid $csv_o->print ($fhi, $row); }

        Enjoy, Have FUN! H.Merijn

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1059455]
Approved by marto
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (8)
As of 2017-10-17 10:50 GMT
Find Nodes?
    Voting Booth?
    My fridge is mostly full of:

    Results (225 votes). Check out past polls.