Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid

Cleaning Data Between Specified Columns

by enoch (Chaplain)
on Jan 27, 2003 at 18:54 UTC ( #230306=perlquestion: print w/replies, xml ) Need Help??

enoch has asked for the wisdom of the Perl Monks concerning the following question:

I was charged with writing a program that would "clean" data in a file (remove characters denoted bad, squash apostrophes). The program would accept a varaible number of arguments. The first would be a path to a file. The remaining arguments would be a series of column markers indicating between which columns to do the data cleaning. For example, if you wanted to clean the file data.txt but only between columns 0 to 30, 44 to 63, and 97 to 111, you would call the program like so:
% ./ data.txt 0-30 44-63 97-111
Here is how I implemented it (changing strings to character arrays, using splice, data munging appropriately, and putting the data back). I was wondering if any monks would have done it differently.
#!/usr/bin/perl use warnings; use strict; ## ## This program accepts the name of a file from the command line ## and a variable length of extra arguments specifying between whic +h ## columns to examine or which fields in a delimited file to examin +e. ## It, then, processes the file replacing any apostrophe with nothi +ng ## (that is, it squashes any appearance of an apostrophe turning "d +on't" ## into "dont" and "O'Connor" into "OConnor"). It, then, replaces ## anything that is not a alpha-numeric, pipe, new line, or dash wi +th a space ## my $fileToCleanse = shift or die "Usage $0 <fileName> <fromColumn - toColumn> " . "<fromColumn - toColumn>... where 'fileName' is the name " . "of the file to cleanse and the other parameters specify " . "the range of columns to cleanse"; open INPUT, $fileToCleanse . ORIG_DATA_FILE_EXT or die "Could not open $fileToCleanse" . ORIG_DATA_FILE_EXT . " for reading because:\n$!\t\n"; my $fileContents = ''; my @columnSpanArray = (); # build a two dimensional array # to hold each one of the column index paramter pairs my $index = 0; foreach(@ARGV) { ($columnSpanArray[$index][0], $columnSpanArray[$index][1]) = split '-', $_; $index++; } while(my $line = <INPUT>) { my @chars = split '', $line; # split the line into an array of +chars foreach my $parameters (@columnSpanArray) { # if the end of the line occurs before the parameter # specified to cleanse to, only cleanse until end of line # for example, if we are to cleanse from 45 to 115 # but the line is only 65 characters long, only cleanse up til + 65 my $endOrLineLength = (length($line) > $$parameters[1]) ? $$parameters[1] : length($line); # go to next loop if the paramters exceed the line length next if $$parameters[0] >= $endOrLineLength; # take a slice of the array between the columns to examine my $tmpString = join '', @chars[$$parameters[0]..$endOrLineLen +gth-1]; $tmpString =~ s/(.)'(.)/$1$2/g; # squash apostrophe +s $tmpString =~ tr/a-zA-Z0-9\n\|\-/ /c; # remove bad characte +rs # put it back into the array from which we got it splice(@chars, $$parameters[0], $endOrLineLength-$$parameters[ +0], split '', $tmpString); } # store the cleansed data as a string $fileContents .= join '', @chars; } # end while INPUT close INPUT; # print back the cleansed data to the original file name open OUTPUT, ">$fileToCleanse" or die "Could not open $fileToCleanse.cleansed for reading because:\n +$!\t\n"; print OUTPUT $fileContents; close OUTPUT;

Replies are listed 'Best First'.
Re: Cleaning Data Between Specified Columns
by Fletch (Bishop) on Jan 27, 2003 at 19:51 UTC

    You can use the fact that the return value from substr() is an lvalue and do something like:

    { local *tmp = \substr( $source, $start, $len ); $tmp =~ y/'//d; $tmp =~ y/a-zA-Z0-9\n\|-/ /c; }
      Damn. That's great. :) And with a little change it's strict compliant and shorter too:
      { local *_ = \substr $source, $start, $len; y/a-zA-Z0-9\n\|-/ /c; y/'//d; }
      Update: changed order of transliterations to account for the deletion of characters causing extra characters to shift in.

      Makeshifts last the longest.

Re: Cleaning Data Between Specified Columns
by BrowserUk (Patriarch) on Jan 27, 2003 at 21:50 UTC

    Using substr for this is kinda tricky. If one of the earlier ranges contains 's, and you delete them, that screws up the indexes for later ranges. One solution is to substitute a known-not-present char (I used \x7F) for ' whilst processing the ranges and then remove these from the resultant line.

    Update:A modified version to deal with replacing 's with spaces at the end of the field rather than deleting them entirely. Makes use of Fletch's neat trick and Aristotle's enhancement to it, now that is possible.

    #! perl -sw use warnings; use strict; open( FILE, '<', shift) or die "Couldn't open $::FILE; $!"; @ARGV = map{ [ split(/-/, $_, 2) ] } @ARGV or die 'usage $0: file c1-c2 [ c1-c2 [ ... ] ] >modified_file'; while (my $line = <FILE>) { for ( @ARGV ) { next if $_->[0] > length $line; $_->[1] = length $line if $_->[1] > length $line; local *_ = \substr($line, $_->[0], $_->[1]-$_->[0] + 1); tr[a-zA-Z0-9\n\|\-'][ ]c; $_ .= ' ' x tr['][]d; } print $line; } close FILE;

    Original version

    Examine what is said, not who speaks.

    The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.


      I was bitten by my squashing of apostrophes. Because it was a fixed width file, squashing the apostrophes caused the width's to change. I changed the regex to:
      s/(.)'(.\B*)/$1$2 /g
      So, that spacing was added for each apostrophe I pulled from any field.


        In that case, there's no need for this line

        $tmpString =~ s/(.)'(.)/$1$2/g; # squash apostrophes

        As the next line

        $tmpString =~ tr/a-zA-Z0-9\n\|\-/ /c; # remove bad characters

        will convert them to spaces anyway?

        Examine what is said, not who speaks.

        The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

      Using "known nonexistant" characters is just asking for trouble.. it's a practice I've come to regard as a huge red flag. In this particular case and with Perl being Perl, the proper solution is surprising but very neat. Fletch++

      Makeshifts last the longest.

        Sorry Aristotle. Fletch's (partial) solution, neat as the technique is, falls foul of the fact that deleting the apostrophies in a one range, causes all the subsequent columns to shift.

        Examine what is said, not who speaks.

        The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

Re: Cleaning Data Between Specified Columns
by Aragorn (Curate) on Jan 27, 2003 at 20:41 UTC
    Taking one of the suggestions above, keeping your program flow intact, and mixing it a bit with my personal style, I come up with this:

    #!/usr/bin/perl use strict; use warnings; # Store the ranges in a list of hashes. my @Ranges = (); foreach my $arg (@ARGV) { my ($start, $end) = split(/-/, $arg); push @Ranges, { start => $start, end => $end }; } # Use standard input/output so we can pipe files through it. while (my $line = <STDIN>) { # We go through the columns in reverse so the offsets will be # correct. foreach my $range (reverse @Ranges) { my ($start, $end) = ($range->{start}, $range->{end}); my $line_len = length($line); $end = $line_len if $line_len < $end; next if $start > $end; # Skip out-of-bound ranges my $nchars = $end - $start; substr($line, $start-1, $end-$start) = cleanse(substr($line, $start-1, $end-$start)); } print $line; } # Remove apostrophes and other unwanted characters. sub cleanse { my $string = shift; $string =~ s/'//g; $string =~ tr/a-zA-Z0-9\n\|\-/ /c; return $string; }

    It could use some extra checking on the program arguments (ranges), though.


Re: Cleaning Data Between Specified Columns
by Anonymous Monk on Jan 27, 2003 at 19:24 UTC
    You could do something like the following:
    substr($line, $pos1, $pos2) = cleanse(substr($line, $pos1, $pos2));
    You would have to work from the high indicies to the low ones in order to avoid messing up your offset counts.
      The third parameter should be a length, not a position.
      substr($line, $pos1, $pos2-$pos1+1) = cleanse(substr($line, $pos1, $po +s2-$pos1+1));
      Update:Here is some sample code, simplified in its input and output. The "local *tmp" trick by Fletch would also work, but I found it wouldn't pass strict without a "use vars qw($tmp);", so I went back to the simpler temporary variable idea.
      use strict; my @pos = (0,5,15,20); while(my $line = <>) { my $i = $#pos; while ($i >= 1) { my $pos1 = $pos[$i - 1]; my $pos2 = $pos[$i]; my $len = $pos2 - $pos1 + 1; my $part = substr( $line, $pos1, $len ); $part =~ y/'//d; $part =~ y/a-zA-Z0-9\n\|-/ /c; substr( $line, $pos1, $len ) = $part; $i -= 2; } # end while $i print $line; } # end while

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://230306]
Approved by virtualsue
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (4)
As of 2023-12-09 14:56 GMT
Find Nodes?
    Voting Booth?
    What's your preferred 'use VERSION' for new CPAN modules in 2023?

    Results (38 votes). Check out past polls.