http://www.perlmonks.org?node_id=230306

enoch has asked for the wisdom of the Perl Monks concerning the following question:

I was charged with writing a program that would "clean" data in a file (remove characters denoted bad, squash apostrophes). The program would accept a varaible number of arguments. The first would be a path to a file. The remaining arguments would be a series of column markers indicating between which columns to do the data cleaning. For example, if you wanted to clean the file data.txt but only between columns 0 to 30, 44 to 63, and 97 to 111, you would call the program like so:
% ./cleanse.pl data.txt 0-30 44-63 97-111
Here is how I implemented it (changing strings to character arrays, using splice, data munging appropriately, and putting the data back). I was wondering if any monks would have done it differently.
#!/usr/bin/perl use warnings; use strict; ## ## This program accepts the name of a file from the command line ## and a variable length of extra arguments specifying between whic +h ## columns to examine or which fields in a delimited file to examin +e. ## It, then, processes the file replacing any apostrophe with nothi +ng ## (that is, it squashes any appearance of an apostrophe turning "d +on't" ## into "dont" and "O'Connor" into "OConnor"). It, then, replaces ## anything that is not a alpha-numeric, pipe, new line, or dash wi +th a space ## my $fileToCleanse = shift or die "Usage $0 <fileName> <fromColumn - toColumn> " . "<fromColumn - toColumn>... where 'fileName' is the name " . "of the file to cleanse and the other parameters specify " . "the range of columns to cleanse"; open INPUT, $fileToCleanse . ORIG_DATA_FILE_EXT or die "Could not open $fileToCleanse" . ORIG_DATA_FILE_EXT . " for reading because:\n$!\t\n"; my $fileContents = ''; my @columnSpanArray = (); # build a two dimensional array # to hold each one of the column index paramter pairs my $index = 0; foreach(@ARGV) { ($columnSpanArray[$index][0], $columnSpanArray[$index][1]) = split '-', $_; $index++; } while(my $line = <INPUT>) { my @chars = split '', $line; # split the line into an array of +chars foreach my $parameters (@columnSpanArray) { # if the end of the line occurs before the parameter # specified to cleanse to, only cleanse until end of line # for example, if we are to cleanse from 45 to 115 # but the line is only 65 characters long, only cleanse up til + 65 my $endOrLineLength = (length($line) > $$parameters[1]) ? $$parameters[1] : length($line); # go to next loop if the paramters exceed the line length next if $$parameters[0] >= $endOrLineLength; # take a slice of the array between the columns to examine my $tmpString = join '', @chars[$$parameters[0]..$endOrLineLen +gth-1]; $tmpString =~ s/(.)'(.)/$1$2/g; # squash apostrophe +s $tmpString =~ tr/a-zA-Z0-9\n\|\-/ /c; # remove bad characte +rs # put it back into the array from which we got it splice(@chars, $$parameters[0], $endOrLineLength-$$parameters[ +0], split '', $tmpString); } # store the cleansed data as a string $fileContents .= join '', @chars; } # end while INPUT close INPUT; # print back the cleansed data to the original file name open OUTPUT, ">$fileToCleanse" or die "Could not open $fileToCleanse.cleansed for reading because:\n +$!\t\n"; print OUTPUT $fileContents; close OUTPUT;
Enoch