enoch has asked for the wisdom of the Perl Monks concerning the following question:
I was charged with writing a program that would "clean" data in a file (remove characters denoted bad, squash apostrophes). The program would accept a varaible number of arguments. The first would be a path to a file. The remaining arguments would be a series of column markers indicating between which columns to do the data cleaning. For example, if you wanted to clean the file data.txt but only between columns 0 to 30, 44 to 63, and 97 to 111, you would call the program like so:
% ./cleanse.pl data.txt 0-30 44-63 97-111
Here is how I implemented it (changing strings to character arrays, using splice, data munging appropriately, and putting the data back). I was wondering if any monks would have done it differently.
#!/usr/bin/perl
use warnings;
use strict;
##
## This program accepts the name of a file from the command line
## and a variable length of extra arguments specifying between whic
+h
## columns to examine or which fields in a delimited file to examin
+e.
## It, then, processes the file replacing any apostrophe with nothi
+ng
## (that is, it squashes any appearance of an apostrophe turning "d
+on't"
## into "dont" and "O'Connor" into "OConnor"). It, then, replaces
## anything that is not a alpha-numeric, pipe, new line, or dash wi
+th a space
##
my $fileToCleanse = shift
or die "Usage $0 <fileName> <fromColumn - toColumn> " .
"<fromColumn - toColumn>... where 'fileName' is the name " .
"of the file to cleanse and the other parameters specify " .
"the range of columns to cleanse";
open INPUT, $fileToCleanse . ORIG_DATA_FILE_EXT
or die "Could not open $fileToCleanse" .
ORIG_DATA_FILE_EXT . " for reading because:\n$!\t\n";
my $fileContents = '';
my @columnSpanArray = ();
# build a two dimensional array
# to hold each one of the column index paramter pairs
my $index = 0;
foreach(@ARGV)
{
($columnSpanArray[$index][0], $columnSpanArray[$index][1]) =
split '-', $_;
$index++;
}
while(my $line = <INPUT>)
{
my @chars = split '', $line; # split the line into an array of
+chars
foreach my $parameters (@columnSpanArray)
{
# if the end of the line occurs before the parameter
# specified to cleanse to, only cleanse until end of line
# for example, if we are to cleanse from 45 to 115
# but the line is only 65 characters long, only cleanse up til
+ 65
my $endOrLineLength = (length($line) > $$parameters[1])
? $$parameters[1]
: length($line);
# go to next loop if the paramters exceed the line length
next if $$parameters[0] >= $endOrLineLength;
# take a slice of the array between the columns to examine
my $tmpString = join '', @chars[$$parameters[0]..$endOrLineLen
+gth-1];
$tmpString =~ s/(.)'(.)/$1$2/g; # squash apostrophe
+s
$tmpString =~ tr/a-zA-Z0-9\n\|\-/ /c; # remove bad characte
+rs
# put it back into the array from which we got it
splice(@chars, $$parameters[0], $endOrLineLength-$$parameters[
+0],
split '', $tmpString);
}
# store the cleansed data as a string
$fileContents .= join '', @chars;
} # end while INPUT
close INPUT;
# print back the cleansed data to the original file name
open OUTPUT, ">$fileToCleanse" or
die "Could not open $fileToCleanse.cleansed for reading because:\n
+$!\t\n";
print OUTPUT $fileContents;
close OUTPUT;
Enoch
Re: Cleaning Data Between Specified Columns
by Fletch (Bishop) on Jan 27, 2003 at 19:51 UTC
|
{
local *tmp = \substr( $source, $start, $len );
$tmp =~ y/'//d;
$tmp =~ y/a-zA-Z0-9\n\|-/ /c;
}
| [reply] [d/l] [select] |
|
Damn. That's great. :) And with a little change it's strict compliant and shorter too:
{
local *_ = \substr $source, $start, $len;
y/a-zA-Z0-9\n\|-/ /c;
y/'//d;
}
Update: changed order of transliterations to account for the deletion of characters causing extra characters to shift in.
Makeshifts last the longest. | [reply] [d/l] |
Re: Cleaning Data Between Specified Columns
by BrowserUk (Patriarch) on Jan 27, 2003 at 21:50 UTC
|
Using substr for this is kinda tricky. If one of the earlier ranges contains 's, and you delete them, that screws up the indexes for later ranges. One solution is to substitute a known-not-present char (I used \x7F) for ' whilst processing the ranges and then remove these from the resultant line.
Update:A modified version to deal with replacing 's with spaces at the end of the field rather than deleting them entirely. Makes use of Fletch's neat trick and Aristotle's enhancement to it, now that is possible.
#! perl -sw
use warnings;
use strict;
open( FILE, '<', shift) or die "Couldn't open $::FILE; $!";
@ARGV = map{ [ split(/-/, $_, 2) ] } @ARGV
or die 'usage $0: file c1-c2 [ c1-c2 [ ... ] ] >modified_file';
while (my $line = <FILE>) {
for ( @ARGV ) {
next if $_->[0] > length $line;
$_->[1] = length $line if $_->[1] > length $line;
local *_ = \substr($line, $_->[0], $_->[1]-$_->[0] + 1);
tr[a-zA-Z0-9\n\|\-'][ ]c;
$_ .= ' ' x tr['][]d;
}
print $line;
}
close FILE;
Original version
Examine what is said, not who speaks.
The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead. | [reply] [d/l] [select] |
|
++BrowserUK!
I was bitten by my squashing of apostrophes. Because it was a fixed width file, squashing the apostrophes caused the width's to change. I changed the regex to:
s/(.)'(.\B*)/$1$2 /g
So, that spacing was added for each apostrophe I pulled from any field.
enoch | [reply] [d/l] |
|
$tmpString =~ s/(.)'(.)/$1$2/g; # squash apostrophes
As the next line
$tmpString =~ tr/a-zA-Z0-9\n\|\-/ /c; # remove bad characters
will convert them to spaces anyway?
Examine what is said, not who speaks.
The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead. | [reply] [d/l] [select] |
|
|
| [reply] |
|
Sorry Aristotle. Fletch's (partial) solution, neat as the technique is, falls foul of the fact that deleting the apostrophies in a one range, causes all the subsequent columns to shift.
Examine what is said, not who speaks.
The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.
| [reply] |
|
Re: Cleaning Data Between Specified Columns
by Aragorn (Curate) on Jan 27, 2003 at 20:41 UTC
|
Taking one of the suggestions above, keeping your program flow intact, and mixing it a bit with my personal style, I come up with this:
#!/usr/bin/perl
use strict;
use warnings;
# Store the ranges in a list of hashes.
my @Ranges = ();
foreach my $arg (@ARGV) {
my ($start, $end) = split(/-/, $arg);
push @Ranges, { start => $start, end => $end };
}
# Use standard input/output so we can pipe files through it.
while (my $line = <STDIN>) {
# We go through the columns in reverse so the offsets will be
# correct.
foreach my $range (reverse @Ranges) {
my ($start, $end) = ($range->{start}, $range->{end});
my $line_len = length($line);
$end = $line_len if $line_len < $end;
next if $start > $end; # Skip out-of-bound ranges
my $nchars = $end - $start;
substr($line, $start-1, $end-$start) =
cleanse(substr($line, $start-1, $end-$start));
}
print $line;
}
# Remove apostrophes and other unwanted characters.
sub cleanse {
my $string = shift;
$string =~ s/'//g;
$string =~ tr/a-zA-Z0-9\n\|\-/ /c;
return $string;
}
It could use some extra checking on the program arguments (ranges), though.
Arjen
| [reply] [d/l] |
Re: Cleaning Data Between Specified Columns
by Anonymous Monk on Jan 27, 2003 at 19:24 UTC
|
You could do something like the following:
substr($line, $pos1, $pos2) = cleanse(substr($line, $pos1, $pos2));
You would have to work from the high indicies to
the low ones in order to avoid messing up your offset counts.
| [reply] [d/l] |
|
The third parameter should be a length, not a position.
substr($line, $pos1, $pos2-$pos1+1) = cleanse(substr($line, $pos1, $po
+s2-$pos1+1));
Update:Here is some sample code, simplified in its input and output. The "local *tmp" trick by Fletch would also work, but I found it wouldn't pass strict without a "use vars qw($tmp);", so I went back to the simpler temporary variable idea.
use strict;
my @pos = (0,5,15,20);
while(my $line = <>) {
my $i = $#pos;
while ($i >= 1) {
my $pos1 = $pos[$i - 1];
my $pos2 = $pos[$i];
my $len = $pos2 - $pos1 + 1;
my $part = substr( $line, $pos1, $len );
$part =~ y/'//d;
$part =~ y/a-zA-Z0-9\n\|-/ /c;
substr( $line, $pos1, $len ) = $part;
$i -= 2;
} # end while $i
print $line;
} # end while
| [reply] [d/l] [select] |
|
|