Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

resub

by graff (Chancellor)
on Nov 08, 2002 at 09:02 UTC ( [id://211379]=sourcecode: print w/replies, xml ) Need Help??
Category: Utility Scripts
Author/Contact Info David Graff <graff@ldc.upenn.edu>
Description: Do any number of global regex substitutions uniformly over any number of text files, and correctly handle all character encodings supported by Perl 5.8.0, with optional conversion of data from one encoding to another. (update: fixed checks for valid regexes)
#!/usr/bin/perl

=head1 NAME

resub -- applies regex substitutions on selected files

=head1 SYNOPSIS

  resub -h  (or resub -?) : list supported character encodings and exi
+t

  resub [-r enc] [-w enc] [-o .ext] [-s '/old/new/[ise]'] [-f regex.li
+st] file.name[s]

  cat filename.list | resub [options] stdin  (or resub [options] - )

=head1 SAMPLE USAGE

  grep -l badword *.txt | resub -s '/badword/good word/' -   # fix a s
+pelling error 

  resub -f edit.list *.txt      # fix lots of things at once

  resub -s /Bill/George/ -o .updated  crank.letter 
          # keeps crank.letter, creates crank.letter.updated

=head1 DESCRIPTION

When a particular string replacement (or set of replacements) needs to
be applied to one or more files, resub makes it possible to apply all
the replacements with a single command line.  The specification of the
editing to be done can be provided as one or more "-s /old/new/"
options on the command line, and/or listed in a separate text file
whose name is given with the "-f" option.  Both "-s" and "-f" edit
specs can be used in combination.  All edits will be applied globally
to the content of each text file, in the order given.

Resub requires Perl 5.8.0 or later, in order to use the Encode module
and the internal Unicode representation of characters.  Data files and
edit.list files can use any character encoding supported by Encode
(the complete list of supported character sets is listed with the "-h"
option).  Also, the edited data can be written in any encoding (so
character code conversion is simple).  The default encoding for
reading and writing text is "utf8" (which means that plain ASCII text
will be treated correctly during default operation).  Alternate
encodings for input or output can be selected with the "-r" and "-w"
options, respectively.  Any regex-list file(s) given with the "-f"
option must use the same encoding as the input text file(s).

Specify regex instructions just as you would in Perl, using curly
braces ( {pattern}{replacement} ) or an arbitrary delimiter character
(/p/r/, %p%r%, etc).  The "i", "e" and "s" flags can be added after
the final delimiter.  Note that the "g" flag is always attached to the
expression before applying it to the data, to do global replacements.
When placing regex instructions in a file for use with "-f", put one
regex per line (they will be applied in top-down order).

Names (or wildcard) for one or more files to edit can be given on the
command line; or, instead of given any file name as a command line
arg, use "-" or "stdin", and resub will expect to read a list of file
names from STDIN.

By default, resub will first create a temporary file to hold the
edited version of each text file, then (if there are no errors) it
will rename the temporary to the original file name, so that the
original file content is overwritten.  You can preserve the original
file(s) by using the "-o .ext" option: in this case, whatever string
follows "-o" will be appended to each original input file name, and
the resulting name will be used to receive the edited data.

=head1 CHARACTER ENCODING ISSUES

If "-r" is used to specify a given encoding for the input text
file(s), resub will use Encode to convert the data, including
regex-list files, if any, to unicode.  Whether the input data is
already utf8 or some other encoding, it is possible that a data file
or regex-list may contain byte sequences that are invalid or undefined
for the selected input encoding.  When reading a regex-list file,
resub will report and skip any single regex that contains bad
character data (but other regexes will be used).  When reading from a
text file, encountering bad character data will cause the editing of
that file to be aborted, and the original file will remain unaltered,
and resub will report the problem (but then go on to edit other files,
if any).

If some non-unicode encoding has been selected for output with the
"-w" option, resub will convert the data accordingly.  Again, it is
possible that the data may contain characters that cannot be
represented in the character set chosen for output.  As with input,
the editing of the file is aborted, and the original file remains
unaltered, the problem is reported (and other files, if any, are
handled).

=cut

require 5.008_000;

use strict;
use Encode;

# we will read whole files into memory, so let's be careful about file
+ size:
use constant MAX_FILE_SIZE => 2 ** 26;  # that's about 67 MB (okay for
+ most unix boxes)

$0 =~ s%^.*/%%;
my $Usage =
    "Usage:\n\n".
    " $0 [-r enc] [-w enc] [-o .ext] [-s '/old/new/[i]'] [-f edspec.fi
+le] file.name[*]\n\n".
    " $0 -h  or -? : list available character encodings for reading (-
+r) and writing (-w)\n".
    "    default -r = utf8; default -w = same as -r (that is, -r by it
+self means '-rw')\n\n".
    " Does in-place edit of text file(s) by applying regex substitutio
+n(s)\n".
    " or, using '-o .ext', writes edited output to 'text.file(s).ext' 
+instead\n\n".
    " Multiple '-s ...' and/or '-f ...' args can be combined in one co
+mmand\n".
    " ALL substitutions are applied globally, in order, throughout eac
+h file\n\n".
    " Use '-' or 'stdin' as the file.name to read list of file names f
+rom stdin\n\n";

my $newext = ".resub.$ENV{USER}.$$";     # default extension for temp 
+file
my @enclist = Encode->encodings(":all"); # list of all supported chara
+cter sets

my (@subs,@regexs,@filelist,@regfile);
my ($renc,$wenc) = ("","");

# The ordering of "-s /old/new/" and "-f edspec.file" options may be
# significant, so we will do our own command-line option processing,
# rather than the usual "Getopt" module.

listEncodings() if ( grep /^-+[h\?]/, @ARGV );  # this will exit when 
+done

while ( @ARGV > 1 && $ARGV[0] =~ /^-([rwosf])$/ )
{
    my $argtyp = $1;
    if ( $argtyp eq "s" ) {
    if ( $ARGV[1] =~ /^\{.*?\}\s*\{.*\}[ise]*$/ or  # check for regex 
+delimiters
         ( $ARGV[1] =~ /^(.)(.*)\1(.*)\1[ise]*$/ && (length($2) or len
+gth($3)))) {
        push( @regexs, $ARGV[1] );
    } else {
        die "Invalid regex on command line: $ARGV[1]\n";
    }
    }
    elsif ( $argtyp eq "f" ) {
    open( INP, "$ARGV[1]" ) || die "Unable to open editspec.file $ARGV
+[1]\n";
    while (<INP>) {
        s/\#.*//;             # delete commentary, if any
        next if ( /^\s*$/ );  # skip empty lines
        s/^\s*//;          # delete leading and trailing whitespace, i
+f any
        s/\s*$//;
        if ( /^\{.*?\}\s*\{.*\}[ise]*$/ or     # check for suitable re
+gex delimiters
         ( /^(.)(.*)\1(.*)\1[ise]*$/ && (length($2) or length($3)))) {
        push( @regexs, $_ );
        $regfile[$#regexs] = "$ARGV[1] at line $.";
        } else {
        warn "Ignored invalid regex in $ARGV[1] at line $.\n";
        }
    }
    close INP;
    }
    elsif ( $argtyp eq "o" ) {
    $newext = $ARGV[1];
    }
    elsif ( $argtyp eq "r" ) {
    $renc = $ARGV[1];
    }
    elsif ( $argtyp eq "w" ) {
    $wenc = $ARGV[1];
    }
    shift;
    shift;
}

if ( @ARGV == 1 && $ARGV[0] =~ /^(-|stdin)$/ ) {
    shift;
    while (<>) {
    chomp;
    push( @filelist, $_ );
    }
} else {
    @filelist = @ARGV;
}

die $Usage unless ( @regexs && @filelist );
$renc = "utf8" if ( $renc eq "" );
$wenc = $renc  if ( $wenc eq "" );
listEncodings() unless ( grep( /$renc/, @enclist ) && grep( /$wenc/, @
+enclist ));

foreach my $i ( 0 .. $#regexs ) {
    $_ = $regexs[$i];
    if ( exists( $regfile[$i] ) and $renc ne "utf8" ) {
    unless ( eval "\$_ = decode( \$renc, \$_, Encode::FB_CROAK )" ) {
        warn "Bad character data for $renc in $regfile[$i]: skipped th
+at regex\n";
        next;
    }
    }
    push @subs, eval "sub { s$_ }" or die $@;
}

$/ = undef; # nullify record-separator variable, to read whole file(s)
+ into a scalar

foreach my $file ( @filelist )
{
    unless ( -f $file ) {
    warn "$file is not a data file -- skipped this file\n";
    next;
    }
    unless ( -s _ < MAX_FILE_SIZE ) {
    warn "$file is too big (break it up or use some other tool) -- ski
+pped this file\n";
    next;
    }
    unless ( open( INP, $file ) && ( $newext ne ".resub.$ENV{USER}.$$"
+ || -w $file )) {
    warn "Unable to open $file for editing -- skipped this file\n";
    next;
    }
    binmode( INP, ":utf8" ) if ( $renc eq "utf8" );  # (see pod::perlu
+niinto man page)
    $_ = <INP>;  # $_ will contain all data in $file
    close INP;

    unless ( $renc eq "utf8" or eval "\$_ = decode( \$renc, \$_, Encod
+e::FB_CROAK )" ) {
    warn "Bad character data for $renc in $file -- skipped this file\n
+";
    next;
    }
    foreach my $resub ( @subs ) {
    $resub->();
    }
    unless ( $wenc eq "utf8" or eval "\$_ = encode( \$wenc, \$_, Encod
+e::FB_CROAK )" ) {
    warn "$file contains characters unwritable in $wenc -- skipped thi
+s file\n";
    next;
    }

    my $newfile = $file.$newext;
    unless ( open( OUT, ">$newfile" )) {
    warn "Unable to open output file -- $file unchanged\n";
    next;
    }
    binmode( OUT, ":utf8" ) if ( $wenc eq "utf8" );  # (see pod::perlu
+niinto man page)
    print OUT or die "Unable to write data to $newfile -- $file unchan
+ged";
    close OUT or die "Ack! Error occurred when closing $newfile -- $fi
+le unchanged";
    my $mode = (stat $file)[2] & 0777;
    chmod $mode,$newfile;
    rename $newfile, $file if ( $newext eq ".resub.$ENV{USER}.$$" );
}

sub listEncodings
{   # user is asking for help: list all available encodings
    my $colwidth = length( (sort {length($b) <=> length($a)} @enclist)
+[0] ) + 2;
    my $ncol = int( 80/$colwidth );
    my $nrow = int( scalar(@enclist)/$ncol );
    $nrow++ if ( scalar(@enclist) % $ncol );
    my $fmt = "%-${colwidth}s";

    print $Usage;
    foreach my $r ( 0 .. $nrow ) {
    foreach my $c ( 0 .. $ncol ) {
        my $i = $c * $nrow + $r;
        printf( $fmt, $enclist[$i] );
    }
    print "\n";
    }
    exit( 0 );
}

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: sourcecode [id://211379]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (3)
As of 2024-11-03 03:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    chatterbot is...






    Results (13 votes). Check out past polls.