Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Getopt, regexen, and newlines

by apotheon (Deacon)
on Oct 12, 2005 at 18:08 UTC ( [id://499624]=perlquestion: print w/replies, xml ) Need Help??

apotheon has asked for the wisdom of the Perl Monks concerning the following question:

I've been poking around through manpages, perldoc, Perlmonks Super Search, and Google, and I haven't yet found an answer to this problem: Getopt::Std and Getopt::Long don't seem to play well with escape characters inserted into regular expressions. Do these Getopt modules automatically "sanitize" escape characters? Assuming that's so (and it certainly seems to be), is there some way to route around that if I want it to stop sanitizing them?

This came up because of a throw-away script I wrote for a coworker yesterday to replace double-newlines with single-newlines in a file. I did it the obvious way: I had it open the file by way of a filehandle, gyrated my way to dumping its contents into a scalar, and hard-coded a newline-reducing regex substitution ($foo =~ s/\n\n/\n/g).

I then decided that, like all Perl hack(er)s, I needed to have a wholly redundant text replacement utility of my very own, so I expanded upon the script's functionality by making it take CLI arguments (by way of Getopt::Std) and by adding help switch functionality to explain how it works for the next time I need it and can't remember. By the time I was done, it did everything I wanted except what I'd originally designed it to do: perform substitutions with newlines. From prototype to obsolete in less than an hour. Microsoft, eat your heart out.

Anyway, I refer you back to my first paragraph's questions, and hope for some help. Here's the relevant code:

#!/usr/bin/perl use strict; use Getopt::Std; my (%argument, @contents, $contents); getopts('hd:s:', \%argument); if ($argument{h}) { helptext() }else{ open(FILEHANDLE, "< $ARGV[0]") or die "cannot open file: $!"; $contents = do{local $/; <FILEHANDLE>;}; $contents =~ s/$argument{d}/$argument{s}/g; print $contents; close(FILEHANDLE); } sub helptext { print <<"EOT"; ===== syntax: frep [-h] -d <string> [-s <string>] <file> -h prints this help text and exits: invoking the help argument causes all other arguments to be discarded by this utility -d takes string as input, searches for that string: replaces with string from -s or with an empty string if no -s argument is specified -s takes string as input, substitutes it for string specified by the -d argument -- if not specified, text matching the -d argument will simply be deleted <file> specifies [path and] name of file to take as input, on whose contents this utility operates description: fts (aka "file text substitution") takes a file's contents as input and operates upon them, doing a simple find and replace operation globally throughout the file. The results are dumped to STDOUT, so the original file is untouched. If you want the original file to be overwritten with the new contents, use a shell redirect. bugs: Unfortunately, for reasons that are still a mystery to me, the -s argument does not handle newline escape characters (specifically, "\\n") properly. Such escape characters are "sanitized" by the Getopt::Std module. I have discovered that the Getopt::Long module seems to exhibit the same behavior. Maybe someday I'll bother to fix this. credits: Chad L. Perrin (author) Perlmonks community (contributors) license: This utility released under CCD CopyWrite. See following URL for details. http://ccd.apotheon.org EOT }

That's it. Oh, yeah, and if anyone has any suggestions, comments, criticisms, complaints, or flames relating to the code itself, even if they don't answer my actual questions here, I'd love the feedback.

NOTE: The shell is not the (only) problem, here. Yes, it sanitizes unquoted escape characters, but it does not sanitize quoted escape characters. The script, on the other hand, does sanitize quoted escape characters, which leads me back to the original problem.

print substr("Just another Perl hacker", 0, -2);
- apotheon
CopyWrite Chad Perrin

Replies are listed 'Best First'.
Re: Getopt, regexen, and newlines
by Roy Johnson (Monsignor) on Oct 12, 2005 at 18:23 UTC
    The shell is probably what is sanitizing your arguments. How did you try to call it? With single-quotes around the strings, I hope?

    Incidentally, there is a shell tool that will squeeze newlines.

    tr -s '\n' < file > file.squeezed
    I'm afraid your script has no real advantage over the one-liner:
    perl -pe 's/from-expression/to-expression/g' file > new_file
    or the essentially-the-same
    sed 's/from-expression/to-expression/g' file > new_file

    Caution: Contents may have been coded under pressure.

      See my response to Corion above this one, re: the shell.

      This script does have an advantage over the Perl and sed one-liners you posted, however: You don't have to know either Perl or sed to use it, which was sorta the point.

      print substr("Just another Perl hacker", 0, -2);
      - apotheon
      CopyWrite Chad Perrin

        You don't have to know either Perl or sed to use it
        You don't have to know Perl or sed to use the one-liners, either. You just need to know those particular commands. The amount of knowledge isn't much different in any case. The one advantage of your script is that it will explain its calling syntax (if the user remembers the -h argument).

        I'd probably rewrite it without Getopts. Just take three parameters. Spit out the help if there aren't three parameters. Users can specify '' for an empty parameter. Something like:

        my ($from, $to) = (shift, shift); @ARGV == 1 or die "Usage: $0 from_string to_string filename\n"; while (<>) { s/\Q$from/$to/g; print; }

        Caution: Contents may have been coded under pressure.
Re: Getopt, regexen, and newlines
by Corion (Patriarch) on Oct 12, 2005 at 18:19 UTC

    Did you check what your Perl script actually gets in @ARGV? The following will clarify to you what I suspect is the cause:

    BEGIN { local $" = "] ["; print "Command line: [@ARGV]\n";

    Most likely, your shell interprets the backslash (and other regex metacharacters) as a special character.... Of course, different shells handle the commandline differently, so the set of metacharacters may range from +/&;*"^% (Win32) to +/&!$\'"()[]{} (many *sh variants).

      The shell does indeed sanitize the arguments, unless I quote them. However, quoting them doesn't change the effect of the script, which still ends up giving me sanitized escape characters rather than newlines.

      Of course, I should have mentioned as much in the original post, and I'll go add something about that as a note now. I'll rep++ you for pointing that out.

      print substr("Just another Perl hacker", 0, -2);
      - apotheon
      CopyWrite Chad Perrin

        Backslashed characters are just backslashed characters unless Perl reads them in a double-quotish context. To do that, see efficient char escape sequence substitution.

        Caution: Contents may have been coded under pressure.
Re: Getopt, regexen, and newlines
by JediWizard (Deacon) on Oct 12, 2005 at 19:32 UTC

    Just a note:

    $contents = do{local $/; <FILEHANDLE>;};

    Should use less memory and (I believe) run faster than:

    @contents = <FILEHANDLE>; $contents = join "", @contents;

    P.S. see perlvar for info on $/


    They say that time changes things, but you actually have to change them yourself.

    —Andy Warhol

Re: Getopt, regexen, and newlines
by parv (Parson) on Oct 12, 2005 at 20:25 UTC

    It seems that problem is not in Getopt::Long, but in that the substitution string has to be a plain string not escape sequence when taken from outside (of the program). Escape sequence as replacement, however, does work when hard coded in the s///. Or, something like that.

    The following code -- tested in genuine xterm, bash3, perl 5.8.7, and Getopt::Long 2.68 -- will put \n as itself not as a newline if i specified the -out as an escape sequence. It did the expected when i specified the replacement string as Ctrl-J.

    use warnings; use strict; use Getopt::Long qw(:config gnu_compat no_ignore_case no_debug ); # Get regexen|strings for substitution. my ($in , $out); GetOptions( 'i|in=s' => \$in , 'o|out=s' => \$out) or usage(); usage() unless defined $in && defined $out; printf "in: '%s' out: '%s'\n" , $in , $out; my $text; while (<STDIN>) { $text .= $_; } chomp $text; printf "before:\n'%s'\n" , $text; $text =~ s/$in/$out/g; printf "after:\n'%s'\n" , $text; sub usage { die "specify strings for -in & -out options.\n"; }

    Update: For a user to be able to specify/use escape sequnce in replacement, use a hash, somthing like the last message in Regular Expression To Match Escape Sequences thread.

    - Parv
Re: Getopt, regexen, and newlines
by apotheon (Deacon) on Oct 12, 2005 at 23:02 UTC

    I don't know why I didn't think of this earlier. I've just added some code to the script that iteratively replaces \\ with \ in %arguments before the arguments are dumped into the regex at the heart of this little utility. I guess that was just too simple for me to think of right away.

    Here's the inserted code:

    foreach $key (keys %argument) { $argument{$key} =~ s/\\\\/\\/g; }

    I just inserted that between the getopts() statement and my if statement. The complete script now looks something like this:

    print substr("Just another Perl hacker", 0, -2);
    - apotheon
    CopyWrite Chad Perrin

      Above code when run w/ @ARGV of -d '\n\n' -s '\n' on file containing (there is an empty line after dot)...
      polka
      
      
      dot
      
      

      prints...

      polka\n
      dot\n
      

      ...BTW this input works as expected: -d 'o' -s 'O'; no surprise there.

      Am i specifying the input correctly, or the script still does not do what you wanted initially?

        Actually, I think you're right. I'll have to take another look at it. I haven't paid any attention to this for some time, and pretty much forgot all about it until now.

        Thanks for the reminder, and pointing out the (new?) problem. As I said, I'll have a look.

        EDIT: Okay, now I'm really confused. That regex will fix the input for what's to be substituted, so that you can just delete any instances of two linebreaks (for instance), but won't properly handle the substition if you've got escape characters on the -s argument.

        Yeah, I don't know what it's doing.

        print substr("Just another Perl hacker", 0, -2);
        - apotheon
        CopyWrite Chad Perrin

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://499624]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (7)
As of 2024-04-19 12:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found