http://www.perlmonks.org?node_id=1082405

RonW has asked for the wisdom of the Perl Monks concerning the following question:

I am processing an input format that can use either @ or \ to introduce inline directives. when a literal @ or \ is in the input, either character can be used to escape either character. IE: @@ or \@ or \\ or @\

Currently, I replace the escape sequences with place holders, then extract directives, then replace the placeholders with the intended literal occurrences of @ and \ in the string.

s/(?<![\\\@])[\\\@]\@/\x11/g; s/(?<![\\\@])[\\\@]\\/\x12/g; while (/[\\\@]([_A-Za-z]+)/) { print "Extracted code '$1'\n"; s/[\\\@]$1//; } s/\x11/\@/g; s/\x12/\\/g;

I'm sure there's a better way, but my search-foo is lacking. And so is my regex-foo. (And there's likely an input file that will break this.)

(and no, it's not LaTeX, despite the similarities)

Replies are listed 'Best First'.
Re: better way to escape escapes (1 pass)
by tye (Sage) on Apr 16, 2014 at 01:42 UTC

    Often better to do such things in a single pass.

    s{[\\@]([\\@]|[_A-Za-z]+)}{ my $term = $1; if( $term =~ /[\\@]/ ) { $term # Replace repeated escapes with second of pair } else { print "Extracted code '$term'\n"; '' # Replace extracted codes with nothing. } }ge;

    - tye        

Re: better way to escape escapes
by kcott (Archbishop) on Apr 16, 2014 at 03:00 UTC

    G'day RonW,

    The first thing to do, to make your code more readable and maintainable, is to assign those regex fragments with the escapes to meaningful named variables.

    An '@' character doesn't need to be escaped in a character class, so you can start with something like:

    my $any_slash_or_at = qr{ [\\@] }x;

    In my code below, you'll see that having done that means the rest of the code has almost no backslashes at all. This hopefully makes the code a lot more readable now and, six months or more down the track, when you or someone else needs to make a change.

    I then started to build up more complex regexes based on $any_slash_or_at. Again, this helps with readability and future maintenance.

    I ran a test, replacing your

    s/(?<![\\\@])[\\\@]\@/\x11/g; s/(?<![\\\@])[\\\@]\\/\x12/g;

    with

    s/$slash_or_at__at/DC1/g; s/$slash_or_at__slash/DC2/g;

    The ouput looks fine: I'll leave you to make similar changes in the while loop.

    You asked about "a better way". Building up the regex from fragments hopefully goes some way towards this. I've provided an alternative which uses a less complex regex (although still built from the initial $any_slash_or_at fragment) and requires only a single substitution: if nothing else, that gives you another option.

    [Everything I've provided should work with Perl v5.8 — there may be better solutions if you have a more recent Perl version. As I didn't know what version you're working with, I didn't persue these other potential solutions.]

    Here's the test script:

    #!/usr/bin/env perl -l use strict; use warnings; my $test_string = '\\\\\\ \\\\@ \\@\\ \\@@ @\\\\ @\\@ @@\\ @@@ \\\\ \\@ @\\ @@ \\ +@'; my $any_slash_or_at = qr{ [\\@] }x; my $lone_slash_or_at = qr{ (?<!$any_slash_or_at) $any_slash_or_at }x; my $slash_or_at__at = qr{ $lone_slash_or_at \@ }x; my $slash_or_at__slash = qr{ $lone_slash_or_at \\ }x; # Your code - a lot easier to read print 'Testing your code:'; $_ = $test_string; print; s/$slash_or_at__at/DC1/g; s/$slash_or_at__slash/DC2/g; print; # An alternative with a less complex regex and a single substitution my %replace = ( '@' => 'DC1', '\\' => 'DC2' ); my $escape_slash_or_at = qr{ $any_slash_or_at ( $any_slash_or_at ) }x; print 'Testing simultaneous replacements:'; $_ = $test_string; print; s/$escape_slash_or_at/$replace{$1}/g; print;

    Output:

    Testing your code: \\\ \\@ \@\ \@@ @\\ @\@ @@\ @@@ \\ \@ @\ @@ \ @ DC2\ DC2@ DC1\ DC1@ DC2\ DC2@ DC1\ DC1@ DC2 DC1 DC2 DC1 \ @ Testing simultaneous replacements: \\\ \\@ \@\ \@@ @\\ @\@ @@\ @@@ \\ \@ @\ @@ \ @ DC2\ DC2@ DC1\ DC1@ DC2\ DC2@ DC1\ DC1@ DC2 DC1 DC2 DC1 \ @

    -- Ken

Re: better way to escape escapes (updated)
by LanX (Saint) on Apr 16, 2014 at 01:09 UTC
    Would be easier if you provided sample data.

    The basic idea is to use or conditions, which match first an escaped sequence, secondly a directive and then a single character.

    With a /g modifier within a while condition the match will start where the last one ended.

    Group only on your directives.

    update

    2 proofs of concept:

    The second one w/o while loop.

    Please notice that you need to check for defined, since every escape or single character is an empty (since ungrouped) match.

    DB<120> $str='xxx@@xx\@xx@abc\@xxx@efg@\xxx' => "xxx\@\@xx\\\@xx\@abc\\\@xxx\@efg\@\\xxx" DB<121> print ( defined $1 ? "$1\t" : "") while ( $str =~ m/ (?: [\\ +\@]{2} | ( \@\w+ ) | . ) /xg ) @abc @efg DB<122> grep { defined } $str =~ m/ (?: [\\\@]{2} | ( \@\w+ ) | . ) +/xg => ("\@abc", "\@efg")

    update

    more efficient:

    DB<135> grep {defined} $str =~ m/ (?: [\\\@]{2}+ | ( [\\\@]\w+ ) | [ +^\\\@]+ ) /xg => ("\@abc", "\@efg")

    Cheers Rolf

    ( addicted to the Perl Programming Language)

Re: better way to escape escapes
by andal (Hermit) on Apr 16, 2014 at 09:23 UTC

    Do you have to use regexp? In the past, I've solved this issue by splitting string on "escapes", and then walking all parts merging 2 subsequent "escapes" into character and placing them back into text.

    Something like this

    my @parts = split /([\\@])/, $input; my $txt = shift @parts; while(@parts > 1) { shift @parts; my $t = shift @parts; if($t eq '') { # get the escaped "escape" $txt .= shift @parts; # get the text that follows it $txt .= shift @parts; next; } $t =~ s/^(\w+)//; my $cmd = $1; $txt .= "::$cmd\::$t"; } $txt .= shift @parts if @parts;

      Divide and Conquer looks a lot simpler. Also looks promising.

      Thanks.

      FYI, examples of the data interpretations:

      abc@def@@ghi => abcXef@ghi
      abc@def\@ghi => abcXef@ghi
      abc\def\\ghi => abcXef\ghi
      abc\def@\ghi => abcXef\ghi

      where d represents any directive and X represents the result.
      Thanks. Much better.
Re: better way to escape escapes
by Anonymous Monk on Apr 16, 2014 at 00:11 UTC