better way to escape escapes

RonW has asked for the wisdom of the Perl Monks concerning the following question:

I am processing an input format that can use either @ or \ to introduce inline directives. when a literal @ or \ is in the input, either character can be used to escape either character. IE: @@ or \@ or \\ or @\

Currently, I replace the escape sequences with place holders, then extract directives, then replace the placeholders with the intended literal occurrences of @ and \ in the string.

    s/(?<![\\\@])[\\\@]\@/\x11/g;
    s/(?<![\\\@])[\\\@]\\/\x12/g;
    while (/[\\\@]([_A-Za-z]+)/)
    {
        print "Extracted code '$1'\n";
        s/[\\\@]$1//;
    }
    s/\x11/\@/g;
    s/\x12/\\/g;
[download]

I'm sure there's a better way, but my search-foo is lacking. And so is my regex-foo. (And there's likely an input file that will break this.)

(and no, it's not LaTeX, despite the similarities)

Comment on better way to escape escapes Download Code

Replies are listed 'Best First'.
Re: better way to escape escapes (1 pass) by tye (Sage) on Apr 16, 2014 at 01:42 UTC
Often better to do such things in a single pass. `s{[\\@]([\\@]\|[_A-Za-z]+)}{ my $term = $1; if( $term =~ /[\\@]/ ) { $term # Replace repeated escapes with second of pair } else { print "Extracted code '$term'\n"; '' # Replace extracted codes with nothing. } }ge;` [download] - tye	[reply] [d/l]
Re: better way to escape escapes by kcott (Archbishop) on Apr 16, 2014 at 03:00 UTC
G'day RonW, The first thing to do, to make your code more readable and maintainable, is to assign those regex fragments with the escapes to meaningful named variables. An '`@`' character doesn't need to be escaped in a character class, so you can start with something like: `my $any_slash_or_at = qr{ [\\@] }x;` [download] In my code below, you'll see that having done that means the rest of the code has almost no backslashes at all. This hopefully makes the code a lot more readable now and, six months or more down the track, when you or someone else needs to make a change. I then started to build up more complex regexes based on `$any_slash_or_at`. Again, this helps with readability and future maintenance. I ran a test, replacing your `s/(?<![\\\@])[\\\@]\@/\x11/g; s/(?<![\\\@])[\\\@]\\/\x12/g;` [download] with `s/$slash_or_at__at/DC1/g; s/$slash_or_at__slash/DC2/g;` [download] The ouput looks fine: I'll leave you to make similar changes in the `while` loop. You asked about "a better way". Building up the regex from fragments hopefully goes some way towards this. I've provided an alternative which uses a less complex regex (although still built from the initial `$any_slash_or_at` fragment) and requires only a single substitution: if nothing else, that gives you another option. [Everything I've provided should work with Perl v5.8 — there may be better solutions if you have a more recent Perl version. As I didn't know what version you're working with, I didn't persue these other potential solutions.] Here's the test script: #!/usr/bin/env perl -l use strict; use warnings; my $test_string = '\\\\\\ \\\\@ \\@\\ \\@@ @\\\\ @\\@ @@\\ @@@ \\\\ \\@ @\\ @@ \\ +@'; my $any_slash_or_at = qr{ [\\@] }x; my $lone_slash_or_at = qr{ (?<!$any_slash_or_at) $any_slash_or_at }x; my $slash_or_at__at = qr{ $lone_slash_or_at \@ }x; my $slash_or_at__slash = qr{ $lone_slash_or_at \\ }x; # Your code - a lot easier to read print 'Testing your code:'; $_ = $test_string; print; s/$slash_or_at__at/DC1/g; s/$slash_or_at__slash/DC2/g; print; # An alternative with a less complex regex and a single substitution my %replace = ( '@' => 'DC1', '\\' => 'DC2' ); my $escape_slash_or_at = qr{ $any_slash_or_at ( $any_slash_or_at ) }x; print 'Testing simultaneous replacements:'; $_ = $test_string; print; s/$escape_slash_or_at/$replace{$1}/g; print; [download] Output: `Testing your code: \\\ \\@ \@\ \@@ @\\ @\@ @@\ @@@ \\ \@ @\ @@ \ @ DC2\ DC2@ DC1\ DC1@ DC2\ DC2@ DC1\ DC1@ DC2 DC1 DC2 DC1 \ @ Testing simultaneous replacements: \\\ \\@ \@\ \@@ @\\ @\@ @@\ @@@ \\ \@ @\ @@ \ @ DC2\ DC2@ DC1\ DC1@ DC2\ DC2@ DC1\ DC1@ DC2 DC1 DC2 DC1 \ @` [download] -- Ken	[reply] [d/l] [select]
Re: better way to escape escapes (updated) by LanX (Saint) on Apr 16, 2014 at 01:09 UTC
Would be easier if you provided sample data. The basic idea is to use or conditions, which match first an escaped sequence, secondly a directive and then a single character. With a /g modifier within a while condition the match will start where the last one ended. Group only on your directives. update 2 proofs of concept: The second one w/o while loop. Please notice that you need to check for defined, since every escape or single character is an empty (since ungrouped) match. `DB<120> $str='xxx@@xx\@xx@abc\@xxx@efg@\xxx' => "xxx\@\@xx\\\@xx\@abc\\\@xxx\@efg\@\\xxx" DB<121> print ( defined $1 ? "$1\t" : "") while ( $str =~ m/ (?: [\\ +\@]{2} \| ( \@\w+ ) \| . ) /xg ) @abc @efg DB<122> grep { defined } $str =~ m/ (?: [\\\@]{2} \| ( \@\w+ ) \| . ) +/xg => ("\@abc", "\@efg")` [download] update more efficient: `DB<135> grep {defined} $str =~ m/ (?: [\\\@]{2}+ \| ( [\\\@]\w+ ) \| [ +^\\\@]+ ) /xg => ("\@abc", "\@efg")` [download] Cheers Rolf ( addicted to the Perl Programming Language)	[reply] [d/l] [select]
Re: better way to escape escapes by andal (Hermit) on Apr 16, 2014 at 09:23 UTC
Do you have to use regexp? In the past, I've solved this issue by splitting string on "escapes", and then walking all parts merging 2 subsequent "escapes" into character and placing them back into text. Something like this `my @parts = split /([\\@])/, $input; my $txt = shift @parts; while(@parts > 1) { shift @parts; my $t = shift @parts; if($t eq '') { # get the escaped "escape" $txt .= shift @parts; # get the text that follows it $txt .= shift @parts; next; } $t =~ s/^(\w+)//; my $cmd = $1; $txt .= "::$cmd\::$t"; } $txt .= shift @parts if @parts;` [download]	[reply] [d/l]
Re^2: better way to escape escapes by RonW (Parson) on Apr 16, 2014 at 14:50 UTC
Divide and Conquer looks a lot simpler. Also looks promising. Thanks. FYI, examples of the data interpretations: abc@def@@ghi => abcXef@ghi abc@def\@ghi => abcXef@ghi abc\def\\ghi => abcXef\ghi abc\def@\ghi => abcXef\ghi where d represents any directive and X represents the result.	[reply]
Re^2: better way to escape escapes by RonW (Parson) on Apr 16, 2014 at 16:06 UTC
Thanks. Much better.	[reply]
Re: better way to escape escapes by Anonymous Monk on Apr 16, 2014 at 00:11 UTC
Hard to say without knowing more about the format Oodles of s///ubstitutions is very easy to get wrong and very hard to get right http://tt3.template-toolkit.org/talks/tt3-lpw2009/slides/slide17.html See also the real discouragement Oh Yes You Can Use Regexes to Parse HTML! and the real encouragement Re^2: parsing XML fragments (xml log files) with... a regex	[reply]

Back to Seekers of Perl Wisdom

update

update