Re: split on delimiter unless escaped
by moritz (Cardinal) on Nov 09, 2010 at 08:11 UTC
|
A typical approach is not to split, but to parse the chunks you want to preserve.
A regex for that is:
my $re = qr{
(?> # don't backtrack into this group
!. # either the escape character,
# followed by any other character
| # or
[^!;\n] # a character that is neither escape
# nor split character
)+
}x;
while ($str =~ /($re)/g) {
print "Chunk '$1'\n";
}
This technique is fairly general, and works for example for quoted strings, where the backslash can escape the quote character to not terminate the string.
You can read more about it in Mastering Regular Expressions by Jeffrey E.F. Friedl, a book I can warmly recommend.
Update: Added \n to the negated character class; mr_mischief pointed out that it is probably closer to the desired output that way.
Perl 6 - links to (nearly) everything that is Perl 6.
| [reply] [d/l] |
|
Not quite. "+" means you can't have empty fields. And if you change it "*", you can get one too many empty fields. That's why my solution is slightly different.
| [reply] |
|
Hi ikegami,
Thanks for your example. I'm still trying to figure it all out. I'm running it as below, and it doesn't seem to quite do what I want. I only want the escape character to be treated specially if it's in !+; - i.e. a!!b should be a!!b, whereas a!!!;b should be a!;b.
Also, I seem to be getting an empty field at the end. One or more semicolons at the end seem to be parsed properly, though.
One test string returns a blank result. ?
sub dequote {
my $x = $_[0];
$x =~ s/!(.)/$1/sg;
return $x;
}
while(<>) {
chomp;
my @fields = map dequote($_), /\G((?:[^!;]+|!.)*)(?:;|\z)/sg;
print "$_ => " . join( '|', @fields ) . "\n";
# print "$_ => @fields\n";
}
Sample results:
aval!!!!;bval => aval!!|bval|
aval!!!!!;bval => aval!!;bval|
a!!val!!!!!;bval! =>
!a!!!val!!!!!;bval!! => a!val!!;bval!|
a!val!;bva!l; => aval;bval|
a!!val!!;;bv!!al;; => a!val!||bv!al||
| [reply] [d/l] [select] |
|
|
Ah, neat. I made a couple of small modification, and now I'm very close. The trouble left now is how to capture an empty field - i.e. where I have two delimiter characters next to each other I should emit an empty chunk, instead of no chunk. Still stuck on that. What I have so far:
my $re = qr{
(?> # don't backtrack into this group
!! # either the escape character,
# followed by an escape character
| # or
!; # escape followed by delimiter char
| # or
[^;\n] # a character that is neither delimiter
# character or a newline
)+
}x;
while(<>) {
chomp;
$str = $_;
print "$_\n";
while ($str =~ /($re)/g) {
print " Chunk '$1' => ";
$s = $1;
$s =~ s/!!(?=(!|;))/!/g;
print "$s\n";
}
}
Example of paired delimiters (;;)
a!!val!!;;bv!!al;;
Chunk 'a!!val!!' => a!!val!!
Chunk 'bv!!al' => bv!!al
| [reply] [d/l] [select] |
|
So the following seems to do exactly what I want, but doesn't handle empty fields. It might not matter, because my input shouldn't have any empty fields. I'll probably just check that my input string doesn't begin or end with a delimiter, or have two consecutive delimiters in the middle anywhere. If it does, it's bad input, and I can just throw it out. Would still be fun to know how to handle empty fields, though...
my $re = qr{
(?> # don't backtrack into this group
!! # either the escape character,
# followed by an escape character
| # or
!; # escape followed by delimiter char
| # or
[^;\n] # a character that is neither delimiter
# character or a newline
)+
}x;
while(<>) {
chomp;
my @aray;
$str = $_;
print "$_\n ";
while ($str =~ /($re)/g) {
$s = $1;
$s =~ s/!!(?=(!|;|\z))/!/g;
push( @aray, $s );
}
print join(' | ', @aray) . "\n";
}
| [reply] [d/l] |
Re: split on delimiter unless escaped
by ikegami (Patriarch) on Nov 09, 2010 at 04:11 UTC
|
sub dequote {
my $x = $_[0];
$x =~ s/!(.)/$1/sg;
return $x;
}
my @fields = map dequote($_), /\G((?:[^!;]+|!.)*)(?:;|\z)/sg;
| [reply] [d/l] |
Re: split on delimiter unless escaped
by mr_mischief (Monsignor) on Nov 09, 2010 at 06:08 UTC
|
I'm not sure why you're promoting single delimiters into multiples of the same character. It seems to me that would make things more difficult rather than easier. You probably want to use something that's not in your data at all. I'm guessing you're wanting the '!' rather than the more usual '\' because you're wanting to avoid escaping your escape in Perl or because you're dealing with a data format that someone else already set.
Here's a quick poke at what you describe, although if you get any fancier than this you'd probably want to work on a real (if minimal) parser.
while ( <> ) {
chomp;
s/!;/\x001/g;
s/;/\x000/g;
s/\x001/;/g;
s/!!/!/g;
@a = split /\x000/;
print '{' . (join '}{', @a) . "}\n";
}
__END__
foo;bar;baz
fred!;flintstone;barney!!rubble
eggs!;spam!;toast!;spam;bacon!!;eggs!;spam!!toast!;spam;spamspam!!eggs
+!!;spam
Which prints:
{foo}{bar}{baz}
{fred;flintstone}{barney!rubble}
{eggs;spam;toast;spam}{bacon!;eggs;spam!toast;spam}{spamspam!eggs!;spa
+m}
| [reply] [d/l] [select] |
|
bacon!!;eggs
Fix
while (<>) {
chomp;
s/\x{00}/\x{00}0/g;
s/!!/\x{00}1/g;
s/!;/\x{00}2/g;
my @a = split /;/;
for (@a) {
s/\x{00}2/;/g;
s/\x{00}1/!/g;
s/\x{00}0/\x{00}/g;
}
...
}
I also fixed the inability to have char 00 in the data. | [reply] [d/l] [select] |
Re: split on delimiter unless escaped
by JavaFan (Canon) on Nov 09, 2010 at 10:09 UTC
|
That's not splitting what you do. You're also replacing escaped escapes with the escape (it seems you're replacing every !! with !). You cannot do that with just a split, you'll have to parse. I think this does what you want:
#!/usr/bin/perl
use 5.010;
use strict;
use warnings;
while (<DATA>) {
chomp;
my @a;
my $i = 0;
while (/(!.|.)/g) {
my $char = $1;
if ($char eq ";") {
$i++;
next;
}
$char = $1 if $char =~ /!(.)/;
$a[$i] .= $char;
}
say "@a";
}
__DATA__
foo;bar
foo!;bar;baz
foo!!;bar!;baz;qux
foo!!!;bar!!;baz!;qux;quux
foo!!!!;bar!!!;baz!!;qux!;quux;garply
foo!!!!!;bar!!!!;baz!!!;qux!!;quux!;garply;waldo
Output:
foo bar
foo;bar baz
foo! bar;baz qux
foo!;bar! baz;qux quux
foo!! bar!;baz! qux;quux garply
foo!!;bar!! baz!;qux! quux;garply waldo
| [reply] [d/l] [select] |
Re: split on delimiter unless escaped
by derby (Abbot) on Nov 09, 2010 at 10:43 UTC
|
| [reply] |
Re: split on delimiter unless escaped
by sundialsvc4 (Abbot) on Nov 09, 2010 at 13:24 UTC
|
I recall dealing with a situation like that, and what I wound up doing was to just go ahead and split on the desired character. Then, I simply examined the preceding element in the list, to see if it ended with an escape-character. If so, I combined the two elements and moved on.
I am sure that there is “regex mojo” that would have done it. (There’s a Golf section in this site for more reasons than just amusement ...) But what I did was clear-enuf, and it worked.
| |
Re: split on delimiter unless escaped
by ibravos (Initiate) on Mar 09, 2012 at 10:44 UTC
|
I use the encode-decode technique given by Roberto Ierusalimschy in Programming in Lua, 20.4 – Tricks of the Trade. The book is freely available on the net. Here are the functions (translated from lua to perl):
{
my $re = qr/\\(.)/;
sub encode ($) {
my $str = $_[0];
$str =~ s/$re/{ sprintf '\\%03d', ord($1) }/ge;
return $str
}
}
{
my $re = qr/\\(\d{3})/;
sub decode ($) {
my $str = $_[0];
$str =~ s/$re/{ '\\' . chr($1) }/ge;
return $str
}
}
For example, after this line
my @splitted = map { decode $_ } split ',', encode('hel\,lo,world')
@splitted becomes ('hel\,lo', 'world').
I believe this is the most universal and easy to use technique.
Cheers
| [reply] [d/l] [select] |