Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

split on delimiter unless escaped

by yrp001 (Initiate)
on Nov 09, 2010 at 03:57 UTC ( [id://870238]=perlquestion: print w/replies, xml ) Need Help??

yrp001 has asked for the wisdom of the Perl Monks concerning the following question:

I'm looking for good regex mojo. I'd like to split text on a delimiter, a semicolon, say. I also want to be able to escape my delimiter, using an exclamation point, say. And I'd like to be able to escape my escape. I'd prefer to only consider my escape character as an escape character if it immediately precedes my delimiter character or another escape character. To illustrate, I'd like to end up with:

; split !; ; !!; !split !!!; !; !!!!; !!split !!!!!; !!; etc.

Here's my mojo-lacking effort so far:

while(<>) { chomp; s/(?<!!);/ ;;;; /g; s/(?<!!)!;/;/g; s/(?<!!)!!;/! ;;;; /g; s/(?<!!)!!!;/!;/g; s/(?<!!)!!!!;/!! ;;;; /g; s/(?<!!)!!!!!;/!!;/g; @a = split(/ ;;;; /); print "@a\n"; }

Not so elegant. Could a wiser regex master lend a hand?

Replies are listed 'Best First'.
Re: split on delimiter unless escaped
by moritz (Cardinal) on Nov 09, 2010 at 08:11 UTC
    A typical approach is not to split, but to parse the chunks you want to preserve.

    A regex for that is:

    my $re = qr{ (?> # don't backtrack into this group !. # either the escape character, # followed by any other character | # or [^!;\n] # a character that is neither escape # nor split character )+ }x; while ($str =~ /($re)/g) { print "Chunk '$1'\n"; }

    This technique is fairly general, and works for example for quoted strings, where the backslash can escape the quote character to not terminate the string.

    You can read more about it in Mastering Regular Expressions by Jeffrey E.F. Friedl, a book I can warmly recommend.

    Update: Added \n to the negated character class; mr_mischief pointed out that it is probably closer to the desired output that way.

    Perl 6 - links to (nearly) everything that is Perl 6.
      Not quite. "+" means you can't have empty fields. And if you change it "*", you can get one too many empty fields. That's why my solution is slightly different.

        Hi ikegami,

        Thanks for your example. I'm still trying to figure it all out. I'm running it as below, and it doesn't seem to quite do what I want. I only want the escape character to be treated specially if it's in !+; - i.e. a!!b should be a!!b, whereas a!!!;b should be a!;b.

        Also, I seem to be getting an empty field at the end. One or more semicolons at the end seem to be parsed properly, though.

        One test string returns a blank result. ?

        sub dequote { my $x = $_[0]; $x =~ s/!(.)/$1/sg; return $x; } while(<>) { chomp; my @fields = map dequote($_), /\G((?:[^!;]+|!.)*)(?:;|\z)/sg; print "$_ => " . join( '|', @fields ) . "\n"; # print "$_ => @fields\n"; }

        Sample results:

        aval!!!!;bval => aval!!|bval| aval!!!!!;bval => aval!!;bval| a!!val!!!!!;bval! => !a!!!val!!!!!;bval!! => a!val!!;bval!| a!val!;bva!l; => aval;bval| a!!val!!;;bv!!al;; => a!val!||bv!al||

      Ah, neat. I made a couple of small modification, and now I'm very close. The trouble left now is how to capture an empty field - i.e. where I have two delimiter characters next to each other I should emit an empty chunk, instead of no chunk. Still stuck on that. What I have so far:

      my $re = qr{ (?> # don't backtrack into this group !! # either the escape character, # followed by an escape character | # or !; # escape followed by delimiter char | # or [^;\n] # a character that is neither delimiter # character or a newline )+ }x; while(<>) { chomp; $str = $_; print "$_\n"; while ($str =~ /($re)/g) { print " Chunk '$1' => "; $s = $1; $s =~ s/!!(?=(!|;))/!/g; print "$s\n"; } }

      Example of paired delimiters (;;)

      a!!val!!;;bv!!al;; Chunk 'a!!val!!' => a!!val!! Chunk 'bv!!al' => bv!!al

        So the following seems to do exactly what I want, but doesn't handle empty fields. It might not matter, because my input shouldn't have any empty fields. I'll probably just check that my input string doesn't begin or end with a delimiter, or have two consecutive delimiters in the middle anywhere. If it does, it's bad input, and I can just throw it out. Would still be fun to know how to handle empty fields, though...

        my $re = qr{ (?> # don't backtrack into this group !! # either the escape character, # followed by an escape character | # or !; # escape followed by delimiter char | # or [^;\n] # a character that is neither delimiter # character or a newline )+ }x; while(<>) { chomp; my @aray; $str = $_; print "$_\n "; while ($str =~ /($re)/g) { $s = $1; $s =~ s/!!(?=(!|;|\z))/!/g; push( @aray, $s ); } print join(' | ', @aray) . "\n"; }
Re: split on delimiter unless escaped
by ikegami (Patriarch) on Nov 09, 2010 at 04:11 UTC

    One way:

    sub dequote { my $x = $_[0]; $x =~ s/!(.)/$1/sg; return $x; } my @fields = map dequote($_), /\G((?:[^!;]+|!.)*)(?:;|\z)/sg;
Re: split on delimiter unless escaped
by mr_mischief (Monsignor) on Nov 09, 2010 at 06:08 UTC

    I'm not sure why you're promoting single delimiters into multiples of the same character. It seems to me that would make things more difficult rather than easier. You probably want to use something that's not in your data at all. I'm guessing you're wanting the '!' rather than the more usual '\' because you're wanting to avoid escaping your escape in Perl or because you're dealing with a data format that someone else already set.

    Here's a quick poke at what you describe, although if you get any fancier than this you'd probably want to work on a real (if minimal) parser.

    while ( <> ) { chomp; s/!;/\x001/g; s/;/\x000/g; s/\x001/;/g; s/!!/!/g; @a = split /\x000/; print '{' . (join '}{', @a) . "}\n"; } __END__ foo;bar;baz fred!;flintstone;barney!!rubble eggs!;spam!;toast!;spam;bacon!!;eggs!;spam!!toast!;spam;spamspam!!eggs +!!;spam

    Which prints:

    {foo}{bar}{baz} {fred;flintstone}{barney!rubble} {eggs;spam;toast;spam}{bacon!;eggs;spam!toast;spam}{spamspam!eggs!;spa +m}

      As you've demonstrated, it fails to split

      bacon!!;eggs

      Fix

      while (<>) { chomp; s/\x{00}/\x{00}0/g; s/!!/\x{00}1/g; s/!;/\x{00}2/g; my @a = split /;/; for (@a) { s/\x{00}2/;/g; s/\x{00}1/!/g; s/\x{00}0/\x{00}/g; } ... }

      I also fixed the inability to have char 00 in the data.

Re: split on delimiter unless escaped
by JavaFan (Canon) on Nov 09, 2010 at 10:09 UTC
    That's not splitting what you do. You're also replacing escaped escapes with the escape (it seems you're replacing every !! with !). You cannot do that with just a split, you'll have to parse. I think this does what you want:
    #!/usr/bin/perl use 5.010; use strict; use warnings; while (<DATA>) { chomp; my @a; my $i = 0; while (/(!.|.)/g) { my $char = $1; if ($char eq ";") { $i++; next; } $char = $1 if $char =~ /!(.)/; $a[$i] .= $char; } say "@a"; } __DATA__ foo;bar foo!;bar;baz foo!!;bar!;baz;qux foo!!!;bar!!;baz!;qux;quux foo!!!!;bar!!!;baz!!;qux!;quux;garply foo!!!!!;bar!!!!;baz!!!;qux!!;quux!;garply;waldo
    Output:
    foo bar foo;bar baz foo! bar;baz qux foo!;bar! baz;qux quux foo!! bar!;baz! qux;quux garply foo!!;bar!! baz!;qux! quux;garply waldo
Re: split on delimiter unless escaped
by derby (Abbot) on Nov 09, 2010 at 10:43 UTC

    ... and the obligatory link to Text::CSV.

    -derby
Re: split on delimiter unless escaped
by sundialsvc4 (Abbot) on Nov 09, 2010 at 13:24 UTC

    I recall dealing with a situation like that, and what I wound up doing was to just go ahead and split on the desired character.   Then, I simply examined the preceding element in the list, to see if it ended with an escape-character.   If so, I combined the two elements and moved on.

    I am sure that there is “regex mojo” that would have done it.   (There’s a Golf section in this site for more reasons than just amusement ...)   But what I did was clear-enuf, and it worked.

Re: split on delimiter unless escaped
by ibravos (Initiate) on Mar 09, 2012 at 10:44 UTC
    I use the encode-decode technique given by Roberto Ierusalimschy in Programming in Lua, 20.4 – Tricks of the Trade. The book is freely available on the net. Here are the functions (translated from lua to perl):
    { my $re = qr/\\(.)/; sub encode ($) { my $str = $_[0]; $str =~ s/$re/{ sprintf '\\%03d', ord($1) }/ge; return $str } } { my $re = qr/\\(\d{3})/; sub decode ($) { my $str = $_[0]; $str =~ s/$re/{ '\\' . chr($1) }/ge; return $str } }
    For example, after this line
    my @splitted = map { decode $_ } split ',', encode('hel\,lo,world')
    @splitted becomes ('hel\,lo', 'world'). I believe this is the most universal and easy to use technique. Cheers

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://870238]
Approved by ikegami
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (4)
As of 2024-04-20 00:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found