Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

solution wanted for break-on-spaces (w/specifics)

by perl-diddler (Chaplain)
on Oct 23, 2021 at 20:53 UTC ( #11137926=perlquestion: print w/replies, xml ) Need Help??

perl-diddler has asked for the wisdom of the Perl Monks concerning the following question:

I wanted to break input on spaces in an RE, but according to additional rules:
1) Spaces between single or double quotes are not broken in.
2) A single BS ignores any special-ness of the next character.
3) Any quoted string is considered terminated by an end-of-string.

I think (hope) that's it. I've evolved an example I found on this board, but don't remember the original URL. It had some failing, where it didn't satisfy the above additional rules. That example is:

qr{((?>\\[^\\])|"(?:\\\\"|[^\\"]*)"|(?:'[^\\']*'|\S+))+}

My test inputs included "embedded case/line# answer,text" and which assumes testprog can remove '#comments' are:

# ln#, ans, test text 1 3,This is simple. 2 3,This is "so very simple". 3 4,This "is so" very simple. 4 2,This 'isn\'t nice.' 5 2,This "isn\"t nice." 6 3,This 'isn\\'t nice.' 7 3,This "isn\\"t nice." 8 2,This "isnt wrong." 9 2,This 'isnt wrong.' 10 3,This 'isn\\'t wrong. 11 3,This "isn\\"t wrong. 12 3,This isn\'t horrible. 13 3,This isn\"t terrible. 14 3,This \"isnt unnice.\" 15 3,This \'isnt unnice.\' 16 2,This 'is not unnice.' 17 2,This "is not unnice." 18 3,a "bb cc" d

So far, my sample test-runs show lines/cases 4+5 to be in error.

That is the main problem, the rest is specific to my testing program, which I don't care about, but can be used to run the sample RE against the test cases (which are included in the source of the testing program). The test prog's output follows (Note -- like the testprog, don't care about the format, it was just to show me pass/fails"):

ResByLn:{ln=>1, wanted=>3, got=>[3]},[" p "] ResByLn:{ln=>2, wanted=>3, got=>[3]},[" p "] ResByLn:{ln=>3, wanted=>4, got=>[4]},[" p "] ResByLn:{ln=>4, wanted=>2, got=>[3]},["FAIL:<This "isn\"t nice.">"] ResByLn:{ln=>5, wanted=>2, got=>[3]},["FAIL:<This 'isn\\'t nice.'>"] ResByLn:{ln=>6, wanted=>3, got=>[3]},[" p "] ResByLn:{ln=>7, wanted=>3, got=>[3]},[" p "] ResByLn:{ln=>8, wanted=>2, got=>[2]},[" p "] ResByLn:{ln=>9, wanted=>2, got=>[2]},[" p "] ResByLn:{ln=>10, wanted=>3, got=>[3]},[" p "] ResByLn:{ln=>11, wanted=>3, got=>[3]},[" p "] ResByLn:{ln=>12, wanted=>3, got=>[3]},[" p "] ResByLn:{ln=>13, wanted=>3, got=>[3]},[" p "] ResByLn:{ln=>14, wanted=>3, got=>[3]},[" p "] ResByLn:{ln=>15, wanted=>3, got=>[3]},[" p "] ResByLn:{ln=>16, wanted=>2, got=>[2]},[" p "] ResByLn:{ln=>17, wanted=>2, got=>[2]},[" p "] ResByLn:{ln=>18, wanted=>3, got=>[3]},[" p "]
The test prog originally was written to handle multiple RE's, but I threw out all the ones that were less successful than the 1 remaining -- but that's why some things are bracketed like \got\, "array values" in the test-prog output.

For FAIL cases, I had it print the failing text, just to be sure I was failing on the line I thought I was.

The test prog takes an optional '-f' that filters output to only display the FAILing cases.

I repeat -- my main problem is finding an RE that I can use to break the test-cases into the correct number of capture-args (as broken by unquoted spaces). Am including my test prog, below for those that want to use it or see what I did. But the help I need is in fixing the 'RE' so all the test cases work. Thanks for any help!

#!/usr/bin/perl use P; # vim=:SetNumberAndWidth if ( -t 0) { # unlikely to provide reliable test case open(STDIN, "<&=", "main::DATA") || die "opening internal tests: $!" +; } my @lines = grep { defined $_ and ! /^\s*#/ } (<main::DATA>); my @regex = ( qr{((?>\\[^\\])|"(?:\\\\"|[^\\"]*)"|(?:'[^\\']*'|\S+))+} + ); my $ln=0; my $norm=0; my @ResByLn; sub lnout($$$) { my ($ans,$outp, $lnp) = @_; bless {wanted=>$ans, got=>$outp, ln=>$lnp}, q(ResByLn:); } sub txt($) { local $_=shift; my (undef, undef,$txt)=m{^\s*(\d+)\s+(\d+),(.*)}; $txt; } my $only_fails = @ARGV && $ARGV[0] eq '-f' ? 1 : 0; for (@lines) { ++$ln; my ($lnnum, $ans,$_)=m{^\s*(\d+)\s+(\d+),(.*)}; my @got; for (my $r=0; $r<@regex; ++$r) { my $reg = $regex[$r]; my @out = grep {$_ } m{$reg}g; my $cnt = 0+@out; push @got, $cnt; } $lnnum and push @outs, lnout($ans, \@got, $lnnum); } for my $o (@outs) { my $ans = $o->{wanted}; my @out = @{$o->{got}}; my $ln = $o->{ln}; my @rts; #results for (my $i=0;$i<@out;++$i) { $rts[$i] = do { if ($ans==$out[$i]) { " p " } else { "FAIL:" . do { my $txt = txt($lines[$ln]); chomp $txt; "<$txt>"; }; } }; } my @output_args=("%s,%s",$o,\@rts); # * Shows line-number, result-wanted, result-got + test-status # * if test-status is FAIL, shows corresponding test-txt between <> +(carats) # unless ($only_fails) { P @output_args; } else { my @fails = grep /FAIL/, P @output_args; foreach (@fails) { P "%s", $_; } } } # vim: ts=2 sw=2 ai number __DATA__ # ln#, ans, test text 1 3,This is simple. 2 3,This is "so very simple". 3 4,This "is so" very simple. 4 2,This 'isn\'t nice.' 5 2,This "isn\"t nice." 6 3,This 'isn\\'t nice.' 7 3,This "isn\\"t nice." 8 2,This "isnt wrong." 9 2,This 'isnt wrong.' 10 3,This 'isn\\'t wrong. 11 3,This "isn\\"t wrong. 12 3,This isn\'t horrible. 13 3,This isn\"t terrible. 14 3,This \"isnt unnice.\" 15 3,This \'isnt unnice.\' 16 2,This 'is not unnice.' 17 2,This "is not unnice." 18 3,a "bb cc" d __END__
P.s. I could probably do this easily in a parser -- But I'm trying to fit it into an RE, as I think it should be possible, and/or I just maybe a self-masochist. :-0 P.P.s originally had 4,5,6,7 as wrong, but realize 6+7 were right, so corrected things (I hope).

Replies are listed 'Best First'.
Re: solution wanted for break-on-spaces (w/specifics)
by AnomalousMonk (Bishop) on Oct 24, 2021 at 07:59 UTC

    Building on kcott's approach (and his test cases and their underlying assumptions), here's a regex-based solution. I've added a few test cases of my own, but their validity is questionable because I don't fully understand perl-diddler's requirements. No attempt has been made to compare performance.

    Win8 Strawberry 5.8.9.5 (32) Sun 10/24/2021 3:14:25 C:\@Work\Perl\monks >perl use strict; use warnings; use Test::More; use Test::NoWarnings; sub pp { local $" = '| |'; "|@{$_[0]}|"; } # for output pretty-print +ing my @tests = ( q{all '- and "-quotes properly balanced}, [ q{This is simple.}, [ q{This}, q{is}, q{simple.} + ] ], [ q{ This is simple. }, [ q{This}, q{is}, q{simple.} + ] ], [ q{This is "so very simple".}, [ q{This}, q{is}, q{"so very simple" +.} ] ], [ q{This "is so" very simple.}, [ q{This}, q{"is so"}, q{very}, q{si +mple.} ] ], [ q{This 'isn\'t nice.'}, [ q{This}, q{'isn\'t nice.'} + ] ], [ q{This "isn\"t nice."}, [ q{This}, q{"isn\"t nice."} + ] ], [ q{This 'isn\\\\'t nice.'}, [ q{This}, q{'isn\\\\'t}, q{nice.'} + ] ], [ q{This "isn\\\\"t nice."}, [ q{This}, q{"isn\\\\"t}, q{nice."} + ] ], [ q{This 'is not unnice.'}, [ q{This}, q{'is not unnice.'} + ] ], [ q{This "is not unnice."}, [ q{This}, q{"is not unnice."} + ] ], [ q{a "bb cc" d}, [ q{a}, q{"bb cc"}, q{d} + ] ], q{UNbalanced '- and "-quotes at absolute end of string}, [ q{This is "so very simple}, [ q{This}, q{is}, q{"so very simple} ] + ], [ q{This 'isn\'t nice.}, [ q{This}, q{'isn\'t nice.} ] + ], [ q{This "isn\"t nice.}, [ q{This}, q{"isn\"t nice.} ] + ], [ q{This 'isn\\\\'t nice.}, [ q{This}, q{'isn\\\\'t}, q{nice.} ] + ], [ q{This "isn\\\\"t nice.}, [ q{This}, q{"isn\\\\"t}, q{nice.} ] + ], [ q{This 'is not unnice.}, [ q{This}, q{'is not unnice.} ] + ], [ q{This "is not unnice.}, [ q{This}, q{"is not unnice.} ] + ], 'what about these questionable cases?', [ q{is this"really so"simple now?}, [ q{is}, q{this"really so"simple +}, q{now?} ] ], [ q{is this"really so" now?}, [ q{is}, q{this"really so"}, + q{now?} ] ], [ q{is "really so"simple now?}, [ q{is}, q{"really so"simple}, + q{now?} ] ], [ q{is this'really so'simple now?}, [ q{is}, q{this'really so'simple +}, q{now?} ] ], [ q{is this'really so' now?}, [ q{is}, q{this'really so'}, + q{now?} ] ], [ q{is 'really so'simple now?}, [ q{is}, q{'really so'simple}, + q{now?} ] ], ); my @additional = qw(Test::NoWarnings); # each of these adds 1 test plan 'tests' => (scalar grep { ref eq 'ARRAY' } @tests) + @additional ; # an escape \ escapes ANY character. my $rx_dq = qr{ " [^\\"]* (?: \\. [^\\"]*)* (?: " | \z) }xms; my $rx_sq = qr{ ' [^\\']* (?: \\. [^\\']*)* (?: ' | \z) }xms; my $rx_q = qr{ $rx_dq | $rx_sq }xms; # match quoted or non-space substrings. alt order critical! # my $rx_extract = qr{ $rx_q \S* | \S+ }xms; # for non-questionable c +ases my $rx_extract = qr{ [^'"\s]* $rx_q [^'"\s]* | \S+ }xms; VECTOR: for my $ar_vector (@tests) { if (not ref $ar_vector) { note $ar_vector; next VECTOR; } my ($string, $ar_expected) = @$ar_vector; my @got = $string =~ m{ $rx_extract }xmsg; is_deeply \@got, $ar_expected, "|$string| -> " . pp $ar_expected; } # end for VECTOR ^Z 1..25 # all '- and "-quotes properly balanced ok 1 - |This is simple.| -> |This| |is| |simple.| ok 2 - | This is simple. | -> |This| |is| |simple.| ok 3 - |This is "so very simple".| -> |This| |is| |"so very simple".| ok 4 - |This "is so" very simple.| -> |This| |"is so"| |very| |simple. +| ok 5 - |This 'isn\'t nice.'| -> |This| |'isn\'t nice.'| ok 6 - |This "isn\"t nice."| -> |This| |"isn\"t nice."| ok 7 - |This 'isn\\'t nice.'| -> |This| |'isn\\'t| |nice.'| ok 8 - |This "isn\\"t nice."| -> |This| |"isn\\"t| |nice."| ok 9 - |This 'is not unnice.'| -> |This| |'is not unnice.'| ok 10 - |This "is not unnice."| -> |This| |"is not unnice."| ok 11 - |a "bb cc" d| -> |a| |"bb cc"| |d| # UNbalanced '- and "-quotes at absolute end of string ok 12 - |This is "so very simple| -> |This| |is| |"so very simple| ok 13 - |This 'isn\'t nice.| -> |This| |'isn\'t nice.| ok 14 - |This "isn\"t nice.| -> |This| |"isn\"t nice.| ok 15 - |This 'isn\\'t nice.| -> |This| |'isn\\'t| |nice.| ok 16 - |This "isn\\"t nice.| -> |This| |"isn\\"t| |nice.| ok 17 - |This 'is not unnice.| -> |This| |'is not unnice.| ok 18 - |This "is not unnice.| -> |This| |"is not unnice.| # what about these questionable cases? ok 19 - |is this"really so"simple now?| -> |is| |this"really so"simple +| |now?| ok 20 - |is this"really so" now?| -> |is| |this"really so"| |now +?| ok 21 - |is "really so"simple now?| -> |is| |"really so"simple| |n +ow?| ok 22 - |is this'really so'simple now?| -> |is| |this'really so'simple +| |now?| ok 23 - |is this'really so' now?| -> |is| |this'really so'| |now +?| ok 24 - |is 'really so'simple now?| -> |is| |'really so'simple| |n +ow?| ok 25 - no warnings


    Give a man a fish:  <%-{-{-{-<

      Re: "No attempt has been made to compare performance. " Absolutely! I thought about rolling that in, but the question was already long and complex. I totally agree that should be measured and factored in, however, a few saying around that -- "premature optimization is a bane". Similar to priorities on code development: 1) get something working, 2) then look at other issues (like perf, etc).

      Of the solutions I've seen, both seem like they wouldn't be too different as they use similar methodology. A multi-state parser might be faster than an RE, but maybe not if written in interpreted perl code. One might have to go to 'XS' to gain speed that way.

Re: solution wanted for break-on-spaces (w/specifics)
by hippo (Bishop) on Oct 23, 2021 at 22:57 UTC

    Here are a few suggestions to make the code clearer and perhaps then garner more helpful answers:

    • use strict
    • use warnings
    • use Test::More instead of trying to roll your own testing framework
    • Avoid prototypes
    • Avoid localising $_
    • Avoid capture groups which you never use
    • Avoid P. It's fine in your own code but here it is unnecessary (or would be if you used Test::More) and is another barrier to help.
    • Pick a formatting scheme and stick to it. Random whitespace doesn't help.

    In summary, help us to help you.


    🦛

      re strict/warnings -- they were their and got deleted as I deleted chunks of template-prefix code...*oops*.

      Test::More is what I use for testing not random development -- Test::More is a heavy-weight solution for testing a few example RE's against lines in a file.

      prototypes -- avoid? only when I need to avoid them to make it work. Most of my prototypes are documentary -- in that I put them on Class-methods where they aren't used, with the expectation that the "this" ptr doesn't count.

      localising $_ -- I localise it if I change it's value in a sub -- I don't want to create side effects. In code cleanup I'll often replace them with "my $var"s.

      capture groups -- don't think there were any such that I didn't use. I use (?:...) if I don't use the result.

      Avoid P? If I don't use it, who would? ;-)

      As for being able to 'help' me -- I'm beyond help, but anyone who tried to write a regex seemed to have no problem giving me clues about things that worked or things to try.

        Test::More is what I use for testing not random development -- Test::More is a heavy-weight solution for testing a few example RE's against lines in a file.

        Test::More is in Core so everyone has it and everyone who writes any significant amount of Perl has used it and is familiar with it. The same is not true of your hand-rolled testing framework so when I look at your example code I have to first analyse your testing framework not least because it might be responsible for the underlying problem your code exhibits.

        If Test::More is too "heavy-weight" for you then you can always use the ultra-light Test::Simple instead.

        prototypes -- avoid?

        Yes, avoid!

        localising $_ -- I localise it if I change it's value in a sub -- I don't want to create side effects. In code cleanup I'll often replace them with "my $var"s.
        capture groups -- don't think there were any such that I didn't use. I use (?:...) if I don't use the result.

        Here is your subroutine txt:

        sub txt($) { local $_=shift; my (undef, undef,$txt)=m{^\s*(\d+)\s+(\d+),(.*)}; $txt; }

        It unnecessarily localizes $_ and discards 2 capture groups. Instead it could be written thus:

        sub txt { shift =~ /^\s*\d+\s+\d+,(.*)/; return $1; }

        No need to mess with $_ or declare any lexical variables at all. No need for 3 capture groups when all you want is one. No need for prototypes either.

        Of course you are entirely free to ignore these suggestions but the harder you make it for others to read or run your code the less likely they are to want to unpick it all.


        🦛

Re: solution wanted for break-on-spaces (w/specifics)
by kcott (Bishop) on Oct 24, 2021 at 05:00 UTC

    G'day perl-diddler,

    Testing for the number of elements is a weak test; you really need qualitative tests as well. In addition, that would have told us what you expected (and allowed better answers).

    Your title has "break-on-spaces" (plural) but all your tests only use single spaces. In my code below, I added an additional test to show that q{This is simple.} and q{     This  is   simple. } both produce the same output. I guessed that is what you would've wanted; if not, you'll need to advise us.

    Writing code for purely academic reasons is absolutely fine; I do it myself. Having said that, the regex you presented is unwieldy, difficult to read, and maintenance would, I suspect, be an error-prone nightmare. I've provided an alternative solution below which mostly just uses Perl's string handling functions. When you have a working regex solution, I'd be interested to see a benchmark.

    You indicated that you'd encountered problems with lines 4-7; and later amended that that to just 6-7. I suspect you may have run into problems with escaping, particularly \\ and \\\\. Take a look at my ok N lines 7-10: I've just made a guess at what I thought you wanted.

    I've included most of your tests; you can, of course, add the remainder yourself. I didn't see the benefit of tests 8 and 9; and I thought that tests 10-15 potentially had issues with escaped backslashes so its perhaps best to wait for clarification from you on that score.

    Here's the code:

    #!/usr/bin/env perl use strict; use warnings; use Test::More; my @tests = ( [q{This is simple.}, [q{This}, q{is}, q{simple.}]], [q{ This is simple. }, [q{This}, q{is}, q{simple.}]], [q{This is "so very simple".}, [q{This}, q{is}, q{"so very simple" +.}]], [q{This "is so" very simple.}, [q{This}, q{"is so"}, q{very}, q{si +mple.}]], [q{This 'isn\'t nice.'}, [q{This}, q{'isn\'t nice.'}]], [q{This "isn\"t nice."}, [q{This}, q{"isn\"t nice."}]], [q{This 'isn\\'t nice.'}, [q{This}, q{'isn\\'t nice.'}]], [q{This "isn\\"t nice."}, [q{This}, q{"isn\\"t nice."}]], [q{This 'isn\\\\'t nice.'}, [q{This}, q{'isn\\\\'t}, q{nice.'}]], [q{This "isn\\\\"t nice."}, [q{This}, q{"isn\\\\"t}, q{nice."}]], [q{This 'is not unnice.'}, [q{This}, q{'is not unnice.'}]], [q{This "is not unnice."}, [q{This}, q{"is not unnice."}]], [q{a "bb cc" d}, [q{a}, q{"bb cc"}, q{d}]], ); plan tests => 0+@tests; for my $test (@tests) { my ($raw_str, $exp) = @$test; my $str = ($raw_str =~ /^\s*(.*?)\s*$/)[0]; my $got = []; my $str_len = length $str; my ($unbroken, $in_quote, $escape, $in_space) = ('', '', 0, 0); my $quote_re = qr{(['"])}; for my $str_index (0 .. $str_len - 1) { my $char = substr $str, $str_index, 1; if ($escape) { $unbroken .= $char; $escape = 0; next; } if ($char eq qq{\\}) { $escape = 1; $unbroken .= $char; next; } if ($char =~ $quote_re) { my $quote = $char; if ($in_quote) { $in_quote = '' if $in_quote eq $quote; } else { $in_quote = $quote; } $unbroken .= $char; next; } if ($char eq ' ') { next if $in_space; if ($in_quote) { $unbroken .= $char; } else { $in_space = 1; } } else { $unbroken .= $char; $in_space = 0; next; } if ($in_space) { push @$got, $unbroken; $unbroken = ''; } } push @$got, $unbroken; is_deeply($got, $exp, qq{<$raw_str>: } . join('|', @$exp)); }

    Here's the output:

    $ ./pm_11137926_str_parse.pl 1..13 ok 1 - <This is simple.>: This|is|simple. ok 2 - < This is simple. >: This|is|simple. ok 3 - <This is "so very simple".>: This|is|"so very simple". ok 4 - <This "is so" very simple.>: This|"is so"|very|simple. ok 5 - <This 'isn\'t nice.'>: This|'isn\'t nice.' ok 6 - <This "isn\"t nice.">: This|"isn\"t nice." ok 7 - <This 'isn\'t nice.'>: This|'isn\'t nice.' ok 8 - <This "isn\"t nice.">: This|"isn\"t nice." ok 9 - <This 'isn\\'t nice.'>: This|'isn\\'t|nice.' ok 10 - <This "isn\\"t nice.">: This|"isn\\"t|nice." ok 11 - <This 'is not unnice.'>: This|'is not unnice.' ok 12 - <This "is not unnice.">: This|"is not unnice." ok 13 - <a "bb cc" d>: a|"bb cc"|d

    — Ken

      > Testing for the number of elements is a weak test; you really need qualitative tests as well.

      > I've included most of your tests;

      I think the best way to test this, is to create these strings from joining @expected arrays.

      By generating these arrays one can make sure to cover all edge cases.

      As a side product you'll define a formal grammar. Like:

      • how are unpaired quotes to be handled?
      • what about multiple whitespaces in a row?
      • what about multi-line input?
      • what about whitespace at start and end of string?
      It would also help testing sub-regexes individually.

      Crafting the strings by hand is error prone, because there are far too many cases to handle.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

        Why do I need qualitative tests? I just wanted to know if the RE's broke the line into the expected number of sections. The original test strings were read from a data file, which was several pared down representations of what one might find as attr-value fields after an initial xml or html element.

        How are unpaired quotes handled? That's really a bit undefined, but I thought terminating them at the end of the "string", would be most forgiving. For multi-whitespace -- I would assume shell semantics. Multi-line input -- in some larger more general case, lf+cr are both types of white space, but I didn't want to clutter my question and test cases. As for whitespace prefixes and suffixes -- in both cases, there is no "non-whitespace" before or after (respectivly) those, so they make no difference in the final answer.

        As I tried to stress, the program wasn't really important, it was just something I threw together over a few hours that grew by "whim", to test the regex's against the input lines in the test-data.txt file. It wasn't meant as a formal test harness.

Re: solution wanted for break-on-spaces (w/specifics)
by tybalt89 (Prior) on Oct 24, 2021 at 18:44 UTC
    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11137926 use warnings; use Data::Dump 'dd'; my @tests = ( # q{all '- and "-quotes properly balanced}, [ q{This is simple.}, [ q{This}, q{is}, q{simpl +e.} ] ], [ q{ This is simple. }, [ q{This}, q{is}, q{simpl +e.} ] ], [ q{This is "so very simple".}, [ q{This}, q{is}, q{"so v +ery simple".} ] ], [ q{This "is so" very simple.}, [ q{This}, q{"is so"}, q{ +very}, q{simple.} ] ], [ q{This 'isn\'t nice.'}, [ q{This}, q{'isn\'t nice +.'} ] ], [ q{This "isn\"t nice."}, [ q{This}, q{"isn\"t nice +."} ] ], [ q{This 'isn\\\\'t nice.'}, [ q{This}, q{'isn\\\\'t}, + q{nice.'} ] ], [ q{This "isn\\\\"t nice."}, [ q{This}, q{"isn\\\\"t}, + q{nice."} ] ], [ q{This 'is not unnice.'}, [ q{This}, q{'is not unni +ce.'} ] ], [ q{This "is not unnice."}, [ q{This}, q{"is not unni +ce."} ] ], [ q{a "bb cc" d}, [ q{a}, q{"bb cc"}, q{d} + ] ], # q{UNbalanced '- and "-quotes at absolute end of string +}, [ q{This is "so very simple}, [ q{This}, q{is}, q{"so ver +y simple} ] ], [ q{This 'isn\'t nice.}, [ q{This}, q{'isn\'t nice.} + ] ], [ q{This "isn\"t nice.}, [ q{This}, q{"isn\"t nice.} + ] ], [ q{This 'isn\\\\'t nice.}, [ q{This}, q{'isn\\\\'t}, q +{nice.} ] ], [ q{This "isn\\\\"t nice.}, [ q{This}, q{"isn\\\\"t}, q +{nice.} ] ], [ q{This 'is not unnice.}, [ q{This}, q{'is not unnice +.} ] ], [ q{This "is not unnice.}, [ q{This}, q{"is not unnice +.} ] ], # 'what about these questionable cases?', [ q{is this"really so"simple now?}, [ q{is}, q{this"reall +y so"simple}, q{now?} ] ], [ q{is this"really so" now?}, [ q{is}, q{this"reall +y so"}, q{now?} ] ], [ q{is "really so"simple now?}, [ q{is}, q{"really so +"simple}, q{now?} ] ], [ q{is this'really so'simple now?}, [ q{is}, q{this'reall +y so'simple}, q{now?} ] ], [ q{is this'really so' now?}, [ q{is}, q{this'reall +y so'}, q{now?} ] ], [ q{is 'really so'simple now?}, [ q{is}, q{'really so +'simple}, q{now?} ] ], [ q{is really\\ so\\ simple now?}, [ q{is}, q{really\\ so +\\ simple}, q{now?} ] ], ); my $regex = qr/(?: '(?: \\. | [^'\\] )*' # single quoted string | "(?: \\. | [^"\\] )*" # double quoted string | ['"].* # unmatched quote | \\. # escaped character | \S # single non-space character )+/x; my $passcount = 0; for ( @tests ) { my ( $string, $want ) = @$_; my @out = $string =~ /$regex/g; local $" = "\0"x5; # just some array element boundary separator "@$want" eq "@out" ? $passcount++ : dd "$string => FAILED got", \@out, ' wanted ', $want; } print "$passcount of @{[scalar @tests]} passed\n";

    Outputs:

    25 of 25 passed
        BTW, on the no-backtracking -- that was a later addition one of about 10-15 alterations in the statement I tried over time.
      Your regex was perfect. FWIW, I put it in my original prog (some bugs fixed in the prog), as the 2nd regex in the regex array. The reason I had them and the outputs in arrays was to compare several RE's. But I ended up with just the one as it passed the most cases. So lines for cases 3 and 4 (w/4+5 being the two that didn't pass in the regex I originally posted)
      ResByLn:{ln=>3, wanted=>4, got=>[4, 4]},[" p ", " p "] ResByLn:{ln=>4, wanted=>2, got=>[3, 2]},["FAIL:<4>", " p "]
      The gots were count I got from the regex's, with your RE being in the 2nd position. The last brackets contained the p/f for each regex against that statement. So yours were 'p' straight down the 2nd column. Thanks. I had spaces in the earlier revisions of the re's, but I wasn't sure I had the 'x' flag applied to the sub-re's that needed them.

      I guess each outer layer of the RE's flags get propagated to inner RE's.

      I'm not sure if you were asking a question about your third group above where it you wrote: " 'what about these questionable cases?',"? I'm not sure what is questionable about them. In my use case, neither 'q{}' nor '?' have special meaning. Only the quotes and backslash were meta chars. So in the first line, I see 3 fields in both of the 1st 2 cases:

      [ q{is this"really so"simple now?}, [ q{is}, q{this"really so"simple}, + q{now?} ] ], ^ ^ ^ +^
      Both of expressions had 2 breaks -- yielding 3 parts in each. Does that make sense?

      One rule I forgot to list, though, at least your example handled it as expected, was what to do with overlapping quotes, and not making a quote of a different type have 'meta' properties. I.e.:

      this "is a' test" of weird' stuff
      I may be wrong but I don't think most here would split that into 3 parts, as most of us are used to meta-properties of characters being disabled or modified within quotes, so the single quote above wouldn't start a quoted sub-expression overlapping with double quoted part. That would effectively make "is a' test" of weird' all 1 "word" as all the spaces are between quotes of some type. While that would be "a" way of interpreting overlapping quoted sections, I don't know how expected or useful it would be. Need to study your example and some others, but wanted to make some response. Just that about 3-4 other things cropped up and need attention just after I posted this...
Re: solution wanted for break-on-spaces (w/specifics)
by LanX (Sage) on Oct 23, 2021 at 22:05 UTC
    Your regex is messy.

    Using the /x flag (see perlre), plus

    • linebreaks
    • space
    • comments

    would make things far more readable! ( Not only for you ...:)

    Consider also composing your regex from smaller parts thru interpolation of variables.

    Anyway from what I can spot are you treating " and ' very differently.

    Since your requirements are fuzzy I don't dare telling what you really want.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

Re: solution wanted for break-on-spaces (w/specifics)
by vr (Curate) on Oct 23, 2021 at 23:06 UTC
    use strict; use warnings; use feature 'say'; # use Regexp::Common; # ^^^ Not used. I'm so lazy, I just peeked at $RE{quoted} # to construct the "$quoted" expression below, by slightly # modifying it (see "$") to satisfy the third clause. # And actually 2nd test case below is to test how it works, # it seems there's not a similar one among your 18. my $quoted = qr/ (?:(?| (?:(?<!\\)\")(?:[^\\\"]*(?:\\.[^\\\"]*)*)(?:\"|$)| (?:(?<!\\)\')(?:[^\\\']*(?:\\.[^\\\']*)*)(?:\'|$) )) /x; my $re = qr/(?:$quoted|[^ ])+\K(?: |$)/; my @tests = ( q(This 'isn\'t nice.'), q(This 'isn\'t nice.), q(This \"isnt unnice.\"), ); for my $t ( @tests ) { say "[$_]" for split $re, $t; } __END__ [This] ['isn\'t nice.'] [This] ['isn\'t nice.] [This] [\"isnt] [unnice.\"]

    10 minutes update: aargh, added negative look-behind to cover your 14th case (and added my third). Maybe there are more to add. Further: it's more tricky, 6 (and 7) are split in 3, but wrong, groups. Will look into that later. False alarm? Will see yet later :)

    Next morning update. As LanX pointed out, negative look-behind for just a single backslash isn't enough. Then to save this answer (I like how the "keep" \K meta-character helps in regexp for split, it's kind of interesting), maybe it's easier to revert $quoted to as it was borrowed from $RE{quoted}, and tweak the $re:

    my $quoted = qr/ (?:(?| (?:\")(?:[^\\\"]*(?:\\.[^\\\"]*)*)(?:\"|$)| (?:\')(?:[^\\\']*(?:\\.[^\\\']*)*)(?:\'|$) )) /x; my $re = qr/ (?: (?:\\\\)+ | (?:\\[^ ]) | $quoted | [^ ] )+ \K (?: \ | $ ) /x;

    I hope it works now, my 1st attempt at this "update" was broken (see, but better not -- nothing interesting -- below. Sorry for the mess.). But further, it's unclear whether to split on escaped space, or several spaces in a row.

    And later (final(?)) update: Sigh... damn lack of practice. So this:

    my $quoted = qr/ (?:(?| (?:\")(?:[^\\\"]*(?:\\.[^\\\"]*)*)(?:\"|$) | (?:\')(?:[^\\\']*(?:\\.[^\\\']*)*)(?:\'|$) )) /x; my $re = qr/ (?: (?:\\.)+ | $quoted | [^ \\"']+ )* \K (?: \ | $ )+ /x; # and later: my $got = [ split $re, $str ];

    passes all tests in LanX's later answer except #2 and is somewhat optimized.

    About test #2: consensus is "the brief is unclear", must split-like behaviour generate an empty leading field for #2? Expression to split on is definitely not missing nor space literal. If, nevertheless, it must not (as my solution does, failing #2), then my bad, but still, yeah, this regexp is "working" and can be used to literally split on. :)

      I'm not sure about this

      (?:(?<!\\)\")

      I read it as doublequote which is not preceded by backslash

      But what about an escaped backslash \\" or two \\\\" ... ?

      I'd rather try something like (Untested pseudocode)

      s/^(?:$escaped|$quoted|\S)*\K\s+/\n/g

      and

      $escaped = qr/\\./; $quoted = qr/ (['"]) # start (?: $escaped | [^\1] )* # inside \1 # end, probably \g-1 better /x;

      NB: I didn't cover the case of unclosed quotes, which is unclear anyway.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

      update

      tested - fails - good night! :)

      update

      see Re: solution wanted for break-on-spaces (w/specifics) for "working" example

Re: solution wanted for break-on-spaces (w/specifics)
by LanX (Sage) on Oct 24, 2021 at 12:00 UTC
    in continuation to Re^2: solution wanted for break-on-spaces (w/specifics):

    Please note how readable and maintainable the regexes become now!

    This solves AnomalousMonk's test case here but is easily adaptable to various interpretations.

    (I disagree in the case of unbalanced quotes, I'd rather ignore them. For this to happen drop the $-branch commented with "EOL".)

    use v5.12; use warnings; use Test::More; my $escaped = qr/\\./; my $quoted = qr/ (['"]) # --- start-quote (?: # --- inside $escaped # any escape-pair | . # anything else )*? # non-greedy (?: # --- end \g{-1} # same quote | $ # EOL ends missing pair ) /x; my $re = qr/ (?: $escaped # any escape pair | $quoted # any quoted string | \S # any none whitespace )+ # at least once /x; my $str = q{This "is so" very simple.}; my @tests = ( # q{all '- and "-quotes properly balanced}, [ q{This is simple.}, [ q{This}, q{is}, q{simpl +e.} ] ], [ q{ This is simple. }, [ q{This}, q{is}, q{simpl +e.} ] ], [ q{This is "so very simple".}, [ q{This}, q{is}, q{"so v +ery simple".} ] ], [ q{This "is so" very simple.}, [ q{This}, q{"is so"}, q{ +very}, q{simple.} ] ], [ q{This 'isn\'t nice.'}, [ q{This}, q{'isn\'t nice +.'} ] ], [ q{This "isn\"t nice."}, [ q{This}, q{"isn\"t nice +."} ] ], [ q{This 'isn\\\\'t nice.'}, [ q{This}, q{'isn\\\\'t}, + q{nice.'} ] ], [ q{This "isn\\\\"t nice."}, [ q{This}, q{"isn\\\\"t}, + q{nice."} ] ], [ q{This 'is not unnice.'}, [ q{This}, q{'is not unni +ce.'} ] ], [ q{This "is not unnice."}, [ q{This}, q{"is not unni +ce."} ] ], [ q{a "bb cc" d}, [ q{a}, q{"bb cc"}, q{d} + ] ], # q{UNbalanced '- and "-quotes at absolute end of string +}, [ q{This is "so very simple}, [ q{This}, q{is}, q{"so ver +y simple} ] ], [ q{This 'isn\'t nice.}, [ q{This}, q{'isn\'t nice.} + ] ], [ q{This "isn\"t nice.}, [ q{This}, q{"isn\"t nice.} + ] ], [ q{This 'isn\\\\'t nice.}, [ q{This}, q{'isn\\\\'t}, q +{nice.} ] ], [ q{This "isn\\\\"t nice.}, [ q{This}, q{"isn\\\\"t}, q +{nice.} ] ], [ q{This 'is not unnice.}, [ q{This}, q{'is not unnice +.} ] ], [ q{This "is not unnice.}, [ q{This}, q{"is not unnice +.} ] ], # 'what about these questionable cases?', [ q{is this"really so"simple now?}, [ q{is}, q{this"reall +y so"simple}, q{now?} ] ], [ q{is this"really so" now?}, [ q{is}, q{this"reall +y so"}, q{now?} ] ], [ q{is "really so"simple now?}, [ q{is}, q{"really so +"simple}, q{now?} ] ], [ q{is this'really so'simple now?}, [ q{is}, q{this'reall +y so'simple}, q{now?} ] ], [ q{is this'really so' now?}, [ q{is}, q{this'reall +y so'}, q{now?} ] ], [ q{is 'really so'simple now?}, [ q{is}, q{'really so +'simple}, q{now?} ] ], ); plan tests => 0+@tests; for my $test (@tests) { my ($str, $exp) = @$test; my $got; push @$got, $& while ($str =~ /$re/g); is_deeply($got, $exp, qq{<$str>: } . join('|', @$exp)); }

    -*- mode: compilation; default-directory: "d:/tmp/pm/" -*- Compilation started at Sun Oct 24 14:00:21 C:/Strawberry/perl/bin\perl.exe -w d:/tmp/pm/break_not_quoted.pl 1..24 ok 1 - <This is simple.>: This|is|simple. ok 2 - < This is simple. >: This|is|simple. ok 3 - <This is "so very simple".>: This|is|"so very simple". ok 4 - <This "is so" very simple.>: This|"is so"|very|simple. ok 5 - <This 'isn\'t nice.'>: This|'isn\'t nice.' ok 6 - <This "isn\"t nice.">: This|"isn\"t nice." ok 7 - <This 'isn\\'t nice.'>: This|'isn\\'t|nice.' ok 8 - <This "isn\\"t nice.">: This|"isn\\"t|nice." ok 9 - <This 'is not unnice.'>: This|'is not unnice.' ok 10 - <This "is not unnice.">: This|"is not unnice." ok 11 - <a "bb cc" d>: a|"bb cc"|d ok 12 - <This is "so very simple>: This|is|"so very simple ok 13 - <This 'isn\'t nice.>: This|'isn\'t nice. ok 14 - <This "isn\"t nice.>: This|"isn\"t nice. ok 15 - <This 'isn\\'t nice.>: This|'isn\\'t|nice. ok 16 - <This "isn\\"t nice.>: This|"isn\\"t|nice. ok 17 - <This 'is not unnice.>: This|'is not unnice. ok 18 - <This "is not unnice.>: This|"is not unnice. ok 19 - <is this"really so"simple now?>: is|this"really so"simple|now? ok 20 - <is this"really so" now?>: is|this"really so"|now? ok 21 - <is "really so"simple now?>: is|"really so"simple|now? ok 22 - <is this'really so'simple now?>: is|this'really so'simple|now? ok 23 - <is this'really so' now?>: is|this'really so'|now? ok 24 - <is 'really so'simple now?>: is|'really so'simple|now? Compilation finished at Sun Oct 24 14:00:21

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11137926]
Approved by LanX
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (4)
As of 2021-12-04 09:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    R or B?



    Results (30 votes). Check out past polls.

    Notices?