http://www.perlmonks.org?node_id=11120124

Background

I found an error in the documentation of split Function Split, bug or error in the documentation?.

Had problems to understand the documentation of split. Decided to do an emulation of split to get a better understanding of split.

Discovered that split does not behave like a normal subroutine Split does not behave like a subroutine.

This is my try on an emulation of split. I hope this can help someone!

Module Fake::Split

The module implements split using the match operator m{}. It also includes utilities to debug the emulation.

use strict; use warnings; use 5.010; package Fake::Split; use Exporter 'import'; our @EXPORT = ( qw( splitF splitF_explain splitF_match_pos splitF_pos splitF_pos2 +txt splitF_pos2list ), qw( splitF_info splitF_test splitF_debug splitF_case ), ); our @EXPORT_OK = qw(); # symbols to export on request use Data::Dump qw(dump dd ddx); my $debug = !!0; sub splitF_debug { $debug = shift && !!1; } my $info_level = 0; sub splitF_info { $info_level = +shift; } sub info { return if $info_level <= 0; if ( $info_level <= 1 || $debug ) { say STDERR @_ } else { say @_ } } # Create a list with offsets (pos) for the start (using @LAST_MATCH_ST +ART / @-) # and end (@LAST_MATCH_END /@+) of matches. # The list contains: FIELD, [ SEPARATOR ], FIELD, [ SEPARATOR ], ... # FIELD = field_start, field_end # SEPARATOR = separator_start, separator_end, CAPTURE_1, CAPTURE_2, . +.. # CAPTURE_n = capture_n_start, capture_n_end # Can not handle /PATTERN/ as first parameter. # Only compiled qr/PATTERN/ and strings can be used. sub splitF_match_pos { # =========================================== +========== my ( $pat, $str_ref ) = @_; my @res; my $pos_last = 0; my $pat_re = qr{$pat}; # while ( my $rv = $$str_ref =~ m{$pat_re}gc ) { # does not work my $str = $$str_ref; while ( my $rv = $str =~ m{$pat_re}gc ) { push @res, $pos_last, $-[0]; # the field my @sep = ( $-[0], $+[0] ); # the separator for my $ix ( 1 .. $#+ ) { # captures in the separator if ( defined $-[$ix] ) { push @sep, $-[$ix], $+[$ix]; } else { push @sep, -$-[0], -$+[0]; # or undef, undef } } push @res, [@sep]; $pos_last = $+[0]; } push @res, $pos_last, length($str); # rest of string return \@res; } sub splitF_case { # /PATTERN/ with special treatment in split my $pat = shift; my $pat_re = qr{$pat}; my %case; $case{$_} = !!0 for qw( pat_line_begin AWK_emul zero_width); if ( $pat eq '^' || $pat eq qr{^} ) { # split() silently optimizes /^/ to mean /^/m # ^ Matches at the beginning of the string (or line, if /m is +used) info('PATTERN is /^/'); $case{pat_line_begin} = !!1; } elsif ( $pat eq ' ' ) { # emulation of AWK # / / and qr/ / is NOT emulation of AWK $case{AWK_emul} = !!1; info("PATTERN is ' ' BUT qr{\\s+} is used(AWK_emul)"); } elsif ( '' =~ m{$pat_re} ) { info("PATTERN matches between chars"); $case{zero_width} = !!1; } return \%case; } # Uses the generic splitF_match_pos. This routine handles split specif +ic things. sub splitF_pos { # ================================================ +========== my ( $pat, $str_ref ) = @_; my %case = %{ splitF_case($pat) }; if ( $case{AWK_emul} ) { # any contiguous whitespace (not just a single space character +) # is used as a separator; $pat = qr{\s+}; } if ( $case{pat_line_begin} ) { # split() silently optimizes /^/ to mean /^/m $pat = qr{^}m; } return \%case, [undef] if !defined $$str_ref; return \%case, splitF_match_pos( $pat, $str_ref ); } sub splitF_case2txt { my $case_href = shift; my %case = %$case_href; my $case_txt = ''; $case_txt .= $_ for map { $case{$_} ? "$_ " : () } sort keys %case +; return $case_txt; } # Create a textual representation of the output from splitF_pos() sub splitF_pos2txt { my $str_ref = shift; my $case_href = shift; # returnvalue from splitF_pos() my $pos_aref = shift; # returnvalue from splitF_pos() my $ix_last = @$pos_aref; if ( defined $pos_aref && @$pos_aref == 1 ) { return ['ERROR $str is undefined'], splitF_case2txt($case_href +); } my @res; my sub substring_1 { my $start = shift; my $end = shift; return substr( $$str_ref, $start, $end - $start ); } my sub separator_1 { my $pos_aref = shift; my $ix_last = @$pos_aref; my @sep; my $ix = 0; while ( $ix < $ix_last ) { my ( $start, $end ) = ( $pos_aref->[$ix], $pos_aref->[ $ix + + 1 ] ); # negative pos indicates unmatched capture group push @sep, '<', $start < 0 ? 'undef' : substring_1( $start +, $end ), '>'; $ix += 2; } return 'S' . join '', @sep; } my $ix = 0; while ( $ix < $ix_last ) { my ( $field_start, $field_end, $sep_aref ) = @$pos_aref[ $ix .. $ix + 3 ]; $ix += 3; push @res, 'F<' . substring_1( $field_start, $field_end ) . '> +'; last if ( not defined $sep_aref ); push @res, separator_1($sep_aref); } return \@res, splitF_case2txt($case_href); } # The key part of the emulation of split. # Create a list of strings from the output of splitF_pos(). # The created list of strings is ( should be ) the same as that from s +plit sub splitF_pos2list { # ========================================================== my $par_nof = @_; my $str_ref = shift; my $case_href = shift; # returnvalue from splitF_pos() my $pos_aref = shift; # returnvalue from splitF_pos() my $limit = shift; if ( !( defined $str_ref && !$$str_ref eq '' ) ) { return (); } my %case = %$case_href; $case{$_} = !!0 for qw( zero_width capturing lim_neg lim_omitted lim_positive ); if ( $par_nof < 3 || $par_nof > 4 ) { die 'too few or too many parameters'; } elsif ( $par_nof == 3 ) { $case{lim_omitted} = !!1; # or $limit == 0 } elsif ( $par_nof == 4 ) { if ( $limit < 0 ) { $case{lim_neg} = !!1; $case{lim_omitted} = !!0; } elsif ( $limit > 0 ) { $case{lim_positive} = !!1; $case{lim_omitted} = !!0; } else { $case{lim_omitted} = !!1; # or $limit == 0 } } my sub substring { my $start = shift; my $end = shift; return substr( $$str_ref, $start, $end - $start ); } my @res; my sub separator { my $pos_aref = shift; my $ix_last = @$pos_aref - 1; my $ix = 2; # skip the match of the sperator while ( $ix < $ix_last ) { my ( $start, $end ) = ( $pos_aref->[$ix], $pos_aref->[ $ix + + 1 ] ); # negative pos indicates unmatched capture group push @res, $start < 0 ? undef : substring( $start, $end ); $ix += 2; } } my $ix = 0; my $ix_last = @$pos_aref; my $str_length = length $$str_ref; my $field_nof = 0; while ( $ix < $ix_last ) { my ( $field_start, $field_end, $sep_aref ) = @$pos_aref[ $ix .. $ix + 3 ]; my $last_field = $ix >= $ix_last - 2; if ( $ix == 0 ) { # first field + seperator if ($last_field) { info('Only one field'); push @res, substring( $field_start, $str_length ); last; } $case{zero_width} = !!1 if $sep_aref->[1] == 0; $case{capturing} = @$sep_aref >= 3; if ( $case{zero_width} ) { # "a zero-width match at the beginning never produces an e +mpty field" info('SKIP first field, zero width separator'); next; } if ( $field_start == 0 && $field_end == 0 && ( ( !$case{lim_omitted} && !$case{lim_positive} ) || $case{AWK_emul} ) ) { info('SKIP first field+separator, both are empty'); next; } } if ( $field_start == $field_end && $sep_aref && $field_start == $sep_aref->[0] && $field_start == $sep_aref->[1] ) { info('SKIP intermediate field+separator, both are empty'); next; } $field_nof++; if ( $case{lim_positive} && $field_nof >= $limit ) { info("Reached nof field limit $limit"); push @res, substring( $field_start, $str_length ); last; } if ( $last_field && $field_start == $field_end && $case{lim_positive} && $case{lim_neg} ) { info('SKIP the last field'); last; } # add field to result push @res, substring( $field_start, $field_end ); # add separator to result next if ( !defined $sep_aref || !$case{capturing} ); separator($sep_aref); } continue { $ix += 3; } # $ix for next field my @removed; if ( $case{lim_omitted} ) { push @removed, pop @res while ( @res && ( !defined $res[-1] || $res[-1] eq '' ) ); } info( 'REMOVE from end: ', dump @removed ) if @removed; return @res; } sub split_core { # facade to core split ============================ +========== info "\nsplit_core", dump @_; my $par_nof = @_; if ( $par_nof == 0 ) { warn 'ERROR'; } elsif ( $par_nof == 1 ) { return split $_[0]; } elsif ( $par_nof == 2 ) { return split $_[0], $_[1]; } elsif ( $par_nof == 3 ) { return split $_[0], $_[1], $_[2]; } else { warn 'ERROR' } } sub splitF_test($;$$) { # test emulation of split ======================================= +========= info( "\nsplitF_test", dump @_ ); my @rv_F = splitF(@_); my @rv_c = split_core(@_); info( 'split_c: ', dump @rv_c ); return [@rv_F], [@rv_c], dump @_; } sub splitF_explain($;$$) { # explains the output from splitF_pos() ======================== +=== info( "\nsplitF_explain", dump @_ ); my ( $case_href, $pos_aref ) = splitF_pos( $_[0], \$_[1] ); return splitF_pos2txt( \$_[1], $case_href, $pos_aref ); } # Emulation of split ================================================= +========= # in /PATTERN/, the first parameter, the match operator /STRING/ must +be replaced # with qr/STRING/, the compiled regular expression. sub splitF($;$$) { info( "\nsplitF", dump @_ ); my ( $case_href, $pos_aref ) = splitF_pos( $_[0], \$_[1] ); my @rv = splitF_pos2list( \$_[1], $case_href, $pos_aref, @_ > 2 ? $_[2] : + () ); info( 'splitF: ', dump @rv ); return @rv; } !!1;
Used terms:
Split divides a string in fields and separators. The /PATTERN/ matches the separators. The parts before, between and after a separator are the fields.
sub splitF_match_pos

sub splitF_match_pos returns a list with the position (pos) in the string for the start and end of matches. The list contains groups of pos with pos start and pos end of field, and a reference to an array. The array contains pos for start and end of separator and optional for each capture group, the start and end.

sub splitF_case

Identifies the patterns which needs special treatment in split.

sub splitF_pos

This routine handles split specific things. It uses the more generic splitF_match_pos.

sub splitF_pos2txt

The output is a textual presentation of the output from sub splitF_pos.

sub splitF_pos2list

Create a list of strings from the output of sub splitF_pos. The output is (should be) the same as that from split.

sub splitF_test($;$$)

Can be used to test the split emulation. See below!

sub splitF($;$$)

This is the emulation of split.

If PATTERN, the first argument to split, is a match operator /STRING/ it must be replaced with qr/STRING/, a compiled regular expression.

The syntax split /PATTERN/ and split are not supported

My test of the module

I have based my tests on the file t/op/split.t in the Perl source code distribution and on the examples in split.

splitF_test_test.pl

This script uses most of the /PATTERN/,EXPR,LIMIT combinations used in split.t. I had to change all /STRING/ to qr/STRING/. (I have not found any way to emulate split's way to delay the evaluation of its first argument).

use strict; use warnings; use 5.010; use Test::More; use lib 'lib'; use Fake::Split; splitF_info( 1 ); plan tests => 73; is_deeply splitF_test(" ", "a b c"); is_deeply splitF_test("^", "a\nb\nc"); is_deeply splitF_test(qr/:/, undef); is_deeply splitF_test(":", "a:b:c"); is_deeply splitF_test(qr/:b:/, "a:b:c"); is_deeply splitF_test(qr//, "abc\n"); is_deeply splitF_test(qr/:/, "a:b:c::::"); is_deeply splitF_test(" ", " a b\tc \t d "); is_deeply splitF_test(qr/ */, "foo bar bie\tdoll"); is_deeply splitF_test(qr/ /, "a b c"); is_deeply splitF_test(" ", "1 2 3 4 5 6", 3); is_deeply splitF_test(" ", "1 2 3 4 5 6", 4); is_deeply splitF_test(qr/:/, "1:2:3:4:5:6:::", 999); is_deeply splitF_test(" ", "1 2 3 4 5 6", 2); is_deeply splitF_test(qr/,|(-)/, "1-10,20,,,"); is_deeply splitF_test(qr/,|(-)/, "1-10,20,,,", 10); is_deeply splitF_test(qr/x/, "", -1); is_deeply splitF_test(qr/x/, "", 1); is_deeply splitF_test(qr/(p+)/, "", -1); is_deeply splitF_test(qr/.?/, "", -1); is_deeply splitF_test(qr/^a/m, "a b a\na d a", 20); is_deeply splitF_test(qr/a$/m, "a b a\na d a", 20); is_deeply splitF_test(qr/^aa/m, "aa b aa\naa d aa", 20); is_deeply splitF_test(qr/aa$/m, "aa b aa\naa d aa", 20); is_deeply splitF_test(qr/\s*:\s*/, "a : b :c: d"); is_deeply splitF_test(1, "p1q1r1s"); is_deeply splitF_test(qr/^/, "ab\ncd\nef\n"); is_deeply splitF_test(qr/\A/, "ab\ncd\nef\n"); is_deeply splitF_test(qr/(?=\w)/, "rm b"); is_deeply splitF_test(qr//, v1.20.300.4000.50000.4000.300.20.1); is_deeply splitF_test(qr/\x{FE}/, "\xFF\xFE\xFD"); is_deeply splitF_test(qr/(\x{FE}\xFE)/, "\xFF\xFF\xFE\xFE\xFD\xFD"); is_deeply splitF_test(qr//, "\x{4D2}{\x{929}"); is_deeply splitF_test(qr/A/, "\x{4D2}A\x{929}"); is_deeply splitF_test(qr//, "\x{B36C}\x{5A8C}\x{FF5B}\x{5079}\x{505B}" +); is_deeply splitF_test(qr/\x40/, " \@\x80\x{100}\x80\@ "); is_deeply splitF_test(qr/(?^u:\x{100})/, " \@\x80\x{100}\x80\@ "); is_deeply splitF_test(qr/(?^u:\x{80}\x{100}\x{80})/, " \@\x80\x{100}\x +80\@ "); is_deeply splitF_test(qr/\x40\x{80}/, " \@\x80\x{100}\x80\@ "); is_deeply splitF_test(qr/[\x40\x{80}]+/, " \@\x80\x{100}\x80\@ "); is_deeply splitF_test(qr//, "ABC\x{263A}"); is_deeply splitF_test(qr/\xFE/, "\xFF\xFE\xFD"); is_deeply splitF_test(qr/\s+/, "hello cruel world"); is_deeply splitF_test(qr/ll/, "hello cruel world"); is_deeply splitF_test(qr/(A)|B/, "1B2"); is_deeply splitF_test(qr/\r?\n/, "\x{10F1FF}\n"); is_deeply splitF_test(qr/[,]/, "readin,database,readout"); is_deeply splitF_test(qr/[, ]+/, "a,b"); is_deeply splitF_test(qr/(?^u:ä)/, "a\xE4b"); is_deeply splitF_test(qr/(?^u:ä)/, "axb"); is_deeply splitF_test(qr/,/, ""); is_deeply splitF_test(qr/,/, ",,,,,"); is_deeply splitF_test(" \0 ", "ABC \0 FOO \0 XYZ"); is_deeply splitF_test(qr/ \0 /, "ABC \0 FOO \0 XYZ"); is_deeply splitF_test(1, "", {}); is_deeply splitF_test(qr/::/, "Font::GlyphNames"); is_deeply splitF_test(" ", "foo bar"); is_deeply splitF_test(qr/ /, "foo bar"); is_deeply splitF_test(qr/\s/, " a b c "); is_deeply splitF_test(qr/ /, " a b c "); is_deeply splitF_test(" ", " a b c "); is_deeply splitF_test(" ", " a \tb c "); is_deeply splitF_test(" ", " foo "); is_deeply splitF_test(qr/ /, " foo "); is_deeply splitF_test(qr//, undef, 0); is_deeply splitF_test(qr//, "foobarbaz"); is_deeply splitF_test(qr//, "abc"); #is_deeply splitF_test(qr/-(?{ $c++ })/, "a-b-c"); is_deeply splitF_test(qr/:/, "a:b:c"); is_deeply splitF_test(qr/:/, "a:b:c:d:e"); is_deeply splitF_test(qr/-/, "-"); is_deeply splitF_test(" ", ""); is_deeply splitF_test("", "ab"); is_deeply splitF_test(";", "a;b");

In the line is_deeply splitF_test(" ", "a b c"); returns splitF_test a suitable input to is_deeply.

The output from splitF_test consists of two anonymous arrays, one with the output from the emulation and one from split, and a string with a dump of the argument /PATTERN/,EXPR,LIMIT

My observations and questions

The use of the Perl variable $#+

It is important to use $#+ together with @- and @+. See my sub splitF_match_pos above.

Inconsistency between m{} and split

The need for this in sub splitF_pos2list

if ( $field_start == $field_end && $sep_aref && $field_start == $sep_aref->[0] && $field_start == $sep_aref->[1] ) { info('SKIP intermediate field+separator, both are empty'); next; }
indicates an undocumented inconsistency!?

Arguments used together with Regexp Quote-Like Operators

I have several times found limitations on what can be an argument to Regexp Quote Like Operators.

One example is

# while ( my $rv = $$str_ref =~ m{$pat_re}gc ) { does not work # but this works: my $str = $$str_ref; while ( my $rv = $str =~ m{$pat_re}gc ) {

Are those limitations documented anywhere?

A split function which behaves like a perl subroutine?

The current split is a list operator with a lot of surprises and special cases.

What about a parallel alternative, not so optimized implemented, which behaves like a normal subroutine. Perhaps a string study function with the arguments: pattern, reference to a string and returning a list with positions (not splitting the string in sub-strings). An optional parameter could be used to select special cases.

A string study iterator is also useful.

Replies are listed 'Best First'.
Re: Split fake, an emulation of split
by vr (Curate) on Aug 01, 2020 at 19:28 UTC

    To address one question:

    ...limitations on what can be an argument to Regexp Quote Like Operators

    # while ( my $rv = $$str_ref =~ m{$pat_re}gc ) { does not work

    But it does. All tests in the suite pass OK with this line restored. You probably used the same scalar variable with other (not shown) tests in succession, and the (useless here) "c" modifier had left this scalar's pos property being set after the very 1st splitF call, then further calls produced bogus results. So, the bad (or good?) news is, there is no undocumented limitation, but the really good news -- in this case, I think -- if you had omitted "c" but wrote line as above, all tests (shown or not) would pass OK. And yet, such implementation would lead to subtle and evil bugs: the pos of referent (splitF argument, as opposed to disposable lexical copy) would be: (a) reset, and, maybe worse, (b) taken into account if it had been set previously to splitF call.

      Thanks vr for the answer!

      Your answer has made me understand why, I several times "found" limitations, on what can be an argument to Regexp Quote Like Operators.

      The (useless here) "c" modifier in combination with other errors have fooled me. I have also erroneously thought that the string used in $str =~ m{$pat} could not change during the match.

      The error message "Use of uninitialized value in pattern match (m//)" has made me believe that the input was the problem. Not the value after one or more matches.

      use strict; use warnings; use 5.010; use Data::Dump qw(dump dd ddx); use Test::More; sub match { # $pat, $str_ref warn "\nmatch", dump @_; my $pat = $_[0]; my $str_ref = $_[1]; my @res; say STDERR '$str_ref before loop: ', dump $str_ref; while ( my $rv = $$str_ref =~ m{$pat}g ) { say STDERR 'in loop: ', dump $str_ref, $rv, @{^CAPTURE}; push @res, @{^CAPTURE}; } say STDERR 'match: ', ( dump @res ), "\n"; return @res; } sub spliT { # $pat, $str_ref warn "\nspliT", dump @_; my @rv = split $_[0], ${ $_[1] }; say STDERR 'split: ', ( dump @rv ), "\n"; return @rv; } $_ = 'ERROR?'; /(.+)/; # set $1 my $match_OK = [ ':', ':' ]; my $split_OK = [ "a", ":", "b", ":", "c" ]; is_deeply [ match( qr{(:)}, \'a:b:c' ) ], [ ':', ':' ], q{\'a:b:c'}; my $dollar1_ref; { my $str = 'a:b:c'; $_ = $str; /(.+)/; $dollar1_ref = \$1; warn "\n\$dollar1_ref: ", dump $dollar1_ref; is_deeply [ match( qr{(:)}, $dollar1_ref ) ], $match_OK, q{$dollar +1_ref}; warn "NOK because \$\$dollar1_ref is changed by m{}\n"; warn "\n\$dollar1_ref: ", dump $dollar1_ref; is_deeply [ spliT( qr{(:)}, $dollar1_ref ) ], $split_OK, q{spliT( qr{(:)}, $dollar1_ref)}; warn "\n\$dollar1_ref: ", dump $dollar1_ref; warn "\n ===== leave scope"; } warn "\n\$dollar1_ref: ", dump $dollar1_ref; # $dollar1_ref = \undef # is_deeply [ match( qr{(:)}, $dollar1_ref ) ], $match_OK, q{$dollar1_ +ref}; # is_deeply [ spliT( qr{(:)}, $dollar1_ref ) ], $split_OK, # q{spliT( qr{(:)}, $dollar1_ref}; done_testing(); __DATA__ output: match(qr/(:)/, \"a:b:c") at pm_5.pl line 17. $str_ref before loop: \"a:b:c" in loop: (\"a:b:c", 1, ":") in loop: (\"a:b:c", 1, ":") match: (":", ":") ok 1 - \'a:b:c' $dollar1_ref: \"a:b:c" at pm_5.pl line 51. match(qr/(:)/, \"a:b:c") at pm_5.pl line 17. $str_ref before loop: \"a:b:c" in loop: (\":", 1, ":") match: ":" not ok 2 - $dollar1_ref # Failed test '$dollar1_ref' # at pm_5.pl line 53. # Structures begin differing at: # $got->[1] = Does not exist # $expected->[1] = ':' NOK because $$dollar1_ref is changed by m{} $dollar1_ref: \"a:b:c" at pm_5.pl line 56. spliT(qr/(:)/, \"a:b:c") at pm_5.pl line 31. split: ("a", ":", "b", ":", "c") ok 3 - spliT( qr{(:)}, $dollar1_ref) $dollar1_ref: \"a:b:c" at pm_5.pl line 60. ===== leave scope at pm_5.pl line 61. $dollar1_ref: \"ERROR?" at pm_5.pl line 64. 1..3 # Looks like you failed 1 test of 3.

      I am a little bit surprised that split could handle split $_[0], ${ $_[1] };

      Thank you very much!!

Re: Split fake, an emulation of split
by ikegami (Patriarch) on Jul 31, 2020 at 14:29 UTC

    I found an error in the documentation of split

    No you didn't. As documented, trailing empty fields are only removed when the third operand is zero (the default).

      use strict; use warnings; use 5.010; # From https://perldoc.perl.org/5.30.1/functions/split.html: # and each of these: print join(':', split(//, 'abc', 3)), "\n"; print join(':', split(//, 'abc', 4)), "\n"; # produces the output a:b:c __DATA__ Output: a:b:c a:b:c:

        ah yes. You weren't very clear, so I misunderstood what you meant. That is indeed a documentation bug. Did you report it?

      If something behaves differently from its documentation, then it is either an implementation or a documentation error. The discrepancy between actual behaviour and documentation is a fact, as the example is taken straight from split.

      Greetings,
      -jo

      $gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$