Split(), Initial Spaces, & a limit?

cmv has asked for the wisdom of the Perl Monks concerning the following question:

Monks-

I'm having difficulty figuring out how to get split to do its job with data that has initial spaces, when I need to set a limit.

Consider the following:

use strict;
use warnings;

use Data::Dumper;

my @data = ( "  56 1752.eps", "  56 2613.eps", "  56 3469.eps",
"   8 INPUT000", "  16 INPUT001", "  16 INPUT002", "  96 MTA.1.ps",
"  96 MTA.6.ps", "  80 MTA.7.ps", "  32 head.eps", "   8 labs",
"   0 lib", "   8 mkexe.bat", " 112 out", "   0 screenshots",
"8720 trace.exe", "  16 trace.pl", "   8 tracehosts",
"1160 trace.041409.exe", "1160 trace.orig.exe",
);

foreach (@data) {
    print STDERR Dumper(split), "\n";
}

OUTPUT SAMPLE:
$VAR1 = '56';
$VAR2 = '1752.eps';

$VAR1 = '56';
$VAR2 = '2613.eps';

$VAR1 = '56';
$VAR2 = '3469.eps';
[download]

This works wonderfully (as stated in the Camel book), the initial spaces in the data are ignored and each call to split returned a list with two elements!

Now, I would like to add a limit to the number of fields that split will split on. Notice the changes in the data below, and I still want each split to return a list with two elements:

use strict;
use warnings;

use Data::Dumper;

my @data = ( "  56 1752.eps a b", "  56 2613.eps", "  56 3469.eps",
"   8 INPUT000 a b", "  16 INPUT001", "  16 INPUT002", "  96 MTA.ps",
"  96 MTA.6.ps a b", "  80 MTA.7.ps", "  32 head.eps", "   8 labs",
"   0 lib a b", "   8 mkexe.bat", " 112 out", "   0 screenshots",
"8720 trace.exe a b", "  16 trace.pl", "   8 tracehosts",
"1160 trace.041409.exe a b", "1160 trace.orig.exe",
);

foreach (@data) {
    print STDERR Dumper(split /\s+/, $_, 2), "\n";
}

OUTPUT SAMPLE:
$VAR1 = '';
$VAR2 = '56 1752.eps a b';

$VAR1 = '';
$VAR2 = '56 2613.eps';

$VAR1 = '';
$VAR2 = '56 3469.eps';
[download]

Well, split is returning a list with two elements in every case, but in the case of the lines with initial spaces, it returns a null for the first element.

How can I get the second case to ignore leading spaces like the first case did?

Many thanks...

-Craig

Comment on Split(), Initial Spaces, & a limit? Select or Download Code

Replies are listed 'Best First'.
Re: Split(), Initial Spaces, & a limit? by ikegami (Patriarch) on Jul 20, 2010 at 16:03 UTC
As documented, the default for the first argument of `split` is `' '`. ("As a special case, specifying a PATTERN of space (`' '`) will split on white space just as split with no arguments does.") `$ perl -MData::Dumper -e' $_=" 56 1752.eps a b"; print Dumper split; ' $VAR1 = '56'; $VAR2 = '1752.eps'; $VAR3 = 'a'; $VAR4 = 'b'; $ perl -MData::Dumper -e' $_=" 56 1752.eps a b"; print Dumper split " "; ' $VAR1 = '56'; $VAR2 = '1752.eps'; $VAR3 = 'a'; $VAR4 = 'b';` [download] The next problem is that "2" is wrong for the third argument. You'd want to use "3" and ignore the last value returned. `$ perl -MData::Dumper -e' $_=" 56 1752.eps a b"; print Dumper split " ", $_, 2; ' $VAR1 = '56'; $VAR2 = '1752.eps a b'; $ perl -MData::Dumper -e' $_=" 56 1752.eps a b"; print Dumper split " ", $_, 3; ' $VAR1 = '56'; $VAR2 = '1752.eps'; $VAR3 = 'a b';` [download] Solutions: `for (@data) { my @a = (split)[0,1]; my @a = (split " ", $_, 3)[0,1]; my ($x, $y) = split(" ", $_, 3); my ($x, $y) = split; ... }` [download] `split` is optimised so that it doesn't do any unnecessary work for the last one. You could also avoid `split` entirely. `my ($x, $y) = /^\s*(\S+)\s+(\S+)/;` [download]	[reply] [d/l] [select]
Re: Split(), Initial Spaces, & a limit? by jethro (Monsignor) on Jul 20, 2010 at 15:57 UTC
Use `split ' ',$_,2`. The documentation on split says: As a special case, specifying a PATTERN of space (' ') will split on white space just as "split" with no arguments does. Thus, "split(' ')" can be used to emulate awk's default behavior, whereas "split(/ /)" will give you as many null initial fields as there are leading spaces. A "split" on "/\s+/" is like a "split(' ')" except that any leading whitespace produces a null first field. A "split" with no arguments really does a "split(' ', $_)" internally.	[reply] [d/l]
Re: Split(), Initial Spaces, & a limit? by Anonymous Monk on Jul 20, 2010 at 15:54 UTC
use strict; use warnings; use Data::Dumper; my @data = ( " 56 1752.eps a b", " 56 2613.eps", " 56 3469.eps", " 8 INPUT000 a b", " 16 INPUT001", " 16 INPUT002", " 96 MTA.ps", " 96 MTA.6.ps a b", " 80 MTA.7.ps", " 32 head.eps", " 8 labs", " 0 lib a b", " 8 mkexe.bat", " 112 out", " 0 screenshots", "8720 trace.exe a b", " 16 trace.pl", " 8 tracehosts", "1160 trace.041409.exe a b", "1160 trace.orig.exe", ); foreach (@data) { print STDERR Dumper( (split)[0,1] ), "\n"; } __END__ $VAR1 = '56'; $VAR2 = '1752.eps'; $VAR1 = '56'; $VAR2 = '2613.eps'; ... [download]	[reply] [d/l]
Bug in Sort::Fields? by cmv (Chaplain) on Jul 20, 2010 at 17:08 UTC
Folks- Anon, ++jethro, and ++ikegami, thanks for the great responses. The reason I asked this question is because I'm seeing this problem when I use Sort::Fields. The script below will show what I mean: use strict; use warnings; use Sort::Fields; use Data::Dumper; my @data = ( " 56 1752.eps", " 56 2613.eps", " 56 3469.eps", " 8 INPUT000", " 16 INPUT001", " 16 INPUT002", " 96 MTA.1.ps", " 96 MTA.6.ps", " 80 MTA.7.ps", " 32 head.eps", " 8 labs", " 0 lib", " 8 mkexe.bat", " 112 out", " 0 screenshots", "8720 trace.exe", " 16 trace.pl", " 8 tracehosts", "1160 trace.041409.exe", "1160 trace.orig.exe", ); # Initial spaces in column 1 don't sort the same as... my @sorted = fieldsort( ['1n'], @data); print STDERR "First sorted DUMP:\n", Dumper(\@sorted), "\n"; my @data2 = grep s/^/1 /, @data; # ...initial spaces in column 2! @sorted = fieldsort( ['2n'], @data2); print STDERR "Second sorted DUMP:\n", Dumper(\@sorted), "\n"; [download] You'll see in the output that the two `fieldsorts()` get sorted differently. I contacted the module owner about it, but in the mean time was trying to figure out how to fix it on my own. If you look in `make_fieldsort()` sub in the Sort::Fields code, you'll see the nested map commands. I'm just getting comfortable with map, but this nested one is really throwing me for a loop (heh). I just can't seem to come up with the right solution here. Any help for a poor, confused, I-only-seem-to-be-able-to-understand-non-nested-map-commands type person? Thanks -Craig	[reply] [d/l] [select]
Re: Bug in Sort::Fields? by ikegami (Patriarch) on Jul 20, 2010 at 17:29 UTC
# Initial spaces in column 1 don't sort the same as... It's impossible for a column to have initial spaces when spaces is your delimiter. The first field of most of @data is `""`. `use strict; use warnings; use Sort::Fields; use Data::Dumper; my @data = ( " 56 1752.eps", " 56 2613.eps", " 56 3469.eps", " 8 INPUT000", " 16 INPUT001", " 16 INPUT002", " 96 MTA.1.ps", " 96 MTA.6.ps", " 80 MTA.7.ps", " 32 head.eps", " 8 labs", " 0 lib", " 8 mkexe.bat", " 112 out", " 0 screenshots", "8720 trace.exe", " 16 trace.pl", " 8 tracehosts", "1160 trace.041409.exe", "1160 trace.orig.exe", ); s/^\s+// for @data; my @sorted = fieldsort( ['1n'], @data); print(Dumper(\@sorted));` [download] By the way, you were using `grep` as `map`, and you were clobbering @data in the process.	[reply] [d/l] [select]
Re^2: Bug in Sort::Fields? by cmv (Chaplain) on Jul 20, 2010 at 18:39 UTC
ikegami- I'm sorry, but I don't believe I understand your point. It seems that all you did to fix the problem was to remove the initial spaces in the original data. In my opinion Sort::Fields should sort the data the same way, regardless of where the data is (field 1 or field 2). If you try to numerically sort the output of an 'ls -s' command, you can see the problem clearly: use strict; use warnings; use Sort::Fields; use Data::Dumper; my @data = `ls -s`; chomp(@data); my @sorted = fieldsort( ['1n'], @data); print(Dumper(\@sorted)); [download] This doesn't do what is intended, and is why I made the report to the author. I'm sure I could remove the initial spaces for Data::Dumper, then put them back after it's done, but that doesn't seem right to me. -Craig	[reply] [d/l]
Re^3: Bug in Sort::Fields? by ikegami (Patriarch) on Jul 20, 2010 at 19:02 UTC
Re^4: Bug in Sort::Fields? by cmv (Chaplain) on Jul 20, 2010 at 19:14 UTC
Some notes below your chosen depth have not been shown here
Re: Bug in Sort::Fields? by ikegami (Patriarch) on Jul 20, 2010 at 19:32 UTC
Alternatively, you can change the definition of a field. `use strict; use warnings; use Sort::Fields; use Data::Dumper; my @data = ( " 56 1752.eps", " 56 2613.eps", " 56 3469.eps", " 8 INPUT000", " 16 INPUT001", " 16 INPUT002", " 96 MTA.1.ps", " 96 MTA.6.ps", " 80 MTA.7.ps", " 32 head.eps", " 8 labs", " 0 lib", " 8 mkexe.bat", " 112 out", " 0 screenshots", "8720 trace.exe", " 16 trace.pl", " 8 tracehosts", "1160 trace.041409.exe", "1160 trace.orig.exe", ); my @sorted = fieldsort( "".qr/(?<!^)(?<!\s)\s+/, ['1n'], @data); print(Dumper(\@sorted));` [download]	[reply] [d/l]
Re^2: Bug in Sort::Fields? by cmv (Chaplain) on Jul 20, 2010 at 20:07 UTC
ikegami++ Brilliant! I believe I understand the theory here. Now I just have to go off and figure out the specifics of what `".qr/(?<!^)(?<!\s)\s+/` is actually doing. I'll get it after a while, and will learn a lot in doing so, no doubt! Nicely done!	[reply] [d/l]


Your skill will accomplish what the force of many cannot
	PerlMonks