cmv has asked for the wisdom of the Perl Monks concerning the following question:
Monks-
I'm having difficulty figuring out how to get split to do its job with data that has initial spaces, when I need to set a limit.
Consider the following:
use strict;
use warnings;
use Data::Dumper;
my @data = ( " 56 1752.eps", " 56 2613.eps", " 56 3469.eps",
" 8 INPUT000", " 16 INPUT001", " 16 INPUT002", " 96 MTA.1.ps",
" 96 MTA.6.ps", " 80 MTA.7.ps", " 32 head.eps", " 8 labs",
" 0 lib", " 8 mkexe.bat", " 112 out", " 0 screenshots",
"8720 trace.exe", " 16 trace.pl", " 8 tracehosts",
"1160 trace.041409.exe", "1160 trace.orig.exe",
);
foreach (@data) {
print STDERR Dumper(split), "\n";
}
OUTPUT SAMPLE:
$VAR1 = '56';
$VAR2 = '1752.eps';
$VAR1 = '56';
$VAR2 = '2613.eps';
$VAR1 = '56';
$VAR2 = '3469.eps';
This works wonderfully (as stated in the Camel book), the initial spaces in the data are ignored and each call to split returned a list with two elements!
Now, I would like to add a limit to the number of fields that split will split on. Notice the changes in the data below, and I still want each split to return a list with two elements:
use strict;
use warnings;
use Data::Dumper;
my @data = ( " 56 1752.eps a b", " 56 2613.eps", " 56 3469.eps",
" 8 INPUT000 a b", " 16 INPUT001", " 16 INPUT002", " 96 MTA.ps",
" 96 MTA.6.ps a b", " 80 MTA.7.ps", " 32 head.eps", " 8 labs",
" 0 lib a b", " 8 mkexe.bat", " 112 out", " 0 screenshots",
"8720 trace.exe a b", " 16 trace.pl", " 8 tracehosts",
"1160 trace.041409.exe a b", "1160 trace.orig.exe",
);
foreach (@data) {
print STDERR Dumper(split /\s+/, $_, 2), "\n";
}
OUTPUT SAMPLE:
$VAR1 = '';
$VAR2 = '56 1752.eps a b';
$VAR1 = '';
$VAR2 = '56 2613.eps';
$VAR1 = '';
$VAR2 = '56 3469.eps';
Well, split is returning a list with two elements in every case, but in the case of the lines with initial spaces, it returns a null for the first element.
How can I get the second case to ignore leading spaces like the first case did?
Many thanks...
-Craig
Re: Split(), Initial Spaces, & a limit?
by ikegami (Patriarch) on Jul 20, 2010 at 16:03 UTC
|
As documented, the default for the first argument of split is ' '. ("As a special case, specifying a PATTERN of space (' ') will split on white space just as split with no arguments does.")
$ perl -MData::Dumper -e'
$_=" 56 1752.eps a b";
print Dumper split;
'
$VAR1 = '56';
$VAR2 = '1752.eps';
$VAR3 = 'a';
$VAR4 = 'b';
$ perl -MData::Dumper -e'
$_=" 56 1752.eps a b";
print Dumper split " ";
'
$VAR1 = '56';
$VAR2 = '1752.eps';
$VAR3 = 'a';
$VAR4 = 'b';
The next problem is that "2" is wrong for the third argument. You'd want to use "3" and ignore the last value returned.
$ perl -MData::Dumper -e'
$_=" 56 1752.eps a b";
print Dumper split " ", $_, 2;
'
$VAR1 = '56';
$VAR2 = '1752.eps a b';
$ perl -MData::Dumper -e'
$_=" 56 1752.eps a b";
print Dumper split " ", $_, 3;
'
$VAR1 = '56';
$VAR2 = '1752.eps';
$VAR3 = 'a b';
Solutions:
for (@data) {
my @a = (split)[0,1];
my @a = (split " ", $_, 3)[0,1];
my ($x, $y) = split(" ", $_, 3);
my ($x, $y) = split;
...
}
split is optimised so that it doesn't do any unnecessary work for the last one. You could also avoid split entirely.
my ($x, $y) = /^\s*(\S+)\s+(\S+)/;
| [reply] [d/l] [select] |
Re: Split(), Initial Spaces, & a limit?
by jethro (Monsignor) on Jul 20, 2010 at 15:57 UTC
|
Use split ' ',$_,2. The documentation on split says:
As a special case, specifying a PATTERN of space (' ') will split on white space just as
"split" with no arguments does. Thus, "split(' ')" can be used to emulate awk's default
behavior, whereas "split(/ /)" will give you as many null initial fields as there are leading
spaces. A "split" on "/\s+/" is like a "split(' ')" except that any leading whitespace
produces a null first field. A "split" with no arguments really does a "split(' ', $_)"
internally.
| [reply] [d/l] |
Re: Split(), Initial Spaces, & a limit?
by Anonymous Monk on Jul 20, 2010 at 15:54 UTC
|
use strict;
use warnings;
use Data::Dumper;
my @data = ( " 56 1752.eps a b", " 56 2613.eps", " 56 3469.eps",
" 8 INPUT000 a b", " 16 INPUT001", " 16 INPUT002", " 96 MTA.ps",
" 96 MTA.6.ps a b", " 80 MTA.7.ps", " 32 head.eps", " 8 labs",
" 0 lib a b", " 8 mkexe.bat", " 112 out", " 0 screenshots",
"8720 trace.exe a b", " 16 trace.pl", " 8 tracehosts",
"1160 trace.041409.exe a b", "1160 trace.orig.exe",
);
foreach (@data) {
print STDERR Dumper( (split)[0,1] ), "\n";
}
__END__
$VAR1 = '56';
$VAR2 = '1752.eps';
$VAR1 = '56';
$VAR2 = '2613.eps';
...
| [reply] [d/l] |
Bug in Sort::Fields?
by cmv (Chaplain) on Jul 20, 2010 at 17:08 UTC
|
Folks-
Anon, ++jethro, and ++ikegami, thanks for the great responses.
The reason I asked this question is because I'm seeing this problem when I use Sort::Fields. The script below will show what I mean:
use strict;
use warnings;
use Sort::Fields;
use Data::Dumper;
my @data = ( " 56 1752.eps", " 56 2613.eps", " 56 3469.eps",
" 8 INPUT000", " 16 INPUT001", " 16 INPUT002", " 96 MTA.1.ps",
" 96 MTA.6.ps", " 80 MTA.7.ps", " 32 head.eps", " 8 labs",
" 0 lib", " 8 mkexe.bat", " 112 out", " 0 screenshots",
"8720 trace.exe", " 16 trace.pl", " 8 tracehosts",
"1160 trace.041409.exe", "1160 trace.orig.exe",
);
# Initial spaces in column 1 don't sort the same as...
my @sorted = fieldsort( ['1n'], @data);
print STDERR "First sorted DUMP:\n", Dumper(\@sorted), "\n";
my @data2 = grep s/^/1 /, @data;
# ...initial spaces in column 2!
@sorted = fieldsort( ['2n'], @data2);
print STDERR "Second sorted DUMP:\n", Dumper(\@sorted), "\n";
You'll see in the output that the two fieldsorts() get sorted differently. I contacted the module owner about it, but in the mean time was trying to figure out how to fix it on my own.
If you look in make_fieldsort() sub in the Sort::Fields code, you'll see the nested map commands. I'm just getting comfortable with map, but this nested one is really throwing me for a loop (heh). I just can't seem to come up with the right solution here.
Any help for a poor, confused, I-only-seem-to-be-able-to-understand-non-nested-map-commands type person?
Thanks
-Craig | [reply] [d/l] [select] |
|
use strict;
use warnings;
use Sort::Fields;
use Data::Dumper;
my @data = ( " 56 1752.eps", " 56 2613.eps", " 56 3469.eps",
" 8 INPUT000", " 16 INPUT001", " 16 INPUT002", " 96 MTA.1.ps",
" 96 MTA.6.ps", " 80 MTA.7.ps", " 32 head.eps", " 8 labs",
" 0 lib", " 8 mkexe.bat", " 112 out", " 0 screenshots",
"8720 trace.exe", " 16 trace.pl", " 8 tracehosts",
"1160 trace.041409.exe", "1160 trace.orig.exe",
);
s/^\s+// for @data;
my @sorted = fieldsort( ['1n'], @data);
print(Dumper(\@sorted));
By the way, you were using grep as map, and you were clobbering @data in the process.
| [reply] [d/l] [select] |
|
ikegami-
I'm sorry, but I don't believe I understand your point. It seems that all you did to fix the problem was to remove the initial spaces in the original data.
In my opinion Sort::Fields should sort the data the same way, regardless of where the data is (field 1 or field 2). If you try to numerically sort the output of an 'ls -s' command, you can see the problem clearly:
use strict;
use warnings;
use Sort::Fields;
use Data::Dumper;
my @data = `ls -s`; chomp(@data);
my @sorted = fieldsort( ['1n'], @data);
print(Dumper(\@sorted));
This doesn't do what is intended, and is why I made the report to the author. I'm sure I could remove the initial spaces for Data::Dumper, then put them back after it's done, but that doesn't seem right to me.
-Craig | [reply] [d/l] |
|
|
|
|
Alternatively, you can change the definition of a field.
use strict;
use warnings;
use Sort::Fields;
use Data::Dumper;
my @data = ( " 56 1752.eps", " 56 2613.eps", " 56 3469.eps",
" 8 INPUT000", " 16 INPUT001", " 16 INPUT002", " 96 MTA.1.ps",
" 96 MTA.6.ps", " 80 MTA.7.ps", " 32 head.eps", " 8 labs",
" 0 lib", " 8 mkexe.bat", " 112 out", " 0 screenshots",
"8720 trace.exe", " 16 trace.pl", " 8 tracehosts",
"1160 trace.041409.exe", "1160 trace.orig.exe",
);
my @sorted = fieldsort( "".qr/(?<!^)(?<!\s)\s+/, ['1n'], @data);
print(Dumper(\@sorted));
| [reply] [d/l] |
|
ikegami++
Brilliant!
I believe I understand the theory here. Now I just have to go off and figure out the specifics of what ".qr/(?<!^)(?<!\s)\s+/ is actually doing. I'll get it after a while, and will learn a lot in doing so, no doubt!
Nicely done!
| [reply] [d/l] |
|
|