Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Split(), Initial Spaces, & a limit?

by cmv (Chaplain)
on Jul 20, 2010 at 15:44 UTC ( [id://850448]=perlquestion: print w/replies, xml ) Need Help??

cmv has asked for the wisdom of the Perl Monks concerning the following question:

Monks-

I'm having difficulty figuring out how to get split to do its job with data that has initial spaces, when I need to set a limit.

Consider the following:

use strict; use warnings; use Data::Dumper; my @data = ( " 56 1752.eps", " 56 2613.eps", " 56 3469.eps", " 8 INPUT000", " 16 INPUT001", " 16 INPUT002", " 96 MTA.1.ps", " 96 MTA.6.ps", " 80 MTA.7.ps", " 32 head.eps", " 8 labs", " 0 lib", " 8 mkexe.bat", " 112 out", " 0 screenshots", "8720 trace.exe", " 16 trace.pl", " 8 tracehosts", "1160 trace.041409.exe", "1160 trace.orig.exe", ); foreach (@data) { print STDERR Dumper(split), "\n"; } OUTPUT SAMPLE: $VAR1 = '56'; $VAR2 = '1752.eps'; $VAR1 = '56'; $VAR2 = '2613.eps'; $VAR1 = '56'; $VAR2 = '3469.eps';
This works wonderfully (as stated in the Camel book), the initial spaces in the data are ignored and each call to split returned a list with two elements!

Now, I would like to add a limit to the number of fields that split will split on. Notice the changes in the data below, and I still want each split to return a list with two elements:

use strict; use warnings; use Data::Dumper; my @data = ( " 56 1752.eps a b", " 56 2613.eps", " 56 3469.eps", " 8 INPUT000 a b", " 16 INPUT001", " 16 INPUT002", " 96 MTA.ps", " 96 MTA.6.ps a b", " 80 MTA.7.ps", " 32 head.eps", " 8 labs", " 0 lib a b", " 8 mkexe.bat", " 112 out", " 0 screenshots", "8720 trace.exe a b", " 16 trace.pl", " 8 tracehosts", "1160 trace.041409.exe a b", "1160 trace.orig.exe", ); foreach (@data) { print STDERR Dumper(split /\s+/, $_, 2), "\n"; } OUTPUT SAMPLE: $VAR1 = ''; $VAR2 = '56 1752.eps a b'; $VAR1 = ''; $VAR2 = '56 2613.eps'; $VAR1 = ''; $VAR2 = '56 3469.eps';
Well, split is returning a list with two elements in every case, but in the case of the lines with initial spaces, it returns a null for the first element.

How can I get the second case to ignore leading spaces like the first case did?

Many thanks...

-Craig

Replies are listed 'Best First'.
Re: Split(), Initial Spaces, & a limit?
by ikegami (Patriarch) on Jul 20, 2010 at 16:03 UTC

    As documented, the default for the first argument of split is ' '. ("As a special case, specifying a PATTERN of space (' ') will split on white space just as split with no arguments does.")

    $ perl -MData::Dumper -e' $_=" 56 1752.eps a b"; print Dumper split; ' $VAR1 = '56'; $VAR2 = '1752.eps'; $VAR3 = 'a'; $VAR4 = 'b'; $ perl -MData::Dumper -e' $_=" 56 1752.eps a b"; print Dumper split " "; ' $VAR1 = '56'; $VAR2 = '1752.eps'; $VAR3 = 'a'; $VAR4 = 'b';

    The next problem is that "2" is wrong for the third argument. You'd want to use "3" and ignore the last value returned.

    $ perl -MData::Dumper -e' $_=" 56 1752.eps a b"; print Dumper split " ", $_, 2; ' $VAR1 = '56'; $VAR2 = '1752.eps a b'; $ perl -MData::Dumper -e' $_=" 56 1752.eps a b"; print Dumper split " ", $_, 3; ' $VAR1 = '56'; $VAR2 = '1752.eps'; $VAR3 = 'a b';

    Solutions:

    for (@data) { my @a = (split)[0,1]; my @a = (split " ", $_, 3)[0,1]; my ($x, $y) = split(" ", $_, 3); my ($x, $y) = split; ... }

    split is optimised so that it doesn't do any unnecessary work for the last one. You could also avoid split entirely.

    my ($x, $y) = /^\s*(\S+)\s+(\S+)/;
Re: Split(), Initial Spaces, & a limit?
by jethro (Monsignor) on Jul 20, 2010 at 15:57 UTC

    Use split ' ',$_,2. The documentation on split says:

    As a special case, specifying a PATTERN of space (' ') will split on white space just as "split" with no arguments does. Thus, "split(' ')" can be used to emulate awk's default behavior, whereas "split(/ /)" will give you as many null initial fields as there are leading spaces. A "split" on "/\s+/" is like a "split(' ')" except that any leading whitespace produces a null first field. A "split" with no arguments really does a "split(' ', $_)" internally.
Re: Split(), Initial Spaces, & a limit?
by Anonymous Monk on Jul 20, 2010 at 15:54 UTC
    use strict; use warnings; use Data::Dumper; my @data = ( " 56 1752.eps a b", " 56 2613.eps", " 56 3469.eps", " 8 INPUT000 a b", " 16 INPUT001", " 16 INPUT002", " 96 MTA.ps", " 96 MTA.6.ps a b", " 80 MTA.7.ps", " 32 head.eps", " 8 labs", " 0 lib a b", " 8 mkexe.bat", " 112 out", " 0 screenshots", "8720 trace.exe a b", " 16 trace.pl", " 8 tracehosts", "1160 trace.041409.exe a b", "1160 trace.orig.exe", ); foreach (@data) { print STDERR Dumper( (split)[0,1] ), "\n"; } __END__ $VAR1 = '56'; $VAR2 = '1752.eps'; $VAR1 = '56'; $VAR2 = '2613.eps'; ...
Bug in Sort::Fields?
by cmv (Chaplain) on Jul 20, 2010 at 17:08 UTC
    Folks-

    Anon, ++jethro, and ++ikegami, thanks for the great responses.

    The reason I asked this question is because I'm seeing this problem when I use Sort::Fields. The script below will show what I mean:

    use strict; use warnings; use Sort::Fields; use Data::Dumper; my @data = ( " 56 1752.eps", " 56 2613.eps", " 56 3469.eps", " 8 INPUT000", " 16 INPUT001", " 16 INPUT002", " 96 MTA.1.ps", " 96 MTA.6.ps", " 80 MTA.7.ps", " 32 head.eps", " 8 labs", " 0 lib", " 8 mkexe.bat", " 112 out", " 0 screenshots", "8720 trace.exe", " 16 trace.pl", " 8 tracehosts", "1160 trace.041409.exe", "1160 trace.orig.exe", ); # Initial spaces in column 1 don't sort the same as... my @sorted = fieldsort( ['1n'], @data); print STDERR "First sorted DUMP:\n", Dumper(\@sorted), "\n"; my @data2 = grep s/^/1 /, @data; # ...initial spaces in column 2! @sorted = fieldsort( ['2n'], @data2); print STDERR "Second sorted DUMP:\n", Dumper(\@sorted), "\n";
    You'll see in the output that the two fieldsorts() get sorted differently. I contacted the module owner about it, but in the mean time was trying to figure out how to fix it on my own.

    If you look in make_fieldsort() sub in the Sort::Fields code, you'll see the nested map commands. I'm just getting comfortable with map, but this nested one is really throwing me for a loop (heh). I just can't seem to come up with the right solution here.

    Any help for a poor, confused, I-only-seem-to-be-able-to-understand-non-nested-map-commands type person?

    Thanks

    -Craig

      # Initial spaces in column 1 don't sort the same as...

      It's impossible for a column to have initial spaces when spaces is your delimiter. The first field of most of @data is "".

      use strict; use warnings; use Sort::Fields; use Data::Dumper; my @data = ( " 56 1752.eps", " 56 2613.eps", " 56 3469.eps", " 8 INPUT000", " 16 INPUT001", " 16 INPUT002", " 96 MTA.1.ps", " 96 MTA.6.ps", " 80 MTA.7.ps", " 32 head.eps", " 8 labs", " 0 lib", " 8 mkexe.bat", " 112 out", " 0 screenshots", "8720 trace.exe", " 16 trace.pl", " 8 tracehosts", "1160 trace.041409.exe", "1160 trace.orig.exe", ); s/^\s+// for @data; my @sorted = fieldsort( ['1n'], @data); print(Dumper(\@sorted));

      By the way, you were using grep as map, and you were clobbering @data in the process.

        ikegami-

        I'm sorry, but I don't believe I understand your point. It seems that all you did to fix the problem was to remove the initial spaces in the original data.

        In my opinion Sort::Fields should sort the data the same way, regardless of where the data is (field 1 or field 2). If you try to numerically sort the output of an 'ls -s' command, you can see the problem clearly:

        use strict; use warnings; use Sort::Fields; use Data::Dumper; my @data = `ls -s`; chomp(@data); my @sorted = fieldsort( ['1n'], @data); print(Dumper(\@sorted));
        This doesn't do what is intended, and is why I made the report to the author. I'm sure I could remove the initial spaces for Data::Dumper, then put them back after it's done, but that doesn't seem right to me.

        -Craig

      Alternatively, you can change the definition of a field.
      use strict; use warnings; use Sort::Fields; use Data::Dumper; my @data = ( " 56 1752.eps", " 56 2613.eps", " 56 3469.eps", " 8 INPUT000", " 16 INPUT001", " 16 INPUT002", " 96 MTA.1.ps", " 96 MTA.6.ps", " 80 MTA.7.ps", " 32 head.eps", " 8 labs", " 0 lib", " 8 mkexe.bat", " 112 out", " 0 screenshots", "8720 trace.exe", " 16 trace.pl", " 8 tracehosts", "1160 trace.041409.exe", "1160 trace.orig.exe", ); my @sorted = fieldsort( "".qr/(?<!^)(?<!\s)\s+/, ['1n'], @data); print(Dumper(\@sorted));
        ikegami++

        Brilliant!

        I believe I understand the theory here. Now I just have to go off and figure out the specifics of what ".qr/(?<!^)(?<!\s)\s+/ is actually doing. I'll get it after a while, and will learn a lot in doing so, no doubt!

        Nicely done!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://850448]
Approved by Tanktalus
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (5)
As of 2024-04-23 13:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found