Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw

Re^4: multicolumn extraction

by Kenosis (Priest)
on Jun 03, 2012 at 17:15 UTC ( #974171=note: print w/replies, xml ) Need Help??

in reply to Re^3: multicolumn extraction
in thread multicolumn extraction

Your reply makes sense, sauoq. I see that I assumed too much by the OP's field representations as not containing any spaces, so it would be best to split on the known field delimiter. Indeed, it would be disastrous if the first field contained spaces. Good catch and thank you for bringing this to my attention.


I was curious to see whether there was any speed difference between spliting all or spliting some columns, so I ran the following which creates and splits a 20 column x 10000 row file:

use Modern::Perl; use Benchmark qw(cmpthese); my $entry = "aaaaaaaaaaaaaaaaa"; my $columnsFile = 'columns.txt'; open my $file, ">$columnsFile" or die $!; do { print $file "$entry\t" x 19; say $file $entry } for 1 .. 10000; close $file; sub splitAll { open my $file, "<$columnsFile" or die $!; while (<$file>) { my @columns = split /\t/; } close $file; } sub splitSome { open my $file, "<$columnsFile" or die $!; while (<$file>) { my @columns = ( split /\t/ )[ 1 .. 2 ]; } close $file; } cmpthese( -5, { splitAll => sub { splitAll() }, splitSome => sub { splitSome() } } );


Rate splitAll splitSome splitAll 19.8/s -- -21% splitSome 25.1/s 27% --

In this case, spliting only some shows a significant speed advantage--and with this relatively small file. I ran the script many times, getting as high as 31% for splitSome and as low as 21%--but always showing that splitSome is significantly faster.

Replies are listed 'Best First'.
Re^5: multicolumn extraction
by GrandFather (Sage) on Jun 03, 2012 at 22:33 UTC

    Significantly faster is on the order of a few tens of seconds if run several times a day over a long period, or about half an hour if run just once. For almost all practical use cases the trivial difference you demonstrate is just that - trivial. A, maybe useful, little extra juice can be squeezed out by stopping the split early rather than just slicing the result to avoid copying a few extra list elements:

    use strict; use warnings; use Benchmark qw(cmpthese); my $kFName = 'delme.txt'; test(); sub test { my $entry = 'a' x 18; open my $fOut, '>', $kFName or die "Can't create $kFName: $!\n"; print $fOut "$entry\t" x 19, "\n" for 1 .. 10000; close $fOut; cmpthese( -5, { splitAll => sub {splitAll()}, splitLimit => sub {splitLimit()}, splitSlice => sub {splitSlice()}, } ); } sub splitAll { open my $fIn, '<', $kFName or die "Can't open $kFName: $!\n"; while (<$fIn>) { my @columns = split /\t/; } close $fIn; } sub splitSlice { open my $fIn, '<', $kFName or die "Can't open $kFName: $!\n"; while (<$fIn>) { my @columns = (split /\t/)[1 .. 2]; } close $fIn; } sub splitLimit { open my $fIn, '<', $kFName or die "Can't open $kFName: $!\n"; while (<$fIn>) { my @columns = (split /\t/, $_, 4)[1 ..2]; } close $fIn; }


    Rate splitAll splitSlice splitLimit splitAll 5.60/s -- -36% -73% splitSlice 8.75/s 56% -- -59% splitLimit 21.1/s 276% 141% --

    However, even the worst performing variant is still so fast that it simple not worth worrying about even if you were running it several thousand times a day every day of the year. And not of these solutions is actually useful for parsing CSV. To do that in a reasonably robust way you should really use something like Text:CSV, which is about ten times slower than any of the benchmarked solutions, but has the huge advantage that it may actually give correct results for anything other than the trivial test data used by this test.

    True laziness is hard work

      Your example and explanation are well done, suggesting that the difference between the two splits I timed--even if 'statistically significant'--really makes little or no practical difference.

      I see, now, that these timing routines can be a red herring that detracts from achieving both correct results and readability.

      Thanks, GrandFather.

      And not of these solutions is actually useful for parsing CSV.

      True enough! But then, he didn't say it was CSV. He said it was "tab delimited".

      CSV is a rather more robust format that permits fields in which you can escape your separator, have embedded newlines, etc. "Tab delimited" generally means a file where records are delimited by newlines, fields are delimited by tabs, and fields may not contain newlines or tabs. It's common. And splitting on tabs works very nicely for it without requiring additional dependencies and overhead.

      "My two cents aren't worth a dime.";

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://974171]
[Eily]: you could tie a variable into not having the same value each time, if you like to make people who try to debug your code facepalm
[Corion]: perl -wle 'package o; use overload q("") => sub {warn "str"; ""}, bool => sub{warn "bool"; 1}; package main; my $o={}; bless $o => o; print "Yay" if ($o && !length($o))'
[Corion]: But people writing such code should document the objects they construct and why it makes sense for an object to be invisible as string while being true in a boolean context
[hippo]: That's equal parts clever and horrendous.
[Eily]: the overload version wouldn't return true with "$x" && !length $x though, I guess
[hippo]: The more I look at this code, the more $x is a plain old scalar and the more this condition will never be true. I'm calling it a bug at this point.
[hippo]: Thanks for your input which has soothed my sanity (a little)
[Corion]: Eily: Sure - if you force both things into stringy things, then you break that magic. But that would also mean that you changed the expression, as now $x = 0.00 will be true instead of false as it were before
[Corion]: Ah no, at least in my feeble experiments that doesn't change the meaning
[Corion]: We sell sanity in small packages ;)

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (8)
As of 2017-07-27 13:42 GMT
Find Nodes?
    Voting Booth?
    I came, I saw, I ...

    Results (413 votes). Check out past polls.