Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Re: selecting columns from a tab-separated-values file

by Kenosis (Priest)
on Jan 21, 2013 at 23:27 UTC ( #1014521=note: print w/ replies, xml ) Need Help??


in reply to selecting columns from a tab-separated-values file

Consider slicing a split to get only the elements you need, instead of splitting the entire line, as it benchmarks significantly faster:

use strict; use warnings; use Benchmark qw(cmpthese); my $line = "FIRST\tMIDDLE\tLAST\tSTRNO\tSTRNAME\tCITY\tSTATE\tZIP" . " +\tFOO" x 42; sub trySplit { my @capture = split /\t/, $line; } sub trySplitSlice { my @capture = ( split /\t/, $line )[ 0, 2, 5 ]; } sub trySplitSliceLimit { my @capture = ( split /\t/, $line, 7 )[ 0, 2, 5 ]; } cmpthese( -5, { trySplit => sub { trySplit() }, trySplitSlice => sub { trySplitSlice() }, trySplitSliceLimit => sub { trySplitSliceLimit() } } );

Results:

Rate trySplit trySplitSlice trySplit +SliceLimit trySplit 110337/s -- -46% + -84% trySplitSlice 204730/s 86% -- + -71% trySplitSliceLimit 708158/s 542% 246% + --

Update: Have added choroba's trySplitSliceLimit() option to the benchmarking.

Update II: Thanks to AnomalousMonk, have appended "\tFOO" x 42 to the original string to create a string with 50 tab-delimited fields. This effectively shows the speed increase using trySplitSliceLimit().

Update III: Changed splitting on ' ' to \t. Thanks CountZero.


Comment on Re: selecting columns from a tab-separated-values file
Select or Download Code
Re^2: selecting columns from a tab-separated-values file
by choroba (Abbot) on Jan 21, 2013 at 23:36 UTC
    Limiting the splice makes it even faster:
    sub trySplitSliceLimit { my @capture = (split ' ', $line, 7)[ 0, 2, 5 ]; }
    لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      Yes--excellent! Will modify my posting to include this option.

      This turns out to make the biggest difference, far and away. Of course, it depends on what fields are chosen - in particular, where the rightmost one is.
Re^2: selecting columns from a tab-separated-values file
by AnomalousMonk (Abbot) on Jan 22, 2013 at 02:27 UTC

    How about using a 'record' string with more fields to better show the effect of a split limit? Just append something like  "\tFOO" x 42 to the existing eight-field string.

      Yes--impressive! Doing so makes a huge difference in the benchmarking. Will update the update...

Re^2: selecting columns from a tab-separated-values file
by ibm1620 (Beadle) on Jan 22, 2013 at 03:58 UTC

    This is nice. Couple of questions:

    What's the first arg to split()? It appears to be a single blank char. How does that work to split upon tab chars?

    In trySplitSliceLimit, wouldn't it be better to set LIMIT to 3, or in general to the number of fields you expect to extract?

    And one observation for the record: the indices can appear in any order. To extract LAST, FIRST, and CITY, you'd write [2, 0, 5]

      What's the first arg to split()? It appears to be a single blank char. How does that work to split upon tab chars?

      It's a space enclosed within single quotes. It tells split to split on whitespace, e.g., \t, \n, space.

      In trySplitSliceLimit, wouldn't it be better to set LIMIT to 3, or in general to the number of fields you expect to extract?

      It should be set to the number of fields plus one that are needed to get the fields you want. For example, using your original string:

      "FIRST\tMIDDLE\tLAST\tSTRNO\tSTRNAME\tCITY\tSTATE\tZIP" 1 2 3 4 5 6 7 ----> my @capture = ( split /\t/, $line, 7 )[ 2, 0, 5 ];

      You want LAST FIRST CITY and CITY is the sixth field. Setting the LIMIT to seven will return the first six fields and the remainder of the string is the seventh. The slice is then used on those seven to get only the three you want.

      And one observation for the record: the indices can appear in any order. To extract LAST, FIRST, and CITY, you'd write [2, 0, 5]

      You're correct!

      Update: Changed splitting on ' ' to \t. Thanks CountZero.

        What's the first arg to split()? It appears to be a single blank char. How does that work to split upon tab chars?
        It's a space enclosed within single quotes. It tells split to split on whitespace, e.g., \t, \n, space.
        That is a dangerous thing to do. it works for your example data, but it will break on real word data where you will have LAST names like Van Winkle and CITYs called New York.

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

        My blog: Imperial Deltronics

        Deleted--replied to myself. Time for sleep...

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1014521]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (6)
As of 2014-10-31 11:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (216 votes), past polls