Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re: selecting columns from a tab-separated-values file

by BrowserUk (Pope)
on Jan 22, 2013 at 04:59 UTC ( #1014614=note: print w/replies, xml ) Need Help??


in reply to selecting columns from a tab-separated-values file

What most the responses so far seem to have missed is that the cast of the in-ram spliting and joining of records, pales into insignificance when compared to the time required for the read-a-record, write-a-record flip-flopping of the read head back and forth across the disk.

There are two possible ways to speed up your processing:

  1. Overlap the spliting & joining with either the reading or the writing (or both).

    This can be achieved by using two or more threads or processes.

  2. Read and write larger chunks of the file to minimise the number of seeks the read heads need to make.

    Interleaving 80GB/4k = 20 million reads, and (~) 200,000 write means (at least) 400,000 track to track seeks.

    If you increase the read & write chunk sizes to 1MB each, that can be reduced to ~1,500 track to track seeks.

Try these:

  • ibuf.pl:
    #! perl -sw use strict; our $B //= 64; $B *= 1024; my $ibuf = ''; while( sysread( *STDIN, $ibuf, $B, length $ibuf ) ) { my $p = 1+rindex( $ibuf, "\n" ); my $rem = substr( $ibuf, $p ); substr( $ibuf, $p ) = ''; open my $RAM, '<', \$ibuf; print while <$RAM>; $ibuf = $rem; }
  • obuf.pl:
    #! perl -sw use strict; our $B //= 64; $B *= 1024; my $obuf; while( <> ) { my @f = split chr(9); $obuf .= join( chr(9), @f[2,0,5] ) . "\n"; if( length( $obuf ) > $B ) { print $obuf; $obuf = ''; } } print $obuf; ## Corrected C&P error. See [lotus1]'s post below.

    Usage: ibuf -B=1024 < in.tsv | obuf.pl -B=1024 > out.tsv.

    Based on my experiments, it might be possible to more than halve your processing time, though YMMV.

    Experiment with adjusting the -B=nnn (in kbs) parameters up & down in unison and independently to find the sweet spot on your system.

    Be aware, bigger is not always better. 1024 for both seems to work quite well on my system, anything larger slows it down.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: selecting columns from a tab-separated-values file
by Lotus1 (Vicar) on Jan 22, 2013 at 12:03 UTC

    I just tried your solution with an input file of 100,000 lines but I only got 65537 lines of output. Here is the command line I used to run the test:

    perl ibuf.pl -B=1024 < testdata.csv | perl obuf.pl -B=1024 > out.csv

      Damn! I missed/dropped the last line of obuf.pl when I C&P's the code:

      print $obuf;

      Now corrected above, Thank you.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re^2: selecting columns from a tab-separated-values file
by ibm1620 (Monk) on Jan 22, 2013 at 18:10 UTC
    My Perl program was consuming 100% of one CPU core, leading me to think it was unlikely to be I/O-bound.

      The construction of the arrays from the split lines consumes a fair amount of CPU. As does the allocation of the buffers for reading into and writing from. As does the searching of the 4K buffers to locate the line ends. And the search of the lines to find the tabs.

      But still, the minimum elapsed time is dominated by the time it takes to a) read 80 GB from disk; whilst simultaneously writing ~5 GB back to disk. On my system 2.6 GHz 4-core Q6600 + 7200 RPM SATA2 disk with 16MB ram, that reading and writing process -- with no other processing; searching; memory allocations etc. -- requires a minimum of 56 minutes.

      The difference between that and your 5 hours comes in two parts.

      1. The internal processing of split & join.
      2. The waiting for IO.

      If you can overlap the IO and the internal processing, you can claw back some of that difference.

      Have you tried my two processes solution?


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        I'm back at work and have tested the two-process solution. It took 60 seconds to pass 10M (M=million) records. Then I pulled the logic for splitting and joining the records out of obuf and into ibuf (thus eliminating obuf) and ran the same test, and it ran in 62 seconds. (In both cases the output was to /dev/null.)<\p>

        I reran the tests sending output to an actual file in the same directory as the input, and obtained exactly the same runtimes.

        In ALL cases I observed the CPU of the process that was doing the split/join to peg at 100%.

        So I have to conclude that disk I/O is negligible for this program, in my environment.

        I've not yet tried the two-process solution, but I intend to. One concern I have is that, at some point fairly early on, it seems like the "pipeline" would get full, at which point the two processes would have to operate in lock-step.

        Another thought I've had is that, since the process appears to me to be CPU-bound, it might be worth forking several children and distributing the work across them. Each child would have to write to a separate output file, which admittedly would increase the possibility of disk head thrashing, but I think it's worth a try.

Re^2: selecting columns from a tab-separated-values file
by Anonymous Monk on Jan 23, 2013 at 04:33 UTC

    Read and write larger chunks of the file to minimise the number of seeks the read heads need to make.

    Sometimes there's the luxury of having two hard drive spindles. Use one for reading, the other for writing, and find your I/O-bound application much sped up. (It works even when the other spindle is over a network share.) Or the other kind of luxury of having an SSD: costless seeks.

    A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1014614]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (3)
As of 2019-10-20 01:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Notices?