Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris

Re^3: Reading files, skipping very long lines...

by pjf (Curate)
on Sep 30, 2005 at 01:03 UTC ( #496301=note: print w/replies, xml ) Need Help??

in reply to Re^2: Reading files, skipping very long lines...
in thread Reading files, skipping very long lines...

G'day Excalibor,

All the suggestions so far have been fantastic, and it sounds like all you really need now is a very-fast 'discard line' subroutine.

Be aware that regardless of how efficient your code may be, you'll be limited by the speed of the I/O operations provided by your operating system. If you've got to read 380Mb from disk, that's going to take some time regardless of how you process it.

If possible, set your program running and take a look at what your system is doing. If you're on a unix-flavoured system, then top and time can help a lot. If you're hitting 100% CPU usage, and a lot of that is in userland time, then a tigher reading-loop may help. If you're not seeing 100% CPU usage, or you're seeing a very high amount of system time, then you're probably I/O bound. You'll need faster disks, hardware, and/or filesystems for your program's performance to improve.

Assuming that you are CPU bound, you can potentially write your 'discard line' subroutine in C, which allows it to be very fast and compact. Here's an example using Inline::C

use Inline 'C'; # Example, skip a line of input from STDIN: skip_line(); # Look! The next line is read fine by Perl. print scalar <STDIN>; __END__ __C__ /* Read (and discard) until we find a newline */ /* NOTE: This will loop endlessly if it hits EOF * before finding a newline. Caveat lector. */ void skip_line() { while( getchar() != '\n' ) { } }

I haven't benchmarked that, but it should be both very memory efficient and fast. Be aware the of the problem that you will encounter if skip_line() hits EOF before a newline; unless you're very sure of your input file you'll want to improve upon the sample code provided here.

If you do benchmark, keep in mind that any caching by the CPU may make a significant difference to your end results.

All the very best,

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://496301]
[Corion]: Mhhmmm - https://sod.pixlab .io/ looks really interesting for embedding with Perl (XS), but they don't have any kind of free model available and the cheapest pretrained model costs EUR 40 :-(
[Corion]: Maybe I should mail them to find out if they can provide me a "hotdog / no hotdog" model for developping the XS bindings. It would be nice to have a self-contained XS library for applying models to data. Or maybe I should look at TensorFlow, which can...
[Corion]: ... at least be trained by me, instead of relying on a vendor
[Discipulus]: complex and interesting

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (9)
As of 2018-06-18 10:46 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (109 votes). Check out past polls.