Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

Re: About text file parsing

by bliako (Prior)
on Aug 29, 2018 at 10:52 UTC ( #1221297=note: print w/replies, xml ) Need Help??

in reply to About text file parsing

If you insert the file in a RAM disk (as TheloniusMonk suggested) (see reply below) and also insert your results in RAM (push @sample...) you will need a bit or much more RAM to run it. Also copying the file in the RAM disk, which the OS will do for you via cp, will take some time (although reading it line-by-line from normal disk with your script will possibly take longer). Then there are SSD disks and mechanical drives and with each the time benefits will be different. This is the easiest approach without you writing more code.

The additional benefit if you go the RAM disk way is that you can keep your input files in the disk for multiple perl runs, until the next reboot or until you remove them from RAM. So the second time you run a similar script to find different patterns you will see a better time benefit because the input is already in RAM.

If you go the parallel way (as Discipulus mentioned) then you are bound by the total IO bandwidth of your hard disk. And so the benefits also may be different than just multiplying by the time number of parallel workers. Although I am not sure whether splitting the file and copying it to different, physically, hard-disks will get you benefits. If the content of your file is just separate lines who do not depend on each other (e.g. it is not an XML spanning multiple lines) then you can break that large file into smaller chunks (and keep it that way) and see if that helps parallelisation (in conjuction with storing it to different disks): edit: split -l 1000000 input.txt will split the input in chunks of 1000000 lines each (in unix).

If your Not sample CC lines are a lot then you can filter them out before running all the different regexes on each line of input or even before running that perl script: for example, via grep -v 'Not sample CC' input.txt | perl ... or with a perl one-liner filter, but I am not sure perl beats grep. Of course you need the filter-out lines to have a common regex to filter them out.

And finally, if you do manage to remove all the Not sample CC lines, it is worth trying the following and see if it is faster (caveat: results in %inp will be in random order and not the order of insertion as with arrays) :

open(FILE,"test.txt"); my %inp = (); while(<FILE>){ if(/^(.+?)\s+(\S+)/){ $inp{$1} = $2 } } close(FILE);

Edit: If you want to pass the output of your command above to another command for further processing then the problem of waiting for a process to finish in order to get all its output out and run it through another command and so on has been solved a long time ago, it is called a pipeline and essentially is what you see in unix style cmd1 | cmd2 | cmd3 ... . cmd1 starts outputing results as soon as it reads its input (if it is a simple program as yours above), its output is immediately read by cmd2 which then spits its output as soon as the first line is read and on to cmd3 which finally gives you an output as soon as the first line of input is read by cmd1 plus the propagation time. So you save a lot of time and you have results coming out almost immediately. The provision is that processing one line or chunk of input must be independent of the following lines of input.

Replies are listed 'Best First'.
Re^2: About text file parsing
by TheloniusMonk (Sexton) on Aug 29, 2018 at 12:48 UTC
    I meant OP could store the arrays as files on a RAM disk, not the input file necessarily - though that is an interesting extra idea.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1221297]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (9)
As of 2020-11-23 17:46 GMT
Find Nodes?
    Voting Booth?

    No recent polls found