Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

If you insert the file in a RAM disk (as TheloniusMonk suggested) (see reply below) and also insert your results in RAM (push @sample...) you will need a bit or much more RAM to run it. Also copying the file in the RAM disk, which the OS will do for you via cp, will take some time (although reading it line-by-line from normal disk with your script will possibly take longer). Then there are SSD disks and mechanical drives and with each the time benefits will be different. This is the easiest approach without you writing more code.

The additional benefit if you go the RAM disk way is that you can keep your input files in the disk for multiple perl runs, until the next reboot or until you remove them from RAM. So the second time you run a similar script to find different patterns you will see a better time benefit because the input is already in RAM.

If you go the parallel way (as Discipulus mentioned) then you are bound by the total IO bandwidth of your hard disk. And so the benefits also may be different than just multiplying by the time number of parallel workers. Although I am not sure whether splitting the file and copying it to different, physically, hard-disks will get you benefits. If the content of your file is just separate lines who do not depend on each other (e.g. it is not an XML spanning multiple lines) then you can break that large file into smaller chunks (and keep it that way) and see if that helps parallelisation (in conjuction with storing it to different disks): edit: split -l 1000000 input.txt will split the input in chunks of 1000000 lines each (in unix).

If your Not sample CC lines are a lot then you can filter them out before running all the different regexes on each line of input or even before running that perl script: for example, via grep -v 'Not sample CC' input.txt | perl ... or with a perl one-liner filter, but I am not sure perl beats grep. Of course you need the filter-out lines to have a common regex to filter them out.

And finally, if you do manage to remove all the Not sample CC lines, it is worth trying the following and see if it is faster (caveat: results in %inp will be in random order and not the order of insertion as with arrays) :

open(FILE,"test.txt"); my %inp = (); while(<FILE>){ if(/^(.+?)\s+(\S+)/){ $inp{$1} = $2 } } close(FILE);

Edit: If you want to pass the output of your command above to another command for further processing then the problem of waiting for a process to finish in order to get all its output out and run it through another command and so on has been solved a long time ago, it is called a pipeline and essentially is what you see in unix style cmd1 | cmd2 | cmd3 ... . cmd1 starts outputing results as soon as it reads its input (if it is a simple program as yours above), its output is immediately read by cmd2 which then spits its output as soon as the first line is read and on to cmd3 which finally gives you an output as soon as the first line of input is read by cmd1 plus the propagation time. So you save a lot of time and you have results coming out almost immediately. The provision is that processing one line or chunk of input must be independent of the following lines of input.

In reply to Re: About text file parsing by bliako
in thread About text file parsing by dideod.yang

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others taking refuge in the Monastery: (7)
    As of 2020-12-02 13:44 GMT
    Find Nodes?
      Voting Booth?
      How often do you use taint mode?

      Results (41 votes). Check out past polls.