Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: perl performance vs egrep

by exussum0 (Vicar)
on Jan 23, 2005 at 13:55 UTC ( [id://424384]=note: print w/replies, xml ) Need Help??


in reply to perl performance vs egrep

Let's step back and discuss what you'd like to happen. You are looking for some arbitrary text in 10 files. When you do a basic search for a file using grep, some data is slurped into memory from disc (usually) and then the text is searched for in memory.

If you can split this into many jobs, even using 10 egrep processes, you may be able to do better. In any modern OS, when one process is using a disc, the cpu is kinda idle for that moment. Another process could grab the cpu for that time and "do stuff".

Have you tried running searches in parallel? I doubt, though I wouldn't rule it out, that egrep doesn't do multiprocess/threaded searching.

----
Give me strength for today.. I will not talk it away..
Just for a moment.. It will burn through the clouds.. and shine down on me.

Replies are listed 'Best First'.
Re^2: perl performance vs egrep
by ambrus (Abbot) on Jan 23, 2005 at 14:18 UTC

    I wouldn't think threaded searching would help. The operating system knows that many programs read file sequentially, thus, if you read some part of a file from disk, the OS will probably read further if the disk is free, so that the program can access the rest of it faster. (You may even be able to override this behaiviour with the posix_fadvise function.)

      My first thought about making the searching multi-threaded is that the disk then has to read from multiple files (for each thread). This probably means the read head on the hard drive will be required to move around the surface of the disk more than just reading each file sequentially.

      It is hard to predict which approach will give the best read performance. I think it's a reasonable to assume that your OS and filesystem try to keep the files stored sequentially on the disk so I would expect searching each file in sequence is probably faster.

      Maybe you might want to time running 'wc' on all of the files in sequence vs all of the files at different concurrencies to see what works best for reading the data.

      The whole point is your disk is probably much slower than your CPU so it will probably be a much bigger bottleneck especially if you go and start moving the read head around a lot.
      Buffering will help, no doubt, but at some point, the process will have to block to read in a couple of bytes or a large chunk of data. Also in the situation with dual CPUs, having both chug away could be adventageous.

      ----
      Give me strength for today.. I will not talk it away..
      Just for a moment.. It will burn through the clouds.. and shine down on me.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://424384]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (3)
As of 2024-03-29 15:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found