Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

reading file

by saranperl (Initiate)
on Aug 04, 2009 at 06:59 UTC ( #785661=perlquestion: print w/ replies, xml ) Need Help??
saranperl has asked for the wisdom of the Perl Monks concerning the following question:

I have a file which is 1 GB. It contain email ids. I want to avoid the duplication of email in that file? i tried by "open file ">file name" while(file){}" but it take too much time. Please guide me how to process fast? my ram size is 2 GB

Comment on reading file
Re: reading file
by vinoth.ree (Parson) on Aug 04, 2009 at 07:03 UTC

    What is your RAM size?

      2 GB

        Whenever you update your post please update us. You update your post right ?

Re: reading file
by moritz (Cardinal) on Aug 04, 2009 at 07:31 UTC
    i tried by "open file ">file name" while(file){}" but it take too much time

    That's understandable, while (file) { ... } is an infinite loop that doesn't read anything. I suggest using  while (<file>) { ... } instead.

      ok..ok.. i missed <> ok.. eventhough i read like this. i will get much time.
Re: reading file
by vinoth.ree (Parson) on Aug 04, 2009 at 07:34 UTC
Re: reading file
by dsheroh (Parson) on Aug 04, 2009 at 09:07 UTC
    If you profile your program you will probably find (after removing the infinite loop that moritz pointed out) that it's spending most of its time waiting for data to be read from or written to the hard drive. File I/O takes a certain amount of time (and is relatively slow) and there's not really anything you can do about it in Perl. In some cases, operating system-level tuning may be able to help with it some, but even that rarely does much, since this is primarily a hardware limitation.

    To illustrate this slowness:

    $ time cat [a random 700M file] > /dev/null real 0m46.649s user 0m0.020s sys 0m2.278s
    For a 1G file, we could reasonably expect the time required to be on the order of 67 seconds. And that's just to read the contents of the file and throw them away without doing any processing.

    For comparison, try copying your existing file to a new file using your operating system's normal file-copy methods. Since you're doing a copy, plus also doing some other work in the process, then you can reasonably assume that there's no way you can possibly do it faster than a plain copy.

    If you're not just encountering an I/O issue, then anything which can be done in your code to speed things up would be inside of the while loop, which you haven't shown us.

      60 seconds is considerable. but my side taking 15 min something. tell me any other way to read the file.
        Show us the actual code you are running. Put your script between "<code>" and "</code>" tags when you post it.
Re: reading file
by roboticus (Canon) on Aug 04, 2009 at 12:49 UTC
    saranperl:

    You don't even need to use perl for that. If you're on a *NIX box, you could use the following command:

    sort -u original_file >new_file

    But since you're here, you might want to know how to do it in perl. So you could do it in roughly the same way the sort command does it:

    1. Read the file into in array
    2. Sort the array, so all duplicates will be next to each other
    3. Scan through the array and remove adjacent duplicates
    4. Write the array to the new output file

    Give it a try and let us know if and where you get stuck!

    ...roboticus

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://785661]
Approved by vinoth.ree
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (10)
As of 2014-12-21 17:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (106 votes), past polls