Beefy Boxes and Bandwidth Generously Provided by pair Networks Cowboy Neal with Hat
Perl: the Markov chain saw
 
PerlMonks  

Re^4: Comparing and getting information from two large files and appending it in a new file

by perlkhan77 (Acolyte)
on Mar 31, 2012 at 23:37 UTC ( #962809=note: print w/ replies, xml ) Need Help??


in reply to Re^3: Comparing and getting information from two large files and appending it in a new file
in thread Comparing and getting information from two large files and appending it in a new file

Thanks graff the code takes around :- 4115 wallclock secs (4112.37 usr + 0.28 sys = 4112.65 CPU) for running one methylome file that is similar to file2 in the above comments.

I have been programming for quite some time but I have mostly used cookbook solutions to make things work and that seem to have stunted my learning in perl or programming as a whole of solutions like indexing. What would be a good way to learn such tricks as indexing e.t.c. Thanks again for all your help


Comment on Re^4: Comparing and getting information from two large files and appending it in a new file
Re^5: Comparing and getting information from two large files and appending it in a new file
by graff (Chancellor) on Apr 01, 2012 at 00:13 UTC
    What would be a good way to learn such tricks as indexing e.t.c.

    in the context of setting up a table on a reasonably mature relational database engine (mysql, postgres, etc), creating an index on a given field is simply a matter of telling the database engine that you want that field to be indexed. The engine handles the rest for you -- that's one of the things that makes database engines so attractive. For example:

    CREATE TABLE genome ( id int AUTO_INCREMENT PRIMARY KEY, start int, end int, strand varchar(6), ... -- include other fields as needed INDEX start, INDEX end ... -- index other fields that are often useful in queries )
    By declaring that those two integer fields are to be indexed, mysql will take care of all the back-end work to make sure that queries like the following would execute with a minimum amount of time spent searching and comparing:
    -- assume your current input data record has a "position" value of 876 +5: select * from genome where start <= 8765 and end >= 8765
    As for your benchmark results that you reported, do you consider those to be good or bad? How much longer is that relative to the time it takes just to read the larger input file and do little else (e.g. how long does it take to run a simple one-liner, like:  perl -ne '$sz+=length(); END {print $sz,"\n"}' )?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://962809]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (4)
As of 2014-04-21 08:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (492 votes), past polls