Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re^2: Comparing and getting information from two large files and appending it in a new file

by perlkhan77 (Acolyte)
on Mar 31, 2012 at 05:01 UTC ( #962719=note: print w/ replies, xml ) Need Help??


in reply to Re: Comparing and getting information from two large files and appending it in a new file
in thread Comparing and getting information from two large files and appending it in a new file

Hi Graff I was going over your code and I think there are few issues in the begining that you might suggest a way out of.

$methrange{$bgn}{$end} = $_;

Here you are using begin and end as keys of the HOH %methrange but there are some CDS in the file which can have same start and end but belong to different gene ids how can we account for such anomalies using your code. Also I tried running your code for about half an hour it did not produce any result thus efficiency is still an issue here. What'd you guys say about trying DBDsqlite or mysql for this task would the overhead be more or less??


Comment on Re^2: Comparing and getting information from two large files and appending it in a new file
Download Code
Re^3: Comparing and getting information from two large files and appending it in a new file
by graff (Chancellor) on Mar 31, 2012 at 22:42 UTC
    I tried running your code for about half an hour it did not produce any result thus efficiency is still an issue here.

    Given the larger sample of data that you gave for the first file, it's clear that the code in my initial reply wasn't working as intended. Sorry about that. I'll give it one more try and let you know how that comes out, but in the meantime...

    What'd you guys say about trying DBDsqlite or mysql for this task would the overhead be more or less??

    Actually, when you've got a clear layout of the logical structure of the inputs, the decision process and the desired outputs, it's reasonably likely that a relational-table/SQL-based solution will help a lot. It has the same prerequisites as writing a perl script in order to get the right answers: having the right way to describe the task. Once you have that, it'll probably be a lot easier to write a suitable SQL statement to express the conditions and accomplish the operations that need to be done. If nothing else, just being able to deal with relevant subsets of the data at a time, rather than having to slog through one huge, monolithic stream, would be a win.

    I don't know enough about SQLite to say how good it would be with optimizing queries, but Mysql is easy to set up, easy to use, and quite effective for very large data sets (and it's very well documented). So long as you make sure to apply indexing on the table fields that get used a lot in queries, you should see pretty zippy results. You get a lot of built-in efficiency for free.

Re^3: Comparing and getting information from two large files and appending it in a new file
by graff (Chancellor) on Mar 31, 2012 at 23:21 UTC
    ... you are using begin and end as keys of the HOH %methrange but there are some CDS in the file which can have same start and end but belong to different gene ids how can we account for such anomalies using your code.

    Here's a version that will keep track of all possible matches, even when two entries from the first file happen to have identical "Start - End" ranges. I tested that by adding a "dummy" entry to your larger data sample for the "Methylation.gtf" file, copying one of the lines that would match, and changing the "Gene" field to make it distinguishable.

    Below are the updated code, the two input files, and the resulting output (file name "res_11268_10.txt" as per the OP code). If I remove my "test" line from the gtf file, I get just two lines of output instead of three. (I also ran the OP code on the same data, and got the same result, once I handled the initial spaces in the gtf data.)

    I'd be curious how the timing looks on a larger data set (but I expect a proper mysql solution would run faster).

      Thanks graff the code takes around :- 4115 wallclock secs (4112.37 usr + 0.28 sys = 4112.65 CPU) for running one methylome file that is similar to file2 in the above comments.

      I have been programming for quite some time but I have mostly used cookbook solutions to make things work and that seem to have stunted my learning in perl or programming as a whole of solutions like indexing. What would be a good way to learn such tricks as indexing e.t.c. Thanks again for all your help

        What would be a good way to learn such tricks as indexing e.t.c.

        in the context of setting up a table on a reasonably mature relational database engine (mysql, postgres, etc), creating an index on a given field is simply a matter of telling the database engine that you want that field to be indexed. The engine handles the rest for you -- that's one of the things that makes database engines so attractive. For example:

        CREATE TABLE genome ( id int AUTO_INCREMENT PRIMARY KEY, start int, end int, strand varchar(6), ... -- include other fields as needed INDEX start, INDEX end ... -- index other fields that are often useful in queries )
        By declaring that those two integer fields are to be indexed, mysql will take care of all the back-end work to make sure that queries like the following would execute with a minimum amount of time spent searching and comparing:
        -- assume your current input data record has a "position" value of 876 +5: select * from genome where start <= 8765 and end >= 8765
        As for your benchmark results that you reported, do you consider those to be good or bad? How much longer is that relative to the time it takes just to read the larger input file and do little else (e.g. how long does it take to run a simple one-liner, like:  perl -ne '$sz+=length(); END {print $sz,"\n"}' )?

      By indexing tricks I was referring to perl I am aware of the indexing that happens in mysql. Also the time I gave in the previous post was for your previous program. Your new program takes 4297 wallclock secs (4294.19 usr + 0.29 sys = 4294.48 CPU).

      As for your benchmark results that you reported, do you consider those to be good or bad? How much longer is that relative to the time it takes just to read the larger input file and do little else

      It does not take long to read the larger file what takes longer is the looping and in matching. Previously there were two loops (a for inside a while) but now there are three (a for inside a for inside a while) due to HOHOH. For a quickfix solution this is ok but I would have to eventually use mysql but even in mysql I would have to update a particular class match every time if a particular position falls in that range for that gene but hopefully it would be faster than this.

      Thanks again for all the help.

      Hi Graff one more thing about the result file being produced the number of lines having Gm10 assembly in Methylation.gtf are 26637 and thus the final result should have the same number with 0 value for those genes which have no CG CHG or CHH count while right now it only prints 8626 lines including the header. Sorry to trouble you about that but if you can let me know what changes should I make in the code to make it possible

      Thanks again

        To get one line of output for every line in your first input file, there's few changes.
        ... my %methrange; my %methhash; # this line had been further down in my prev.version, j +ust move it up ... if ( /^\s*Gm10/ ) { my ( $bgn, $end, $methcount ) = (split)[3,4,10]; $methrange{$bgn}{$end}{$_} = undef; $methhash{$_}(methcount} = $methcount; # moved up from below } ... for my $end ( keys %{$methrange{$bgn}} ) { if ( $position <= $end ) { for my $match ( keys %{$methrange{$bgn}{$end}} ) { $methhash{$match}{$class}++; # methcount was +moved from here } } } ...
        As for the benchmark, you said that your OP version "took forever". Was "forever" more than an hour and a half? (Did my version yield any improvement at all?) Do you have specific constraints about how much time can be taken up by a single run? If not, I'd say focus more on making sure the output is correct, rather than how long it takes to produce the output.

      Your's is a lot more faster than my code mine did not yield result even after running for six hours. Thanks

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://962719]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (5)
As of 2014-09-20 18:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (160 votes), past polls