Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much

Parsing file

by maverick.usb (Initiate)
on Apr 09, 2010 at 04:46 UTC ( #833697=perlquestion: print w/replies, xml ) Need Help??
maverick.usb has asked for the wisdom of the Perl Monks concerning the following question:

Hi Guys, I have a large text file that is has a bunch of records. Each record starts with a header line that starts with the ">" character. Then there are several lines under that line that contain sequences of letters. For example, a file might look like: >TEXT ID=2L TEXT ABCDEDKGFKGJDJED ALDKDKKFJFJGF >TEXT ID=3R TEXT FGDFGKDFSKDSLD FGFDGDSFFDG FDGF I can send you the actual file if you would like to see it. All I want is a PERL script that inputs a file in the following format: ID:START-STOP SEQUENCE Where ID is an identifier for the record (in the above record 1 has identifier of 2L and record 2 has identifier of 3R and start and stop are first and last positions of a subsequence in the record. Then there is a space and a SEQUENCE. I want the script to use the ranges to pull out the subsequence from the big record text file and then see if it matches the SEQUENCE in the input. The big record file is large so you can't load the whole file into memory, you need to stream it. The output would be the subsequence from the record and a space and a YES or NO for where it matches. Thanks

Replies are listed 'Best First'.
Re: Parsing file
by cdarke (Prior) on Apr 09, 2010 at 08:53 UTC
    Assuming you have not written any code yet, this should get you started:
    use warnings; and use strict;
    open the file
    Write a while loop that reads each record
    Within the loop, check if the record is of interest, probably using a regular expression and m.
    "pull out the subsequence" (whatever that means).
    End the loop.
    close the input file.
Re: Parsing file
by BioLion (Curate) on Apr 09, 2010 at 10:39 UTC

    Apart from what other people have said, i thought i should add that many of the things you want to do are fairly standard and there is a lot of help out there already - check out perl and bioinformatics as a start.

    On bioperl you will propbably find there are parsers for your format, and once you have your sequences as a standard object, there are many methods for retreiving subsequences / mathcing etc...

    If you files are really that huge, i would say that you probably need to process them one record at a time, or use a database - again there is help on bioperl for doing this, but i think you probably won't need to go that far.

    Let us see what you have so far in terms of code, and an example of input and desired output too. HTH.

    Just a something something...
Re: Parsing file
by nagalenoj (Friar) on Apr 09, 2010 at 06:26 UTC
Re: Parsing file
by umasuresh (Hermit) on Apr 09, 2010 at 14:29 UTC

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://833697]
Approved by rovf
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (2)
As of 2017-08-22 03:42 GMT
Find Nodes?
    Voting Booth?
    Who is your favorite scientist and why?

    Results (328 votes). Check out past polls.