Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

how to isolate text in a text file.

by undergradguy (Novice)
on Dec 14, 2018 at 02:05 UTC ( [id://1227228]=perlquestion: print w/replies, xml ) Need Help??

undergradguy has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks. I am an undergrad that has started to teach myself Perl from scratch. I know nothing about coding or any thing similar but this has grabbed my attention. So I come to you this day to seek your knowledge and guidance. I have a text file that has a DNA string in it. I want a way to isolate the DNA string and ignore the rest of the text file. Is this possible? Thank you for your time.

Replies are listed 'Best First'.
Re: how to isolate text in a text file.
by davido (Cardinal) on Dec 14, 2018 at 03:29 UTC

    Yes, possible.

    I don't know what your DNA strings might look like, or how big the input file may be. Those are relevant details that would need to be considered in crafting a solution. When indicating what a DNA string should look like, we would need to know if it can contain special characters embedded within it such as \n (newline), and if there is more than one per file. We could guess at that sort of thing, and enough DNA-related questions come up that we may actually get lucky and arrive at a sensible understanding of the problem. But it's better to be clear about it up front so we're not chasing a changing spec.

    Once you isolate that (or those) DNA sequence(s) I suspect there might be additional information you also need to gather, so rather than give the specification in dribs and drabs, let's hear all of it.

    Start out by reading perlintro, perlre and perlretut. Those will get you pointed in the right direction. Then take a stab at coming up with your own solution, and as you encounter hangups that you cannot get past, please do post them here with a small self-contained snippet of code (not necessarily your entire program, but enough code that we can run it). The snippet of code should be accompanied by sample input and expected output. You should then describe what problem you are having, and what aspect of the code you posted you cannot figure out. With such well-written questions we are happy to help.

    If you want to learn Perl, we can help you get there. Good books include Learning Perl (OReilly) and Intermediate Perl (OReilly). And there's also the Learn Perl in about two hours and thirty minutes website that can really help too.

    I would say the required reading for what you are trying to accomplish is at minimum that 2h30m website, and perlintro, and perlretut. That's about 4.5 hours of reading, which is not much to ask of someone who wants to learn to program, and who wants to enlist the help of those who have put in the time. Clearly it's only a start, but it will give you enough of a background to be able to ask sensible questions about those parts that are still opaque.


    Dave

      Dave, Thank you for the suggestions, I will go check them out tomorrow before work. This is some good leads for me to follow and try things out. I will post my progess when I make some. I can only do this in my free time so it may take awhile. Giovanni
Re: how to isolate text in a text file.
by 1nickt (Canon) on Dec 14, 2018 at 04:29 UTC

    Hi, welcome to Perl, the One True Religion. Please start with perlintro. It'll take you less than an hour and afterwards you'll be able to understand the concepts demonstrated below.

    Yes, this is possible using a Regular Expression to perform a Pattern Match.

    Copy this into a file and run it with $ perl <filename>:

    use strict; use warnings; use feature 'say'; my $txt = q{ Uninteresting line. More boring stuff. Here's the AATCCGCTAG string. Afterwards, more drivel. Footnotes, etc. }; my @lines = split "\n", $txt; my $counter = 0; for my $line ( @lines ) { $counter++; next unless $line =~ /(\b[ACGT]+\b)/; say "Found $1 in line $counter"; } __END__

    See perlrequick for the beginner's regular expression tutorial.

    Hope this helps!


    The way forward always starts with a minimal test.

      that's a nice script, 1nickt....

      $ ./1.dna.pl Found AATCCGCTAG in line 4 $ cat 1.dna.pl #!/usr/bin/perl -w use 5.011; my $txt = q{ Uninteresting line. More boring stuff. Here's the AATCCGCTAG string. Afterwards, more drivel. Footnotes, etc. }; my @lines = split "\n", $txt; my $counter = 0; for my $line ( @lines ) { $counter++; next unless $line =~ /(\b[ACGT]+\b)/; say "Found $1 in line $counter"; } __END__ $
Re: how to isolate text in a text file.
by BillKSmith (Monsignor) on Dec 14, 2018 at 04:00 UTC
    Most questions concerning DNA input involve a file conforming to the fasta format (or a subset of it). In this case, we usually recommend a module to parse it.
    Bill
Re: how to isolate text in a text file.
by bliako (Monsignor) on Dec 14, 2018 at 09:30 UTC

    comments say all, goodluck

    bw bliako

    #!/usr/bin/env perl # author: bliako # for: https://perlmonks.org/?node_id=1227228 # date: 14/12/2018 use strict; use warnings; # our $biologists = new alchemists() and fork until pop(@the::bubble); + do not tell until tied $all; # Below, set the complete pathname of the file with DNA strings my $filename = "dna.txt"; # a DNA string is a sequence (any order) of A,C,G,T of at # least 5 items in length. Adjust 5 to suit you and # not detect TATA FOR NOW # or GATTAGA or ATTAC or CACA or GAGA or TAGA # similar detection to https://perlmonks.org/?node_id=1227232 by 1nick +t my $regex = q/([ACGT]{5,})/; # # nothing to change below # my $FH; open($FH, '<', $filename) or die "opening file: $!"; # slurp the file my $contents = undef; {local $/ = undef; $contents = <$FH> } close $FH; print "Possible DNA:\n$1\nEnd\n\n" while $contents =~ /$regex/gi

      I'm happy to see this post and will follow this thread. I'm interested in genealogy, but have had no place to start with it with perl. A few questions:

      Q1) If that text were the entire gene sequence for homo sapiens, how long would it be? How much variation is there in our populations of the cardinality of the gene sequence which represents a given person? Does it vary according to sex?

      Q2) For analyzing one's 23andMe data, how is that represented and searched?

      Q3) Is there a namespace in cpan developed for frequent tasks in this realm?

      Thank you for your comments,

        Q3) Is there a namespace in cpan developed for frequent tasks in this realm?

        Well there is BioPerl, the source for which is on CPAN. There are plenty of other dists in the Bio tree as well.

        Hi Aldebaran,

        I am in no way expert in Biology as I only have touched some of its more interesting parts through bioinformatics. There is a huge collection of computational and data resources through the software package R, which is not Perl. But still free software. Through R one can download sequences, compare them etc. BUT it is a steep learning curve, almost impossible to climb. Then of course is the great effort of BioPerl at www.bioperl.org . Where I worked we all used R though, so not much exposure to it for me.

        I will start with the observation of Vavilov that the greatest genetic variation for a species is where the species evolved. Don't know how much has been verified by data but certainly people think this theory is valid. HomoSapien's (HS) genome's greatest variation is in Sub-Saharean region. Update: Vavilov was mainly talking about plants, transfering this to nomadic animals, humans may be a bit misleading because animals travel and so correlating location and DNA will not work in many cases - especially today.

        Then I will mention that genetic variation starts because of mutations from external factors, like radiation and chemical exposure - not necessarily dramatic exposure: note that it was happening thousands of years ago when pollution was not a problem. As for radiation we have it from the stars and from the Earth itself. It is believed (as in "who knows?" typical of the field of bios) that HS (mamals) females have a fixed stock of eggs and they release one each month. Whereas male HS replensish sperm at regular, short intervals. This means, I think, that female eggs accumulate mutations over their reproductive life whereas sperm gets a mutation but if not used, it is replaced soon and the mutation has no chance to be passed on to the offsprings. (BTW mutations can lead to good as well as to bad phenotype=traits, probability favours the bad - as usual ;( .) So, I say, the contribution of the female and the male to their offspring's DNA is qualitatively different as they reflect nature's past events at different time scales. Maybe related is that today we live twice as much as 5000 years ago (well only in some countries unfortunately) meaning more accumulations of mutations.

        In conjuction to Mutations there is also the Mating and Environmental Selection. Some of the species's member's gonads's cells get a mutation. The member finds a mate with a different mutation, they try to mate, if successful the offspring inhertis the mutation. The offspring grows (or dies if mutation kills it) and then the Environment takes over in test-driving it and sometimes crashing the individual to Kaeadas, the Chasm of Death before it had a chance to mate, whereas other times it crowns it as the king of the jungle fertilising hundreds of eggs and passing on its genetic footprint. So, lots of chance events leading to lots of genetic variation.

        Mendel's genetic theories are too simplistic for today's data - if ever were true outside his monastery's walls. But still taught in school spreading the belief that a single gene's ON/OFF is responsible for one physical characteristic (and if we find that gene we will cure blahblah - please donate, yeah right). That's so much rubbish installed in human brains that it will take a while, to remove it and look at genetics as a complex system where everything plays a role and things are less binary than thought mainly because people in the field are practical and if they can't handle the complexity they will create a less complex universe and happily live in that. Looking at genetics as a Whole remains a challenge both to persuade biologists to even consider it and also practically apply it and deal with its immense complexity.

        Wikipedia says that typical genetic variation is at 0.6% of the 3.2 billion bases comprising the HS DNA. Now a base is one of A, T, G, C. A gene is a small sub-sequence of bases (10^2-10^6) in the whole genome and is the blueprint for cells to make proteins. So a defective gene, means a defective recipe which may lead to defective proteins.

        Proteins serve a few roles. I think two important roles are acting as messengers for signals between and inside cells and causing or catalysing chemical reactions. A lot of their function and properties comes from their physical structure, electro-chemical properties. Therefore, a molecule placed wrong in the protein may cause it to like water rather than hate it (a major distinction between proteins) and function completely different. However, there are examples of extreme fault-tolerance and fault-intolerance. Again, even with lots of data scientific conclusions tend to be different depending on year, place and who made them. There is something for everyone when the machinery is oiled by Doleros Inc.

        I like to think of biological systems as an analogue computer. But I see nothing exciting in cracking it apart for the hack-challenge.

        Instead I get more excitement in working on the macro scales of society. There is a reason why in the old story, Prometheas gave the Fire to humans rather than genetically engineering them to have their middle finger act as a lighter.

        bw, bliako

Re: how to isolate text in a text file.
by AnomalousMonk (Archbishop) on Dec 16, 2018 at 06:09 UTC
    I have a text file that has a DNA string in it. I want a way to isolate the DNA string and ignore the rest of the text file.

    What's a DNA string? What's in the rest of the text that makes it differ from a DNA string so that we can ignore it? Clarity is the soul of programming.

    The following code is intentionally pitched above what I think is the level of your current understanding of Perl in the hope that it may spur your curiosity.

    Script extract_dna_seq_1.pl (runs under Perl version 5.8.9):

    Output:
    c:\@Work\Perl\monks\undergradguy>perl extract_dna_seq_1.pl sequence 'GATTACA' captured 5 sequences: 'A' 'ATCG' 'a' 'AATCCGCTGATT' 'act' captured 2 codon sequences: 'AATCCGCTGATT' 'act'

    The first attempt to define a DNA sequence regex
        my $base = qr{ [ATCGatcg] }xms;
        my $sequence = qr{ \b $base+ \b }xms;
    extracts
        5 sequences: 'A' 'ATCG' 'a' 'AATCCGCTGATT' 'act'
    from the text. Some of these, "A", "a" and "act", are clearly part of the text and not really DNA sequences. In an effort to refine sequence extraction, one can recognize that codon base-pair triplets are often the unit of interest and define
        my $codon  = qr{ $base{3} }xms;
        my $codons = qr{ \b $codon+ \b }xms;
    to extract codon sequences instead. This reduces spurious output, but "act" is still recognied as a valid sequence while "ATCG" is not (because it's not an exact number of codons long!). Well, that's life. Better knowledge of the data will allow greater refinement of the matching regexes and better extraction performance.

    Some random notes:

    • The statements
          my $base = qr{ [ATCGatcg] }xms;
          my $sequence = qr{ \b $base+ \b }xms;
      illustrate the concept of "factoring" for regexes; factoring is also useful in designing functions and classes and is also a manifestation of the DRY principle.
    • The compound assignment statement
          my $n_captured =
          my @captures =
          $text =~ m{ $sequence }xmsg;
      captures a series of substrings from a match and then captures the number of substrings captured. What's going on here? For this, see:
      • The behavior of a  m//g match (see the  /g modifier in perlop (update: m// and qr// discussed in Regexp Quote-Like Operators section)) in list context (as imposed by assignment to the array; see Context tutorial); and
      • The behavior of an array (or list) in scalar context (imposed by assignment of the array to the scalar).
    Others have posted links to much useful info; please take advantage of it. That's all for now from me; perhaps more later. Please don't hesitate to post any questions you may have.

    Update: Slight wording, paragraph reorganization; should not be significant.


    Give a man a fish:  <%-{-{-{-<

      Wow! This is way more than I had hoped for monks. I was only looking for directions, I didn’t think the community would be this willing to help some dumb undergrad. Thank you all and this has more than soured my curiosity.
        Apologies that was typed on my phone while out.
        soured my curiosity

        ?

Re: how to isolate text in a text file.
by stevieb (Canon) on Dec 14, 2018 at 02:12 UTC

    Nice to meet you undergradguy, and welcome to the Monastery.

    What you're asking is definitely possible. Problem is, we're not a code writing service here. You need to show what you've tried and attempted first, *then* we help with the specific issues you're having.

    If you've come here in hopes to have homework questions answered, you're in the wrong place.

    You haven't even presented the question that requires an answer.

    It actually scares me to know that there are DNA examination people out in the field that started just like this... begging for answers.

      I am doing this all on my own. I am not working with others or on real projects, I am simply trying to learn how to do these things for my own benefit.
      It actually scares me to know that there are DNA examination people out in the field that started just like this... begging for answers.

      Generally (and not for this or any specific case whose details are unknown to me) my feeling too!

      But I still post an answer ... in the hope that they will discover that promised elixir a day sooner ... and they wont change their minds the next day, after we drunk it ...

      bw, bliako

Re: how to isolate text in a text file.
by Anonymous Monk on Dec 14, 2018 at 07:24 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1227228]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (8)
As of 2024-04-18 06:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found