Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Little pattern problem...

by bioinformatics (Friar)
on Aug 22, 2003 at 17:25 UTC ( #285872=perlquestion: print w/ replies, xml ) Need Help??
bioinformatics has asked for the wisdom of the Perl Monks concerning the following question:

Hello Friends!!!
What is the best way to pattern match an unknown pattern? Allow me to explain... I have a file that contains a series of data values (microarray probe sets to be specific) that I need to sort through. Technically, there should be 11 "probes" for each target (ex. 154115_at=target name), but there are not. So, since there is a commonality between these probes (the target name), I need to be able to sort through the file and have the program take the target name value from the first line, compare it to succesive lines until one doesn't match. (The matching data needs to be further parsed and put on one line tab delimited, but I know how to do that.)When that occurs, the mismatched data needs to become the new pattern to be compared to. I'm familiar with pattern matching. However, I don't know how to designate an "unknown" pattern in perl, since I can't go and write 22,000 some-odd patterns:-). A sample imput file:
>probe:MOE430A:1415670_at(target name):549:177; Interrogation_Position +=2436; Antisense; GGCTGATCACATCCAAAAAGTCATG(probe sequence) >probe:MOE430A:1415670_at:549:177; Interrogation_Position=2513; Antise +nse; GAGGAAACGTTCACCCTGTCTACTA >probe:MOE430A:1415670_at:467:433; Interrogation_Position=2521; Antise +nse; GTTCACCCTGTCTACTATCAAGACA >probe:MOE430A:1415670_at:254:643; Interrogation_Position=2533; Antise +nse; TACTATCAAGACACTCGAAGAGGCT >probe:MOE430A:1415670_at:54:269; Interrogation_Position=2556; Antisen +se; CTGTGGGCAATATTGTGAAGTTCCT >probe:MOE430A:1415670_at:405:339; Interrogation_Position=2583; Antise +nse; GAATGCATCCTTGTGAGAGGTCAGA >probe:MOE430A:1415670_at:60:395; Interrogation_Position=2597; Antisen +se; GAGAGGTCAGACAAAGTGCCAGAAA >probe:MOE430A:1415670_at:284:165; Interrogation_Position=2619; Antise +nse; AAAACAAGAACACCCACACGCTGCT >probe:MOE430A:1415670_at:622:145; Interrogation_Position=2634; Antise +nse; ACACGCTGCTGCTAGCTGGAGTATT >probe:MOE430A:1415670_at:291:661; Interrogation_Position=2804; Antise +nse; TATCTTGTCCAACACTACGTCGAAG >probe:MOE430A:1415670_at:146:701; Interrogation_Position=2956; Antise +nse; TTGTCACCATGCCTGCAAGGAGAGA >probe:MOE430A:1415671_at:116:525; Interrogation_Position=1156; Antise +nse; GGAACAGGAATGTCGCAACATCGTA >probe:MOE430A:1415671_at:655:137; Interrogation_Position=1173; Antise +nse; ACATCGTATGGATTGCTGAGTGCAT >probe:MOE430A:1415671_at:398:139; Interrogation_Position=1232; Antise +nse;
Any help is most appreciated!
Bioinformatics

Comment on Little pattern problem...
Download Code
Re: Little pattern problem...
by CombatSquirrel (Hermit) on Aug 22, 2003 at 17:34 UTC
    Have a look at Regex with variables?. From what I understood this might be what you need.
    Cheers, CombatSquirrel.
    Update: Or maybe in your case, the following might help as well:
    #!perl use strict; use warnings; my ($curr, $probe, $target, $pos, $sense); for (<DATA>) { if (/>probe:(\w+):(\w+):\d+:\d+;\s+Interrogation_Position=(\d+);\s+ +(\w+);/) { ($probe, $target, $pos, $sense) = ($1, $2, $3, $4); if ($curr and $curr ne $target) { if ($curr) { ### do processing for mismatch here } $curr = $target; } else { $curr = $target if (!$curr); ### do processing for target here } } else { ### do processing for probe sequence here } } __DATA__ >probe:MOE430A:1415670_at:549:177; Interrogation_Position=2436; Antise +nse; GGCTGATCACATCCAAAAAGTCATG >probe:MOE430A:1415670_at:549:177; Interrogation_Position=2513; Antise +nse; GAGGAAACGTTCACCCTGTCTACTA >probe:MOE430A:1415670_at:467:433; Interrogation_Position=2521; Antise +nse; GTTCACCCTGTCTACTATCAAGACA >probe:MOE430A:1415670_at:254:643; Interrogation_Position=2533; Antise +nse; TACTATCAAGACACTCGAAGAGGCT >probe:MOE430A:1415670_at:54:269; Interrogation_Position=2556; Antisen +se; CTGTGGGCAATATTGTGAAGTTCCT >probe:MOE430A:1415670_at:405:339; Interrogation_Position=2583; Antise +nse; GAATGCATCCTTGTGAGAGGTCAGA >probe:MOE430A:1415670_at:60:395; Interrogation_Position=2597; Antisen +se; GAGAGGTCAGACAAAGTGCCAGAAA >probe:MOE430A:1415670_at:284:165; Interrogation_Position=2619; Antise +nse; AAAACAAGAACACCCACACGCTGCT >probe:MOE430A:1415670_at:622:145; Interrogation_Position=2634; Antise +nse; ACACGCTGCTGCTAGCTGGAGTATT >probe:MOE430A:1415670_at:291:661; Interrogation_Position=2804; Antise +nse; TATCTTGTCCAACACTACGTCGAAG >probe:MOE430A:1415670_at:146:701; Interrogation_Position=2956; Antise +nse; TTGTCACCATGCCTGCAAGGAGAGA >probe:MOE430A:1415671_at:116:525; Interrogation_Position=1156; Antise +nse; GGAACAGGAATGTCGCAACATCGTA >probe:MOE430A:1415671_at:655:137; Interrogation_Position=1173; Antise +nse; ACATCGTATGGATTGCTGAGTGCAT >probe:MOE430A:1415671_at:398:139; Interrogation_Position=1232; Antise +nse;
Re: Little pattern problem...
by VSarkiss (Monsignor) on Aug 22, 2003 at 17:42 UTC

    I don't quite understand what you're trying to do, but I'll try to help in spite of that. ;-)

    It sounds like you may not need to use regular expressions at all, just simple string equality. If parts of the input string are fixed (like the colon delimiters), and you need to see if a piece changes, you can split the input line, and just do a string compare (using eq, that is).

    Unless I've completely misunderstood what you're trying to do, which is quite possible....

      Sorry for not explaining things better...I have a hard time doing that sometimes. What I'm trying to do is take the target name (ex. 100000_at) and use that as the object to pattern match against. The target name is the only thing that the data sets have in common (I want all the 10000_at, 10001_at sets etc. together). By pattern matching, I hope to be able to print out the interegation positions (where the probe sequences are found in the genome) and sequences (actg's in repetition)for each probe that belongs to a single target. The way the file is arranged, it would seem like I could just use a counter and have it chop up the data for every such-and-such number of lines, but there are apparently a few targets with a varying number of probes. Since that is the case, I have to separate the probe sets based on their relevant target name. I can then import the output file into excel and have a spreadsheet of all the targets, with one row dedicated to the "probes" (stands of dna that are attached to a glass slide) that match that target.
      Bioinformatics
Re: Little pattern problem...
by BrowserUk (Pope) on Aug 22, 2003 at 19:40 UTC

    I think this does what you describe.

    #! perl -slw use strict; my( $target_name, %probes ); $_ = <DATA>; # Prime the pump. do { # extract the target name $target_name = $1 if m[( \d{7} ) _at: \d{3} : \d{3} ]x; while( m[$target_name] ) { # process the record containing the current target name my $probe = <DATA>; # Read the probe chomp $probe; # save it in an HoA keyed by the target name push @{ $probes{ $target_name } }, $probe; last unless defined( $_ = <DATA> ); # get the next line; } } until eof DATA; # till done print $_, ' : ', join', ', @{ $probes{ $_ } } for keys %probes; __DATA__ probe:MOE430A:1415670_at:549:177; Interrogation_Position=2513; Antisen +se; GAGGAAACGTTCACCCTGTCTACTA probe:MOE430A:1415670_at:467:433; Interrogation_Position=2521; Antisen +se; GTTCACCCTGTCTACTATCAAGACA probe:MOE430A:1415670_at:254:643; Interrogation_Position=2533; Antisen +se; TACTATCAAGACACTCGAAGAGGCT probe:MOE430A:1415670_at:54:269; Interrogation_Position=2556; Antisens +e; CTGTGGGCAATATTGTGAAGTTCCT probe:MOE430A:1415670_at:405:339; Interrogation_Position=2583; Antisen +se; GAATGCATCCTTGTGAGAGGTCAGA probe:MOE430A:1415670_at:60:395; Interrogation_Position=2597; Antisens +e; GAGAGGTCAGACAAAGTGCCAGAAA probe:MOE430A:1415670_at:284:165; Interrogation_Position=2619; Antisen +se; AAAACAAGAACACCCACACGCTGCT probe:MOE430A:1415670_at:622:145; Interrogation_Position=2634; Antisen +se; ACACGCTGCTGCTAGCTGGAGTATT probe:MOE430A:1415670_at:291:661; Interrogation_Position=2804; Antisen +se; TATCTTGTCCAACACTACGTCGAAG probe:MOE430A:1415670_at:146:701; Interrogation_Position=2956; Antisen +se; TTGTCACCATGCCTGCAAGGAGAGA probe:MOE430A:1415671_at:116:525; Interrogation_Position=1156; Antisen +se; GGAACAGGAATGTCGCAACATCGTA probe:MOE430A:1415671_at:655:137; Interrogation_Position=1173; Antisen +se; ACATCGTATGGATTGCTGAGTGCAT probe:MOE430A:1415671_at:398:139; Interrogation_Position=1232; Antisen +se; GGCTGATCACATCCAAAAAGTCATG

    Output

    P:\test>285872 1415671 : GGAACAGGAATGTCGCAACATCGTA, ACATCGTATGGATTGCTGAGTGCAT, GGCTGA +TCACATCCAAAAAGTCATG 1415670 : GAGGAAACGTTCACCCTGTCTACTA, GTTCACCCTGTCTACTATCAAGACA, TACTAT +CAAGACACTCGAAGAGGCT, CTGTGGGCAATATTGTGAAGTTCCT, GAATGCATCCTTGTGAGAGGT +CAGA, GAGAGGTCAGACAAAGTGCCAGAAA, AAAACAAGAACACCCACACGCTGCT, ACACGCTGC +TGCTAGCTGGAGTATT, TATCTTGTCCAACACTACGTCGAAG, TTGTCACCATGCCTGCAAGGAGAG +A

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
    If I understand your problem, I can solve it! Of course, the same can be said for you.

      In response to a /msg.

      $target_name = $1 if m[( \d{7} ) _at: \d{3} : \d{3} ]x;

      If the regex matches $_ (ie. the line read in from the DATA file), then the 7-digit number '\d{7}' is captured (because of the brackets) into the perl special variable $1. Because the regex matched, the if condition is true and so the value of $1 will be assigned to the variable $target_name.

      If that isn't clear then I suggest your find and read the documents perlrequick and perlretut, particularly the sections entitled "Extracting matches" in both. You should have copies of these on your system, but the above links will take you to the latest versions incase you haven't. They won't take long to read and they do a much better job of explaining this stuff than I would.

      I hope that clarifies things a little.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
      If I understand your problem, I can solve it! Of course, the same can be said for you.

Re: Little pattern problem...
by johndageek (Hermit) on Aug 22, 2003 at 21:03 UTC
    does this do what you want?
    #!/usr/bin/perl use strict; use warnings; my $start_target_name = ""; for (<DATA>) { ## line to process? if (/^>probe:/) { ## first record processing ($start_target_name) = /.*:(\d\d\d+_at:).*/ if ($start_target +_name eq ""); ## Got new target? or got not last target? if (!/.*:$start_target_name.*/) { print " do neat stuff for prior target $start_target_name\ +n"; ## set up for next target ($start_target_name) = /.*:(\d\d\d+_at:).*/; } } } ## process last set print " do neat stuff for prior target $start_target_name\n"; __DATA__ >probe:MOE430A:1415670_at:549:177; Interrogation_Position=2436; Antise + nse; GGCTGATCACATCCAAAAAGTCATG >probe:MOE430A:1415670_at:549:177; Interrogation_Position=2513; Antise + nse; GAGGAAACGTTCACCCTGTCTACTA >probe:MOE430A:1415670_at:467:433; Interrogation_Position=2521; Antise + nse; GTTCACCCTGTCTACTATCAAGACA >probe:MOE430A:1415670_at:254:643; Interrogation_Position=2533; Antise + nse; TACTATCAAGACACTCGAAGAGGCT >probe:MOE430A:1415670_at:54:269; Interrogation_Position=2556; Antisen + se; CTGTGGGCAATATTGTGAAGTTCCT >probe:MOE430A:1415670_at:405:339; Interrogation_Position=2583; Antise + nse; GAATGCATCCTTGTGAGAGGTCAGA >probe:MOE430A:1415670_at:60:395; Interrogation_Position=2597; Antisen + se; GAGAGGTCAGACAAAGTGCCAGAAA >probe:MOE430A:1415670_at:284:165; Interrogation_Position=2619; Antise + nse; AAAACAAGAACACCCACACGCTGCT >probe:MOE430A:1415670_at:622:145; Interrogation_Position=2634; Antise + nse; ACACGCTGCTGCTAGCTGGAGTATT >probe:MOE430A:1415670_at:291:661; Interrogation_Position=2804; Antise + nse; TATCTTGTCCAACACTACGTCGAAG >probe:MOE430A:1415670_at:146:701; Interrogation_Position=2956; Antise + nse; TTGTCACCATGCCTGCAAGGAGAGA >probe:MOE430A:1415671_at:116:525; Interrogation_Position=1156; Antise + nse; GGAACAGGAATGTCGCAACATCGTA >probe:MOE430A:1415671_at:655:137; Interrogation_Position=1173; Antise + nse; ACATCGTATGGATTGCTGAGTGCAT >probe:MOE430A:1415671_at:398:139; Interrogation_Position=1232; Antise + nse;

    Good luck!
    dageek

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://285872]
Approved by ybiC
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (7)
As of 2014-09-18 10:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (110 votes), past polls