http://www.perlmonks.org?node_id=994367

randomhero1270 has asked for the wisdom of the Perl Monks concerning the following question:

Hi! I am very new to perl. I love programming but I am just not very good at it. I am trying to use regex's to read a file which contains a bunch of these R00005: 00330: C01010 => C00011 R00005: 00791: C01010 => C00011 R00005: 01100: C01010 <=> C00011 R00006: 00770: C00022 => C00900 R00008: 00362: C06033 => C00022 R00008: 00660: C00022 => C06033 R00010: 00500: C01083 => C00031 R00013: 00630: C00048 => C01146 R00013: 01100: C00048 <=> C01146 what it needs to do is print the R_____ then whichever => follows it. for example it would need to find R00008 and print => I don't really understand how regex works so I started with this

use strict; open(DNA, 'reaction_mapformula.lst'); while(my $protein = <DNA>) { if(my $protien =~ m/^R\d\d\d\d\d$/){ print "it"; }else{ print "no"; } }
and it just prints nononononononononononono etc... I thought that was right? any help is appreciated thank you!

Replies are listed 'Best First'.
Re: Regex help
by ww (Archbishop) on Sep 18, 2012 at 21:51 UTC

    The "$" at the end of your regex means 'match only if "R" followed by five digits' is at the END of your line of data. That said, there are numerous guesses already as to what your data might really look like (Hint: code tags!), so here's another. It may be relevant if your data looks like that in the code below:

    #!/usr/bin/perl use 5.014; #994367 my @data = ( 'R00005: 00330: C01010 => C00011', 'R00005: 00791: C01010 => C00011', 'R00005: 01100: C01010 <=> C00011', 'R00006: 00770: C00022 => C00900', 'R00008: 00362: C06033 => C00022', 'R00008: 00660: C00022 => C06033', 'R00010: 00500: C01083 => C00031', 'R00013: 00630: C00048 => C01146', 'R00013: 01100: C00048 <=> C01146', ); for my $data (@data) { if ( $data =~ /(R\d{5}).+)/ ) { # Match on and capture any line +which # contains the sequence R, 5 dig +its, and # something more # The parens in the regex captur +e the # match to $1 (but are not used +here) # which could be used in other a +pplications. say "\t Match on data $data. Hooray!"; } else { say "No match on data $data"; } } # do it again for those beginning "R00005" ONLY say "\n\n doing it again for 'R00005' ONLY"; for my $data (@data) { if ( $data =~ /(R0{4})(5)(.+)/ ) { say "\t Match on data " . $1 . $2 . $3 . " Hooray!"; # 3 capt +ures, # print' +em # NOT a good pr +actice; # illustrate ON +LY one aspect # of regex capt +ures } else { say "No match on data $data"; } } =head output: Match on data R00005: 00330: C01010 => C00011. Hooray! Match on data R00005: 00791: C01010 => C00011. Hooray! Match on data R00005: 01100: C01010 <=> C00011. Hooray! Match on data R00006: 00770: C00022 => C00900. Hooray! Match on data R00008: 00362: C06033 => C00022. Hooray! Match on data R00008: 00660: C00022 => C06033. Hooray! Match on data R00010: 00500: C01083 => C00031. Hooray! Match on data R00013: 00630: C00048 => C01146. Hooray! Match on data R00013: 01100: C00048 <=> C01146. Hooray! doing it again for 'R00005' ONLY Match on data R00005: 00330: C01010 => C00011. Hooray! Match on data R00005: 00791: C01010 => C00011. Hooray! Match on data R00005: 01100: C01010 <=> C00011. Hooray! No match on data R00006: 00770: C00022 => C00900 No match on data R00008: 00362: C06033 => C00022 No match on data R00008: 00660: C00022 => C06033 No match on data R00010: 00500: C01083 => C00031 No match on data R00013: 00630: C00048 => C01146 No match on data R00013: 01100: C00048 <=> C01146 =cut

    See perlretut.

Re: Regex help
by hbm (Hermit) on Sep 18, 2012 at 21:25 UTC

    Note also you have the following:

    while(my $protein = ... if(my $protien =~ ... ^^

    The 'if' will never match because you have declared a new variable. In the 'if', correct the spelling and drop 'my':

    if ($protein =~ ...
Re: Regex help
by Kenosis (Priest) on Sep 18, 2012 at 22:48 UTC

    Welcome to PerlMonks, randomhero1270! I think you did well on both your question and script...

    Given your data set, consider the following:

    use strict; use warnings; while ( my $protein = <DATA> ) { $protein =~ /([^:]+).+\s+([<=>]+)\s+/; print "$1 - $2\n"; } __DATA__ R00005: 00330: C01010 => C00011 R00005: 00791: C01010 => C00011 R00005: 01100: C01010 <=> C00011 R00006: 00770: C00022 => C00900 R00008: 00362: C06033 => C00022 R00008: 00660: C00022 => C06033 R00010: 00500: C01083 => C00031 R00013: 00630: C00048 => C01146 R00013: 01100: C00048 <=> C01146

    Output:

    R00005 - => R00005 - => R00005 - <=> R00006 - => R00008 - => R00008 - => R00010 - => R00013 - => R00013 - <=>

    The regex:

    /^([^:]+).+\s+([<=>]+)\s+/ ^ ^ ^ ^ | | | | | | | + - Capture characters from this class enclosed by 1+ + spaces | | + - Keep going, matching any character except \n | + - Capture characters that are not : + - Start at the beginning

    The () notation creates captures. In this case $1 will contain the captured text, like R00008; $2 will contain the captured characters <, =, or >.

    Hope this helps!

      That was exactly what I needed, I played around with mine before I looked at this and I was very close. Thank you for showing what the regex means I really appreciate that!

        You're very welcome, randomhero1270!

Re: Regex help
by nemesdani (Friar) on Sep 18, 2012 at 21:12 UTC
    ^R\d\d\d\d\d$ means that you look for a line which contains nothing, but Rxxxxx. That's not whar you want, I think.

    If the records are in separate lines, then use

    /(R\d\d\d\d\d)(.*)=>(.*)/

    $3 will contain whatever comes after =>.

    I'm too lazy to be proud of being impatient.