http://www.perlmonks.org?node_id=219504

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am attempting to extract information from a text file across a number of lines.I can't find the correct regular expression syntax to match when it comes across = or () or '.' or any other non-alphanumeric characters. '.' only seems to match alphanumeric characters.Can anyone help me? ie:
#! /usr/bin/perl use strict; my $string = "LOCUS SOY4KDPP 545 bp PLN 01-FEB-2000 DEFINITION BIO5 gene complete cds EC=1.7.7.1. ACCESSION D17396 VERSION D17396.1 GI:498167 KEYWORDS protein; insulin-like protein; leginsulin. //"; if ($string =~ /DEFINITION\s\s([\w\s]+)ACCESSION/){ print "$1\n"; } exit;
  • Comment on Matching '=' and other non alphanumeric characters using regular expressions
  • Download Code

Replies are listed 'Best First'.
Re: Matching '=' and other non alphanumeric characters using regular expressions
by pg (Canon) on Dec 13, 2002 at 07:04 UTC
    When I looked at your string, I realized that it is actually a kind of key-value pair. My suggestion is to save the result returned from m// into a hash.

    Make some small modification to the following code I gave, you can easily make it work for you:
    use strict; my $string = "DEFINITION BIO5 gene complete cds EC=1.7.7.1. ACCESSION D17396 VERSION D17396.1 GI:498167 KEYWORDS protein; insulin-like protein;"; my %KVpairs = ($string =~ /(\w+)\s+(.*)/mg);#save the result right int +o a hash foreach (keys %KVpairs) { print "[$_] = $KVpairs{$_}\n"; }
Re: Matching '=' and other non alphanumeric characters using regular expressions
by MarkM (Curate) on Dec 13, 2002 at 06:19 UTC

    '.' will (normally) match all characters except for the newline character.

    The regular expression that you are looking for is probably:

    if ($string =~ /^DEFINITION[ \t]+([^\r\n]*)/mg) {

    In english, this would be: A line in $string that begins with the literal string 'DEFINITION' followed by any amount of simple white space (' ' and '\t') followed by a string of characters that do not include '\r' or '\n'. This will grab all characters to the end of the line, but not the end-of-line character sequence itself.

    I choose to use [^\r\n]* instead of .* as I regularly have to ensure that my code will function equally well under both UNIX and WIN32. Using [^\r\n]* instead of .* allows me to ensure that '\r' is not picked up at the end of $1.

Re: Matching '=' and other non alphanumeric characters using regular expressions
by blahblahblah (Priest) on Dec 13, 2002 at 06:18 UTC
    In addition to what the others have pointed out, I think you'll need to add an 's' to the end so that you can match across multiple lines, like this:
    if ($string =~ /DEFINITION\s\s([\w\s]+)ACCESSION/s){ print "$1\n"; }
    Also, if you simply want to pick up everything between "DEFINITION " and "ACCESSION", another way to do it without worrying about knowing every possible character is like this:
    $string =~ /DEFINITION\s\s(.*?)ACCESSION/s
Re: Matching '=' and other non alphanumeric characters using regular expressions
by krujos (Curate) on Dec 13, 2002 at 06:07 UTC
    you need to match the . and = characters. If there is a genral form for the where thoes occur you could plug them into your scripts. Without knowing that a smiple suggestion is to just add the = and periods to your regex.
    if ($string =~ /DEFINITION\s\s([\w\s=\.]+)ACCESSION/){
    Good luck. Josh
Re: Matching '=' and other non alphanumeric characters using regular expressions
by Bukowski (Deacon) on Dec 13, 2002 at 12:16 UTC
    Hmm I smell a bioinformatics question :)

    You're obviously trying to parse out information from GenBank records.

    Have you come across Licoln Steins Creating a Bioinformatics Nation?

    One of the points is that there is massive duplication of effort. Every Perl using biologist has written a GenBank parser!! I have, my colleagues have.. friends of colleagues have!

    Run, don't walk, to BioPerl and save yourself the time and effort - the parsers you require are already there for you :)

    Bukowski - aka Dan (dcs@black.hole-in-the.net)
    "Coffee for the mind, Pizza for the body, Sushi for the soul" -Userfriendly

Re: Matching '=' and other non alphanumeric characters using regular expressions
by jjdraco (Scribe) on Dec 13, 2002 at 06:06 UTC
    \W matches a non-word character
    \C matches a character


    jjdraco
    learning Perl one statement at a time.