Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Matching '=' and other non alphanumeric characters using regular expressions

by Anonymous Monk
on Dec 13, 2002 at 05:58 UTC ( #219504=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am attempting to extract information from a text file across a number of lines.I can't find the correct regular expression syntax to match when it comes across = or () or '.' or any other non-alphanumeric characters. '.' only seems to match alphanumeric characters.Can anyone help me? ie:
#! /usr/bin/perl use strict; my $string = "LOCUS SOY4KDPP 545 bp PLN 01-FEB-2000 DEFINITION BIO5 gene complete cds EC=1.7.7.1. ACCESSION D17396 VERSION D17396.1 GI:498167 KEYWORDS protein; insulin-like protein; leginsulin. //"; if ($string =~ /DEFINITION\s\s([\w\s]+)ACCESSION/){ print "$1\n"; } exit;

Comment on Matching '=' and other non alphanumeric characters using regular expressions
Download Code
Re: Matching '=' and other non alphanumeric characters using regular expressions
by jjdraco (Scribe) on Dec 13, 2002 at 06:06 UTC
    \W matches a non-word character
    \C matches a character


    jjdraco
    learning Perl one statement at a time.
Re: Matching '=' and other non alphanumeric characters using regular expressions
by krujos (Curate) on Dec 13, 2002 at 06:07 UTC
    you need to match the . and = characters. If there is a genral form for the where thoes occur you could plug them into your scripts. Without knowing that a smiple suggestion is to just add the = and periods to your regex.
    if ($string =~ /DEFINITION\s\s([\w\s=\.]+)ACCESSION/){
    Good luck. Josh
Re: Matching '=' and other non alphanumeric characters using regular expressions
by blahblahblah (Priest) on Dec 13, 2002 at 06:18 UTC
    In addition to what the others have pointed out, I think you'll need to add an 's' to the end so that you can match across multiple lines, like this:
    if ($string =~ /DEFINITION\s\s([\w\s]+)ACCESSION/s){ print "$1\n"; }
    Also, if you simply want to pick up everything between "DEFINITION " and "ACCESSION", another way to do it without worrying about knowing every possible character is like this:
    $string =~ /DEFINITION\s\s(.*?)ACCESSION/s
Re: Matching '=' and other non alphanumeric characters using regular expressions
by MarkM (Curate) on Dec 13, 2002 at 06:19 UTC

    '.' will (normally) match all characters except for the newline character.

    The regular expression that you are looking for is probably:

    if ($string =~ /^DEFINITION[ \t]+([^\r\n]*)/mg) {

    In english, this would be: A line in $string that begins with the literal string 'DEFINITION' followed by any amount of simple white space (' ' and '\t') followed by a string of characters that do not include '\r' or '\n'. This will grab all characters to the end of the line, but not the end-of-line character sequence itself.

    I choose to use [^\r\n]* instead of .* as I regularly have to ensure that my code will function equally well under both UNIX and WIN32. Using [^\r\n]* instead of .* allows me to ensure that '\r' is not picked up at the end of $1.

Re: Matching '=' and other non alphanumeric characters using regular expressions
by pg (Canon) on Dec 13, 2002 at 07:04 UTC
    When I looked at your string, I realized that it is actually a kind of key-value pair. My suggestion is to save the result returned from m// into a hash.

    Make some small modification to the following code I gave, you can easily make it work for you:
    use strict; my $string = "DEFINITION BIO5 gene complete cds EC=1.7.7.1. ACCESSION D17396 VERSION D17396.1 GI:498167 KEYWORDS protein; insulin-like protein;"; my %KVpairs = ($string =~ /(\w+)\s+(.*)/mg);#save the result right int +o a hash foreach (keys %KVpairs) { print "[$_] = $KVpairs{$_}\n"; }
Re: Matching '=' and other non alphanumeric characters using regular expressions
by Bukowski (Deacon) on Dec 13, 2002 at 12:16 UTC
    Hmm I smell a bioinformatics question :)

    You're obviously trying to parse out information from GenBank records.

    Have you come across Licoln Steins Creating a Bioinformatics Nation?

    One of the points is that there is massive duplication of effort. Every Perl using biologist has written a GenBank parser!! I have, my colleagues have.. friends of colleagues have!

    Run, don't walk, to BioPerl and save yourself the time and effort - the parsers you require are already there for you :)

    Bukowski - aka Dan (dcs@black.hole-in-the.net)
    "Coffee for the mind, Pizza for the body, Sushi for the soul" -Userfriendly

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://219504]
Approved by krujos
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (23)
As of 2015-07-02 18:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (44 votes), past polls