Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number

Regex Extraction Help

by invaderzard (Acolyte)
on Aug 09, 2012 at 15:32 UTC ( #986543=perlquestion: print w/replies, xml ) Need Help??
invaderzard has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I'm trying to extract some information from a .dat file. The format of the information I'm trying to extract is this.

DR Pfam; PF00070; Pyr_redox; 2.

What I want is the PF00070 inside.

I tried several regex like (/(?<=DR\s\s\sPfam;\s\w\w\d\d\d\d\d)(.*)$/)



But none of them seem to work. Any help would be appreciated. Thanks!

Replies are listed 'Best First'.
Re: Regex Extraction Help
by Kenosis (Priest) on Aug 09, 2012 at 16:27 UTC

    Here's another option:

    use Modern::Perl; my $dat = 'DR Pfam; PF00070; Pyr_redox; 2.'; my $info = (split '; ', $dat)[1]; say $info;



    Hope this helps!

      invaderzard, just wanted to make clear that this solution by Kenosis is the far quicker and easier version, which I'd, of course use anytime I'd just need a quick split by a field separator on an input.

      But: There is one caveat here to keep in mind. Split, of course, does not test the format of the input. So if you wanted the second field of a record that goes like this:

      $record = 'A;B;C;D';
      $second_field = (split ';', $record)[1];

      does work. However so it does for inputs like:

      A;B #foo;B;ar ;B

      All of the above inputs would leave a B in $second_field. Which, you know might be correct in a particular case, but in general, we don't want to just ignore malformed records, so if we, say, iterate over records, then make sure to test and capture using a regexp in an if:

      if($record =~ m/^.;(.);.;.$/) { $second_field = $1; }

      Now this will only set $second_field if the record matches the four single-character fields delimited by one semicolon format. Even if the input is ';;;;;;;'. ;)

      Have fun with regexen. They're cool. ;)

      So long,

        You make a good point about splitting on a record separator within possibly malformed records. Based upon the OP's regex, it appears that the pattern's stable--with one space after the semi-colon. However, we can ask split to 'test' the format of the input, like this:

        my $info = (split /\s*;\s*/, $dat)[1];

        This will return the info the OP wants, whether there are spaces before or after the semi-colon, or not.

        And within a regex on the OP's data:

        use Modern::Perl; my $dat = 'DR Pfam; PF00070; Pyr_redox; 2.'; $dat =~ /;\s*(\w+)\s*;.+;/ and say $1; #prints PF00070

        It was a good call to address this issue...

Re: Regex Extraction Help
by Ratazong (Monsignor) on Aug 09, 2012 at 15:43 UTC


    Your main issue seems to be that you fetch too much or the wrong things. Reminder: you need to put the thing you want to get in round brackets ... which is the \w\w\d\d\d\d\d in this case. So your regex could look like

    (I don't see the need for look-aheads or or look-behinds here.)

    HTH, Rata

    update Flexx: why do you assume that the second field of an semicolon-seperated file is meant? I agree that the specification is very vague. However from the examples given by invaderzard, it seems that the text DR   Pfam as well as the format (2 letters, 5 digits) are the important parts ...

      Well, I guess all the fields are variable, and what invaderzard meant, was to get that second field.

      So I'd suggest this:

      # assuming the raw data is in $line. $line =~ m/^[^;]*;\s*([^;]*?)\s*;/ # $1 now holds whatever is between the second and third # semicolon, leading and trailing spaces trimmed.

      Now, what am I doing here?

      First I say: Let's start at the beginning (^). This is important, since we can't exclude the possibility that the pattern repeats in one instance of $line.

      Next, I say: give me zero or more non-semicolon characters ([^;]*), followed by exactly one semicolon (;).

      Now our "cursor" would be in the second field, quasi. We say, well, there might or might not be some leading space (\s*). Then comes the data we want, that's why we use parentheses to capture it. What do we wanna capture? Well, again, anything not a semicolon ([^;]*?), but this time, non-greedily (using the *? quantifier.). Well, that's because we want any trailing space to go into the \s* that follows, instead of it being captured. Lastly, we need to require that the field is terminated by exactly one semicolon (;).

      If you want to capture other fields as well, then a solution using split, like it's been suggested below is a more efficient way of doing it. If you want just a few fields of a long CSV record (which this seems to be, only demimited by semicola instead of kommas, then you also could expand on the regexp above, which might be a bit more performant than split. But I didn't really check that with benchmarks. Just an inkling I'd have, and very dependent on the length of the input, and the number of fields in it.


      « Flexx: why do you assume that the second field of an semicolon-seperated file is meant? »

      Umm.. Well because it looks like a CSV format? Experience seeing a bit of a problem and getting what the requirement is (a/k/a "getting 'all' the information from the customer" ;)?

      And it appered that putting that first field in the regexp was more out of confusion as to how to "get to" the second field, something I do see often when someone learns how to use regular experessions. Along with too much use of .* to pull in fields, BTW, when "not the separator" ([^;]*) is often more correct, or even needed. Things get worse, once quoting is to be considered, of course.

      But yeah, it was just an educated guess.

      So long,

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://986543]
Approved by Ratazong
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (10)
As of 2017-04-24 19:08 GMT
Find Nodes?
    Voting Booth?
    I'm a fool:

    Results (444 votes). Check out past polls.