in reply to Regex Extraction Help


Your main issue seems to be that you fetch too much or the wrong things. Reminder: you need to put the thing you want to get in round brackets ... which is the \w\w\d\d\d\d\d in this case. So your regex could look like

(I don't see the need for look-aheads or or look-behinds here.)

HTH, Rata

update Flexx: why do you assume that the second field of an semicolon-seperated file is meant? I agree that the specification is very vague. However from the examples given by invaderzard, it seems that the text DR   Pfam as well as the format (2 letters, 5 digits) are the important parts ...

Replies are listed 'Best First'.
Re^2: Regex Extraction Help
by Flexx (Pilgrim) on Aug 09, 2012 at 17:02 UTC

    Well, I guess all the fields are variable, and what invaderzard meant, was to get that second field.

    So I'd suggest this:

    # assuming the raw data is in $line. $line =~ m/^[^;]*;\s*([^;]*?)\s*;/ # $1 now holds whatever is between the second and third # semicolon, leading and trailing spaces trimmed.

    Now, what am I doing here?

    First I say: Let's start at the beginning (^). This is important, since we can't exclude the possibility that the pattern repeats in one instance of $line.

    Next, I say: give me zero or more non-semicolon characters ([^;]*), followed by exactly one semicolon (;).

    Now our "cursor" would be in the second field, quasi. We say, well, there might or might not be some leading space (\s*). Then comes the data we want, that's why we use parentheses to capture it. What do we wanna capture? Well, again, anything not a semicolon ([^;]*?), but this time, non-greedily (using the *? quantifier.). Well, that's because we want any trailing space to go into the \s* that follows, instead of it being captured. Lastly, we need to require that the field is terminated by exactly one semicolon (;).

    If you want to capture other fields as well, then a solution using split, like it's been suggested below is a more efficient way of doing it. If you want just a few fields of a long CSV record (which this seems to be, only demimited by semicola instead of kommas, then you also could expand on the regexp above, which might be a bit more performant than split. But I didn't really check that with benchmarks. Just an inkling I'd have, and very dependent on the length of the input, and the number of fields in it.


Re^2: Regex Extraction Help
by Flexx (Pilgrim) on Aug 15, 2012 at 21:52 UTC
    « Flexx: why do you assume that the second field of an semicolon-seperated file is meant? »

    Umm.. Well because it looks like a CSV format? Experience seeing a bit of a problem and getting what the requirement is (a/k/a "getting 'all' the information from the customer" ;)?

    And it appered that putting that first field in the regexp was more out of confusion as to how to "get to" the second field, something I do see often when someone learns how to use regular experessions. Along with too much use of .* to pull in fields, BTW, when "not the separator" ([^;]*) is often more correct, or even needed. Things get worse, once quoting is to be considered, of course.

    But yeah, it was just an educated guess.

    So long,