Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

Match and Extract String with Regex

by monkfan (Curate)
on Nov 17, 2007 at 01:39 UTC ( #651361=perlquestion: print w/replies, xml ) Need Help??
monkfan has asked for the wisdom of the Perl Monks concerning the following question:

Dear fellow monks,
I have these two strings:
my $str1 = 'AT1G71260'; my $str2 = 'AT1G71260_|_chr1';
How can I extract element of $str2 that match with $str1, such that in the end the extraction just give this as results:
my $results = 'AT1G71260';
Note that these two strings can contain any characters (even whitespace, semicolon(;), etc) and of any lengths.


Replies are listed 'Best First'.
Re: Match and Extract String with Regex
by ikegami (Pope) on Nov 17, 2007 at 01:47 UTC

    Are you asking to find the Longest Common Subsequence (LCS) of the strings? Searching for that should reveal nodes on the topic.

    Or if all you want to do is check if $str1 is in $str2, then index will do the trick.

    if (index($str2, $str1) >= 0) { ... }
      Searching for (Longest Common Subsequence) should reveal nodes on the topic.
      ... only if they understand how/why Longest Common Subsequence is a generalisation of Longest Common Substring.

      While it's an interesting area in which to educate oneself, String::LCSS does exactly what you need.


        String::LCSS does exactly what you need.

        I don't want to bash String::LCSS, but the implementation seems to be the naive O(n^3) algorithm instead of the O(mn) dynamic programming solution ( A quick and dirty (and not thoroughly tested) implementation is much faster (although probably buggy).

        sub lcss2 { my ($s, $t) = @_; my $z = 0; my $m = length $s; my $n = length $t; my @S = (undef, split(//, $s)); my @T = (undef, split(//, $t)); my @L; my @ret; for my $i ( 1 .. $m ) { for my $j ( 1 .. $n ) { if ($S[$i] eq $T[$j]) { $L[$i-1][$j-1] ||= 0; $L[$i][$j] = $L[$i-1][$j-1] + 1; if ($L[$i][$j] > $z) { $z = $L[$i][$j]; @ret = (); } if ($L[$i][$j] == $z) { push @ret,substr($s, ($i-$z), $z); } } } } # warn Dumper \@L; return join '*', @ret; }
        my $s1 = '6'x 200 . 'zyzxx'; my $s2 = '5'x 200 . 'abczyzefg'; my $count = 1; timethese($count, { 'String::LCSS' => sub { String::LCSS::lcss( $s1, $s2 ) }, 'dynprog' => sub { lcss2( $s1, $s2 )}, });
        Update: Took the opportunity to learn XS and wrote String::LCSS_XS.
Re: Match and Extract String with Regex
by mwah (Hermit) on Nov 17, 2007 at 09:00 UTC

    If it's only that simple kind of problem that you mentioned (find one string in another string), then Ikegami's solution (index) should suffice. If you need to do that by regex, try:

    ... my $str1 = 'AT1G7. +126[0]*'; my $str2 = 'AT1G7. +126[0]*_|_chr1'; ... (my $results) = $str2 =~ /\Q$str1/g; print $results if $results; ...



Re: Match and Extract String with Regex
by lima1 (Curate) on Nov 17, 2007 at 10:13 UTC
    If you just want to extract the TAIR id, you could use this regex:
    my ($tair_id) = $str2 =~ /(AT\dG\d{5})/i;
    If you want to fetch the alternative splice form suffixes as well (e.g. AT1G71260.1), you could use:
    my ($tair_id) = $str2 =~ /(AT\dG\d{5}(?:\.\d+)?)/i;

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://651361]
Approved by ikegami
[Corion]: choroba: I use spod5, which also has that support, and also implements its own kinda-make stuff
[haukex]: But that module I just linked to assumes that most verbatim blocks are runnable code, I have other modules where that's not the case, so there I just copy-and-paste the synopsis into the author tests...
[haukex]: not the most efficient, but then again, I don't have that many modules on CPAN :-)
[Corion]: haukex: Yes, but if it's only supposed to run on my machine, I can be far more liberal with how I extract the code etc.
[Corion]: haukex: Yes - I see the benefit of using Dist::Zilla for people with 150+ modules on CPAN, but I don't see it for myself, and I'm always put off from contributing to such modules because they require a lot of toolchain setup that I don't want to ...
[Corion]: ... spend time on if I only want to provide a short patch
[haukex]: Corion: Yes exactly, in the author tests I don't worry about portability as much, I also don't list the author tests' dependencies in Makefile.PL
[haukex]: I figure someone who wants to contribute will know how to install the missing modules ;-) Not the nicest way to go but I don't think many people are using my modules yet

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (11)
As of 2017-02-27 12:37 GMT
Find Nodes?
    Voting Booth?
    Before electricity was invented, what was the Electric Eel called?

    Results (385 votes). Check out past polls.