Match and Extract String with Regex

monkfan has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Match and Extract String with Regex by ikegami (Patriarch) on Nov 17, 2007 at 01:47 UTC
Are you asking to find the Longest Common Subsequence (LCS) of the strings? Searching for that should reveal nodes on the topic. Or if all you want to do is check if `$str1` is in `$str2`, then `index` will do the trick. `if (index($str2, $str1) >= 0) { ... }` [download]	[reply] [d/l] [select]
Re^2: Match and Extract String with Regex by erroneousBollock (Curate) on Nov 17, 2007 at 09:25 UTC
Searching for (Longest Common Subsequence) should reveal nodes on the topic. ... only if they understand how/why Longest Common Subsequence is a generalisation of Longest Common Substring. While it's an interesting area in which to educate oneself, String::LCSS does exactly what you need. -David	[reply]
Re^3: Match and Extract String with Regex by lima1 (Curate) on Nov 17, 2007 at 16:58 UTC
String::LCSS does exactly what you need. I don't want to bash String::LCSS, but the implementation seems to be the naive O(n^3) algorithm instead of the O(mn) dynamic programming solution (http://en.wikipedia.org/wiki/Longest_common_substring_problem). A quick and dirty (and not thoroughly tested) implementation is much faster (although probably buggy). `sub lcss2 { my ($s, $t) = @_; my $z = 0; my $m = length $s; my $n = length $t; my @S = (undef, split(//, $s)); my @T = (undef, split(//, $t)); my @L; my @ret; for my $i ( 1 .. $m ) { for my $j ( 1 .. $n ) { if ($S[$i] eq $T[$j]) { $L[$i-1][$j-1] \|\|= 0; $L[$i][$j] = $L[$i-1][$j-1] + 1; if ($L[$i][$j] > $z) { $z = $L[$i][$j]; @ret = (); } if ($L[$i][$j] == $z) { push @ret,substr($s, ($i-$z), $z); } } } } # warn Dumper \@L; return join '', @ret; }` [download] `my $s1 = '6'x 200 . 'zyzxx'; my $s2 = '5'x 200 . 'abczyzefg'; my $count = 1; timethese($count, { 'String::LCSS' => sub { String::LCSS::lcss( $s1, $s2 ) }, 'dynprog' => sub { lcss2( $s1, $s2 )}, });` [download] Update:* Took the opportunity to learn XS and wrote String::LCSS_XS.	[reply] [d/l] [select]
Re^4: Match and Extract String with Regex by erroneousBollock (Curate) on Nov 18, 2007 at 04:10 UTC
Re: Match and Extract String with Regex by mwah (Hermit) on Nov 17, 2007 at 09:00 UTC
If it's only that simple kind of problem that you mentioned (find one string in another string), then Ikegami's solution (index) should suffice. If you need to do that by regex, try: `... my $str1 = 'AT1G7. +126[0]'; my $str2 = 'AT1G7. +126[0]_\|_chr1'; ... (my $results) = $str2 =~ /\Q$str1/g; print $results if $results; ...` [download] Regards mwa	[reply] [d/l]
Re: Match and Extract String with Regex by lima1 (Curate) on Nov 17, 2007 at 10:13 UTC
If you just want to extract the TAIR id, you could use this regex: `my ($tair_id) = $str2 =~ /(AT\dG\d{5})/i;` [download] If you want to fetch the alternative splice form suffixes as well (e.g. AT1G71260.1), you could use: `my ($tair_id) = $str2 =~ /(AT\dG\d{5}(?:\.\d+)?)/i;` [download]	[reply] [d/l] [select]


Think about Loose Coupling
	PerlMonks