Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Regexp string concat

by Murcia (Monk)
on Jun 14, 2004 at 15:05 UTC ( #366569=perlquestion: print w/ replies, xml ) Need Help??
Murcia has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,

I have following problem ... and it is no HOMEWORK (I am 38 years old) eg:

my $string ="EICHENBAUMSCHULE"; my @query = qw(EIC BAUM UMS CHU LE);
1. task
find matches
2. task
make longest query_string out off single short overlapping 
(eg. BAUM and UMS) or touching (eg. CHU and LE) matches
e.g.

EICHENBAUMSCHULE
EIC
      BAUM
        UMS
           CHU
              LE
concatenate now
     BAUMSCHULE
result should be
@new_query = qw(EIC, BAUMSCHULE);

Thanks in advance

Murcia

Updated - Steve_p made changes to the formatting -- flux8 too.

Comment on Regexp string concat
Select or Download Code
Re: Regexp string concat
by Happy-the-monk (Monsignor) on Jun 14, 2004 at 15:15 UTC
    (please fix that missing closing </pre>-tag in your node. Thank you.)

    Sounds like homework to me. I don't think having your homework done by the community is the right way to get through school - or life for that matter.

    Tell us your thoughts on that matter and tell us your difficulties or problems, how you tried to solve them, where your insecurities are: we will help you along then.

    Cheers, Sören

      Sören, thanks about your comment ... I found a more or less working solution, but I do not like the code I wrote ... and do not want to annoy with it so I start just giving the task The real task is coming from a Genome project ... Displaying peptides in a protein sequence ...
Re: Regexp string concat
by davidj (Priest) on Jun 14, 2004 at 16:53 UTC
    questions regarding the resulting query_string:
    do the overlapping and/or touching matches have to be in the same order in which they are indicated in the @query array. That is, given:
    my $string ="EICHENBAUMSCHULE"; my @query = qw(EIC BAUM UMS CHU LE);
    would the following also match
    <BR> BAUMSLECHU # the CHU and LE are not in the same order as indicated in + @query

    davidj
Re: Regexp string concat
by BrowserUk (Pope) on Jun 14, 2004 at 17:00 UTC

    This doesn't handle any error conditions, like if the query contains non-matches, but it may give you a starting point.

    #! perl -slw use strict; sub display{ my $string = shift; return $string . $/, map{ ' ' x index( $string, $_ ) . "$_\n" } @_; } my $string ="EICHENBAUMSCHULE"; my @query = qw( EIC BAUM UMS CHU LE ); print display( $string, @query ); __END__ P:\test>366569 EICHENBAUMSCHULE EIC BAUM UMS CHU LE

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
      but this does not give me the longest concatenated string "BAUMSCHULE"
Re: Regexp string concat
by dragonchild (Archbishop) on Jun 14, 2004 at 17:16 UTC
    First - doesn't BioPerl have a few functions that will do this?

    As for code ... you still don't have all the requirements laid out yet. Specifically:

    • Do the matches have to be in the order specified? For example, what should @new_query contain if $string = 'EICHENLEBAUMSCHU'?
    • What happens if something in @query isn't there? For example, $string = 'XXX'.
    • What happens if there is more than one match? For example, $string = 'EICBAUMEIC'.
    • What happens if something isn't there and there's two matches on something else?
    • What happens if one element in @query fits into another element? For example, @query = qw( BAUM AU ).
    • What happens if there are two possible solutions? For example, $string = 'BAABAB' and @query = qw( AA BA ).

    I'm sure there are other possibilities that I didn't think of in the first five minutes of looking at the problem. Remember - if you can't explain the problem to a teddy bear, you can't explain the problem to a computer.

    ------
    We are the carpenters and bricklayers of the Information Age.

    Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

    I shouldn't have to say this, but any code, unless otherwise stated, is untested

      • First - doesn't BioPerl have a few functions that will do this?
      • I don't think so.
      As for code ... you still don't have all the requirements laid out yet. Specifically:
      • * Do the matches have to be in the order specified? For example, what should @new_query contain if $string = 'EICHENLEBAUMSCHU'?
      • @new_query = qw( EIC, LEBAUMSCHU);
      • * What happens if something in @query isn't there? For example, $string = 'XXX'.
      • @new_query = @query
      • * What happens if there is more than one match? For example, $string = 'EICBAUMEIC'.
      • will not be the case
      • * What happens if something isn't there and there's two matches on something else?
      • remove double
      • * What happens if one element in @query fits into another element? For example, @query = qw( BAUM AU ).
      • @new_query = qw(BAUM)
      • * What happens if there are two possible solutions? For example, $string = 'BAABAB' and @query = qw( AA BA ).
      • will not be the case
      • I'm sure there are other possibilities that I didn't think of in the first five minutes of looking at the problem. Remember - if you can't explain the problem to a teddy bear, you can't explain the problem to a computer. >thanks for good comments Murcia
        Heh. Your requirements seemingly contract each other. Specifically, you state that @new_query = @query if a value in @query isn't matched. However, you say that if something isn't matched and something else is matched twice, remove the double. But, you say that there can never be a double match.

        In other words, it doesn't sound like you can explain the requirements to a teddybear. I would spend some time with pencil and paper before going much further. Then, when you have pencil and paper down, convert those into test cases before writing any code.

        ------
        We are the carpenters and bricklayers of the Information Age.

        Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

        I shouldn't have to say this, but any code, unless otherwise stated, is untested

Re: Regexp string concat
by CombatSquirrel (Hermit) on Jun 14, 2004 at 17:37 UTC
    I hacked this piece of code together. It may work for you or it may not. I think it is pretty straightforward, except for the two-level Schwartzian transform-like anonymous array nesting (Data::Dumper should help, though)...
    #!perl use strict; use warnings; my $string = "EICHENBAUMSCHULE"; my @query = qw/EIC BAUM UMS CHU LE/; @query = sort { $a->[1] <=> $b->[1] || $a->[2] <=> $b->[2] } grep { $_->[1] > -1 } map { [$_, index($string, $_), index($string, $_) + length($ +_) - 1] } @query; my @matches; @query or die "No matches"; my $i = -1; my $p = -2; for (@query) { if ($p + 1 < $_->[1]) { ++$i; } push @{$matches[$i]}, $_; $p = $_->[2]; } @matches = sort { $b->[0] <=> $a->[0] } map { [$_->[-1]->[2] - $_->[0]->[1] + 1, $_] } @matches; print "Longest continuous match has length $matches[0]->[0]\n\n"; print "$string\n"; for (@{$matches[0]->[1]}) { print ' ' x $_->[1] . $_->[0] . "\n"; }
    Hope this helped.
    CombatSquirrel.
    Entropy is the tendency of everything going to hell.
Re: Regexp string concat
by BrowserUk (Pope) on Jun 14, 2004 at 18:13 UTC

    Sorry for the false start, hopefully this does the job.

    #! perl -slw use strict; use List::Util qw[ reduce ]; sub longestComposite{ my( $string, $qRef ) = @_; my $copy = ' ' x length $string; $string =~ s[($_)]{ my $len = length $1; substr( $copy, pos( $string ), $len ) = $1; $1; }ge for @$qRef; return reduce{ length $a > length $b ? $a : $b } split ' ', $copy; } my $string ="EICHENBAUMSCHULE"; my @query = qw( EIC BAUM UMS CHU LE ); print longestComposite( $string, \@query ); __END__ P:\test>366569 BAUMSCHULE

    No promises about efficiency if you're doing this on large volumes of data.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
Re: Regexp string concat
by delirium (Chaplain) on Jun 14, 2004 at 21:32 UTC
    I think we can solve this without doing any fancy sorting or iterating. How about a simple regular expression approach using upper and lower case characters as our markers?

    #!/usr/bin/perl use strict; use warnings; $\ = $/; my $string = 'eichenbaumschule'; my @query = qw(baum ums eic chu le); $string =~ s/$_/uc $_/ieg for @query; @query = $string =~ /([A-Z]+)/g; print for @query; __OUTPUT__ EIC BAUMSCHULE
      For a slight speed boost, use compiled regular expressions:
      my $string = 'eichenbaumschule'; my @query = qw(baum ums eic chu le); @query = map { qr/($_)/i } @query; $string =~ s/$_/\U$1/g for @query;
      (I get ~6000 iterations/sec this way vs. ~3000 your way).

      Update: Just realized, this is only a factor if you do the same substitutions multiple times in the same program...if you're only doing this once per run, then you may as well do it delirium's way :-)

      This is really neat++.

      As the target is genome work which usually involves large volumes of big strings, using split '[a-z]+', string; to avoid the capture brackets will save a little time.

      sub delirium{ my( $string, $qRef ) = @_; $string =~ tr[A-Z][a-z]; $string =~ s/$_/uc $_/ieg for @$qRef; return reduce{ length $a > length $b ? $a : $b } split '[a-z]+',; }

      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
        An optimisation I've thought of for this, even though the thread is long dead, is to lc the original string, and have all the strings to match as upper case, then all you need do is: $string =~ s/$_/$_/i for @$qRef; I'm sure the regex is considerably faster for having taken the executable bit out of it.
Re: Regexp string concat
by Jasper (Chaplain) on Jun 15, 2004 at 14:41 UTC
    Bit-wise oring is your friend. Although the answer comes out lower-cased :( - unless I change those pesky spaces to '@'s - which I just did.
    my $string ="EICHENBAUMSCHULE"; my @query = qw(EIC BAUM UMS CHU LE); my $matchline = '@' x length $string; for $match(@query) { my $spacer = index $string, $match; $matchline |= (('@' x $spacer) . $match) if $spacer > -1; } print for sort { length($b) <=> length($a) } $matchline =~ /[^@]+/g;

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://366569]
Approved by Happy-the-monk
Front-paged by delirium
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (4)
As of 2014-11-25 02:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (148 votes), past polls