Look-behind regex to

Kozz has asked for the wisdom of the Perl Monks concerning the following question:

Most wise monks: I know that aspects of my questions have been touched on in various other nodes, but I've done my homework and node research, so please don't shred me to bits. I throw myself on the mercy of your monkship.

The book "Mastering Regular Expressions" is on my wishlist, but thus far I consider myself to be perhaps only an intermediate-skilled monk with regexes. My goal is to create a sub that would "hotlink" text containing http or mailto URIs. This has been done in another node, but not to the extent to which I'd like, and doesn't cover all cases. I've read up on perlman:perlre, Email::Valid as well, and saw the nightmare-ish super-long regex, which was far more complex than I was hoping for. The thing is that I don't really want/need to validate the domain and mx record for emails or whatever - just want to identify those things that LOOK LIKE URIs and hyperlink them accordingly. Perhaps there are other modules which could be used in conjunction with one another to identify potential URIs and deal with all the different formats/cases? See the code below.

#!/usr/bin/perl

use strict;

my $string = q{
  An email address is foo-master@bar.com, but
  this http://foo@bar.com/ and
  this ftp://foo:baz@bar.com/ are
  not emails.
};

my $desired_output = q{
  An email address is <A HREF="mailto:foo-master@bar.com">foo-master@b
+ar.com"</A>, but
  this <A HREF="http://foo@bar.com/">http://foo@bar.com/</A> and
  this <A HREF="ftp://foo:baz@bar.com/">ftp://foo:baz@bar.com/</A> are
  not emails.
};

foreach($string){

     # make http or ftp hyperlinks first
     s#((ht|f)tp://[\S]+)#<A HREF="$1">$1</A>#isg;

     # now make email addresses hyperlinks
     # but don't "email-ify" http or ftp URIs

     s#((?<!tp://)[a-z0-9\-\_\.]+\@[a-z0-9\-]+(\.[a-z0-9\-]+)+)#<A HRE
+F="mailto:$&">$&</A>#isg;

     if( m#(?<!tp://)[a-z0-9\-\_\.]+\@[a-z0-9\-]+(\.[a-z0-9\-]+)+#isg)
+{
         print "Yes, it matches ($&).\n";
     }else{
         print "It does not match.\n";
     }

     print;

}
[download]

If you try the code, you'll see that the actual output of the above regex is

Yes, it matches (foo-master@bar.com).

  An email address is <A HREF="mailto:foo-master@bar.com">foo-master@b
+ar.com</A>, but
  this <A HREF="http://f<A HREF="mailto:oo@bar.com">oo@bar.com</A>/">h
+ttp://f<A HREF="mailto:oo@bar.com">oo@bar.com</A>/</A> and
  this <A HREF="ftp://foo:<A HREF="mailto:baz@bar.com">baz@bar.com</A>
+/">ftp://foo:<A HREF="mailto:baz@bar.com">baz@bar.com</A>/</A> are
  not emails.
[download]

:-( I thought I could surely get this figured out with a moderately simple regex, but the test-cases I've used here have proved quite difficult. How close am I? Or am I going about it the wrong way, or can it not be done with these test cases? I thank you in advance for your consideration. --Kozz

Comment on Look-behind regex to Select or Download Code

Replies are listed 'Best First'.
using Email::Find and URI::Find by gav^ (Curate) on Mar 21, 2002 at 22:28 UTC
If you look at Email::Find and URI::Find you'll find 2 modules that do the job, eg `use Email::Find; my $finder = Email::Find->new( sub { my($email, $orig_email) = @_; my($address) = $email->format; return qq\|<a href="mailto:$address">$orig_email</a>\|; }, ); $finder->find(\$text);` [download] and `use URI::Find; find_uris($text, sub { my($uri, $orig_uri) = @_; return qq\|<a href="$uri">$orig_uri</a>\|; });` [download] Examples pulled from the docs. You might also find HTML::FromText handy for formatting text (and converting URLs etc). Hope this helps. gav^	[reply] [d/l] [select]
Re: using Email::Find and URI::Find by Kozz (Friar) on Mar 21, 2002 at 22:48 UTC
Thank you for the tips. The URI Unfortunately, Email::Find chokes on those complex URIs containing usernames in those cases. Example output from that module: `An email address is <a href="mailto:foo-master@bar.com">foo-master@b +ar.com</a>, but this http:<a href="mailto://foo@bar.com">//foo@bar.com</a>/ and this ftp://foo:<a href="mailto:baz@bar.com">baz@bar.com</a>/ are not emails.` [download] I wonder if one would have to copy some Email::Find code and modify it with a negative zero-width look-behind for (ht\|f)tp:// ? (p.s. my apologies for the unfinished node title! I got caught up with the code.)	[reply] [d/l]
Re: Re: using Email::Find and URI::Find by hossman (Prior) on Mar 21, 2002 at 22:59 UTC
I wonder if one would have to copy some Email::Find code and modify it with a negative zero-width look-behind for (ht\|f)tp:// ? It's even easier then that. The docs for Email::Find have a section entitled "SUBCLASSING" that explains how you can make your own version with a different regex, or validation function. You can create a basic subclass that just defienes a new regex (with spaces before and after, and or enclosed in "<...>" -- whatever you want. "	[reply]
•Re: Re: using Email::Find and URI::Find by merlyn (Sage) on Mar 22, 2002 at 15:35 UTC
Not an answer, but I had to chuckle at the phrase: negative zero-width look-behind -- Randal L. Schwartz, Perl hacker	[reply]
Re: Look-behind regex to by erikharrison (Deacon) on Mar 21, 2002 at 22:49 UTC
While you are working on this solution, it might be a good idea to remember that there are valid protocols (little used, yes) you might want to watch for - (gopher is the only one to spring to mind at the moment - I'd bet a insert small amount of currency of choice here that there are more) Cheers, Erik	[reply]
URI protocols by Kozz (Friar) on Mar 21, 2002 at 23:03 UTC
Good point. My code above doesn't cover nearly all the possibilities: http, https, ftp, gopher, ssh, telnet, finger, news, irc, and so many that are far more obscure or rarely used these days. I will likely expand it to include these options, despite the fact 99.9% of the end-users of the system I'm building will never use those "non-www" protocols. ;)	[reply]


Perl Monk, Perl Meditation
	PerlMonks