Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Look-behind regex to

by Kozz (Friar)
on Mar 21, 2002 at 22:21 UTC ( #153444=perlquestion: print w/ replies, xml ) Need Help??
Kozz has asked for the wisdom of the Perl Monks concerning the following question:

Most wise monks: I know that aspects of my questions have been touched on in various other nodes, but I've done my homework and node research, so please don't shred me to bits. I throw myself on the mercy of your monkship.

The book "Mastering Regular Expressions" is on my wishlist, but thus far I consider myself to be perhaps only an intermediate-skilled monk with regexes. My goal is to create a sub that would "hotlink" text containing http or mailto URIs. This has been done in another node, but not to the extent to which I'd like, and doesn't cover all cases. I've read up on perlman:perlre, Email::Valid as well, and saw the nightmare-ish super-long regex, which was far more complex than I was hoping for. The thing is that I don't really want/need to validate the domain and mx record for emails or whatever - just want to identify those things that LOOK LIKE URIs and hyperlink them accordingly. Perhaps there are other modules which could be used in conjunction with one another to identify potential URIs and deal with all the different formats/cases? See the code below.

#!/usr/bin/perl use strict; my $string = q{ An email address is foo-master@bar.com, but this http://foo@bar.com/ and this ftp://foo:baz@bar.com/ are not emails. }; my $desired_output = q{ An email address is <A HREF="mailto:foo-master@bar.com">foo-master@b +ar.com"</A>, but this <A HREF="http://foo@bar.com/">http://foo@bar.com/</A> and this <A HREF="ftp://foo:baz@bar.com/">ftp://foo:baz@bar.com/</A> are not emails. }; foreach($string){ # make http or ftp hyperlinks first s#((ht|f)tp://[\S]+)#<A HREF="$1">$1</A>#isg; # now make email addresses hyperlinks # but don't "email-ify" http or ftp URIs s#((?<!tp://)[a-z0-9\-\_\.]+\@[a-z0-9\-]+(\.[a-z0-9\-]+)+)#<A HRE +F="mailto:$&">$&</A>#isg; if( m#(?<!tp://)[a-z0-9\-\_\.]+\@[a-z0-9\-]+(\.[a-z0-9\-]+)+#isg) +{ print "Yes, it matches ($&).\n"; }else{ print "It does not match.\n"; } print; }
If you try the code, you'll see that the actual output of the above regex is
Yes, it matches (foo-master@bar.com). An email address is <A HREF="mailto:foo-master@bar.com">foo-master@b +ar.com</A>, but this <A HREF="http://f<A HREF="mailto:oo@bar.com">oo@bar.com</A>/">h +ttp://f<A HREF="mailto:oo@bar.com">oo@bar.com</A>/</A> and this <A HREF="ftp://foo:<A HREF="mailto:baz@bar.com">baz@bar.com</A> +/">ftp://foo:<A HREF="mailto:baz@bar.com">baz@bar.com</A>/</A> are not emails.
:-( I thought I could surely get this figured out with a moderately simple regex, but the test-cases I've used here have proved quite difficult. How close am I? Or am I going about it the wrong way, or can it not be done with these test cases? I thank you in advance for your consideration. --Kozz

Comment on Look-behind regex to
Select or Download Code
using Email::Find and URI::Find
by gav^ (Curate) on Mar 21, 2002 at 22:28 UTC
    If you look at Email::Find and URI::Find you'll find 2 modules that do the job, eg
    use Email::Find; my $finder = Email::Find->new( sub { my($email, $orig_email) = @_; my($address) = $email->format; return qq|<a href="mailto:$address">$orig_email</a>|; }, ); $finder->find(\$text);
    and
    use URI::Find; find_uris($text, sub { my($uri, $orig_uri) = @_; return qq|<a href="$uri">$orig_uri</a>|; });
    Examples pulled from the docs. You might also find HTML::FromText handy for formatting text (and converting URLs etc).

    Hope this helps.

    gav^

      Thank you for the tips. The URI Unfortunately, Email::Find chokes on those complex URIs containing usernames in those cases. Example output from that module:
      An email address is <a href="mailto:foo-master@bar.com">foo-master@b +ar.com</a>, but this http:<a href="mailto://foo@bar.com">//foo@bar.com</a>/ and this ftp://foo:<a href="mailto:baz@bar.com">baz@bar.com</a>/ are not emails.
      I wonder if one would have to copy some Email::Find code and modify it with a negative zero-width look-behind for (ht|f)tp:// ?

      (p.s. my apologies for the unfinished node title! I got caught up with the code.)

        I wonder if one would have to copy some Email::Find code and modify it with a negative zero-width look-behind for (ht|f)tp:// ?

        It's even easier then that.

        The docs for Email::Find have a section entitled "SUBCLASSING" that explains how you can make your own version with a different regex, or validation function. You can create a basic subclass that just defienes a new regex (with spaces before and after, and or enclosed in "<...>" -- whatever you want. "

Re: Look-behind regex to
by erikharrison (Deacon) on Mar 21, 2002 at 22:49 UTC

    While you are working on this solution, it might be a good idea to remember that there are valid protocols (little used, yes) you might want to watch for - (gopher is the only one to spring to mind at the moment - I'd bet a *insert small amount of currency of choice here* that there are more)

    Cheers,
    Erik
      Good point. My code above doesn't cover nearly all the possibilities: http, https, ftp, gopher, ssh, telnet, finger, news, irc, and so many that are far more obscure or rarely used these days. I will likely expand it to include these options, despite the fact 99.9% of the end-users of the system I'm building will never use those "non-www" protocols. ;)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://153444]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (10)
As of 2014-08-22 18:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (163 votes), past polls