You don't need to build a full-blown parser for a case like this. You're not going to get it with a simple regular expression search and replacement, though. Something of a very simple loop and split with some thought about what you're actually doing will help.
There are some problems inherent in marking up text that hasn't had the proper markup maintained throughout its lifetime, though. In conversation just now for example you put a comma right after a valid URL. Since it's part of the hostname the way it's mentioned and that's clearly invalid, we can separate that out if you only want to worry about the root path of the resource. However, a comma, semicolon, slash, equals, percent sign, period, plus, or question mark can be part of a URI (and therefore URL) even though they are not part of a hostname. Then there are non-Latin paths. Don't even get em started on non-Latin domain names, because that's a whole book worth of special considerations.
If you're wanting to handle common URLs most of the time without getting tripped up with all the punctuation that can be part of a resource path, you can a good deal of the time get away with assuming the last character of a URL will be a letter, a number, or a slash. That doesn't account for validating what's between the start and the end, and there are cases in which a period, colon, equals, or question mark are the last character in a valid one.
I'll skip most of the discussion of usernames and passwords in URLs in addition to the protocol scheme and URN. You probably shouldn't have those as results and you probably don't want them easy to click if you do.
So, this leaves us with a somewhat simple spec that is however more involved than a single search and replace. As with many things, with Perl's regular expressions you may find a way to shoehorn it into the substitution operator in just one statement, but that would be more complex and funky than breaking it into pieces.
So, do I have code for you? Well, it's not pretty...
#!/usr/local/bin/perl
use strict;
use warnings;
my $match = qr{perl}i;
print "\n\n";
while ( <> ) {
chomp;
my @words = split /\s+/;
my $output = '';
foreach my $word ( @words ) {
if ( $word =~ m{http://} ) {
my $before = my $after = '';
if ( $word !~ m{^(:?http://|telnet://|ftp://|chrome://)} )
+ {
$word =~ s{^(.*?)(http://|telnet://|ftp://|chrome://)}
+{$2};
$before = $1;
}
if ( $word =~ m{[^\w\d/]+$} ) {
$word =~ s{([^\w\d/]+)$}{};
$after = $1;
}
my $str = '<a href="' . $word . '">';
if ( $word =~ m{$match} ) {
$word =~ s{($match)}{<b>$1</b>};
}
$str .= $word . '</a>';
$word = $before . $str . $after;
} elsif ( $word =~ m{$match} ) {
$word =~ s{($match)}{<b>$1</b>};
}
$output .= ' ' . $word;
}
print $output . "\n";
}
print "\n\n";
This code will take the following test file:
I have a simple, site-specific search engine.
It finds matches in a MySQL database using LIKE.
Then it bolds the search terms before displaying the results to the us
+er. And it also renders URLs clickable.
So, say the user has searched for "perlmonks" and the page contains "y
+ou should all go to http://perlmonks.org, it's great!".
We find the term in the database, we bold it, we render the URL clicka
+ble and display it to the user, but at this point, the HTML has becom
+e this:
I have a simple, site-specific search engine.
It finds matches in a MySQL database using LIKE.
Then it bolds the search terms before displaying the results to the us
+er. And it also renders URLs clickable.
So, say the user has searched for "perlmonks" and the page contains "y
+ou should all go to http://perlmonks.org/?node_id=891739, it's great!
+".
We find the term in the database, we bold it, we render the URL clicka
+ble and display it to the user, but at this point, the HTML has becom
+e this:
I have a simple, site-specific search engine.
It finds matches in a MySQL database using LIKE.
Then it bolds the search terms before displaying the results to the us
+er. And it also renders URLs clickable.
So, say the user has searched for "perlmonks" and the page contains "y
+ou should all go to http://perlmonks.org/?node_id=;user=, it's great!
+".
We find the term in the database, we bold it, we render the URL clicka
+ble and display it to the user, but at this point, the HTML has becom
+e this:
With this test file, you get this output:
I have a simple, site-specific search engine.
It finds matches in a MySQL database using LIKE.
Then it bolds the search terms before displaying the results to the u
+ser. And it also renders URLs clickable.
So, say the user has searched for "<b>perl</b>monks" and the page con
+tains "you should all go to <a href="http://perlmonks.org">http://<b>
+perl</b>monks.org</a>, it's great!".
We find the term in the database, we bold it, we render the URL click
+able and display it to the user, but at this point, the HTML has beco
+me this:
I have a simple, site-specific search engine.
It finds matches in a MySQL database using LIKE.
Then it bolds the search terms before displaying the results to the u
+ser. And it also renders URLs clickable.
So, say the user has searched for "<b>perl</b>monks" and the page con
+tains "you should all go to <a href="http://perlmonks.org/?node_id=89
+1739">http://<b>perl</b>monks.org/?node_id=891739</a>, it's great!".
We find the term in the database, we bold it, we render the URL click
+able and display it to the user, but at this point, the HTML has beco
+me this:
I have a simple, site-specific search engine.
It finds matches in a MySQL database using LIKE.
Then it bolds the search terms before displaying the results to the u
+ser. And it also renders URLs clickable.
So, say the user has searched for "<b>perl</b>monks" and the page con
+tains "you should all go to <a href="http://perlmonks.org/?node_id=;u
+ser">http://<b>perl</b>monks.org/?node_id=;user</a>=, it's great!".
We find the term in the database, we bold it, we render the URL click
+able and display it to the user, but at this point, the HTML has beco
+me this:
Notice the slight defect here? It's one of those edge cases. There's an equals sign at the very end of that last URL. PerlMonks, without the '=' to indicate a value for the key user, does this (non-shortcut links): http://perlmonks.org?node_id=6364;user but with the equals it will do this: http://perlmonks.org?node_id=6364;user=. Again, notice a subtle difference? When there's an empty value, the form field is empty. When there's no value, there's a pre-populated monk nickname. It even seems to be randomly chosen for your (mild) amusement. This is due to the significance of that symbol even at the end of a URL despite the fact that it will rarely be the final character. Well, that and some quirky programming within PM in this instance no doubt.
Is this good enough? Well, maybe. Only you can tell, and probably only after some use against your data. It only took a couple of minutes to get it that far, though. For actually properly handling all the edge cases, the 80/20 rule in this case is probably more like the 99.999/0.001 rule.
|