Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re: Bolding search terms ... which might be URLs?

by mr_mischief (Prior)
on Mar 11, 2011 at 06:07 UTC ( #892590=note: print w/ replies, xml ) Need Help??


in reply to Bolding search terms ... which might be URLs?

You don't need to build a full-blown parser for a case like this. You're not going to get it with a simple regular expression search and replacement, though. Something of a very simple loop and split with some thought about what you're actually doing will help.

There are some problems inherent in marking up text that hasn't had the proper markup maintained throughout its lifetime, though. In conversation just now for example you put a comma right after a valid URL. Since it's part of the hostname the way it's mentioned and that's clearly invalid, we can separate that out if you only want to worry about the root path of the resource. However, a comma, semicolon, slash, equals, percent sign, period, plus, or question mark can be part of a URI (and therefore URL) even though they are not part of a hostname. Then there are non-Latin paths. Don't even get em started on non-Latin domain names, because that's a whole book worth of special considerations.

If you're wanting to handle common URLs most of the time without getting tripped up with all the punctuation that can be part of a resource path, you can a good deal of the time get away with assuming the last character of a URL will be a letter, a number, or a slash. That doesn't account for validating what's between the start and the end, and there are cases in which a period, colon, equals, or question mark are the last character in a valid one.

I'll skip most of the discussion of usernames and passwords in URLs in addition to the protocol scheme and URN. You probably shouldn't have those as results and you probably don't want them easy to click if you do.

So, this leaves us with a somewhat simple spec that is however more involved than a single search and replace. As with many things, with Perl's regular expressions you may find a way to shoehorn it into the substitution operator in just one statement, but that would be more complex and funky than breaking it into pieces.

So, do I have code for you? Well, it's not pretty...

#!/usr/local/bin/perl use strict; use warnings; my $match = qr{perl}i; print "\n\n"; while ( <> ) { chomp; my @words = split /\s+/; my $output = ''; foreach my $word ( @words ) { if ( $word =~ m{http://} ) { my $before = my $after = ''; if ( $word !~ m{^(:?http://|telnet://|ftp://|chrome://)} ) + { $word =~ s{^(.*?)(http://|telnet://|ftp://|chrome://)} +{$2}; $before = $1; } if ( $word =~ m{[^\w\d/]+$} ) { $word =~ s{([^\w\d/]+)$}{}; $after = $1; } my $str = '<a href="' . $word . '">'; if ( $word =~ m{$match} ) { $word =~ s{($match)}{<b>$1</b>}; } $str .= $word . '</a>'; $word = $before . $str . $after; } elsif ( $word =~ m{$match} ) { $word =~ s{($match)}{<b>$1</b>}; } $output .= ' ' . $word; } print $output . "\n"; } print "\n\n";

This code will take the following test file:

I have a simple, site-specific search engine. It finds matches in a MySQL database using LIKE. Then it bolds the search terms before displaying the results to the us +er. And it also renders URLs clickable. So, say the user has searched for "perlmonks" and the page contains "y +ou should all go to http://perlmonks.org, it's great!". We find the term in the database, we bold it, we render the URL clicka +ble and display it to the user, but at this point, the HTML has becom +e this: I have a simple, site-specific search engine. It finds matches in a MySQL database using LIKE. Then it bolds the search terms before displaying the results to the us +er. And it also renders URLs clickable. So, say the user has searched for "perlmonks" and the page contains "y +ou should all go to http://perlmonks.org/?node_id=891739, it's great! +". We find the term in the database, we bold it, we render the URL clicka +ble and display it to the user, but at this point, the HTML has becom +e this: I have a simple, site-specific search engine. It finds matches in a MySQL database using LIKE. Then it bolds the search terms before displaying the results to the us +er. And it also renders URLs clickable. So, say the user has searched for "perlmonks" and the page contains "y +ou should all go to http://perlmonks.org/?node_id=;user=, it's great! +". We find the term in the database, we bold it, we render the URL clicka +ble and display it to the user, but at this point, the HTML has becom +e this:

With this test file, you get this output:

I have a simple, site-specific search engine. It finds matches in a MySQL database using LIKE. Then it bolds the search terms before displaying the results to the u +ser. And it also renders URLs clickable. So, say the user has searched for "<b>perl</b>monks" and the page con +tains "you should all go to <a href="http://perlmonks.org">http://<b> +perl</b>monks.org</a>, it's great!". We find the term in the database, we bold it, we render the URL click +able and display it to the user, but at this point, the HTML has beco +me this: I have a simple, site-specific search engine. It finds matches in a MySQL database using LIKE. Then it bolds the search terms before displaying the results to the u +ser. And it also renders URLs clickable. So, say the user has searched for "<b>perl</b>monks" and the page con +tains "you should all go to <a href="http://perlmonks.org/?node_id=89 +1739">http://<b>perl</b>monks.org/?node_id=891739</a>, it's great!". We find the term in the database, we bold it, we render the URL click +able and display it to the user, but at this point, the HTML has beco +me this: I have a simple, site-specific search engine. It finds matches in a MySQL database using LIKE. Then it bolds the search terms before displaying the results to the u +ser. And it also renders URLs clickable. So, say the user has searched for "<b>perl</b>monks" and the page con +tains "you should all go to <a href="http://perlmonks.org/?node_id=;u +ser">http://<b>perl</b>monks.org/?node_id=;user</a>=, it's great!". We find the term in the database, we bold it, we render the URL click +able and display it to the user, but at this point, the HTML has beco +me this:

Notice the slight defect here? It's one of those edge cases. There's an equals sign at the very end of that last URL. PerlMonks, without the '=' to indicate a value for the key user, does this (non-shortcut links): http://perlmonks.org?node_id=6364;user but with the equals it will do this: http://perlmonks.org?node_id=6364;user=. Again, notice a subtle difference? When there's an empty value, the form field is empty. When there's no value, there's a pre-populated monk nickname. It even seems to be randomly chosen for your (mild) amusement. This is due to the significance of that symbol even at the end of a URL despite the fact that it will rarely be the final character. Well, that and some quirky programming within PM in this instance no doubt.

Is this good enough? Well, maybe. Only you can tell, and probably only after some use against your data. It only took a couple of minutes to get it that far, though. For actually properly handling all the edge cases, the 80/20 rule in this case is probably more like the 99.999/0.001 rule.


Comment on Re: Bolding search terms ... which might be URLs?
Select or Download Code
Re^2: Bolding search terms ... which might be URLs?
by Anonymous Monk on Mar 11, 2011 at 08:33 UTC

      Thanks for the pointers. Those still cannot deal with every case properly for arbitrary text. It's not a matter of getting the code right. It's a matter of there being too little information in the arbitrary text to be sure how to mark it up.

      A valid URI can easily be formed with a comma, semicolon, colon, question mark, or period at the end of it. They are often not the URI intended, though, as people use English punctuation around their URIs without separating them. There are important differences between the URI with and without those characters in some cases.

      The manual for the first one you list punts on non-Latin characters, too. Regexp::Common::URI::ftp's docs state that there's no well-defined standard across the RFCs for an FTP URI. You can get closer and closer, but you're just not going to get 100%. The only way to be sure you've marked something up entirely properly with URIs is to visit the URI and make sure the expected content is delivered.

      According to the RFCs, a URI such as http://foo.com does not necessarily even need to redirect to the resource http://foo.com/ if the owner of th site doesn't wish it to. You just can't be sure with arbitrary text and no markup that you are introducing links correctly all the time.

        Thanks very much indeed for that work, Mr Mischief. I appreciate it hugely. Sorry I haven't been back to this thread for a while. You've been really helpful. For what it's worth, my users are very unlikely to post edge-case URLs like the ones discussed here, or non-ASCII domain names.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://892590]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (5)
As of 2014-07-10 04:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (198 votes), past polls