http://www.perlmonks.org?node_id=891739

Cody Fendant has asked for the wisdom of the Perl Monks concerning the following question:

I have a simple, site-specific search engine.

It finds matches in a MySQL database using LIKE.

Then it bolds the search terms before displaying the results to the user. And it also renders URLs clickable.

So, say the user has searched for "perlmonks" and the page contains "you should all go to http://perlmonks.org, it's great!".

We find the term in the database, we bold it, we render the URL clickable and display it to the user, but at this point, the HTML has become this:

you should all go to <a href="http://<b>perlmonks</b>.org">http://<b>perlmonks</b>.org</a>, it's great!

Which is of course an invalid URL. How would the Monks approach this problem? ...

Update: I've got some useful and some ... non-useful replies to this. Anonymous monk doesn't seem to understand the question at all. JavaFan doesn't either, although that's my fault because my code was wrong. I updated it.

ikegami and mr_mischief, thank you, but I guess it comes down to this: I can't do this with a simple regular expression, can I? I need some kind of parsing where the regular expression would only be applied to things not inside HTML brackets. Or can I do it with an evaluated RHS?

Replies are listed 'Best First'.
Re: Bolding search terms ... which might be URLs?
by ikegami (Patriarch) on Mar 07, 2011 at 04:59 UTC

    Then it bolds the search terms before displaying the results to the user. And it also renders URLs clickable.

    Do it in the opposite order, and skip tags when deciding if something should be bolded or not.

Re: Bolding search terms ... which might be URLs?
by mr_mischief (Monsignor) on Mar 07, 2011 at 03:45 UTC

    Make the link to the proper URL in the href field. Then bold the search terms in the URL and place that in the link text. Alternately, you could just put the same text (the URL) in both places and make the whole link bold. People shouldn't complain too much about that, but I'd prefer the former.

Re: Bolding search terms ... which might be URLs?
by Anonymous Monk on Mar 07, 2011 at 08:07 UTC
Re: Bolding search terms ... which might be URLs?
by JavaFan (Canon) on Mar 07, 2011 at 13:12 UTC
    Uhm, http://<b>perlmonks</b>.org is displayed in HTML as http://perlmonks.org, which is a valid URL. Now, it's not "clickable", but then, http://perlmonks.org isn't marked up to be "clickable" either. For that, it needs to be wrapped into an anchor, like <a href="http://perlmonks.org">http://perlmonks.org</a>. But <a href="http://perlmonks.org">http://<b>perlmonks</b>.org</a> is valid HTML. And, in most browsers, "clickable": http://perlmonks.org.

    So, where exactly is your problem?

Re: Bolding search terms ... which might be URLs?
by mr_mischief (Monsignor) on Mar 11, 2011 at 06:07 UTC

    You don't need to build a full-blown parser for a case like this. You're not going to get it with a simple regular expression search and replacement, though. Something of a very simple loop and split with some thought about what you're actually doing will help.

    There are some problems inherent in marking up text that hasn't had the proper markup maintained throughout its lifetime, though. In conversation just now for example you put a comma right after a valid URL. Since it's part of the hostname the way it's mentioned and that's clearly invalid, we can separate that out if you only want to worry about the root path of the resource. However, a comma, semicolon, slash, equals, percent sign, period, plus, or question mark can be part of a URI (and therefore URL) even though they are not part of a hostname. Then there are non-Latin paths. Don't even get em started on non-Latin domain names, because that's a whole book worth of special considerations.

    If you're wanting to handle common URLs most of the time without getting tripped up with all the punctuation that can be part of a resource path, you can a good deal of the time get away with assuming the last character of a URL will be a letter, a number, or a slash. That doesn't account for validating what's between the start and the end, and there are cases in which a period, colon, equals, or question mark are the last character in a valid one.

    I'll skip most of the discussion of usernames and passwords in URLs in addition to the protocol scheme and URN. You probably shouldn't have those as results and you probably don't want them easy to click if you do.

    So, this leaves us with a somewhat simple spec that is however more involved than a single search and replace. As with many things, with Perl's regular expressions you may find a way to shoehorn it into the substitution operator in just one statement, but that would be more complex and funky than breaking it into pieces.

    So, do I have code for you? Well, it's not pretty...

    #!/usr/local/bin/perl use strict; use warnings; my $match = qr{perl}i; print "\n\n"; while ( <> ) { chomp; my @words = split /\s+/; my $output = ''; foreach my $word ( @words ) { if ( $word =~ m{http://} ) { my $before = my $after = ''; if ( $word !~ m{^(:?http://|telnet://|ftp://|chrome://)} ) + { $word =~ s{^(.*?)(http://|telnet://|ftp://|chrome://)} +{$2}; $before = $1; } if ( $word =~ m{[^\w\d/]+$} ) { $word =~ s{([^\w\d/]+)$}{}; $after = $1; } my $str = '<a href="' . $word . '">'; if ( $word =~ m{$match} ) { $word =~ s{($match)}{<b>$1</b>}; } $str .= $word . '</a>'; $word = $before . $str . $after; } elsif ( $word =~ m{$match} ) { $word =~ s{($match)}{<b>$1</b>}; } $output .= ' ' . $word; } print $output . "\n"; } print "\n\n";

    This code will take the following test file:

    I have a simple, site-specific search engine. It finds matches in a MySQL database using LIKE. Then it bolds the search terms before displaying the results to the us +er. And it also renders URLs clickable. So, say the user has searched for "perlmonks" and the page contains "y +ou should all go to http://perlmonks.org, it's great!". We find the term in the database, we bold it, we render the URL clicka +ble and display it to the user, but at this point, the HTML has becom +e this: I have a simple, site-specific search engine. It finds matches in a MySQL database using LIKE. Then it bolds the search terms before displaying the results to the us +er. And it also renders URLs clickable. So, say the user has searched for "perlmonks" and the page contains "y +ou should all go to http://perlmonks.org/?node_id=891739, it's great! +". We find the term in the database, we bold it, we render the URL clicka +ble and display it to the user, but at this point, the HTML has becom +e this: I have a simple, site-specific search engine. It finds matches in a MySQL database using LIKE. Then it bolds the search terms before displaying the results to the us +er. And it also renders URLs clickable. So, say the user has searched for "perlmonks" and the page contains "y +ou should all go to http://perlmonks.org/?node_id=;user=, it's great! +". We find the term in the database, we bold it, we render the URL clicka +ble and display it to the user, but at this point, the HTML has becom +e this:

    With this test file, you get this output:

    I have a simple, site-specific search engine. It finds matches in a MySQL database using LIKE. Then it bolds the search terms before displaying the results to the u +ser. And it also renders URLs clickable. So, say the user has searched for "<b>perl</b>monks" and the page con +tains "you should all go to <a href="http://perlmonks.org">http://<b> +perl</b>monks.org</a>, it's great!". We find the term in the database, we bold it, we render the URL click +able and display it to the user, but at this point, the HTML has beco +me this: I have a simple, site-specific search engine. It finds matches in a MySQL database using LIKE. Then it bolds the search terms before displaying the results to the u +ser. And it also renders URLs clickable. So, say the user has searched for "<b>perl</b>monks" and the page con +tains "you should all go to <a href="http://perlmonks.org/?node_id=89 +1739">http://<b>perl</b>monks.org/?node_id=891739</a>, it's great!". We find the term in the database, we bold it, we render the URL click +able and display it to the user, but at this point, the HTML has beco +me this: I have a simple, site-specific search engine. It finds matches in a MySQL database using LIKE. Then it bolds the search terms before displaying the results to the u +ser. And it also renders URLs clickable. So, say the user has searched for "<b>perl</b>monks" and the page con +tains "you should all go to <a href="http://perlmonks.org/?node_id=;u +ser">http://<b>perl</b>monks.org/?node_id=;user</a>=, it's great!". We find the term in the database, we bold it, we render the URL click +able and display it to the user, but at this point, the HTML has beco +me this:

    Notice the slight defect here? It's one of those edge cases. There's an equals sign at the very end of that last URL. PerlMonks, without the '=' to indicate a value for the key user, does this (non-shortcut links): http://perlmonks.org?node_id=6364;user but with the equals it will do this: http://perlmonks.org?node_id=6364;user=. Again, notice a subtle difference? When there's an empty value, the form field is empty. When there's no value, there's a pre-populated monk nickname. It even seems to be randomly chosen for your (mild) amusement. This is due to the significance of that symbol even at the end of a URL despite the fact that it will rarely be the final character. Well, that and some quirky programming within PM in this instance no doubt.

    Is this good enough? Well, maybe. Only you can tell, and probably only after some use against your data. It only took a couple of minutes to get it that far, though. For actually properly handling all the edge cases, the 80/20 rule in this case is probably more like the 99.999/0.001 rule.

        Thanks for the pointers. Those still cannot deal with every case properly for arbitrary text. It's not a matter of getting the code right. It's a matter of there being too little information in the arbitrary text to be sure how to mark it up.

        A valid URI can easily be formed with a comma, semicolon, colon, question mark, or period at the end of it. They are often not the URI intended, though, as people use English punctuation around their URIs without separating them. There are important differences between the URI with and without those characters in some cases.

        The manual for the first one you list punts on non-Latin characters, too. Regexp::Common::URI::ftp's docs state that there's no well-defined standard across the RFCs for an FTP URI. You can get closer and closer, but you're just not going to get 100%. The only way to be sure you've marked something up entirely properly with URIs is to visit the URI and make sure the expected content is delivered.

        According to the RFCs, a URI such as http://foo.com does not necessarily even need to redirect to the resource http://foo.com/ if the owner of th site doesn't wish it to. You just can't be sure with arbitrary text and no markup that you are introducing links correctly all the time.