Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

You don't need to build a full-blown parser for a case like this. You're not going to get it with a simple regular expression search and replacement, though. Something of a very simple loop and split with some thought about what you're actually doing will help.

There are some problems inherent in marking up text that hasn't had the proper markup maintained throughout its lifetime, though. In conversation just now for example you put a comma right after a valid URL. Since it's part of the hostname the way it's mentioned and that's clearly invalid, we can separate that out if you only want to worry about the root path of the resource. However, a comma, semicolon, slash, equals, percent sign, period, plus, or question mark can be part of a URI (and therefore URL) even though they are not part of a hostname. Then there are non-Latin paths. Don't even get em started on non-Latin domain names, because that's a whole book worth of special considerations.

If you're wanting to handle common URLs most of the time without getting tripped up with all the punctuation that can be part of a resource path, you can a good deal of the time get away with assuming the last character of a URL will be a letter, a number, or a slash. That doesn't account for validating what's between the start and the end, and there are cases in which a period, colon, equals, or question mark are the last character in a valid one.

I'll skip most of the discussion of usernames and passwords in URLs in addition to the protocol scheme and URN. You probably shouldn't have those as results and you probably don't want them easy to click if you do.

So, this leaves us with a somewhat simple spec that is however more involved than a single search and replace. As with many things, with Perl's regular expressions you may find a way to shoehorn it into the substitution operator in just one statement, but that would be more complex and funky than breaking it into pieces.

So, do I have code for you? Well, it's not pretty...

#!/usr/local/bin/perl use strict; use warnings; my $match = qr{perl}i; print "\n\n"; while ( <> ) { chomp; my @words = split /\s+/; my $output = ''; foreach my $word ( @words ) { if ( $word =~ m{http://} ) { my $before = my $after = ''; if ( $word !~ m{^(:?http://|telnet://|ftp://|chrome://)} ) + { $word =~ s{^(.*?)(http://|telnet://|ftp://|chrome://)} +{$2}; $before = $1; } if ( $word =~ m{[^\w\d/]+$} ) { $word =~ s{([^\w\d/]+)$}{}; $after = $1; } my $str = '<a href="' . $word . '">'; if ( $word =~ m{$match} ) { $word =~ s{($match)}{<b>$1</b>}; } $str .= $word . '</a>'; $word = $before . $str . $after; } elsif ( $word =~ m{$match} ) { $word =~ s{($match)}{<b>$1</b>}; } $output .= ' ' . $word; } print $output . "\n"; } print "\n\n";

This code will take the following test file:

I have a simple, site-specific search engine. It finds matches in a MySQL database using LIKE. Then it bolds the search terms before displaying the results to the us +er. And it also renders URLs clickable. So, say the user has searched for "perlmonks" and the page contains "y +ou should all go to http://perlmonks.org, it's great!". We find the term in the database, we bold it, we render the URL clicka +ble and display it to the user, but at this point, the HTML has becom +e this: I have a simple, site-specific search engine. It finds matches in a MySQL database using LIKE. Then it bolds the search terms before displaying the results to the us +er. And it also renders URLs clickable. So, say the user has searched for "perlmonks" and the page contains "y +ou should all go to http://perlmonks.org/?node_id=891739, it's great! +". We find the term in the database, we bold it, we render the URL clicka +ble and display it to the user, but at this point, the HTML has becom +e this: I have a simple, site-specific search engine. It finds matches in a MySQL database using LIKE. Then it bolds the search terms before displaying the results to the us +er. And it also renders URLs clickable. So, say the user has searched for "perlmonks" and the page contains "y +ou should all go to http://perlmonks.org/?node_id=;user=, it's great! +". We find the term in the database, we bold it, we render the URL clicka +ble and display it to the user, but at this point, the HTML has becom +e this:

With this test file, you get this output:

I have a simple, site-specific search engine. It finds matches in a MySQL database using LIKE. Then it bolds the search terms before displaying the results to the u +ser. And it also renders URLs clickable. So, say the user has searched for "<b>perl</b>monks" and the page con +tains "you should all go to <a href="http://perlmonks.org">http://<b> +perl</b>monks.org</a>, it's great!". We find the term in the database, we bold it, we render the URL click +able and display it to the user, but at this point, the HTML has beco +me this: I have a simple, site-specific search engine. It finds matches in a MySQL database using LIKE. Then it bolds the search terms before displaying the results to the u +ser. And it also renders URLs clickable. So, say the user has searched for "<b>perl</b>monks" and the page con +tains "you should all go to <a href="http://perlmonks.org/?node_id=89 +1739">http://<b>perl</b>monks.org/?node_id=891739</a>, it's great!". We find the term in the database, we bold it, we render the URL click +able and display it to the user, but at this point, the HTML has beco +me this: I have a simple, site-specific search engine. It finds matches in a MySQL database using LIKE. Then it bolds the search terms before displaying the results to the u +ser. And it also renders URLs clickable. So, say the user has searched for "<b>perl</b>monks" and the page con +tains "you should all go to <a href="http://perlmonks.org/?node_id=;u +ser">http://<b>perl</b>monks.org/?node_id=;user</a>=, it's great!". We find the term in the database, we bold it, we render the URL click +able and display it to the user, but at this point, the HTML has beco +me this:

Notice the slight defect here? It's one of those edge cases. There's an equals sign at the very end of that last URL. PerlMonks, without the '=' to indicate a value for the key user, does this (non-shortcut links): http://perlmonks.org?node_id=6364;user but with the equals it will do this: http://perlmonks.org?node_id=6364;user=. Again, notice a subtle difference? When there's an empty value, the form field is empty. When there's no value, there's a pre-populated monk nickname. It even seems to be randomly chosen for your (mild) amusement. This is due to the significance of that symbol even at the end of a URL despite the fact that it will rarely be the final character. Well, that and some quirky programming within PM in this instance no doubt.

Is this good enough? Well, maybe. Only you can tell, and probably only after some use against your data. It only took a couple of minutes to get it that far, though. For actually properly handling all the edge cases, the 80/20 rule in this case is probably more like the 99.999/0.001 rule.


In reply to Re: Bolding search terms ... which might be URLs? by mr_mischief
in thread Bolding search terms ... which might be URLs? by Cody Fendant

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (4)
As of 2024-04-25 19:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found