Re: Separating multiple keyword search input

Having seen a couple of your posts, I'd like to make a suggestion before I get to my main response: you will have a lot more luck getting help if you would take the time to reduce your problem to simplest possible terms. For most questions related to language usage, it is rare that you can't illustrate the problem in less than 20 lines.

Think about it: the helpful monks have only so much time to offer. Given the choice between a small, succinct tidbit and a long, incoherent, out-of-context, "I'll just dump some code into a question and someone will be able to fully grasp it and tell me what is going wrong" question... which do you think they'll answer?

There are resources that describe how to ask a question so that it will get answered; consider availing yourself of them.

Anyway. Rant concluded.

So far as I can infer from your other postings, you have two sets of data ("normal" and "premium"); you want to accept keywords off a web form, search through those sets for those keywords, then display the results (keeping the results from the two sets distinct).

It further looks like that data might be large and stored in a MySQL database. The data is also apparently structured into tab-separated fields of: unknown, title, description, unknown, unknown, keywords.

Without knowing your entire system (and I don't want to learn it -- I just want you to think about it), we can start biting off chunks. The two phases of processing you show above are interpreting form parameters, then searching through some result lines (which it's not clear where you got them from.)

Finding matches from a file.

# find_hits $file_name, @keywords;
#
# Returns a list of "score: line" strinsg.  Example:
#
# If file "foo.txt" contains:
#
#   Larry Wall, Programming Perl, Reference, x, y, perl
#   Peter Scott, Perl Medic, Legacy herding, z, a, perl
#
# A call to this function might look like this:
#
#   my @hits = find_hits "foo.txt", "medic";
#
# And it would return a one-element list:
#
#  '32: Peter Scot...a, perl'

sub find_hits( $ @ )
{
    my ( $file, @keywords ) = @_;

    # compose a regex for quick rejection of non-matching lines:
    my $any_keyword_re = join '|', map quotemeta $_, @keywords;
    $any_keyword_re = qr/$any_keyword_re/i;

    # and a keyword for the whole phrase
    my $phrase_re = join '\W+', @keywords;
    $phrase_re = qr/$phrase_re/i;

    # open input file for read
    open my $in, $file or croak "opening $file for read: $!";

    my @rv; # for accumulating return values

    while ( <$in> )
    {
        # reject lines with no matches out of hand
        next unless m/$any_keyword_re/;

        # any match at all is one point.
        my $score = 1;

        # split into fields for further scoring.
        my ( undef, $title, $desc, undef, undef, $keys ) =
            split /\t/;

        # title matches are worth 5 points each
        while ( $title =~ m/$any_keyword_re/g ) { $score += 5 }

        # description matches are only 1 point
        while ( $desc =~ m/$any_keyword_re/g ) { $score += 1 }

        # keyword matches are 4 points
        while ( $keys =~ m/$any_keyword_re/g ) { $score += 4 }

        # phrase matches (against entire line) are 10 points
        while ( m/$phrase_re/g ) { $score += 10 }

        # multiple matches are worth 10x the number
        # of keywords that matched.
        my $n_matches = () = m/$any_keyword_re/g; # see perlfaq4
        if ( $n_matches > 1 ) { $score += 10*$n_matches }

        # finally, format $score and save for returning
        # to the caller
        push @rv, sprintf "%03d: %s", $score, $_;
    }

    return @rv;
}
[download]

Ok, so now we address your other main issue, that of doing a substitution in your template. Since we can have a variable number of responses to each, but you only have one template variable, I'll assume that we can wedge all our answers into that one spot. We'll do this by joining together all the hits we found.

# validate our keywords.
unless ( defined $fields{keywords} &&
         fields{keywords} ne '' )
{
    # complain, bail out of this run.
}

my @keywords = split ' ', $fields{keywords};

# find actual matches in highest- to lowest-score order
my @normal_hits  = reverse sort find_hits "normal_list.txt", @keywords
+;
my @premium_hits = reverse sort find_hits "premium_list.txt", @keyword
+s;

# keep lists reasonable
if ( @normal_hits  > 100 ) { splice @normal_hits,  100 }
if ( @premium_hits > 100 ) { splice @premium_hits, 100 }

# now join them together for presentation:
my $normal_hits  = join '<br />', @normal_hits;
my $premium_hits = join '<br />', @premium_hits;

# finally, do the now-obvious substitution:
$template =~ s/%%normalresults%%/$normal_hits/;
$template =~ s/%%premiumresults%%/$premium_hits/;
[download]

Hopefully this gets you closer. There are all sorts of places that this code might need tweaking: if there are possibly a huge number of hits, you can't store them all in memory, so you'll need to keep track of just the top 100 (or whatever) the whole way. Handling special characters in the return strings for printing to HTML. How to handle more advanced CSV files (e.g. double quotes protecting a comma that is in a field.)

None of these are insurmountable, but until and unless you develop the skill to break down problems into simpler, more digestable chunks (both for asking questions and for writing the solution in the first place), no amount of cookbookery will help you. Good luck.

Comment on Re: Separating multiple keyword search input Select or Download Code

Replies are listed 'Best First'.
Re: Re: Separating multiple keyword search input by Dente (Acolyte) on May 03, 2004 at 23:44 UTC
This definetly gets me closer to the solution. I apologize for submitting code from mid point as I thought that is where my problem was initially, however, upon further disection, I feel the problem needs to be addressed early on. Here is code I thought would have worked for me, kepping in mind that I want a site search engine returning the result-like appearance of google. `open (SIDX, "$data_dir/search.idx"); open (SIDX2, "$data_dir/search2.idx"); my @sidx = <SIDX>; my @sidx2 = <SIDX2>; @skeyw = grep {$_ ne ""} @skeyw; @premiumkeyw = grep {$_ ne ""} @premiumkeyw; my $regexp = join('\|', @skeyw, @premiumkeyw); $regexp = qr/$regexp/; (@skeyw) = split(/ /,$fields{'keywords'}); # SPLITTING SIDX FILE (@premiumskeyw) = split(/ /,$fields{'keywords'}); ## SPLITTING SIDX2 F +ILE $nrkeywords = push(@skeyw); $premiumnrkeywords = push(@premiumskeyw);` [download]	[reply] [d/l]
Re^3: Separating multiple keyword search input by tkil (Monk) on May 13, 2004 at 07:10 UTC
First, apologies for taking so long to get back to you on this. (And it took even longer, as my browser decided to nose dive after I wrote up this response the first time. Sigh.) Looking at this code, though, I again have to say that it is almost a complete non-sequitor to me. I think I understand your overall goals, but your actual program construction seems to be very haphazard. I can't figure out your larger issues without grasping your entire system (which I'm not equipped to do!), but here are a few comments on your "programming in the small": `open (SIDX, "$data_dir/search.idx"); open (SIDX2, "$data_dir/search2.idx"); my @sidx = <SIDX>; my @sidx2 = <SIDX2>;` [download] Assuming that these are what you're actually searching through, you probably don't want to load them into memory up front like this. Build up your comparison / scoring function, then apply it to each line of input, keeping only those that match to a certain level. (This all depends on what portion of the index you expect to return; in a typical situation, I would assume that only a small part of the index is relevant to any given query, so I would want to have only those values in memory.) `(@skeyw) = split(/ /,$fields{'keywords'}); # SPLITTING SIDX FILE (@premiumskeyw) = split(/ /,$fields{'keywords'}); ## SPLITTING SIDX2 F +ILE` [download] At the very least, the comments are grossly inaccurate. Also, you're splitting the same value into two different arrays. Finally, you are using a pattern which can result in the null strings that you later have to filter out; a better split pattern can fix that. The most trivial change is that simple hash keys don't need to be in quotes. All together, we can just say: `my @keywords = split ' ', $fields{keywords};` [download] The use of `' '` has special meaning to `split`: break it up on one or more whitespace characters, and ignore leading whitespace as well. This should guarantee that you have no null keywords in the array, making your checks unnecessary. (Side note: don't skimp on variable names. Use an editor that does expansions if you have to, but I personally find `@keywords` more evocative and self-descriptive than `@skeyw`. Same comment applies to filehandle names, and pretty much every other name in your program...) `$nrkeywords = push(@skeyw); $premiumnrkeywords = push(@premiumskeyw);` [download] Assuming you just want to get the number of elements in those arrays (which, as pointed out above, are the same array, so you don't need to do it twice!), you can simply evaluate the array in scalar context. The slightly long way of saying that is: `my $n_keywords = scalar @keywords;` [download] I say this is the long way, because the `scalar` is unnecessary. You're asasigning something into a scalar, and that puts that thing into a scalar context. So, we could have written: `my $n_keywords = @keywords;` [download] Either way, you certainly don't need `push`. Finally, now that you know there is only one set of keywords, you can construct the regex out of them. One point to consider is whether thes keywords are themselves regexes, or if they are just plain text. If the latter, we want to make sure we neutralize any characters that are special to the regex engine. The built-in function `quotemeta` is just the ticket. Putting it all together, we have: `my $regex = join '\|', map { quotemeta $_ } @keywords; $regex = qr/$regex/;` [download] I addressed this in more detail in my original response, but now you can basically do: `my @std_hits = grep { m/$regex/ } <STD_IDX>; my @prem_hits = grep { m/$regex/ } <PREM_IDX>;` [download] Hopefully this has helped you further. I do worry that there are archetectural issues that I am unable to address, and untill you get the chance to rethink what you are doing (as opposed to just making random changes and hoping for the best), you're not going to have much luck. Regardless, I hope it works out for you.	[reply] [d/l] [select]