No-Lifer has asked for the wisdom of the Perl Monks concerning the following question:
Dear all,
Yes, my search script is giving my gyp again - It's actually because I don't know enough of what can and can't be done. I'm usually ok, given a bit of a help to get started - I think a push in the right direction is required!
My search script is a little too "rigid". If I search for "regular expression", it searches the documents on my site for the EXACT string, "regular expression" (both the words appearing on the page next to each other).
I wish to make it more flexible - allowing it to find, say, instances of "regular" AND "expression", not necessarily beside each other on the page.
My form code is pretty much standard -
if ($ENV{'REQUEST_METHOD'} eq 'POST') {
read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'});
@pairs = split(/&/, $buffer);
foreach $pair (@pairs) {
($name, $value) = split(/=/, $pair);
$value =~ tr/+/ /;
$value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;
$value =~ s/["]//gi;
$value =~ s/[+]/ /gi;
$FORM{$name} = $value;
}
}
$keyword=$FORM{keyword};
How would I go about creating an "AND" sort of search? I'm assuming it's something to do with what I've got above.
Many thanks (again)!
Re: Form validation/Search script
by JediWizard (Deacon) on Oct 24, 2005 at 18:27 UTC
|
I highly recomend use CGI; as it will (among other things) make reading your parameters much easier:
use CGI qw(:standard);
my(%params) =();
foreach my $param (param()){
$params{$param} = param($param);
}
Just makes life easier for us all. see CGI.
They say that time changes things, but you actually have to change them yourself. Andy Warhol
| [reply] [d/l] [select] |
Re: Form validation/Search script
by ikegami (Patriarch) on Oct 24, 2005 at 18:37 UTC
|
To do a regexp search for "foo" and "bar", in any order, any distance from each other, one would use ^(?=.*foo)(?=.*bar). In the following snippet, a regexp of that form is constructed dynamically:
my $text = 'Perl is a general-purpose programming language originally
+developed for text manipulation and now used for a wide range of task
+s including system administration, web development, network programmi
+ng, GUI development, and more.';
# Text source: perlintro [http://perldoc.perl.org/perlintro.html]
foreach my $keywords (
'Perl development',
'Perl sucks',
) {
$re = '^' .
join '',
map { "(?=.*\\b$_\\b)" }
map quotemeta,
split ' ',
$keywords;
print("$re\n");
print("$keywords: ", $text =~ /$re/ ? "match" : "no match", "\n");
}
outputs
^(?=.*\bPerl\b)(?=.*\bdevelopment\b)
Perl development: match
^(?=.*\bPerl\b)(?=.*\bsucks\b)
Perl sucks: no match
Note: The use of \b is questionable. What if the keywords start or end with characters that don't match \w? This issue is left unresolved.
By the way, you should use core module CGI instead of handling the CGI request in your code. It's much more reliable and maintainable. You should also search if a module already does what I just coded.
| [reply] [d/l] [select] |
Re: Form validation/Search script
by Limbic~Region (Chancellor) on Oct 24, 2005 at 18:49 UTC
|
No-Lifer,
This is the first node I have seen you write on this topic, so I am unfamiliar with the history. It sounds like you are building a rudimentary search engine. It also sounds like you want to do this on your own instead of using a pre-built wheel.
There is nothing wrong with this approach in general, but it is also sometimes useful to learn about existing technology:
| [reply] |
|
Limbic-Region,
Cheers for the reply. As to what I'm doing - spot on. It's actually a bit of coursework I'm working on for University - I have a mandatory "Introduction to perl/cgi" class which I'm sucking at- but determined to get finished quite soon.
So, yes, I'm building a very simple search engine to go through a few pages (try www.ally.nu - searching for "perl"). I've got a few bits and bobs working, thanks to the other Monks here, and nearly have an application I could submit. Bearing in mind that they're not expecting miracles from us - we're not programming students!
I'm trying to do it in the most straightforward way possible - this is my first experience with perl *shudder*.
The things that're stumping me at the moment are - if "submit" is pressed without any form data, how to display an "error" page. And secondly, the question above - an "AND" type search.
My full code is below - I know it's a complete mess, will tidy it up at the end! Thank goodness we're not being assessed on code pretty-ness, purely on search engine function!
#!/usr/bin/perl -w
# The following code deals with the form data
if ($ENV{'REQUEST_METHOD'} eq 'POST') {
read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'});
@pairs = split(/&/, $buffer);
foreach $pair (@pairs) {
($name, $value) = split(/=/, $pair);
$value =~ tr/+/ /;
$value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;
$value =~ s/["]//gi;
$value =~ s/[+]/ /gi;
$FORM{$name} = $value;
}
}
$keyword=$FORM{keyword};
chdir("/home/1008/gnicoll/www.abernyte.net/public_html");
opendir(DIR, ".");
print "Content-type: text/html\n\n";
print"<STYLE>";
print"BODY {FONT-FAMILY: arial,sans-serif}";
print"TD {FONT-FAMILY: arial,sans-serif}";
print"DIV {FONT-FAMILY: arial,sans-serif}";
print"P {FONT-FAMILY: arial,sans-serif}";
print"A {FONT-FAMILY: arial,sans-serif}";
print"UNKNOWN {COLOR: #0000cc}";
print"</STYLE>";
print"<BODY bgColor=#ffffff topMargin=2 marginheight=2>";
print"<TABLE cellSpacing=2 cellPadding=0 width=100% border=0>";
print"<TBODY>";
print"<TR>";
print"<TD width=1% height=69 vAlign=top><a href=http://www.ally.nu><IM
+G height=59 alt=Go to Noogle Home hspace=3
src=http://www.ally.nu/logo.gif width=143 vspace=5 border=0></a></TD>"
+;
print"<TD width=868></TD>";
print"</TR>";
print"</TBODY>";
print"</TABLE>";
print"<TABLE cellSpacing=0 cellPadding=0 width=100% border=0>";
print"<TBODY>";
print"<TR>";
print"<TD bgColor=#3366cc><IMG height=1 width=1></TD></TR></TBODY></TA
+BLE>";
print"<TABLE cellSpacing=0 cellPadding=2 width=100% border=0>";
print"<TBODY>";
print"<TR>";
print"<TD bgColor=#e5ecf9 colSpan=4><B>Search Results</B> - Your Searc
+h for the keyword(s)
<strong>$keyword</strong> returned the following results:</TD></TR></T
+BODY></TABLE><BR>";
print"<TABLE cellSpacing=0 cellPadding=2 width=100% border=0>";
print"<TBODY>";
print"<TR>";
print"<TD width=133 rowspan=2 vAlign=top noWrap bgColor=#ffffff><P><SM
+ALL><A href=http://www.ally.nu>Noogle
Home</A><BR><BR><A href=http://www.ally.nu/docs>Documentation</A><BR><
+br><A
href=http://www.ally.nu/credits>Credits</A><br><BR><A href=http://www.
+ally.nu/docs/faq>FAQ<br></A><BR><A
href=http://www.ally.nu/quiz>Quiz</A><BR>";
print"<BR><BR>";
print"</SMALL></P></TD>";
print"<TD width=1 height=37 vAlign=bottom></TD>";
print"<TD width=1 rowspan=2 vAlign=bottom background=http://www.ally.n
+u/dot2.gif><IMG height=1
src=http://www.ally.nu/dot2.gif width=1></TD>";
print"<TD width=1 vAlign=bottom></TD>";
print"<TD width=100% valign=top><P><B><FONT size=-1>Search Results</FO
+NT></B></P></TD>";
print"</TR>";
print"<TR>";
print"<TD height=598 vAlign=bottom></TD>";
print"<TD vAlign=bottom></TD>";
print"<TD valign=top></p>";
print"<p></p>";
print"<p></p>";
print"<p></p>";
while($file = readdir(DIR))
{
next if ($file !~ /.html/);
open(FILE, $file);
$foundone = 0;
$title = "";
while (<FILE>)
{
if (/$keyword/i)
{
$foundone = 1;
}
if(/<title>/)
{
chop;
$title = $_;
$title =~ s/<title>//g;
$title =~ s/<\/title>//g;
}
if(/<TITLE>/)
{
chop;
$title = $_;
$title =~ s/<TITLE>//g;
$title =~ s/<\/TITLE>//g;
}
if($title eq "")
{
$title = $file;
}
if(/<META NAME="description" CONTENT="/i)
{
chop;
$content = $_;
$content =~ s/<META NAME="description" CONTENT="//g;
$content =~ s/">//g;
}
if(/<META NAME="author" CONTENT="/i)
{
chop;
$author = $_;
$author =~ s/<META NAME="author" CONTENT="//g;
$author =~ s/">//g;
}
if($content eq "")
{
$content = "No Meta-tag page information available";
}
if($author eq "")
{
$author = "No Meta-tag author information available";
}
$count++ while /$keyword/ig;
}
if($foundone)
{
print "<A HREF=/$file>$title</A><br>";
print"<table width=100% border=0 align=center bgcolor=#e5ecf9>";
print"<tr>";
print"<td height=10><font size=-1><b>Results</b>: <i>$count</i> occurr
+ence(s) of the word(s) <i>\"$keyword\"</i>
on this page.<br> <b>Page Description</b>: $content<br><b>Page Author<
+/b>: $author<br><b>URL</b>:<font
color=#008000>http://www.ally.nu/$file</td>";
print"</tr>";
print"</table>";
print"<br>";
$count = 0;
$listed=1;
}
close(FILE);
}
if($listed ne 1)
{print "<p><br>Sorry, your search returned <b>$foundone</b> res
+ults. <A HREF=/index.html>Search
Again?</A>";}
else
{print "<P><br>Do you want a <A HREF=/index.html>new search?</A
+>";}
print"</TD>";
print"</TR>";
print"</TBODY>";
print"</TABLE>";
print"<BR>";
print"<CENTER>";
print"<TABLE cellSpacing=0 cellPadding=0 width=100% border=0>";
print"<TBODY>";
print"<TR>";
print"<TD bgColor=#3366cc><IMG height=1 width=1></TD></TR></TBODY></TA
+BLE>";
print"<TABLE cellSpacing=0 cellPadding=2 width=100% bgColor=#e5ecf9 bo
+rder=0>";
print"<TBODY>";
print"<TR>";
print"<TD noWrap bgColor=#e5ecf9>";
print"<TABLE cellSpacing=0 cellPadding=0 width=100% border=0>";
print"<TBODY>";
print"<TR>";
print"<TD noWrap align=middle><FONT size=-1>©2005 Noogle - Napier Univ
+ersity Server Side Languages Coursework <A
href=http://www.ally.nu>Noogle Home</A> - <A href=http://www.ally.nu/d
+ocs>Documentation</A> - <A
href=http://www.ally.nu/credits>Credits</A> - <A href=http://www.ally.
+nu/docs/faq>FAQ</A> - <A
href=http://www.ally.nu/quiz>Quiz</A></FONT></TD></TR></TBODY></TABLE>
+</TD></TR></TBODY></TABLE></CENTER></BODY></
HTML>";
closedir(DIR);
exit;
I've also cannibalised quite a bit from other scripts - it's all cobbled together really. But at least I'm understanding what's happening!
Cheers,
NL. | [reply] [d/l] |
Re: Form validation/Search script
by Zaxo (Archbishop) on Oct 24, 2005 at 18:28 UTC
|
That depends on what you're searching. A database will more or less do it for you via DBI and SQL. Text search in files may use system grep or an index of some kind, or can use pure perl. Detail what you want.
If a regex is needed, you can join the search terms with '|' to make an alternation in the regex. That will find a match each time it finds one term.
Your "standard" treatment of the posted form is not so standard any more. Just saying,
use CGI;
my $q = CGI->new;
takes care of all that.
| [reply] [d/l] |
Re: Form validation/Search script
by wfsp (Abbot) on Oct 24, 2005 at 19:47 UTC
|
Hi No-Lifer!
One way to go about this is to first build a list of keywords and then all you need do is a lookup.
You could use HTML::TokeParser to extract the text from each page and something like this to extract the words (what are your plans for common words, accents, hypens, numbers etc.?).
Then load the words into a DB (I use DBM::Deep) with a reference to each file that contains each word (I use another D::D for the file refs).
All you have to bear in mind is to apply the same rules to the words submitted as you did when you built the index.
Then perhaps HTML::Template to format the results and CGI::Session to display them a page at a time.
At the moment I build the index locally and upload it.
There are about 2k pages and the index comes in at a shade under 42k words (12MB). And, fingers crossed, it's working well.
I would be interested in seeing an outline of how you plan to go about this. We may be able to make some suggestions before you commit yourself to any particular strategy.
Good luck!
After writing this I saw your reply to Limbic~Region. My advice above still stands.
| [reply] |
Re: Form validation/Search script
by marto (Cardinal) on Oct 24, 2005 at 20:13 UTC
|
| [reply] |
Re: Form validation/Search script
by cees (Curate) on Oct 25, 2005 at 00:33 UTC
|
If you want to build a search mechanism that will scale to lots of pages, then you need to use an indexer of some sort. Have a look at CGI::Application::Search which integrates the swish-e search index into a CGI app using teh CGI::Application framework.
| [reply] |
|
|