Problems? Is your data what you think it is? | |
PerlMonks |
Cleaning up text for indexing in DBby TVSET (Chaplain) |
on Jul 16, 2003 at 13:03 UTC ( [id://274799]=perlquestion: print w/replies, xml ) | Need Help?? |
TVSET has asked for the wisdom of the Perl Monks concerning the following question:
Dear Monks,
I am working on a knowledge base for our network operations center. Knowledge base is a kind of collection of categorized questions and answers, sometimes with attachments. One of the functions of the knowledge base is searching through the items. Results should be sorted by relevancy with the most ranked item up on the list. In order to achieve this, I count the occurancy of every word in the knowledge item text and store this counts in the separate table. The structure is somewhat like this:
Now, the tricky part is that both HTML and text are allowed, which means that I need to somehow identify if the item being entered is a text or HTML (I am thinking of a radio selection), and then cleaning out all HTMl staff if needed, plus all the punctuation and other crap. All solutions that I have in my mind are somewhat messy and ugly (regexps, etc). HTML parsing modules can be applied to a certain degree, but noone forces people to enter valid HTML. Secondly, all the dots, quotes, semicolumns, etc give a headache. Any suggestions from the wise ones? :)
Back to
Seekers of Perl Wisdom
|
|