http://www.perlmonks.org?node_id=32285

xjar has asked for the wisdom of the Perl Monks concerning the following question:

Hello all. I need to write a program to compare two HTML documents to determine if they are similar enough to be considered "the same". What I was thinking of doing is this (keep in mind, I'm a neophyte, so if my ideas are pretty poor, be kind):

  • Read each document into an array, line by line
  • Strip the newline off of each array element
  • "Concatenate" each array element into a string variable, so that in the end, each variable will hold an entire document
  • Take a substr() of each variable, say 150 characters in, and then take 100 characters from there. If the two are the same, then the documents are the same.

    Now, I'm not sure how efficient this will be, especially with the swapping from array to variable. Can anyone provide me with some ideas, or even (hehe) a module that can help with this?

    Much thanks, xjar