http://www.perlmonks.org?node_id=433250


in reply to IRC log search

I'm assuming that you will have a large volume of data to process. If so, a database is the way to go to speed up searching and data crunching. However, the database schema will depend on what you are trying to get out of your project. What kind of reporting or searching will you be performing?

If you are unsure on what to do about a database schema, inserting each line into a database (I prefer mysql) and then utilize some database analyzing queries to help optimize your database. Referencing the MySQL documenation at http://dev.mysql.com/doc/mysql/en/optimizing-database-structure.html is a good place to start.

On the perl side: DBI, DBD::MySQL, and Class::DBI are a few good modules to help you talk to your database. There are certainly more modules and examples on CPAN.

Hope This helps.

Replies are listed 'Best First'.
Re^2: IRC log search
by Anonymous Monk on Feb 22, 2005 at 05:38 UTC

    In goes one or several search terms, perhaps sometimes constrained to the recent n months or so. Out comes the day and the line number where someone mentioned all the terms. With those data I can look up the immediate context from the log file.

    Currently I'm not treating the nickname which is in front of each line different than the spoken words. This means a nickname can be given as additional search term to restrict results to what a certain person said.

    Is that enough to make a schema?

    Yes, good pointers. I looked at DBI, that's straightforward, but I don't understand Class::DBI on a conceptual level. Still reading the mysql guide.

      Based on what I know, I would say the schema should be minimally:

      id, int(11) auto_increment
      nickname, varchar 100
      message, text
      timestamp, int(11)
      dateStamp, time/date

      Some would argue that you don't need 2 timestamp fields. However in my experience, I sometimes have the need to search base on an EPOCH time range (timestamp) or to use MySQLs date system to search between certain dates.

      My logic for this database is:

      • You can limit queries base on nickname
      • You can limit queries base on time range
      • You can perform certain matches based on text
      • Use a combination of any/all three

      I've included an ID field. This field exists based on common database practices and for future growth. I feel that you may need to code a sub or script to parse a raw log to insert it into the database properly. It may get tricky depending on how the IRC formats the time stamp in the logs.

      Hope this helps