The stupid question is the question not asked | |
PerlMonks |
comment on |
( [id://3333]=superdoc: print w/replies, xml ) | Need Help?? |
That is a good question. The reason is that the logs are uploaded to a central server for processing and are contained in a .tgz file. Once opened up, there is no guarantee that each hour being processed by the loader script is in sequential order and the processing of the log files is done in parallel in a queue with work handed out to child processing scripts. So, the DB is the only point of reference. Also, storing a byte count would not work when the files are rotated (supposedly once a day) and the file starts over at 0 (but on some systems the logrotate does not always happen so those files just keep growing). It could be that hour 3, 5 and 6 are processed and THEN hour 4 gets processed so again the byte count from hour 6 will not be applicable to hour 4. But with hour 4 having been included in the hour 5 and hour 6 upload, there is no need to actually pull out and load the data but without checking the dates for inclusion in the range, there is no way to guarantee this. Given all of this, dates are the only reliable means of knowing data has been processed before. But, even with being able to process only dates after a byte skip, I would still need to process the dates in the range. And there are a lot of them each hour on some of the busier systems. Hence, the original question, how to make the lookup faster/more efficient? In reply to Re^2: search through hash for date in a range
by bfdi533
|
|