http://www.perlmonks.org?node_id=1055178

Cosmic37 has asked for the wisdom of the Perl Monks concerning the following question:

Dear Brethren, I have been trying to think of a better way to do this but I am afraid I am still too ignorant of perl. I love perl when I use it but I am an intermittent user so please forgive me, etc. I have time stamped data with date and time somewhere within each line of interesting data in potentially rather huge data files of many Gigabytes with millions of lines. I am testing with smaller files of less than 100000 lines.

In FILEA the data is logged less frequently than once per second (actually more like twice per minute but could vary). The other files (FILEB, FILEC, FILED) have data lines logged at once per second or even greater but there is always possibility of missing data. In principle all data files overlap in logging times for my analysis period. I want to append each line of data of FILEA with data from the all other files which has the same time stamp (if there are more than one line with same timestamp I don't mind taking the first or last one as long as the sample has same timestamp, ie logged at the same second).

The question is whether anyone can recommend a good method to do this. My simplistic thinking is that I could read all files into respective arrays and trawl through the arrays element by element matching timestamps and concatenating data in an output array before saving that to an output file. However, I feel sure {despite not actually being sure!} that there is a cleverer and quicker way to do this task. Can someone please enlighten me?

A minor complication is that timestamping format may differ per file. eg in FILEA, FILEB and FILEC the date and time components could be matched with m/^\d\.+(\d{2}):(\d{2}):(\d{2})\.+(\d{2})(\d{2})(\d{2})\d{2}.abc/ where $hours=$1; $min=$2; $sec=$3; $daym=$4; $mon=$5-1; $year=100+$6;

Whereas in FILED the date and time components could be matched with m/^\"(\d{4})-(\d{2})-(\d{2})\s+(\d{2}):(\d{2}):(\d{2})\.+/ where $hours=$4; $min=$5; $sec=$6; $daym=$3;$mon=$2-1;$year=$1-1900;

Sorry in advance!

  • Comment on matching datetimestamps and concatenating data where timestamps match from multiple large datafiles

Replies are listed 'Best First'.
Re: matching datetimestamps and concatenating data where timestamps match from multiple large datafiles
by CountZero (Bishop) on Sep 22, 2013 at 12:15 UTC
    You are trying to re-invent a database and query engine.

    Keeping all this data in memory will not work unless you have a computer with a huge memory to avoid the repeatedly swapping in and out of your data and even then you still have to write a program to efficiently search through all those arrays.

    I would do it as follows:

    1. Read each file into its own database table and index the field with the timestamp. You cannot make the timestamp the key of each record as you seem to have multiple records with the same timestamp in each file. Use a module such as Date::Parse or DateTime::Format::DateParse to turn the timestamp into a standard format that will be the same in all reocrds.
    2. Use standard SQL to match the timestamps in your FILEA-table with the timestamps in the FILEB-, FILEC-, ... tables.

    Alternatively but only if all the files have timestamps in strict time-sequential order, you can open a filehandle to each of the files and on a line-by-line basis, loop through FILEA, extract the timestamp, transform the timestamp into a standard format and then iterate through all other files checking their timestamps (after transforming those also into the same standard format) and output the record when the timestamps match until you hit a timestamp past the timestamp of the main loop. Then you do the same for the next FILE until all FILEs have passed the timestamp of the main file and you go to the next record in the main file.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
Re: matching datetimestamps and concatenating data where timestamps match from multiple large datafiles
by CountZero (Bishop) on Sep 22, 2013 at 12:46 UTC
    For the "simple" case where you have all files in strict time-sequential order:
    use Modern::Perl; open my $FILEA, '<', './FILEA.TXT' or die $!; open my $FILEB, '<', './FILEB.TXT' or die $!; open my $FILEC, '<', './FILEC.TXT' or die $!; open my $FILED, '<', './FILED.TXT' or die $!; my $ts_B = 0; my $ts_C = 0; my $ts_D = 0; while ( my $ts_A = <$FILEA> ) { chomp $ts_A; say "A:$ts_A matches B:$ts_B" if $ts_A == $ts_B; while ( $ts_B = <$FILEB> ) { chomp $ts_B; last if $ts_B > $ts_A; next if $ts_B < $ts_A; say "A:$ts_A matches B:$ts_B"; } say "A:$ts_A matches C:$ts_C" if $ts_A == $ts_C; while ( $ts_C = <$FILEC> ) { chomp $ts_C; last if $ts_C > $ts_A; next if $ts_C < $ts_A; say "A:$ts_A matches C:$ts_C"; } say "A:$ts_A matches D:$ts_D" if $ts_A == $ts_D; while ( $ts_D = <$FILED> ) { chomp $ts_D; last if $ts_D > $ts_A; next if $ts_D < $ts_A; say "A:$ts_A matches D:$ts_D"; } }
    Output
    A:102 matches B:102 A:264 matches C:264 A:403 matches D:403 A:403 matches D:403 A:560 matches D:560 A:560 matches D:560 A:744 matches B:744 A:744 matches B:744 A:827 matches C:827 A:915 matches D:915 A:915 matches D:915 A:1018 matches C:1018 A:1180 matches D:1180 ...
    The files I used don't have timestamps, just random numbers in an ever increasing sequence, but the logic remains the same.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics

      Thank you all so much for your comments and guidance. Thank you to those who pointed out the very important caveat that certain logic requires monotonically increasing time within the datafiles - on this occasion this is the case thankfully.

      During my experimentation I had a lot of problems with the following and I am guessing some of you will be able to tell me why as I am once again eeegnorant! The error I get is only at the second match even though $fha and $fhb get set as expected if I print them to screen. The error reported is "Use of uninitialized value in pattern match (m//) at whatever.pl line whatever, <INPUTB> line whatever". I suppose I also need to chomp the input lines before concatenating them but the point is that the match is failing and I can't see why... Many thanks in advance for any suggestions.

      ... open OUTPUT, $outputFile or die "Can't open $outputFile for writing $!\n"; open INPUTA, $inputFileA or die "Can't open $inputFileA for writing $!\n"; open INPUTB, $inputFileB or die "Can't open $inputFileB for writing $!\n"; while($fha=<INPUTA>){ print "a: ",$fha; if($fha=~m/(\d{2}):(\d{2}):(\d{2}).+(\d{2})(\d{2})(\d{2})\d{2}_\d{ +2}\.abc/){ $timea=timegm($3,$2,$1,$4,$5,$6); $output_line=$fha; until($timeb>$timea){ $fhb=<INPUTB>; print "b: ",$fhb; if($fhb=~m/(\d{2}):(\d{2}):(\d{2}).+(\d{2})(\d{2})(\d{2})\d{2} +_\d{2}\.abc/){ $timeb=timegm($3,$2,$1,$4,$5,$6); if($timeb=$timea){ $output_line.=$fhb; } } } print OUTPUT $output_line,"\n"; } } close OUTPUT; close INPUTA; close INPUTB;

      I'm imagining :-D that each file opened has a pointer or equivalent indicating which line we are at. Then if we make a match on the first line of fileA the "while ( $ts_B = <$FILEB> )" statement is not scanning from the beginning of FILEB but from where it previously left off when we are seeking a match for the second line of FILEA. Is this correct? This is obviously important for efficiency and speed as I don't want to be starting all over and scanning through the whole file for each timestamp.

        Indeed, each filehandle is independent of all others and will keep its own reference to where it was in the file it is connected to. So it has no need to rescan from the very beginning of the file to get the next line.

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

        My blog: Imperial Deltronics
Re: matching datetimestamps and concatenating data where timestamps match from multiple large datafiles
by hdb (Monsignor) on Sep 22, 2013 at 12:09 UTC

    UPDATE: As CountZero points out below, there is a problem with my do {...} blocks, so please check his proposal Re: matching datetimestamps and concatenating data where timestamps match from multiple large datafiles which should run as intended.

    It sounds like your data is sorted ascending with respect to time. Under this assumption I propose to use the following idea of a script:

    use strict; use warnings; open my $fileA, ... or die; open my $fileB, ... or die; # same with the other files my $lineB = ''; while( my $lineA = <$fileA> ) { chomp $lineA; next unless $lineA =~ m/^\d\.+(\d{2}):(\d{2}):(\d{2})\.+(\d{2})(\d{2 +})(\d{2})\d{2}.abc/; my $dateA = ... ; # construct date from $1, $2 etc do { chomp $lineB; next unless $lineB =~ m/^\d\.+(\d{2}):(\d{2}):(\d{2})\.+(\d{2})(\d +{2})(\d{2})\d{2}.abc/; my $dateB = ... ; # construct date ... next if $dateB < $dateA; # use some date and time library for thes +e comparisons last if $dateB > $dateA; # ... $lineA .= $lineB; } while( $lineB = <$fileB> ); # same with fileC etc... print "$lineA\n"; }
      Did you try your code?

      One cannot use next or last inside a do { ... } block. As the docs say : "contrary to popular belief do{} blocks, which do NOT count as loops".

      It is not a loop, so you cannot use loop-control commands. You can double the curly braces and put a bare block inside the do { ... } block. Bare blocks are loops that execute once. The loop-control commands will work ... somewhat, but last will not do as you might expect: it will exit the inner (bare) block, only to fall into the outer (do) block and hence has the same effect as next.

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      My blog: Imperial Deltronics

        I did not try my code as it is only a sketch of how to start.

        Thanks for pointing out this flaw, I was not aware of it. I did put the while at end to avoid the duplicate checking of the times. You have solved this directly in your code by adding another check.

Re: matching datetimestamps and concatenating data where timestamps match from multiple large datafiles
by RichardK (Parson) on Sep 22, 2013 at 12:19 UTC

    I assume that your files must be already sorted as they're log files, then what you are trying to do is a type of Merge Sort

    You can then process you files one line at a time. Read the current line for fileA and decide if you want to write the current lines for fileB and fileC, and when you write a line from a file get the next one. Keep going until you run out of lines.

Re: matching datetimestamps and concatenating data where timestamps match from multiple large datafiles
by wjw (Priest) on Sep 22, 2013 at 20:24 UTC

    I have to agree with the database approach brought up earlier. I occasionally run into these issues where correlation between large files is required and have yet to be able to beat the combo of perl and DBI/DBD. SQLite has been a life-saver

    • for ease of use
    • quick implementation
    • a short learning curve
    • as well as performance savings

    Best of luck in whatever approach you use...

    • ...the majority is always wrong, and always the last to know about it...
    • The Spice must flow...
    • ..by my will, and by will alone.. I set my mind in motion
Re: matching datetimestamps and concatenating data where timestamps match from multiple large datafiles
by Marshall (Canon) on Sep 23, 2013 at 06:44 UTC
    One way to do this is to adopt a "standard" format that is ASCII sortable.

    Example: "2012-10-06 1649".
    This is Oct 6, 2012 at 16:49 GMT (year-month-date time).

    The leading zero's are important!
    "2012-10-6 1649" won't sort in the same ASCII order.

    The above format sorts in ASCII order. No problem. That also means that you can compare Date X vs Date Y!.
    Add seconds as you wish "1649:05", etc. As long as you have leading zero's, ASCII sort order will work. "1649 05" is fine also as long as the format is consistent with space between the minutes and seconds and leading zero's.

    In general, log files should use GMT or what is called UTC time. This is time zone independent.

    If you use a DB, same idea: normalize to GMT/UTC time - that is what goes into the DB.

    Basically, normalize the date/times to a standard ASCII sortable format using GMT/UTC time. Put that into a DB or not. You can sort/search on these times in ASCII format if you want to. Perl sorts ASCII fields like a rocket. If you have a time zone independent date/time format, the DB will also sort/search it like a rocket.

      Example: "2012-10-06 1649".

      While you nearly accurately describe the right track, you did not name it.

      ISO 8601 is one of the standards that actually contain a lot of sanity, and there are even a few cpan modules helping you to implement it. 2013-09-23T09:12:55

      Cheers, Sören

      Créateur des bugs mobiles - let loose once, run everywhere.
      (hooked on the Perl Programming language)