Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
I have list of coordinates and a subroutine that tells me in which JSON file among many I must retreve some content for each coordinate pair.
The values in the JSON files change periodically.
Could you please provide me some indications on how to implement a caching of the JSON information in a hash (I saw there is File::Cache) and monitor the changes to the JSON files so that the hash is updated?
Thank you!

Replies are listed 'Best First'.
Re: Caching files
by choroba (Archbishop) on Jan 24, 2020 at 14:50 UTC
    Tell us more. What OS are you on? On Linux, I've had good experience with inotify to watch a directory tree for changes. It doesn't detect changes made by mmap, though, so tell us also how the JSON files change. Other OSes use different notification tools.

    Checking (stat)[9] would be slower than running a notification tool, but it should still be faster than reading the file every time. This method might fail to invalidate the cache properly if mmap again was used to modify the files.

    Given the coordinates, do you know what JSON file holds the relevant information, or does this periodically change as well? If the latter, cache both the value and the filename. (But what would happen if there was conflicting information in two JSON files?)

    What kind of information do the JSON files provide? If it's a structure that JSON can represent, you can store the decoded structure in the hash directly.

    $cache{$x}{$y}{$z} = $decoded_structure;

    BTW, File::Cache is now discouraged and recommends Cache::Cache which itself is not actively developed anymore and recommends CHI instead.

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
      I'm on Linux and there is a cron process that writes at regular times the time window of validity of some forecast data for a number of geographical tiles.
      Each tile has a corresponding JSON written by the cron process where, for example, its time window is written.
      The data generated by the process are 4D arrays saved to ASCII files with regular structure, so that value(i,j,k,t) data are queried with direct access with seek(), upon calculation of byte start of value(i,j,k,t), through a function that has (i,j,k,t) as input.
      At the moment, the process consists in looking for the tile where a given point falls and then read the JSON of such tile when making the query.
      I wonder if there is a way to preload all the JSON files into a hash, and then update them when they change upon cron process execution.
      Here below are parts of code, so that maybe it is possible to understand the situation. In practice, I'd like to cache the sub _get_tile_info()
      
      
      get_data();
      
      #-------------
      sub get_data {
      
          my %args = @_;
          my @coords = @{$args{Coords} || []};
      
          my @tiles_and_ids = _get_tile_and_ids(%args);
      
          foreach my $point (@tiles_and_ids) {
              my $data = _extract_ts(Point=>$point,WS2D=>1,WD2D=>1,TEMP2D=>1);
          }
      
      }
      
      #----------------------
      sub _get_tile_and_ids {
      
          my %args = @_;
      
          my @coords = @{$args{Coords} || []};
      
          foreach my $pair (@coords_lonlat)
      
              my ($status,$tile,$ii,$jj) = _find_tile_and_ids(X=>$xx,Y=>$yy);
              push @results,$tile,$ii,$jj;
      
          }
      
          return \@results;
      
      }
      
      #-----------------------
      sub _find_tile_and_ids {
      
          my %args = @_;
      
          my $x = $args{X};
          my $x = $args{Y};
      
          # Find tile
      
          ....
      
          my ($icell,$jcell) = _find_cell(X=>$x,Y=>$y,Xmin=>$xll_tile,Ymin=>$yll_tile,Dxy=>$info{dxy});
      
          return('',$tile,$icell,$jcell);
      
      }
      
      #----------------------
      sub _find_cell
      
          my %args = @_;
      
          my $x = $args{X};
          my $y = $args{Y};
          my $xmin = $args{Xmin};
          my $ymin = $args{Ymin};
          my $dxy = $args{Dxy};
      
          my $ii = floor(($x - $xmin) / $dxy) + 1;
          my $jj = floor(($y - $ymin) / $dxy) + 1;
      
          return ($ii,$jj);
      
      }
      
      #----------------
      sub _extract_ts {
      
          my %args = @_;
          my $point = $args{Point} || die;
      
          my ($tile,$ii,$jj) = ($point->1,$point->2,$point->3)
          my %file = (
              WS2D => "$tile/ws3d.dat",
              TEMP2D => "$tile/temp3d.dat",
          );
      
          my $tile_info = _get_tile_info(Tile=>$tile);
      
          ....
      
      }
      
      #-------------------
      sub _get_tile_info {
      
          my %args = @_;
          my $tile = $args{Tile};
      
          my $json_file = "$tile/info.json";
          my $tile_info = read_file($json_file);
      
          return $tile_info;
      
      
      
      
        OK, we still miss some of the details, but let's have some fun.

        I created a Makefile like this:

        Now, you can run

        make simulate_cron
        to generate the input data and start modifying them randomly.

        Then, run

        make query
        in a different terminal. The Perl program is the following:
        #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; use Cpanel::JSON::XS qw{ decode_json }; my %cache; for (1 .. 1000) { my @queries = map [ map int 1 + rand 10, 1, 2 ], 1 .. 50; for my $query (@queries) { my ($x, $y) = @$query; # delete $cache{$x}{$y}; # <- Uncomment to simulate no cache. my $value; if (exists $cache{$x}{$y} && (stat "$x-$y.json")[9] == $cache{$x}{$y}{last} ) { $value = $cache{$x}{$y}{value}; } else { open my $in, '<', "$x-$y.json" or die $!; $cache{$x}{$y}{last} = (stat $in)[9]; $value = $cache{$x}{$y}{value} = decode_json(do { local $/ +; <$in> })->[2]; } say "$x, $y: $value"; } }

        With the delete line uncommented, it takes about 0.400s to terminate. With the line commented, it runs under 0.100s, i.e. slightly more than 4 times faster.

        Notes:

        1. The simulation uses mv to create the JSON files so they change is atomic. If we wrote to the file directly instead, we could get occasional errors when reading it.
        2. We store the modification time before we read the value. There's a race condition: the value may change after we retrieved the modification time, but before we read the value. But it doesn't break the code: we return the correct value, but we might read it from the file once more next time.
        3. I guess the cron process doesn't change all the files all the time, so the real benefit of this kind of cache might be much lesser in your real environment.

        map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]