My program parses json files of varying sizes, some over 10 mb; the files contain json arrays that can contain hundreds and even thousands of json elements. I unwind these into a CSV file of just the few columns I want from the JSON file (only a fraction of each JSON element). It seems to take a long time to simply unthread the JSON files, so I'm wondering if there are any tips on optimizing this process. In particular, am I likely to be losing time with using 'slurp' instead of some other read method? I'm not sure, because the files come in just one ridiculous line anyway. Suggestions, or is this as good as it gets?
sub _process_json {
my $jfile = shift;
my $json;
{
local $/; #Enable 'slurp' mode # xxx Might have trouble wi
+th larger jsons (10+ mb)
open my $fh, "<", "$jfile";
$json = <$fh>;
close $fh;
}
my $json_data = decode_json($json);
# Go through each interaction (twitter message)
my @interactions = $json_data -> {'interactions'}; # A scalar of a
+n array of hashes
while ( (my $key, my $value) = each $interactions[0] ) {
my $tweetid = $value -> {'twitter'} -> {'id'};
if (exists $duplicates{$tweetid}){
$duplicate_count++;
next; # Skip duplicates
}else{
$duplicates{$tweetid} = ();
$tweets_file_count++;
}
# Dates of form 'Fri, 01 Mar 2013 01:21:14 +0000'
my $created_at = epoch_sec($value -> {'twitter'} -> {'created_at'}
+);
my $klout = ($value -> {'klout'} -> {'score'}) // ""; # Optional i
+n DS jsons
my $screen_name = $value -> {'twitter'} -> {'user'} -> {'screen_na
+me'};
my $text = decode_entities($value -> {'twitter'} -> {'text'});
# Formatting for the final output
$text =~ s/\R/\t/g; # Remove linebreaks
$text =~ s/"/""/g; # Swap quotations
print $out_file
"$tweetid,",
"$created_at,",
"$klout,",
"$screen_name,",
"\"$text\"",
"\n";
} #END while (each tweet)
} #END _process_json
use Inline C => q@
int epoch_sec(char * date) {
char *tz_str = date + 26;
struct tm tm;
int tz;
if ( strlen(date) != 31 ||
strptime(date, "%a, %d %b %Y %T", &tm) == NULL ||
sscanf(tz_str, "%d", &tz) != 1)
{
printf("Invalid date %s\n", date);
return 0;
}
return timegm(&tm) -
(tz < 0 ? -1 : 1)*(abs(tz)/100*3600 + abs(tz)%100*60);
}
@;