Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Optimize Large, Complex JSON decoding

by Endless (Beadle)
on Sep 18, 2013 at 22:46 UTC ( [id://1054744]=perlquestion: print w/replies, xml ) Need Help??

Endless has asked for the wisdom of the Perl Monks concerning the following question:

My program parses json files of varying sizes, some over 10 mb; the files contain json arrays that can contain hundreds and even thousands of json elements. I unwind these into a CSV file of just the few columns I want from the JSON file (only a fraction of each JSON element). It seems to take a long time to simply unthread the JSON files, so I'm wondering if there are any tips on optimizing this process. In particular, am I likely to be losing time with using 'slurp' instead of some other read method? I'm not sure, because the files come in just one ridiculous line anyway. Suggestions, or is this as good as it gets?
sub _process_json { my $jfile = shift; my $json; { local $/; #Enable 'slurp' mode # xxx Might have trouble wi +th larger jsons (10+ mb) open my $fh, "<", "$jfile"; $json = <$fh>; close $fh; } my $json_data = decode_json($json); # Go through each interaction (twitter message) my @interactions = $json_data -> {'interactions'}; # A scalar of a +n array of hashes while ( (my $key, my $value) = each $interactions[0] ) { my $tweetid = $value -> {'twitter'} -> {'id'}; if (exists $duplicates{$tweetid}){ $duplicate_count++; next; # Skip duplicates }else{ $duplicates{$tweetid} = (); $tweets_file_count++; } # Dates of form 'Fri, 01 Mar 2013 01:21:14 +0000' my $created_at = epoch_sec($value -> {'twitter'} -> {'created_at'} +); my $klout = ($value -> {'klout'} -> {'score'}) // ""; # Optional i +n DS jsons my $screen_name = $value -> {'twitter'} -> {'user'} -> {'screen_na +me'}; my $text = decode_entities($value -> {'twitter'} -> {'text'}); # Formatting for the final output $text =~ s/\R/\t/g; # Remove linebreaks $text =~ s/"/""/g; # Swap quotations print $out_file "$tweetid,", "$created_at,", "$klout,", "$screen_name,", "\"$text\"", "\n"; } #END while (each tweet) } #END _process_json use Inline C => q@ int epoch_sec(char * date) { char *tz_str = date + 26; struct tm tm; int tz; if ( strlen(date) != 31 || strptime(date, "%a, %d %b %Y %T", &tm) == NULL || sscanf(tz_str, "%d", &tz) != 1) { printf("Invalid date %s\n", date); return 0; } return timegm(&tm) - (tz < 0 ? -1 : 1)*(abs(tz)/100*3600 + abs(tz)%100*60); } @;

Replies are listed 'Best First'.
Re: Optimize Large, Complex JSON decoding
by Anonymous Monk on Sep 19, 2013 at 00:20 UTC

    It seems to take a long time to simply unthread the JSON files, so I'm wondering if there are any tips on optimizing this process.

    What is that? How did you determine the bottleneck ?

    Because on my really old laptop(9yo), it takes 0.96875 to slurp+decode+foreach 189279 "records" from a 21M json file

    If I add in some Time::Piece strftime/strptime it goes to 9.984375 seconds

    I don't see room for improvement, although it looks like you could reduce memory requirement with JSON::Streaming::Reader

Re: Optimize Large, Complex JSON decoding
by Anonymous Monk on Sep 19, 2013 at 00:54 UTC
    Do not, for example, copy the "interactions" into a separate array (@interactions) ... iterate directly over the elements in the array within the decoded JSON content.
      Thanks for your suggestions. As I'm still learning Perl, what do you recommend instead of
      my @interactions = $json_data -> {'interactions'}; # A scalar of a +n array of hashes
      I am parsing only 11846 records per second.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1054744]
Approved by bart
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (3)
As of 2024-07-22 07:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.