Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re^2: Quick 'n dirty extraction of JSON from an HTML page

by davebaker (Pilgrim)
on Mar 08, 2021 at 22:20 UTC ( [id://11129353]=note: print w/replies, xml ) Need Help??


in reply to Re: Quick 'n dirty extraction of JSON from an HTML page
in thread Quick 'n dirty extraction of JSON from an HTML page

Yes, it certainly does give me what I need. Thanks, John!

Some of the JavaScript seems to be using key/value specifications that aren't valid JSON because the keys aren't quoted strings, e.g.

var renderer = new US.Opportunity.OpportunityRenderViewModel({ opportunity: opportunity, currentJobBoardId: "6162c253-9d81-da08-c252-d43d2fcb8345", isViewingInternal: false });
... so I changed the regular expression to be
m/\((\{".*?\})\)/gms
(throwing in a leading quotation mark, in order to find only JSON that has a quoted initial key).

I also played with the possibility that the HTML page would contain more than one block of JSON, and changed your code to be

my ( $json, $ref ); for ( $scrape =~ m/\((\{".*?\})\)/gms ) { $json = $1; $ref = decode_json $json; print Dumper $ref; }
...so as to find and print for me each of multiple JSON blocks (not shown here). Love it!

Replies are listed 'Best First'.
Re^3: Quick 'n dirty extraction of JSON from an HTML page
by tobyink (Canon) on Mar 09, 2021 at 14:29 UTC

    Consider using the original regexp, which doesn't require keys to be quoted, and parsing the JSON using Cpanel::JSON::XS and turning relaxed mode.

    Javascript objects can of course still include values which cannot be encoded into JSON, for example:

    var obj = { "some_key": Date.now(), "other_key": function () { console.log("Hello world"); } };

    So if your Javascript objects contain things like this, you'll be out of luck. You might want to wrap your JSON decoding in try/catch or eval.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11129353]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (4)
As of 2024-04-20 00:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found