I would like to be able to extract the principal information from each of several web pages that are job openings (one per page) on a particular employer's career site. Each page is created by a combination of a JavaScript front end and certain JSON information that is embedded in the page. Once I can extract the JSON, I think I can use one of the many CPAN JSON modules to turn the JSON into Perl data structures I can use to reformat the data for each job. Basically, I'd scrape each job for repurposing, with the employer's permission.
The page that contains links to each of the job openings is here: https://recruiting.ultipro.com/NEW1020/JobBoard/6162c253-9d81-da08-c252-d43d2fcb8345/?q=&o=postedDateDesc&w=&wc=&we=&wpst=
Each page containing a particular job opening is produced by clicking on a job title on that page.
So an example of the JSON data that I'd like to munge is this excerpt from one such job page (not the page that lists all of the jobs):
<script>
$(function () {
var opportunity = new
US.Opportunity.CandidateOpportunityDetail({"Id":"10eb1d6c-359b
+-4f10-84d0-
ca2525d88cce","Title":"Relationship
Manager","Featured":false,"FullTime":true,"HoursPerWeek":null,
+"JobCategoryName":"Qualified
Client Services","Locations":[{"Id":"dd1188b1-18d2-5e8d-9f93-a
+adbe1a3fd22","LocalizedName":"CA
- Remote","LocalizedLocationId":null,"LocalizedDescription":"C
+A - Remote","Address":
{"Line1":null,"Line2":null,"City":"Walnut Creek","State":
{"Name":"California","Code":"CA"},"PostalCode":null,"Country":
+{"Id":"ab896de2-
c528-41b0-90a7-5eed39797103","Name":"United
States","Code":"USA"}},"DisplayName":true,"DisplayLocationId":
+false,"DisplayDescription":true,
"DisplayAddress":false,"DisplayStreetAddress":false,"Coordinat
+es":
{"Longitude":-120.9614611155792,"Latitude":37.584818420647},"S
+hapes":null,"SourceOfTruth":1,"I
sAvailableForOpportunities":true},{"Id":"1945a6cf-0d3b-5b2b-a7
+bf-
dd8dbb9a7b53","LocalizedName":"CA -
Folsom","LocalizedLocationId":null,"LocalizedDescription":"CA
+- Folsom","Address":{"Line1":"35
Iron Point Circle","Line2":"Suite 300","City":"Folsom","State"
+:
{"Name":"California","Code":"CA"},"PostalCode":"95630","Countr
+y":{"Id":"ab896de2-
c528-41b0-90a7-5eed39797103","Name":"United
States","Code":"USA"}},"DisplayName":true,"DisplayLocationId":
+false,"DisplayDescription":true,
"DisplayAddress":true,"DisplayStreetAddress":false,"Coordinate
+s":
{"Longitude":-121.14320436989884,"Latitude":38.643310785875464
+},"Shapes":null,"SourceOfTruth":
1,"IsAvailableForOpportunities":true},{"Id":"ab91588e-c732-56b
+4-9671-
e5daab085388","LocalizedName":"CA - Los
Angeles","LocalizedLocationId":null,"LocalizedDescription":"CA
+ - Los Angeles
Wilshire","Address":{"Line1":"12424 Wilshire Blvd.","Line2":"S
+uite 870","City":"Los
Angeles","State":{"Name":"California","Code":"CA"},"PostalCode
+":"90025","Country":
{"Id":"ab896de2-c528-41b0-90a7-5eed39797103","Name":"United
States","Code":"USA"}},"DisplayName":true,"DisplayLocationId":
+false,"DisplayDescription":true,
"DisplayAddress":false,"DisplayStreetAddress":false,"Coordinat
+es":
{"Longitude":-118.47060174630806,"Latitude":34.041507422395},"
+Shapes":null,"SourceOfTruth":1,"
IsAvailableForOpportunities":true},{"Id":"dadf3d11-17f2-5753-
b719-3291aeeccc69","LocalizedName":"CA -
Fresno","LocalizedLocationId":null,"LocalizedDescription":"CA
+- Fresno","Address":
{"Line1":"7519 North Ingram Avenue","Line2":"Suite 106","City"
+:"Fresno","State":
{"Name":"California","Code":"CA"},"PostalCode":"93711","Countr
+y":{"Id":"ab896de2-
c528-41b0-90a7-5eed39797103","Name":"United
States","Code":"USA"}},"DisplayName":true,"DisplayLocationId":
+false,"DisplayDescription":true,
"DisplayAddress":false,"DisplayStreetAddress":false,"Coordinat
+es":
{"Longitude":-119.80186387305908,"Latitude":36.846081098189643
+},"Shapes":null,"SourceOfTruth":
1,"IsAvailableForOpportunities":true}],"PostedDate":"2021-03-0
+3T16:52:20.236Z","UpdatedDate":"
2021-03-03T16:52:56.265Z","RequisitionNumber":"RELAT03025","De
+scription":"\u003cp\u003e
\u003cstrong\u003e\u003cem\u003eWho We Are\u003c/em\u003e\u003
+c/strong\u003e\u003c/p\u003e\n
\u003cp\u003eNewport helps companies offer their associates a
+more secure financial future
through retirement plans, insurance and consulting services. N
+ewport offers comprehensive plan
solutions and consulting expertise to plan sponsors and the ad
+visors who serve them. As a
provider and partner, Newport is independent, experienced and
+responsive.\u003c/p\u003e\n
\u003cp\u003e\u0026nbsp;\u0026nbsp;\u003c/p\u003e\n\u003cp\u00
+3e\u003cstrong\u003eJob
Description\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003ePro
+vides pro-active service and
communications to retirement plan clients. This includes provi
+ding client support,
documentation and record keeping, preparation of plan statemen
+ts, communication of plan
information to client, and assists with the modification and e
+nhancement of plan
administration processes, within the limits of established pol
+icy.\u003c/p\u003e\n\u003cp
\u003e\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u00
+3eEssential Functions \u003c/
strong\u003e\u003cem\u003eReasonable accommodations may be mad
+e to enable individuals with
disabilities to perform these essential functions\u003c/em\u00
+3e.\u003c/p\u003e\n\u003cul
\u003e\n\u003cli\u003eProvides support to clients through a nu
+mber of channels including
phone, letters and emails to quickly resolve the request\u003c
+/li\u003e\n\u003c/ul\u003e\n
\u003cul\u003e\n\u003cli\u003eActs in a pro-active manner with
+ assigned clients and advisors
to ensure retention as well as inspire client dedication and e
+ngagement to develop positive
relationships\u003c/li\u003e\n\u003cli\u003eResponsible for in
+terpreting plan documents for
client plan administration.\u0026nbsp;\u003c/li\u003e\n\u003cl
+i\u003eProvides calculations and
amounts to plan sponsors, communicates fund actions, consults
+with clients to answer
inquiries, researches and resolves issues, provides legal upda
+tes, and responds to requests
for specialized reports\u003c/li\u003e\n\u003cli\u003eAssists
+plan sponsor and intermediaries
on the utilization of web-based applications and delivers web
+demonstrations for financial
advisors and plan sponsors.\u003c/li\u003e\n\u003cli\u003eWork
+s with clients to correct and
fund payroll items and manages distribution requests.\u003c/li
+\u003e\n\u003cli
\u003eCoordinates plan compliance testing with the compliance
+team.\u003c/li\u003e\n\u003cli
\u003eParticipates in sales finals presentations and promotes
+cross-sell opportunities as
needed\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u0026nbs
+p;\u003c/p\u003e\n\u003cp\u003e
\u003cstrong\u003eSupervisory Responsibilities (none)\u003c/st
+rong\u003e\u003c/p\u003e\n
\u003cp\u003e\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003e\
+u003c/p\u003e\n\u003cp\u003e
\u003cstrong\u003eRequired Education, Experience and Certifica
+tes, Licenses, Registrations
\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u0
+03eBachelor\u0026rsquo;s degree
in business related filed or combination of education and indu
+stry experience\u003c/li\u003e\n
\u003cli\u003e3-5 years of total experience in Retirement Serv
+ices, with emphasis in the daily
401(k) environment, 403b or IRA areas\u003c/li\u003e\n\u003cli
+\u003eStrong MS Office Skills
with an emphasis in Excel\u003c/li\u003e\n\u003c/ul\u003e\n\u0
+03cp\u003e\u003cstrong
\u003ePreferred (but not required) education or skills for thi
+s role are\u003c/strong\u003e
\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003ePreferred ASPPA
+or CEBS\u003c/li\u003e\n\u003c/
ul\u003e\n\u003cp\u003e\u0026nbsp;\u003c/p\u003e\n\u003cp\u003
+e\u003cstrong\u003eCompetencies
\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u0
+03eThrives in a fast-paced
environment\u003c/li\u003e\n\u003cli\u003eEmbraces personal gr
+owth and wants to be challenged
in deadline-driven and multi-component environment\u003c/li\u0
+03e\n\u003cli\u003eExcellent
communication skills both written and verbal\u003c/li\u003e\n\
+u003cli\u003eBuilds
collaborative relationships\u003c/li\u003e\n\u003c/ul\u003e\n\
+u003cul\u003e\n\u003cli
\u003eEffective time management and organization skills\u003c/
+li\u003e\n\u003cli
\u003eDemonstrates initiative\u003c/li\u003e\n\u003cli\u003eFo
+rward thinking\u003c/li\u003e\n
\u003cli\u003eFosters teamwork\u003c/li\u003e\n\u003cli\u003eR
+esults drive/oriented\u003c/li
\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u0026nbsp;\u003c/p\u003
+e\n\u003cp\u003e\u003cstrong
\u003eTRAVEL:\u0026nbsp; 10\u003c/strong\u003e%.\u003c/p\u003e
+\n\u003cp\u003e\u003cstrong
\u003e\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u0
+03e\u003cstrong\u003eOTHER
DUTIES\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003ePlease n
+ote this job description is not
designed to cover or contain a comprehensive listing of activi
+ties, duties or responsibilities
that are required of the employee for this job. Duties, respon
+sibilities and activities may
change at any time with or without notice.\u003c/p\u003e\n\u00
+3cp\u003e\u0026nbsp;\u003c/p
\u003e\n\u003cp\u003e\u003cspan\u003e\u003cstrong\u003eEQUAL O
+PPORTUNITY EMPLOYER\u003c/strong
\u003e\u003c/span\u003e\u003c/p\u003e\n\u003cp\u003eNewport of
+fers for employment are
conditioned upon satisfactory completion of our employment scr
+eening process (including, but
not limited to, a review of past employment and education reco
+rds, background investigation,
and/or credit check and fingerprints.)\u003c/p\u003e\n\u003cp\
+u003e\u0026nbsp;\u003c/p\u003e\n
\u003cp\u003eNewport unequivocally rejects racism and discrimi
+nation of any kind and fosters
an environment of belonging to provide access and opportunity
+for all.\u0026nbsp; As an Equal
Opportunity Employer we do not discriminate on the basis of ra
+ce, religion, color, sex, sexual
orientation, gender identify, gender expression, national orig
+in, age, non-disqualifying
physical or mental disability, veteran status, or any other ba
+sis covered by applicable law.
\u0026nbsp;All employment is decided on the basis of qualifica
+tions, merit, and business need.
\u003c/p
\u003e","EqualOpportunityEmployerDescription":null,"PayTranspa
+rencyPolicyStatement":null,"Matc
hScore":1.0,"HasApplied":false,"ApplicationJobBoardName":null,
+"ApplicationJobBoardId":null,"Da
teApplied":null,"Salaried":true,"CompensationAmount":null,"Pub
+lishingStatus":1,"Links":
[],"BehaviorCriteria":[],"MotivationCriteria":[],"EducationCri
+teria":
[],"LicenseAndCertificationCriteria":[],"SkillCriteria":[],"Wo
+rkExperienceCriteria":
[],"JobBoardMemberships":[{"JobBoardId":"6489e35d-ba29-b1c3-92
+d3-
acb1a86c1453","PublishedInternal":true,"PublishedExternal":fal
+se,"ExternalPostedDate":null,"In
ternalPostedDate":"2021-03-05T23:08:36.109Z"},{"JobBoardId":"6
+162c253-9d81-da08-c252-
d43d2fcb8345","PublishedInternal":true,"PublishedExternal":tru
+e,"ExternalPostedDate":"2021-03-
05T23:08:36.109Z","InternalPostedDate":"2021-03-05T23:08:36.10
+9Z"}],"AssessmentUri":null,"Asse
ssmentStatus":null,"OpportunityIsClosed":false,"TravelRequired
+":null,"TravelDescription":null,
"SupervisorName":null,"Assessments":
[],"ApplicationId":null,"CompensationAnnualMinimum":null,"Comp
+ensationAnnualMaximum":null,"Com
pensationHourlyMinimum":null,"CompensationHourlyMaximum":null,
+"CompensationCurrency":null});
var applicantSourceId = null;
if (applicantSourceId) {
US.utils.sessionStorage.setItem("applicantSourceId", appli
+cantSourceId); }
var renderer = new US.Opportunity.OpportunityRenderViewModel({
opportunity: opportunity,
currentJobBoardId: "6162c253-9d81-da08-c252-d43d2fcb8345",
isViewingInternal: false
});
US.CurrentOpportunityDetailViewModel = new US.Opportunity.Oppo
+rtunityDetailViewModel({
currentJobBoardId: "6162c253-9d81-da08-c252-d43d2fcb8345",
opportunity: opportunity,
renderer: renderer,
candidatePresenceState: null,
opportunityApplyRedirectUrl: "/NEW1020/JobBoard/6162c253-9
+d81-da08-c252-d43d2fcb8345/
Account/Register?redirectUrl=%2FNEW1020%2FJobBoard%2F6162c
+253-9d81-da08-c252-
d43d2fcb8345%2FOpportunityApply%3FopportunityId%3D10eb1d6c
+-359b-4f10-84d0-ca2525d88cce
\u0026cancelUrl=%2FNEW1020%2FJobBoard%2F6162c253-9d81-da08
+-c252-
d43d2fcb8345%2FOpportunityDetail%3FopportunityId%3D10eb1d6
+c-359b-4f10-84d0-ca2525d88cce",
opportunityApplyOnBehalfRedirectUrl: "/NEW1020/JobBoard/61
+62c253-9d81-da08-c252-
d43d2fcb8345/Recruiter/Candidates",
opportunitiesUrl: "/NEW1020/JobBoard/6162c253-9d81-da08-c2
+52-d43d2fcb8345",
tenantAlias: "NEW1020",
featureConfigurationGroups: [{"Id":"001605e9-e513-bcd7-6a0
+5-
b020c4e16539","Name":"Recruitment.OpportunityManagement.Pu
+blishingAndJobBoards","Features"
:
[{"Name":"FeaturedOpportunities","Enabled":true,"HelpToolt
+ipMessageKey":null,"TurnOffWarni
ngMessageKey":null,"ConsentMessageKey":null,"ConsentTitleK
+ey":null,"ToggleableFeature":nul
l},
{"Name":"Approvals","Enabled":false,"HelpTooltipMessageKey
+":null,"TurnOffWarningMessageKey
":null,"ConsentMessageKey":null,"ConsentTitleKey":null,"To
+ggleableFeature":null},
{"Name":"Parallel","Enabled":false,"HelpTooltipMessageKey"
+:null,"TurnOffWarningMessageKey"
:null,"ConsentMessageKey":null,"ConsentTitleKey":null,"Tog
+gleableFeature":null},
{"Name":"IncludeHiringManagersInOnboardingOwnerField","Ena
+bled":true,"HelpTooltipMessageKe
y":null,"TurnOffWarningMessageKey":null,"ConsentMessageKey
+":null,"ConsentTitleKey":null,"T
oggleableFeature":null},
{"Name":"FTE","Enabled":false,"HelpTooltipMessageKey":"Rec
+ruitmentAdministrator.FieldConfi
gurationManager.FeatureConfiguration.Recruitment.Opportuni
+tyManagement.PublishingAndJobBoa
rds.FTEHelpTooltip","TurnOffWarningMessageKey":"Recruitmen
+tAdministrator.FieldConfiguratio
nManager.FeatureConfiguration.Recruitment.OpportunityManag
+ement.PublishingAndJobBoards.FTE
DisableWarningMessage","ConsentMessageKey":null,"ConsentTi
+tleKey":null,"ToggleableFeature"
:null},
{"Name":"Evergreen","Enabled":true,"HelpTooltipMessageKey"
+:"RecruitmentAdministrator.Field
ConfigurationManager.FeatureConfiguration.Recruitment.Oppo
+rtunityManagement.PublishingAndJ
obBoards.EvergreenHelpTooltip","TurnOffWarningMessageKey":
+null,"ConsentMessageKey":null,"C
onsentTitleKey":null,"ToggleableFeature":null},
{"Name":"IncludeHiringManagersInRecruiterField","Enabled":
+false,"HelpTooltipMessageKey":nu
ll,"TurnOffWarningMessageKey":null,"ConsentMessageKey":nul
+l,"ConsentTitleKey":null,"Toggle
ableFeature":null}]},{"Id":"772f9900-a307-4d31-
b15e-9e9052f1c897","Name":"Recruitment.OpportunityManageme
+nt.PageFeatures","Features":
[{"Name":"PersonalizedJobSearch","Enabled":false,"HelpTool
+tipMessageKey":"RecruitmentAdmin
istrator.FieldConfigurationManager.FeatureConfiguration.Re
+cruitment.OpportunityManagement.
PageFeatures.PersonalizedJobSearchTooltip","TurnOffWarning
+MessageKey":null,"ConsentMessage
Key":null,"ConsentTitleKey":null,"ToggleableFeature":null}
+,
{"Name":"JobSearchAgent","Enabled":true,"HelpTooltipMessag
+eKey":"RecruitmentAdministrator.
FieldConfigurationManager.FeatureConfiguration.Recruitment
+.OpportunityManagement.PageFeatu
res.JobSearchAgentTooltip","TurnOffWarningMessageKey":null
+,"ConsentMessageKey":"Recruitmen
tAdministrator.FieldConfigurationManager.FeatureConfigurat
+ion.Recruitment.OpportunityManag
ement.PageFeatures.JobSearchAgentConsentMessage","ConsentT
+itleKey":"RecruitmentAdministrat
or.FieldConfigurationManager.FeatureConfiguration.Recruitm
+ent.OpportunityManagement.PageFe
atures.JobSearchAgentConsentTitle","ToggleableFeature":nul
+l}]}],
linkedInRedirectUrl: "https://recruiting.ultipro.com/NEW10
+20/Opportunity/
ApplyWithLinkedIn?jobBoardId=6162c253-9d81-da08-c252-
d43d2fcb8345\u0026opportunityId=10eb1d6c-359b-4f10-84d0-ca
+2525d88cce",
currentUserRequiresReconsent: false,
userIsRecOrHM: false,
loggedInPersonName: "",
assessmentsUrl: "/NEW1020/JobBoard/6162c253-9d81-da08-c252
+-d43d2fcb8345/ApplicationAssessments"
});
});
I wonder if other Monks are using such a technique to scrape embedded JSON data from a web page.
I see many JSON modules on CPAN, but I'm not finding any that will take ugly HTML and filter it for the embedded JSON.
I see Randal Schwartz' marvelous regex-that-would-be-king that would seem to meet my need at https://perlmonks.org/?node_id=995856, and perlancar's use of it to make the JSON::Decode::Regex module, but I haven't been able to make 'em work. I can provide details here if you'd like, but I'll skip them because I recognize how brittle the regex approach must be. (But if I'm wrong and it's worth pursuing, please let me know.)
Moving on to what must be the "right" way to do it, it appears that I'd learn how to use Selenium to basically process all the JavaScript and get an HTML page that would be parseable by Mojo::DOM, if I want to stick with Perl.
I also see many API software-as-a-service vendors -- a whole industry, practically -- where the vendors essentially have figured all this out, and are happy to extract data from web pages, turn it into JSON, and make it accessible, for a fee, via an API. That's another way to go, but I'd love to be able to do it myself, especially since I see some nice JSON data already hanging on the tree in my target HTML pages.
I also see articles for doing this sort of thing using Python and node, but so far I haven't found a similarly comprehensive article suing Perl -- e.g. https://levelupprogramming.net/how-to-scrap-data-from-javascript-based-website-using-python-selenium-and-headless-web-driver-531c7fe0c01f and https://dev.to/princepeterhansen/how-to-scrape-html-from-a-website-built-with-javascript-mjn
What do you think? Should I go with Selenium and Mojo::DOM? I also see Dave Cross' book on using Selenium and Perl, so I'd probably tap that as a resource.
|
According to the Perl documentation, Perl will always match at the earliest possible position within the string. This is true of any typical alternation, as can be seen here: https://perldoc.perl.org/perlrequick.
However, this is not what I need. I need to specify a priority of match without regard to position. An approximation of what I am needing is illustrated in the following (perhaps poor) example:
$line = qq~I'm looking for the end of a sentence, where possible. How
+ever, in some cases, I'll need to go with a non-conventional "end" to
+ it, such as:
"Here's a quote by a famous person which is supposed to exceed forty w
+ords and is therefore required to be set apart as a separate, indente
+d paragraph per APA style." (Famous, 1999)
Note that the regex needs to look for the full end of the sentence, if
+ it exists: it cannot simply stop at the colon unless there is no fur
+ther part to the sentence provided in that paragraph.~;
$line =~ s/^
(.*?)
(
(?:[.?!"]) #FIRST PRIORITY
|
(?:[:;-]) #SECOND PRIORITY
|
(?:\n|\r|\z|$) #LAST PRIORITY
)
/<span class="s">$1$2</span>/gmx;
For the above, the desired sentence matches should be:
- I'm looking for the end of a sentence, where possible.
- However, in some cases, I'll need to go with a non-conventional "end" to it, such as:
- "Here's a quote by a famous person which is supposed to exceed forty words and is therefore required to be set apart as a separate, indented paragraph per APA style."
- (Famous, 1999)
- Note that the regex needs to look for the full end of the sentence, if it exists: it cannot simply stop at the colon unless there is no further part to the sentence provided in that paragraph.
As the example illustrates, the sentences should break at the first colon, but not at the second, as there is a higher-priority break-point, the period.
Is it possible to mandate a match priority such that the first one, irrespective of position, will be looked for first, and only upon failure would the next priority be sought, and so on? I have a case where my entire regex is failing on this issue, and I just cannot think of a good way to resolve it. It would not work, in my case, to use two separate regexes, as it would destroy the correcting ordering of the sequences matched.
Edit:
Perhaps this will be a better example/illustration.
Point 1.3.4: A piece of text.
Point 1.3.5: A piece of text.
Point 1.3.6: Another piece of text. Point 1.3.6: For some reason this piece of text isn't finished yet.
Point 1.3.6: In fact, this piece of text even broke into a new line.
Point 1.3.7: Finally, a new piece of text.
Now, it's easy to see that there are four points here. But the computer might not "see" four as it reads each of the "Point" notations.
How could these points be captured such that each substitution will operate on the FULL point at once, not just a portion of a point?
In other words, Point 1.3.6 needs to include three such notations spanning two separate lines.
I have coded it something like this:
$line =~ s~^
(
Point\s(\d+)\.(\d+)\.(d+)
(.*?)
)
(?=
(?:Point\s
(?:\d+)\.(?:\d+)\.(?!\4)
) #1 Priority
|
(?:\z|$) #2 Priority
)
~$processthis->()~egmx;
However, the #2 Priority match, because it matches first, ends up trumping the #1 Priority match, and any of the chunks of the form illustrated by Point 1.3.6 end up truncated.
Again, this is just an illustration, but perhaps it more clearly explains the priority issues. Moving the (.*?) into the forward-looking assertion(s), as some suggested I try, did not bring about the desired results for me.
|