Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

perl Mojo DOM CSS syntax issues

by Anonymous Monk
on Jan 27, 2024 at 07:37 UTC ( [id://11157299]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, I inherited some code to extract specific data from html. The html has changed and the code doesn't parse and push the text I need. I was up till 3AM last night trying everything. As far as the CSS syntax there is a find for an anchor in the HTML and then a find for several classes to pull text, about 12 per page. It's such a large script I'm having problems making a tear out to test and need to run this large script every time to test which is part of the problem as well. Please find the part of the code that is broken below and the html. Thank you very much in advance for any insight and guidance you can provide

This is what I have from the script:

print "curent working directory is $ENV{PWD}, \n"; chdir ( "$config->{'data_path'}/" ); @files7 = grep { -f } glob("*.html"); foreach (@files7){ print "Newparse2 == Parsing file: $_ \n"; $$temp_content = path($_)->slurp_utf8; $$output_csv .= qq|"$_"\n\n|; $$name4url .= qq|"$_"\n\n|; my $dom1 = Mojo::DOM->new( $$temp_content ); my $r1 = $dom1->find('[class="JMWMJ"]'); #my $r1 = $dom1->find('.JMWMJ'); #my $r1 = $dom1->find('div.JMWMJ'); print "1.1YYY - Config WebSite is $config->{'website'}\n"; foreach my $block ( @{$r1}){ $block =~ s#\s-\s|\s\|\s#%%#g; my $dom2 = my $r2 = my $r3 = my $r4 = my $r5 = undef; m +y $res = {}; my @d = (); $dom2 = Mojo::DOM->new( $block ); my $r101; my @columns; my @columns101; #####added stuff here 01142012 print "1.2YYY - Config WebSite is $config->{'website'}\n"; ##$r2 = $dom2->find('CVA68e.qXLe6d.fuLhoc.ZWRArf', 'qXLe6d FrI +lee') -> map( sub{ $_->text } ); ## $r2 = $dom2->at('h3.LC20lb.MBeuO.DKV0Md','h3.BNeawe.vvjw +Jb.AP7Wnd','h3.CVA68e.qXLe6d', ## ,'h3.CVA68e.qXLe6d.fuLhoc.ZWRArf','h3.qXLe6d.FrIlee', ' +h3.toI8Rb.OSrXXb.usbThf') $r2 = $dom2->find('h3') ->each(sub { push @columns, join '|', map { $_->all_text } $_->find('span' +)->each; });

This is the html that the code is supossed to parse

<div class="JMWMJ"><div class="toI8Rb OSrXXb usbThf">Sam Namett, MD - +Physician - Interventional Orthopedics ...Exosomes are nanovesicles ( +30-200 nm) found in extracellular space of various cell types, and in + biofluids; having diverse functions including intracellular ...</div +></div><div class="Xxy7Vb"><div class="BtwlAd"><g-img style="width:16 +px;height:16px"><img id="dimg_8" src="data:image/gif;base64,R0lGODlhA +QABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="

I'm trying to extract the text from the class, the info like this from the page and there are multiple blocks on each html page

Sam Namett, MD - Physician - Interventional Orthopedics ...Exosomes are nanovesicles (30-200 nm) found in extracellular space of various cell types, and in biofluids; having diverse functions including intracellular ...

Thanks again for any insight you can provide.

Replies are listed 'Best First'.
Re: perl Mojo DOM CSS syntax issues
by NERDVANA (Curate) on Jan 27, 2024 at 19:51 UTC
    Like stevieb says, this script sounds highly prone to breakage. In particular,  <div class="JMWMJ"><div class="toI8Rb OSrXXb usbThf"> looks like it was auto-generated by some tool on the remote end, so you can expect those classes to change to new random strings any time the remote side gets recompiled.

    Scraping data from HTML and/or text is best avoided, but I have found times where it was the only way, or even where the cost of avoiding it was higher than the cost of maintaining it. For instance, I know one company who provides their API for $40,000/year; the alternative was for users of that system to export reports of their own data and send to us to scrape and import to our system, and it was actually much more economical to pay me several dozen hours to make a really clever importer that identifies the data with heuristics (and repair it a few times when the format changed), than to pay that annual API fee.

    (It's been working now for about 6 years since the last time I needed to edit it, actually. Some of that project is now on CPAN as Data::TableReader and Data::TableReader::Decoder::HTML, though that relies on HTML table elements and it looks like you need to match DIVs.)

    So. Supposing that you have a legitimate case of really needing to scrape the data, here's my advice:

    1. Write a perl module that does nothing more and nothing less than take the html and extract the data from it.
    2. Design that code to focus on the structure of the HTML and not those generated div class names... unless the class names are really the only thing available.
    3. Write a unit test that starts with a reduced snippet of the HTML and verifies that it gets back the expected data.
    4. Document all your logic about the scraping
    5. Refactor the top-level script to use that module for the step of extracting the data.

    The idea here is to isolate the ugly high-maintenance piece of the script and give it a unit test so that later you can quickly compare what changed and get it working again without the top level script getting in your way. It's also way easier to explain the problem to a new developer when they can see and use the unit test.

    Your module will look something like this:

    package ScrapeMyData; use Moo; use Mojo::DOM; use v5.36; has input => ( is => 'ro', required => 1 ); =head2 parse This method returns the extracted data from L</input>. It first looks for blah blah to identify the start of the data, then looks for blah blah blah blah to identify the individual lines. Unfortunately I couldn't identify a pattern in the DIVs so I'm matching the class names and this is very likely to break. TODO: scan the file for the most common class="..." and then guess that that class is the one used on data rows. =cut sub parse($self) { my $dom = Mojo::DOM->new( $self->input ); ... # return an arrayref of data } 1;
    And the unit test:
    use Test2::V0; use v5.36; use ScrapeMyData; my $scraper= ScrapeMyData->new(input => <<~'END'); <html> <head>...</head> <body> ... <div class="JMWMJ"><div class="toI8Rb OSrXXb usbThf"> Sam Namett, MD - Physician - Interventional Orthopedics ...Exosomes are nanovesicles (30-200 nm) found in extracellular space of various cell types, and in biofluids; having diverse functions including intracellular ... </div></div> ... </body> </html> END is( $scraper->parse, [ ... { author => 'Sam Namett', title => 'Interventional Orthopedics ...', } ... ], 'parse' );
    Then back to the original script:
    foreach (@files7){ print "Newparse2 == Parsing file: $_ \n"; my $scraper= ScapeMyData->new(input => path($_)->slurp_utf8); my $data= $scraper->parse;

    Hope that helps.

Re: perl Mojo DOM CSS syntax issues
by marto (Cardinal) on Jan 28, 2024 at 09:51 UTC

    It'd help if you could provide a better example of the HTML. You could try something like this:

    #!/usr/bin/perl use strict; use warnings; use feature 'say'; use Mojo::DOM; my $html = '<div class="JMWMJ"><div class="toI8Rb OSrXXb usbThf">Sam N +amett, MD - Physician - Interventional Orthopedics ...Exosomes are na +novesicles (30-200 nm) found in extracellular space of various cell t +ypes, and in biofluids; having diverse functions including intracellu +lar ...</div></div> <div class="JMWMJ"><div class="toI8Rb OSrXXb usbThf">Dr. Bombay - Phys +ician - witch doctor ...canned laughter ...</div></div> <div class="JMWMJ"><div class="toI8Rb OSrXXb usbThf">Dr. Who - time lo +rd - previously good TV show ...</div></div>'; my $dom = Mojo::DOM->new( $html ); for my $entry ( $dom->find('div.JMWMJ')->each ){ say $entry->all_text; }

    Outputting:

    Sam Namett, MD - Physician - Interventional Orthopedics ...Exosomes ar +e nanovesicles (30-200 nm) found in extracellular space of various ce +ll types, and in biofluids; having diverse functions including intrac +ellular ... Dr. Bombay - Physician - witch doctor ...canned laughter ... Dr. Who - time lord - previously good TV show ...

    Posting a more complete example (or example URL) would be beneficial. Super Search for more Mojo::Dom goodness.

Re: perl Mojo DOM CSS syntax issues
by stevieb (Canon) on Jan 27, 2024 at 15:51 UTC

    @filex7, Newparse2, $r1, $dom2... those are what I would call "magic numbers"... hard coded information inside code that really shouldn't be there, and that will eventually cause issues. Seems you're currently facing them.

    You really want to fetch information from this website through a proper web API.

    You can work many nights until 0300 hrs to patch this "large script" until you think it works, until it breaks again, and it will break again.

    It's fragile, very poorly written, hard to understand at a glance, and what I'd call easy to break. I also think I'd be safe to say there are no unit tests to compare revisions.

    This appears to be an X-Y problem. I would be hesitant to change this script in case it broke something irrelevant to the piece that you want to change. Instead, I'd write a new one to focus on the piece of information you need. At best, ensure you keep revisions of the current script as you change it. Depending on how big it is, with numbered variables like that, any change may break something far away.

    Sorry to be the bearer of bad news, but my fix would be to fix it properly, not put a band-aid on it so that the next person has to deal with comments as informative as "#added stuff here 01142012" *

    * - If I had a client where I read a comment like that and they didn't permit me to fix things properly, I'd fire them.

      "You really want to fetch information from this website through a proper web API."

      Not every website has it's own API, hence people using a parser to achieve what they want, especially one that can cope with css selectors. Not everyone posting here is a professional programmer, nor has the luxury of 'firing' people asking them to do work.

Re: perl Mojo DOM CSS syntax issues
by bliako (Monsignor) on Jan 28, 2024 at 20:20 UTC

    NERDVANA has a good point in Re: perl Mojo DOM CSS syntax issues, saying that:

    In particular, <div class="JMWMJ"><div class="toI8Rb OSrXXb usbThf"> +looks like it was auto-generated by some tool on the remote end, so y +ou can expect those classes to change to new random strings any time +the remote side gets recompiled.

    In the long term this is going to be a problem. But a problem which has a solution that can be automated fully. As opposed to the problem of the website changing its structure by adding/removing divs for example.

    The solution to divs class/id being renamed is to keep some html documents from the website at a time when your program worked. And diff the attributes to current html. The diff will tell you how the div names/ids changed and pass that info to your script to revise its anchors.

    Here is my 3AM-whipped-up-code which utilises XML::Diff -- which, despite its name, works for any DOM flavour, html included:

    use strict; use warnings; use XML::Diff; my $html1 =<<EOH; <html> <body> <div id="1"> <div id="2"></div> </div> </body> </html> EOH my $html2 =<<EOH; <html> <body> <div id="4"> <div id="5"></div> </div> </body> </html> EOH my $diff = XML::Diff->new(); my $diffgram = $diff->compare( -old => $html1, -new => $html2, ); print $diffgram;

    and the result reveals the changed div ids:

    <?xml version="1.0"?> <xvcs:diffgram xmlns:xvcs="http://www.xvcs.org/"> <xvcs:update id="2" first-child-of="/html/body"> <xvcs:attr-update name="id" old-value="1" new-value="4"/> </xvcs:update><xvcs:update id="1" first-child-of="/html/body/div"> <xvcs:attr-update name="id" old-value="2" new-value="5"/> </xvcs:update></xvcs:diffgram>

    The so-called diffgram can tell your program its new anchors. With the new anchors automatically fixed, all you have to do is deal with structural changes in the website. Which is a sisyphian task with a herculean twist but, hey, no standards and no APIs or obfuscating important information inide unstructured HTML is how Capitalism creates jobs for the plebes and profits for the bosess. What the legend did not tell us is that everytime Sisyphus' rock rolls back down the hill some fatcat makes a few drachmas.

    bw, bliako

Re: perl Mojo DOM CSS syntax issues
by InfiniteSilence (Curate) on Jan 28, 2024 at 00:01 UTC

    There is a bunch of stuff written in Perl that has lots and lots of functionality I do not want or need so I use the parts I want and skip the other stuff. I could rewrite it but why?

    In this case you have supplied us with an input file that does not appear to be valid HTML, is incomplete, or both. I took the libertty of closing a few tags just to get the parser to stop complaining:

    <r> <div class="JMWMJ"> <div class="toI8Rb OSrXXb usbThf">Sam Namett, MD - Physician - Inter +ventional Orthopedics ...Exosomes are nanovesicles (30-200 nm) found +in extracellular space of various cell types, and in biofluids; havin +g diverse functions including intracellular ... </div> <div class="Xxy7Vb"> <div class="BtwlAd"> <g-img style="width:16px;height:16px"> <img id="dimg_8" src="data:image/gif;base64,R0lGODlhAQABAI +AAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /> </g-img> </div> </div> </div> </r>

    Now to parse and grab a single element:

    perl -MXML::Simple -e 'my $x = qq~./limit.html~; system cat =>$x; my $ +rx = XMLin($x); use Data::Dumper; print Dumper \$rx; print qq~\n\n$rx +->{div}->{div}->[0]->{content}~;'

    Produces...

    ... $VAR1 = \{ + 'div' => { + 'div' => [ + { ... Sam Namett, MD - Physician - Interventional Orthopedics ...Exosomes ar +e nanovesicles (30-200 nm) found in extracellular space of vari\ ous cell types, and in biofluids; having diverse functions including i +ntracellular ...

    The most interesting part being print qq~\n\n$rx->{div}->{div}->[0]->{content}~;

    My point is this: if your input is legit then parsing and extracting from it becomes easier. Perhaps rather than rewrite a big script work on cleaning up the input for now and work on the script later when you have more time.

    Celebrate Intellectual Diversity

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11157299]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (6)
As of 2024-05-21 10:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found