Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re: Parsing and searching HTML code

by Your Mother (Chancellor)
on Jul 26, 2012 at 18:57 UTC ( #983911=note: print w/ replies, xml ) Need Help??


in reply to Parsing and searching HTML code

Regular expressions are very fragile for HTML parsing. Here's a parser (XML::LibXML) based example–

use strictures; use XML::LibXML; use open qw(:std :utf8); use YAML; my $dom = XML::LibXML->load_html( string => do { local $/; <DATA> }, keep_blanks => 0 ); my @advisories; # Only select <p/>s that have a PDF link inside. for my $p ( map { $_->parentNode } $dom->findnodes(q{//p//a[contains(@ +href,'.pdf')]}) ) { my %tmp; for my $kid ( $p->childNodes ) { if ( $kid->nodeName eq "a" ) { $tmp{pdf} = { title => $kid->textContent, href => $kid->getAttribute("href") }; } elsif ( not $tmp{pdf} ) { # You'd have to do some shuffling to handle <br/>->\n here +. $tmp{heading} .= $kid->textContent; } else { ( $tmp{date} = $kid->textContent ) =~ s/[)(\n\r]//g; } } s/[\s,]+\Z// for $tmp{heading}, $tmp{date}; push @advisories, \%tmp; } print YAML::Dump(\@advisories); exit 0; __DATA__ # YOUR HTML FRAGMENT HERE

Snippet of output

--- - date: 'March 21, 2011' heading: |- RealFlex TechnologiesMultiple Vulnerabilities in RealFlex RealWin pdf: href: /control_systems/pdf/ICS-ALERT-11-080-04.pdf title: ICS-ALERT-11-080-04 - date: 'April 20, 2011' heading: RealFlex RealWin Multiple Vulnerabilities pdf: href: /control_systems/pdf/ICSA-11-110-01.pdf title: ICSA-11-110-01 ...

I do understand that regular expressions seem more accessible at first and can solve many specific/one-off problems but putting in the time to get up the learning curve of any of the good HTML/XML parsers will repay greatly over time.


Comment on Re: Parsing and searching HTML code
Select or Download Code
Re^2: Parsing and searching HTML code
by jayto (Acolyte) on Jul 26, 2012 at 20:32 UTC
    Thanks for showing me that, I'm probably going to be using your post as a reference in the future, but I already finished my program and I moved on to the next part... Parsing the PDF file...

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://983911]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (2)
As of 2015-07-05 19:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (67 votes), past polls