Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re: Parsing and searching HTML code

by Your Mother (Canon)
on Jul 26, 2012 at 18:57 UTC ( #983911=note: print w/ replies, xml ) Need Help??


in reply to Parsing and searching HTML code

Regular expressions are very fragile for HTML parsing. Here's a parser (XML::LibXML) based example–

use strictures; use XML::LibXML; use open qw(:std :utf8); use YAML; my $dom = XML::LibXML->load_html( string => do { local $/; <DATA> }, keep_blanks => 0 ); my @advisories; # Only select <p/>s that have a PDF link inside. for my $p ( map { $_->parentNode } $dom->findnodes(q{//p//a[contains(@ +href,'.pdf')]}) ) { my %tmp; for my $kid ( $p->childNodes ) { if ( $kid->nodeName eq "a" ) { $tmp{pdf} = { title => $kid->textContent, href => $kid->getAttribute("href") }; } elsif ( not $tmp{pdf} ) { # You'd have to do some shuffling to handle <br/>->\n here +. $tmp{heading} .= $kid->textContent; } else { ( $tmp{date} = $kid->textContent ) =~ s/[)(\n\r]//g; } } s/[\s,]+\Z// for $tmp{heading}, $tmp{date}; push @advisories, \%tmp; } print YAML::Dump(\@advisories); exit 0; __DATA__ # YOUR HTML FRAGMENT HERE

Snippet of output

--- - date: 'March 21, 2011' heading: |- RealFlex TechnologiesMultiple Vulnerabilities in RealFlex RealWin pdf: href: /control_systems/pdf/ICS-ALERT-11-080-04.pdf title: ICS-ALERT-11-080-04 - date: 'April 20, 2011' heading: RealFlex RealWin Multiple Vulnerabilities pdf: href: /control_systems/pdf/ICSA-11-110-01.pdf title: ICSA-11-110-01 ...

I do understand that regular expressions seem more accessible at first and can solve many specific/one-off problems but putting in the time to get up the learning curve of any of the good HTML/XML parsers will repay greatly over time.


Comment on Re: Parsing and searching HTML code
Select or Download Code
Re^2: Parsing and searching HTML code
by jayto (Acolyte) on Jul 26, 2012 at 20:32 UTC
    Thanks for showing me that, I'm probably going to be using your post as a reference in the future, but I already finished my program and I moved on to the next part... Parsing the PDF file...

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://983911]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (13)
As of 2014-07-24 11:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (160 votes), past polls