Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight

Re: Parsing and searching HTML code

by Your Mother (Canon)
on Jul 26, 2012 at 18:57 UTC ( #983911=note: print w/ replies, xml ) Need Help??

in reply to Parsing and searching HTML code

Regular expressions are very fragile for HTML parsing. Here's a parser (XML::LibXML) based example–

use strictures; use XML::LibXML; use open qw(:std :utf8); use YAML; my $dom = XML::LibXML->load_html( string => do { local $/; <DATA> }, keep_blanks => 0 ); my @advisories; # Only select <p/>s that have a PDF link inside. for my $p ( map { $_->parentNode } $dom->findnodes(q{//p//a[contains(@ +href,'.pdf')]}) ) { my %tmp; for my $kid ( $p->childNodes ) { if ( $kid->nodeName eq "a" ) { $tmp{pdf} = { title => $kid->textContent, href => $kid->getAttribute("href") }; } elsif ( not $tmp{pdf} ) { # You'd have to do some shuffling to handle <br/>->\n here +. $tmp{heading} .= $kid->textContent; } else { ( $tmp{date} = $kid->textContent ) =~ s/[)(\n\r]//g; } } s/[\s,]+\Z// for $tmp{heading}, $tmp{date}; push @advisories, \%tmp; } print YAML::Dump(\@advisories); exit 0; __DATA__ # YOUR HTML FRAGMENT HERE

Snippet of output

--- - date: 'March 21, 2011' heading: |- RealFlex TechnologiesMultiple Vulnerabilities in RealFlex RealWin pdf: href: /control_systems/pdf/ICS-ALERT-11-080-04.pdf title: ICS-ALERT-11-080-04 - date: 'April 20, 2011' heading: RealFlex RealWin Multiple Vulnerabilities pdf: href: /control_systems/pdf/ICSA-11-110-01.pdf title: ICSA-11-110-01 ...

I do understand that regular expressions seem more accessible at first and can solve many specific/one-off problems but putting in the time to get up the learning curve of any of the good HTML/XML parsers will repay greatly over time.

Comment on Re: Parsing and searching HTML code
Select or Download Code
Re^2: Parsing and searching HTML code
by jayto (Acolyte) on Jul 26, 2012 at 20:32 UTC
    Thanks for showing me that, I'm probably going to be using your post as a reference in the future, but I already finished my program and I moved on to the next part... Parsing the PDF file...

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://983911]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (8)
As of 2014-10-01 00:51 GMT
Find Nodes?
    Voting Booth?

    How do you remember the number of days in each month?

    Results (386 votes), past polls