Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options

Re: Parsing and searching HTML code

by Your Mother (Chancellor)
on Jul 26, 2012 at 18:57 UTC ( #983911=note: print w/replies, xml ) Need Help??

in reply to Parsing and searching HTML code

Regular expressions are very fragile for HTML parsing. Here's a parser (XML::LibXML) based example–

use strictures; use XML::LibXML; use open qw(:std :utf8); use YAML; my $dom = XML::LibXML->load_html( string => do { local $/; <DATA> }, keep_blanks => 0 ); my @advisories; # Only select <p/>s that have a PDF link inside. for my $p ( map { $_->parentNode } $dom->findnodes(q{//p//a[contains(@ +href,'.pdf')]}) ) { my %tmp; for my $kid ( $p->childNodes ) { if ( $kid->nodeName eq "a" ) { $tmp{pdf} = { title => $kid->textContent, href => $kid->getAttribute("href") }; } elsif ( not $tmp{pdf} ) { # You'd have to do some shuffling to handle <br/>->\n here +. $tmp{heading} .= $kid->textContent; } else { ( $tmp{date} = $kid->textContent ) =~ s/[)(\n\r]//g; } } s/[\s,]+\Z// for $tmp{heading}, $tmp{date}; push @advisories, \%tmp; } print YAML::Dump(\@advisories); exit 0; __DATA__ # YOUR HTML FRAGMENT HERE

Snippet of output

--- - date: 'March 21, 2011' heading: |- RealFlex TechnologiesMultiple Vulnerabilities in RealFlex RealWin pdf: href: /control_systems/pdf/ICS-ALERT-11-080-04.pdf title: ICS-ALERT-11-080-04 - date: 'April 20, 2011' heading: RealFlex RealWin Multiple Vulnerabilities pdf: href: /control_systems/pdf/ICSA-11-110-01.pdf title: ICSA-11-110-01 ...

I do understand that regular expressions seem more accessible at first and can solve many specific/one-off problems but putting in the time to get up the learning curve of any of the good HTML/XML parsers will repay greatly over time.

Replies are listed 'Best First'.
Re^2: Parsing and searching HTML code
by jayto (Acolyte) on Jul 26, 2012 at 20:32 UTC
    Thanks for showing me that, I'm probably going to be using your post as a reference in the future, but I already finished my program and I moved on to the next part... Parsing the PDF file...

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://983911]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (4)
As of 2017-11-24 20:40 GMT
Find Nodes?
    Voting Booth?
    In order to be able to say "I know Perl", you must have:

    Results (353 votes). Check out past polls.