Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

parsing html

by qingxia (Novice)
on Mar 21, 2013 at 20:08 UTC ( #1024817=perlquestion: print w/replies, xml ) Need Help??
qingxia has asked for the wisdom of the Perl Monks concerning the following question:


I am pretty new to perl and please forgive me if asking too obvious question.

The string i would like to parse has following structure
<td some text here> N </td>
and i want to parse N using  $line  =~ /<td(.*?)>(.*?)<\/td>/ but unusually the string appears as
<td some text here> N </td>
the code which i'm using would miss that N. any ways to play around with it? thanks in advance. shawn

Replies are listed 'Best First'.
Re: parsing html
by tobyink (Abbot) on Mar 21, 2013 at 21:24 UTC

    McA's answer is technically correct, but going down the regexp route is likely to cause you more pain further down the road.

    For example, have you considered HTML where a greater-than sign legitimately occurs in an attribute?

    <td title="n > 5">n greater than 5</td>

    Are you aware that the </td> closing tag is optional (as per the HTML 3.2, HTML 4 and HTML 5 specs). So the following is legitimate:

    <tr> <td>1 <td>2 <td>3</td> </tr>

    You're better off using one of the many HTML parsing modules on CPAN which will already cover these corner cases.

    package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name
Re: parsing html
by kennethk (Abbot) on Mar 21, 2013 at 21:31 UTC

    McA's solution will fix your immediate question, but if you are parsing HTML in anything other than an educational or 1-off context, I would suggest you use a CPAN module rather than reinvent the wheel; perhaps HTML::Parser or Mojo::DOM would be helpful. HTML in the wild is notoriously hard to handle in a general way.

    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

      And to toyyink and kennethk, it is actually a dataset which i need to prepare for the next stage analysis. it comes in as several html files and each of them contains a rather stable pattern like:

      id xxx borrower xxx date xxx ...

      and i want to code them into some standard format which can be read by some commercial statistical software like stata. e.g.

      id borrower date ... xxx xxxx xxxx
      and it is a little too time-consuming to do it in excel, so i switch to perl as i really would like to learn it. doing by learning would be more fun. you can say it is a kind of a one-off project because i will (hope) not frequently parse HTML but thank you anyway for the suggestion, totally agreed with you. best regards,sh

        When I said "1-off context", this is exactly what I meant; a quick script to process 1 set of data. I wholly support your choice of regex for this task.

        #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Re: parsing html
by McA (Priest) on Mar 21, 2013 at 20:15 UTC

    regex modifier 's' should do the trick:

    $line =~ /<td(.*?)>(.*?)<\/td>/s


      thx to McA. It works well.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1024817]
Approved by moritz
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (3)
As of 2018-07-21 21:46 GMT
Find Nodes?
    Voting Booth?
    It has been suggested to rename Perl 6 in order to boost its marketing potential. Which name would you prefer?

    Results (450 votes). Check out past polls.