Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer

Parsing HTML Pages

by SteveBo (Initiate)
on Feb 07, 2013 at 02:41 UTC ( #1017547=perlquestion: print w/replies, xml ) Need Help??
SteveBo has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to write some code, and as a newbie, have some issues. I am trying to parse some HTML pages of some college athletic directories and get a list of all the coaches. I want to go through the page until i match a "section" regex, grab each row (tr) after, until i find another match for my section regex, then start again. I have figured out how to get the sport, and first coach under the heading, but cannot get where i can grab all the coaches under the heading until the next heading match. See my code below:

while ($content =~ m/<(td|th).*?<strong>(<.*?>)?(&nbsp;)?(?<sport>([(] +?(Women('|&#39;)s|Men('|&#39;)s|W|M)?[)]?(\s|&nbsp;)?)?(Archery|Badmi +nton|Baseball|Basketball|Bowling|Cross Country|Track (&|&amp;|[Aa]nd) + Field|Equestrian|Fencing|Field Hockey|Football|Golf|Gymnastics|Ice H +ockey|Lacrosse|Rowing|Rifle|Rugby|Skiing|Soccer|Softball|Squash|Swimm +ing ([Aa]nd |\s?[\/-]\s?|)Diving|Swimming|Diving|Synchronized Swim|Te +am Handball|Handball|Tennis|Volleyball|Water Polo|Wrestling)(\s?[-,]\ +s?)?\s?[(]?(Women(('|&#39;)s)?|Men(('|&#39;)s)?|W|M)?[)]?):?(.*?)?<\/ +(strong|br)>/gi) { print $+{sport} . "\r\n"; if ($content =~ m/\G.*?<tr.*?>(.*?)<\/tr>/sgc) { my $coach_info = $1; while ($coach_info =~ m/<td.*?>(.*?)<\/td>/mg) { print $1 . "\r\n"; } } }

Any ideas?

Replies are listed 'Best First'.
Re: Parsing HTML Pages
by Your Mother (Bishop) on Feb 07, 2013 at 03:16 UTC
Re: Parsing HTML Pages
by Kenosis (Priest) on Feb 07, 2013 at 04:01 UTC

    Welcome to PerlMonks, SteveBo!

    You've likely noted the document that Your Mother has rightfully linked to, as it attempts to dissuade one from using a regex to parse html.

    There are excellent html parsing Modules you can use, and a recent sharing of some can be found at this PM node: HTML Parser suggestions.

    Hope this helps!

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1017547]
Approved by muba
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (7)
As of 2018-03-23 09:41 GMT
Find Nodes?
    Voting Booth?
    When I think of a mole I think of:

    Results (289 votes). Check out past polls.