Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Parsing HTML Pages

by SteveBo (Initiate)
on Feb 07, 2013 at 02:41 UTC ( #1017547=perlquestion: print w/ replies, xml ) Need Help??
SteveBo has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to write some code, and as a newbie, have some issues. I am trying to parse some HTML pages of some college athletic directories and get a list of all the coaches. I want to go through the page until i match a "section" regex, grab each row (tr) after, until i find another match for my section regex, then start again. I have figured out how to get the sport, and first coach under the heading, but cannot get where i can grab all the coaches under the heading until the next heading match. See my code below:

while ($content =~ m/<(td|th).*?<strong>(<.*?>)?(&nbsp;)?(?<sport>([(] +?(Women('|&#39;)s|Men('|&#39;)s|W|M)?[)]?(\s|&nbsp;)?)?(Archery|Badmi +nton|Baseball|Basketball|Bowling|Cross Country|Track (&|&amp;|[Aa]nd) + Field|Equestrian|Fencing|Field Hockey|Football|Golf|Gymnastics|Ice H +ockey|Lacrosse|Rowing|Rifle|Rugby|Skiing|Soccer|Softball|Squash|Swimm +ing ([Aa]nd |\s?[\/-]\s?|)Diving|Swimming|Diving|Synchronized Swim|Te +am Handball|Handball|Tennis|Volleyball|Water Polo|Wrestling)(\s?[-,]\ +s?)?\s?[(]?(Women(('|&#39;)s)?|Men(('|&#39;)s)?|W|M)?[)]?):?(.*?)?<\/ +(strong|br)>/gi) { print $+{sport} . "\r\n"; if ($content =~ m/\G.*?<tr.*?>(.*?)<\/tr>/sgc) { my $coach_info = $1; while ($coach_info =~ m/<td.*?>(.*?)<\/td>/mg) { print $1 . "\r\n"; } } }

Any ideas?

Comment on Parsing HTML Pages
Download Code
Re: Parsing HTML Pages
by Your Mother (Canon) on Feb 07, 2013 at 03:16 UTC
Re: Parsing HTML Pages
by Kenosis (Priest) on Feb 07, 2013 at 04:01 UTC

    Welcome to PerlMonks, SteveBo!

    You've likely noted the document that Your Mother has rightfully linked to, as it attempts to dissuade one from using a regex to parse html.

    There are excellent html parsing Modules you can use, and a recent sharing of some can be found at this PM node: HTML Parser suggestions.

    Hope this helps!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1017547]
Approved by muba
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (7)
As of 2014-09-19 02:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (129 votes), past polls