Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

How do I extract text from an HTML page?

by jamiel (Initiate)
on Aug 03, 2003 at 04:19 UTC ( #280396=perlquestion: print w/ replies, xml ) Need Help??
jamiel has asked for the wisdom of the Perl Monks concerning the following question:

imagine that I have a document named finalgrades.html located at http://somedomain.com/finalgrades.html and it contains the following text:

<html> <head> <title></title> </head> <body> <b>Guy 1</b>, grade: 100<br> <b>Guy 2</b>, grade: 70<br> <b>Guy 3</b>, grade: 98<br> </body> </html>

I want a perl script ( to make a CGI ) which takes the values after "grade:" and before "<b>" for the line containing a name that I coose (for example "Guy 1"), and prints it to the screen when running it as a CGI on my webpage, something like this:

<html> <head> <title>CONGRATULATIONS TO THE PARTICIPANT Guy 1</title> </head> <body> <h1>Guy 1</h1> Has just got a "100" as a grade for this Seminar. Congratulations!!!<br> </body> </html>

edited: Sun Aug 3 15:19:23 2003 by jeffa - formatting, linkafied link edited: Mon Mar 13 10:25:23 2006 by jamiel - formatting, spelling, etc...

Comment on How do I extract text from an HTML page?
Select or Download Code
Re: How do I extract text from an HTML page?
by bobn (Chaplain) on Aug 03, 2003 at 05:22 UTC

    Use the CGI.pm module to generate your HML

    There are many modules to parse HTML. HTML::TokeParser::Simple looks promising but there are others.

    you can check these and others at http://search.cpan.org

    Update: fixed links.

    --Bob Niederman, http://bob-n.com
Re: How do I extract text from an HTML page?
by CountZero (Bishop) on Aug 03, 2003 at 16:06 UTC

    Well whatever you do, the only way not to go is to regex the HTML-code yourself. This will only work for the most simple and regular of HTML-code and will break before you know it.

    Another approach is to go to the source of your data in the first web-page. Assuming that this is based upon some database, can't you go directly to that database and query the data from there?

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: How do I extract text from an HTML page?
by ido50 (Scribe) on Aug 03, 2003 at 17:53 UTC
    Assuming that the format for every grade is '<b>NAME</b>, grade: GRADE<br>' and that there are no lines that begin incidentally with the same format (Well actually you don't have if you use the regex I wrote. At least I think so...), you can use this little snippet that uses regex to create this HTML page (You can easily alter it as you wish to just print it and not create an HTML file):

    open (FILE, "+<$path/finalgrades.html") or die "Can't open file: $!"; +# where $path is the full path to the directory where the file reside +s. while (<FILE>) { if (m/^<b>[^<>]</b>, grade: (\d+)<br>$/) { my $name = $1; my $grade = $2; open (HTML, "+>>$path/$name.html") or die "$!"; print HTML "<html><head><title>CONGRATULATIONS TO THE PARTICIPANT +$name</title></head><body><h1>$name</h1> has just reviece "$grade" as + a grade for this seminar. Congratulations!!!<br></body></html> } } close (FILE) or die "$!";

    Note that you should want to use CGI.pm to print the HTML taks instead of printing them directly. Go to http://search.cpan.org/author/LDS/CGI.pm-2.93/CGI.pm for more info on the CGI module.

    --------------------------
    Live fat, die young
      A few corrections:
      1. Replace "(Well actually you don't have if you..." with "(Well actually you don't have to worry about it if you...".
      2. After the print HTML "bla bla" statement I forgot to include a ";", and you should add after it a "close HTML" statement too.

      ----------------------
      Live fat, die young
        Last two corrections (Not my day today):
        1. I also forgot a terminating double quote in the print HTML "bla bla" statement. 2. Replace the "[^<>]" in the regex with "([^<>])".

        ----------------------
        Live fat, die young

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://280396]
Approved by dws
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (8)
As of 2014-09-19 12:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (137 votes), past polls