Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

How to dump a webpage content into a excel

by ghosh123 (Monk)
on Apr 09, 2013 at 08:14 UTC ( #1027671=perlquestion: print w/ replies, xml ) Need Help??
ghosh123 has asked for the wisdom of the Perl Monks concerning the following question:

Hi
I am trying to extract the content from a webpage and put the same into a xls file. The webpage contains some data in a tabular format which is of my interest. The page also has some 'a href' tags and images which can be ignored.
I need some help in terms of writing the data into the xls file. Please help

Below is the code

use Spreadsheet::WriteExcel; require LWP::UserAgent; LWP::UserAgent(); my $request = HTTP::Request->new(GET=> "http://dvtk.com"); my $response = $ua->request($request); if(!$response->is_success()) { print $response->error_as_HTML(); die "No cpan module found"; } my @data = split(/\n/, $response->content()); # Create a new Excel file my $filename = "/home/report/test.xls"; my $workbook = Spreadsheet::WriteExcel->new($filename); # Add a worksheet my $worksheet = $workbook->add_worksheet('exec2'); while (defined ($_ = shift@data)) { print "$_ \n"; $worksheet->write(0,0, $_); }

The html content I get from the site is as follows

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-U +S"> <head> <title>Cell Collections</title> <link rel="stylesheet" typ +e="text/css" href="/css/tree.css" /> <script src="/js/tree.js" type=" +text/javascript"></script> <meta http-equiv="Content-Type" content="t +ext/html; charset=iso-8859-1" /> </head> <body bgcolor="#FFFFFF"> <im +g src="/images/dvtk/IFX_LOGO.gif" alt="[IFX]" /><br /> <TABLE><TR><TD +><img align="left" src="/images/dvtk/logo100.gif" alt="[LOGO]" /><TD><h1>Cell collection status</h1><h3>AnalogIP:Test1 command: exec</h3></TD></TR></TABLE> <br /><p /><a href="collstat.pl"><b>[Collection overview]</b></a>&nbsp +;&nbsp;<a href="collstat.pl?path=/home/micado/autan_c65fla/dss/dss.co +mmon.default/units/macro/dvtk/results">[Overview for this collection]</a>&nbsp;&nbsp;<a href="cells.pl?path=/home/micado/autan_c65fla/dss/dss.common.default/u +nits/macro/dvtk/results">[All cells of this collection]</a>&nbsp;&nbsp;<a href="report.pl?path=/home +/micado/autan_c65fla/dss/dss.common.default/units/macro/dvtk/results" +>[Custom HTML Report]</a><p /><br /> <FONT SIZE=-1><I>(Click on the hyperlinks in the table headers to see +information on methods [if available])</I></FONT><BR> <TABLE BORDER=1 +><TR BGCOLOR="#D0D0D0"><TH ALIGN=LEFT NOWRAP>Cell</TH><TH ALIGN=LEFT>cdl_prefix</TH><TH ALIGN=LEFT>drc</TH><TH ALIGN=LEFT>drc_prefix</TH><TH ALIGN=LEFT>gds_ab +stract</TH><TH ALIGN=LEFT>gds_prefix</TH><TH ALIGN=LEFT><A HREF="meth +od.pl?path=/home/micado/autan_c65fla/dss/dss.common.default/units/mac +ro/dvtk/results;task=layverDoc;command=exec">layverDoc</A></TH><TH ALIGN=LEFT><A HREF="method.pl?path=/home/micado/autan_c65fla/dss/dss.c +ommon.default/units/macro/dvtk/results;task=layverShadow;command=exec +">layverShadow</A></TH><TH ALIGN=LEFT>lvs</TH><TH ALIGN=LEFT>lvs_prefix</TH><TH ALIGN=LEFT>macroL +ib</TH><TH ALIGN=LEFT>ocapi</TH><TH ALIGN=LEFT><A HREF="method.pl?pat +h=/home/micado/autan_c65fla/dss/dss.common.default/units/macro/dvtk/r +esults;task=prefix;command=exec">prefix</A></TH></TR> <TR CLASS='node' ID='clock_x2' VALIGN=top><TD ALIGN=LEFT NOWRAP><A CLA +SS=blank HREF="" ONCLICK="toggleRow(this.parentNode.parentNode.parentNode, 'clock_x2'); + return false;">&nbsp;<IMG ID="IMGclock_x2" SRC="/images/dvtk/minus.gif" BORDER=0>&nbsp;<B>clock_x2</B></A><B></B></TD><TD BGCOLOR="#D0D0D0"></TD><TD BGCOLOR="#D0D0D0"></TD><TD BGCOLOR="#D0D0D0 +"></TD><TD BGCOLOR="#D0D0D0"></TD><TD BGCOLOR="#D0D0D0"></TD><TD BGCO +LOR="#D0D0D0"></TD><TD BGCOLOR="#D0D0D0"></TD><TD BGCOLOR="#D0D0D0">< +/TD><TD BGCOLOR="#D0D0D0"></TD><TD BGCOLOR="#D0D0D0"></TD><TD BGCOLOR +="#D0D0D0"></TD><TD BGCOLOR="#D0D0D0"></TD></TR> <TR CLASS='node' ID='clock_x2@clock_x2' VALIGN=top><TD ALIGN=LEFT NOWRAP>&nbsp;&nbsp;&nbsp;<IMG ID="IMGclock_x2@clock_x2" SRC="/images/dvtk/bullet.gif" BORDER=0>&nbsp;<B>clock_x2</B><B></B></TD><TD><FONT COLOR="green">OK</FONT></TD><TD>-</TD><TD><FONT COLOR="green">OK</FONT></TD><TD><FONT COLOR="green">OK</FONT></TD><TD><FONT COLOR="green">OK</FONT></TD><TD><FONT COLOR="green">OK</FONT></TD><TD><FONT COLOR="green">OK</FONT></TD><TD>-</TD><TD><FONT COLOR="green">OK</FONT></TD><TD><FONT COLOR="green">OK</FONT></TD><TD><FONT COLOR="green">OK</FONT></TD><TD><FONT COLOR="green">OK</FONT></TD></TR> </TABLE> <BR><hr> <TABLE BORDER=0 CELLSPACING=0 CELLPADDING=0 WIDTH=100%> <TR> <TD WIDTH=42%> <FONT SIZE=-2>&copy; Copyright(c) 1996 - 2013 by , Team<BR> All rights reserved.</FONT></TD> <TD WIDTH=42%> <FONT SIZE=-2>Last modified Tue Apr 9 08:44:32 MEST 2013 <BR> by <A HREF="mailto:user@gmail.com">usergroup</A> </TD> <TD> <IMG BORDER=0 SRC="/images/dvtk/dvtk_dev_team.jpg" ALIGN=RIGHT HSPACE=0 VSPACE=0 ALT="[dvtk]"> </TD> </TR> </TABLE> <p /><br /> </body>

Comment on How to dump a webpage content into a excel
Select or Download Code
Re: How to dump a webpage content into a excel
by hdb (Prior) on Apr 09, 2013 at 08:28 UTC

    HTML::TableExtract is useful for extracting the information from the HTML table.

Re: How to dump a webpage content into a excel
by hdb (Prior) on Apr 09, 2013 at 11:22 UTC

    Using HTML::TableExtract is relatively straightforward but the content of your tables is weird. Play with the keep_html parameter in line 8, putting it to 0 or 1 whatever suits you best.

    use strict; use warnings; use HTML::TableExtract; my $response; { local $/ = undef; $response = <DATA>; } my $te = HTML::TableExtract->new( keep_html => 0 ); $te->parse($response); foreach my $ts ($te->tables) { print "\nTable found at ", join(',', $ts->coords), ":\n"; foreach my $row ($ts->rows) { foreach my $col (@$row) { if( defined $col ) { print "\t$col"; } else { print "\t---"; } } print "\n"; } } __DATA__ <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-U +S"> <head> <title>Cell Collections</title> <link rel="stylesheet" typ +e="text/css" href="/css/tree.css" /> <script src="/js/tree.js" type=" +text/javascript"></script> <meta http-equiv="Content-Type" content="t +ext/html; charset=iso-8859-1" /> </head> <body bgcolor="#FFFFFF"> <im +g src="/images/dvtk/IFX_LOGO.gif" alt="[IFX]" /><br /> <TABLE><TR><TD +><img align="left" src="/images/dvtk/logo100.gif" alt="[LOGO]" /><TD><h1>Cell collection status</h1><h3>AnalogIP:Test1 command: exec</h3></TD></TR></TABLE> <br /><p /><a href="collstat.pl"><b>[Collection overview]</b></a>&nbsp +;&nbsp;<a href="collstat.pl?path=/home/micado/autan_c65fla/dss/dss.co +mmon.default/units/macro/dvtk/results">[Overview for this collection]</a>&nbsp;&nbsp;<a href="cells.pl?path=/home/micado/autan_c65fla/dss/dss.common.default/u +nits/macro/dvtk/results">[All cells of this collection]</a>&nbsp;&nbsp;<a href="report.pl?path=/home +/micado/autan_c65fla/dss/dss.common.default/units/macro/dvtk/results" +>[Custom HTML Report]</a><p /><br /> <FONT SIZE=-1><I>(Click on the hyperlinks in the table headers to see +information on methods [if available])</I></FONT><BR> <TABLE BORDER=1 +><TR BGCOLOR="#D0D0D0"><TH ALIGN=LEFT NOWRAP>Cell</TH><TH ALIGN=LEFT>cdl_prefix</TH><TH ALIGN=LEFT>drc</TH><TH ALIGN=LEFT>drc_prefix</TH><TH ALIGN=LEFT>gds_ab +stract</TH><TH ALIGN=LEFT>gds_prefix</TH><TH ALIGN=LEFT><A HREF="meth +od.pl?path=/home/micado/autan_c65fla/dss/dss.common.default/units/mac +ro/dvtk/results;task=layverDoc;command=exec">layverDoc</A></TH><TH ALIGN=LEFT><A HREF="method.pl?path=/home/micado/autan_c65fla/dss/dss.c +ommon.default/units/macro/dvtk/results;task=layverShadow;command=exec +">layverShadow</A></TH><TH ALIGN=LEFT>lvs</TH><TH ALIGN=LEFT>lvs_prefix</TH><TH ALIGN=LEFT>macroL +ib</TH><TH ALIGN=LEFT>ocapi</TH><TH ALIGN=LEFT><A HREF="method.pl?pat +h=/home/micado/autan_c65fla/dss/dss.common.default/units/macro/dvtk/r +esults;task=prefix;command=exec">prefix</A></TH></TR> <TR CLASS='node' ID='clock_x2' VALIGN=top><TD ALIGN=LEFT NOWRAP><A CLA +SS=blank HREF="" ONCLICK="toggleRow(this.parentNode.parentNode.parentNode, 'clock_x2'); + return false;">&nbsp;<IMG ID="IMGclock_x2" SRC="/images/dvtk/minus.gif" BORDER=0>&nbsp;<B>clock_x2</B></A><B></B></TD><TD BGCOLOR="#D0D0D0"></TD><TD BGCOLOR="#D0D0D0"></TD><TD BGCOLOR="#D0D0D0 +"></TD><TD BGCOLOR="#D0D0D0"></TD><TD BGCOLOR="#D0D0D0"></TD><TD BGCO +LOR="#D0D0D0"></TD><TD BGCOLOR="#D0D0D0"></TD><TD BGCOLOR="#D0D0D0">< +/TD><TD BGCOLOR="#D0D0D0"></TD><TD BGCOLOR="#D0D0D0"></TD><TD BGCOLOR +="#D0D0D0"></TD><TD BGCOLOR="#D0D0D0"></TD></TR> <TR CLASS='node' ID='clock_x2@clock_x2' VALIGN=top><TD ALIGN=LEFT NOWRAP>&nbsp;&nbsp;&nbsp;<IMG ID="IMGclock_x2@clock_x2" SRC="/images/dvtk/bullet.gif" BORDER=0>&nbsp;<B>clock_x2</B><B></B></TD><TD><FONT COLOR="green">OK</FONT></TD><TD>-</TD><TD><FONT COLOR="green">OK</FONT></TD><TD><FONT COLOR="green">OK</FONT></TD><TD><FONT COLOR="green">OK</FONT></TD><TD><FONT COLOR="green">OK</FONT></TD><TD><FONT COLOR="green">OK</FONT></TD><TD>-</TD><TD><FONT COLOR="green">OK</FONT></TD><TD><FONT COLOR="green">OK</FONT></TD><TD><FONT COLOR="green">OK</FONT></TD><TD><FONT COLOR="green">OK</FONT></TD></TR> </TABLE> <BR><hr> <TABLE BORDER=0 CELLSPACING=0 CELLPADDING=0 WIDTH=100%> <TR> <TD WIDTH=42%> <FONT SIZE=-2>&copy; Copyright(c) 1996 - 2013 by , Team<BR> All rights reserved.</FONT></TD> <TD WIDTH=42%> <FONT SIZE=-2>Last modified Tue Apr 9 08:44:32 MEST 2013 <BR> by <A HREF="mailto:user@gmail.com">usergroup</A> </TD> <TD> <IMG BORDER=0 SRC="/images/dvtk/dvtk_dev_team.jpg" ALIGN=RIGHT HSPACE=0 VSPACE=0 ALT="[dvtk]"> </TD> </TR> </TABLE> <p /><br /> </body>

    You still need to replace the print commands with code that puts the data into your spreadsheet.

Re: How to dump a webpage content into a excel
by hdb (Prior) on Apr 09, 2013 at 11:41 UTC

    This how the spreadsheet part could look like:

    use strict; use warnings; use HTML::TableExtract; use Spreadsheet::WriteExcel; my $filename = "table.xls"; my $workbook = Spreadsheet::WriteExcel->new($filename); my $response; { local $/ = undef; $response = <DATA>; } my $te = HTML::TableExtract->new( keep_html => 0 ); $te->parse($response); my $counter; foreach my $ts ($te->tables) { my $worksheet = $workbook->add_worksheet("Table ".$counter++); my $nrow=0; foreach my $row ($ts->rows) { my $ncol=0; foreach my $col (@$row) { $worksheet->write($nrow,$ncol++,$col) if defined $col; } $nrow++; } }

      Hi
      Thanks for the solution. It is working.
      But what if I want to dump these 3 tables (in this current html content)in one worksheet instead of 3 worksheets.
      I removed the $workbook-> add_worksheet line from the loop and dumping these all in one work sheet . But the 1st and last tables are not properly shown. How can I get rid of that ?
      Also how can I show the column headers in bold ?

        Did you put the 3 tables elsewhere on the sheet? Or do you overwrite them? For bold format you need to look into the documentation of Spreadsheet::WriteExcel.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1027671]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (6)
As of 2015-07-02 00:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (25 votes), past polls