Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re: How to dump a webpage content into a excel

by hdb (Prior)
on Apr 09, 2013 at 11:22 UTC ( #1027706=note: print w/ replies, xml ) Need Help??


in reply to How to dump a webpage content into a excel

Using HTML::TableExtract is relatively straightforward but the content of your tables is weird. Play with the keep_html parameter in line 8, putting it to 0 or 1 whatever suits you best.

use strict; use warnings; use HTML::TableExtract; my $response; { local $/ = undef; $response = <DATA>; } my $te = HTML::TableExtract->new( keep_html => 0 ); $te->parse($response); foreach my $ts ($te->tables) { print "\nTable found at ", join(',', $ts->coords), ":\n"; foreach my $row ($ts->rows) { foreach my $col (@$row) { if( defined $col ) { print "\t$col"; } else { print "\t---"; } } print "\n"; } } __DATA__ <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-U +S"> <head> <title>Cell Collections</title> <link rel="stylesheet" typ +e="text/css" href="/css/tree.css" /> <script src="/js/tree.js" type=" +text/javascript"></script> <meta http-equiv="Content-Type" content="t +ext/html; charset=iso-8859-1" /> </head> <body bgcolor="#FFFFFF"> <im +g src="/images/dvtk/IFX_LOGO.gif" alt="[IFX]" /><br /> <TABLE><TR><TD +><img align="left" src="/images/dvtk/logo100.gif" alt="[LOGO]" /><TD><h1>Cell collection status</h1><h3>AnalogIP:Test1 command: exec</h3></TD></TR></TABLE> <br /><p /><a href="collstat.pl"><b>[Collection overview]</b></a>&nbsp +;&nbsp;<a href="collstat.pl?path=/home/micado/autan_c65fla/dss/dss.co +mmon.default/units/macro/dvtk/results">[Overview for this collection]</a>&nbsp;&nbsp;<a href="cells.pl?path=/home/micado/autan_c65fla/dss/dss.common.default/u +nits/macro/dvtk/results">[All cells of this collection]</a>&nbsp;&nbsp;<a href="report.pl?path=/home +/micado/autan_c65fla/dss/dss.common.default/units/macro/dvtk/results" +>[Custom HTML Report]</a><p /><br /> <FONT SIZE=-1><I>(Click on the hyperlinks in the table headers to see +information on methods [if available])</I></FONT><BR> <TABLE BORDER=1 +><TR BGCOLOR="#D0D0D0"><TH ALIGN=LEFT NOWRAP>Cell</TH><TH ALIGN=LEFT>cdl_prefix</TH><TH ALIGN=LEFT>drc</TH><TH ALIGN=LEFT>drc_prefix</TH><TH ALIGN=LEFT>gds_ab +stract</TH><TH ALIGN=LEFT>gds_prefix</TH><TH ALIGN=LEFT><A HREF="meth +od.pl?path=/home/micado/autan_c65fla/dss/dss.common.default/units/mac +ro/dvtk/results;task=layverDoc;command=exec">layverDoc</A></TH><TH ALIGN=LEFT><A HREF="method.pl?path=/home/micado/autan_c65fla/dss/dss.c +ommon.default/units/macro/dvtk/results;task=layverShadow;command=exec +">layverShadow</A></TH><TH ALIGN=LEFT>lvs</TH><TH ALIGN=LEFT>lvs_prefix</TH><TH ALIGN=LEFT>macroL +ib</TH><TH ALIGN=LEFT>ocapi</TH><TH ALIGN=LEFT><A HREF="method.pl?pat +h=/home/micado/autan_c65fla/dss/dss.common.default/units/macro/dvtk/r +esults;task=prefix;command=exec">prefix</A></TH></TR> <TR CLASS='node' ID='clock_x2' VALIGN=top><TD ALIGN=LEFT NOWRAP><A CLA +SS=blank HREF="" ONCLICK="toggleRow(this.parentNode.parentNode.parentNode, 'clock_x2'); + return false;">&nbsp;<IMG ID="IMGclock_x2" SRC="/images/dvtk/minus.gif" BORDER=0>&nbsp;<B>clock_x2</B></A><B></B></TD><TD BGCOLOR="#D0D0D0"></TD><TD BGCOLOR="#D0D0D0"></TD><TD BGCOLOR="#D0D0D0 +"></TD><TD BGCOLOR="#D0D0D0"></TD><TD BGCOLOR="#D0D0D0"></TD><TD BGCO +LOR="#D0D0D0"></TD><TD BGCOLOR="#D0D0D0"></TD><TD BGCOLOR="#D0D0D0">< +/TD><TD BGCOLOR="#D0D0D0"></TD><TD BGCOLOR="#D0D0D0"></TD><TD BGCOLOR +="#D0D0D0"></TD><TD BGCOLOR="#D0D0D0"></TD></TR> <TR CLASS='node' ID='clock_x2@clock_x2' VALIGN=top><TD ALIGN=LEFT NOWRAP>&nbsp;&nbsp;&nbsp;<IMG ID="IMGclock_x2@clock_x2" SRC="/images/dvtk/bullet.gif" BORDER=0>&nbsp;<B>clock_x2</B><B></B></TD><TD><FONT COLOR="green">OK</FONT></TD><TD>-</TD><TD><FONT COLOR="green">OK</FONT></TD><TD><FONT COLOR="green">OK</FONT></TD><TD><FONT COLOR="green">OK</FONT></TD><TD><FONT COLOR="green">OK</FONT></TD><TD><FONT COLOR="green">OK</FONT></TD><TD>-</TD><TD><FONT COLOR="green">OK</FONT></TD><TD><FONT COLOR="green">OK</FONT></TD><TD><FONT COLOR="green">OK</FONT></TD><TD><FONT COLOR="green">OK</FONT></TD></TR> </TABLE> <BR><hr> <TABLE BORDER=0 CELLSPACING=0 CELLPADDING=0 WIDTH=100%> <TR> <TD WIDTH=42%> <FONT SIZE=-2>&copy; Copyright(c) 1996 - 2013 by , Team<BR> All rights reserved.</FONT></TD> <TD WIDTH=42%> <FONT SIZE=-2>Last modified Tue Apr 9 08:44:32 MEST 2013 <BR> by <A HREF="mailto:user@gmail.com">usergroup</A> </TD> <TD> <IMG BORDER=0 SRC="/images/dvtk/dvtk_dev_team.jpg" ALIGN=RIGHT HSPACE=0 VSPACE=0 ALT="[dvtk]"> </TD> </TR> </TABLE> <p /><br /> </body>

You still need to replace the print commands with code that puts the data into your spreadsheet.


Comment on Re: How to dump a webpage content into a excel
Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1027706]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (9)
As of 2015-07-07 16:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (91 votes), past polls