Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Fetching table from website using HTML::TableExtract

by sachin raj aryan (Acolyte)
on Mar 29, 2017 at 11:40 UTC ( #1186362=perlquestion: print w/replies, xml ) Need Help??

sachin raj aryan has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks , I am looking for some help as i got stuck while scrapping website for data present in table Below is my html file data which i am fetching .I took it from developer option

<tbody><tr><th colspan="15" style="COLOR:RED;FONT-SIZE:12pt; FONT-WEIG +HT:BOLD; TEXT-ALIGN:center;">Amount </th></tr> <tr><th rowspan="2">Region</th><th colspan="2">Level 31.03.2016</th><t +h colspan="3">Sanction/Renewal<br>01.04.2016 to 28.02.2017</th><th co +lspan="2">Level 28.02.2017</th><th colspan="3">Sanction/Renewal<br>Du +ring Current Month</th><th colspan="2">Level 26.03.2017</th><th colsp +an="2">Growth<br>as on<br>26.03.2017</th></tr><tr><th>No.</th><th>Bal +ance</th><th>No.</th><th>Limit</th><th>Balance</th><th>No.</th><th>Ba +lance</th><th>No.</th><th>Limit</th><th>Balance</th><th>No.</th><th>B +alance</th><th>GDM</th><th>GUM</th></tr><tr> <td style="TEXT-ALIGN:LEFT;"><a href="AGL_CROPC0820201.HT +M">TEMPORARY-01</a></td><td>19600</td><td>288.36</td><td>14306</td><t +d>272.25</td><td>194.22</td><td>19246</td><td>284.53</td><td>989</td> +<td>19.02</td><td>12.94</td><td>19450</td><td>290.33</td><td>5.80</td +><td>1.97</td></tr><tr> <td style="TEXT-ALIGN:LEFT;"><a href="AGL_CROPC0820202.HT +M">TEMPORARY-02</a></td><td>17417</td><td>167.40</td><td>9466</td><td +>123.61</td><td>99.40</td><td>16717</td><td>167.24</td><td>823</td><t +d>11.71</td><td>9.11</td><td>16721</td><td>169.51</td><td>2.27</td><t +d>2.11</td></tr><tr> <td style="TEXT-ALIGN:LEFT;"><a href="AGL_CROPC0820203.HT +M">TEMPORARY-03</a></td><td>13545</td><td>180.62</td><td>8395</td><td +>144.63</td><td>110.32</td><td>12675</td><td>179.13</td><td>333</td>< +td>7.38</td><td>5.38</td><td>12630</td><td>180.13</td><td>1.00</td><t +d>-0.49</td></tr><tr> <td style="TEXT-ALIGN:LEFT;"><a href="AGL_CROPC0820204.HT +M">TEMPORARY-04</a></td><td>21826</td><td>225.82</td><td>10249</td><t +d>133.52</td><td>113.51</td><td>21558</td><td>230.69</td><td>624</td> +<td>10.07</td><td>7.84</td><td>21524</td><td>233.99</td><td>3.30</td> +<td>8.17</td></tr><tr> <td style="TEXT-ALIGN:LEFT;"><a href="AGL_CROPC0820205.HT +M">TEMPORARY-05</a></td><td>41299</td><td>736.24</td><td>34023</td><t +d>732.70</td><td>601.55</td><td>40822</td><td>732.78</td><td>3177</td +><td>76.32</td><td>60.46</td><td>40794</td><td>736.45</td><td>3.67</t +d><td>0.21</td></tr><tr style="BACKGROUND-COLOR:YELLOW;FONT-WEIGHT:BO +LD;"> <td style="TEXT-ALIGN:LEFT;">TEMPORARY-TOTAL</td><td>1136 +87</td><td>1598.44</td><td>76439</td><td>1406.71</td><td>1119.00</td> +<td>111018</td><td>1594.37</td><td>5946</td><td>124.50</td><td>95.73< +/td><td>111119</td><td>1610.41</td><td>16.04</td><td>11.97</td></tr>< +/tbody>

#!usr/bin/perl ####extracting table having table id #### use Modern::Perl; use WWW::Mechanize; use HTML::TableExtract; open(my $OUT, '>>', 'papa') or die "Could not open file $!"; my $mech = WWW::Mechanize->new(); $mech->get('http://xxx.com/tempo/TYPE_CAT/AGL_CROPC0820200.HTM'); my $html_string = $mech->content(); my $te = HTML::TableExtract->new();####extracting all table #### $te->parse($html_string); foreach my $ts ( $te->tables ) { print "Table (", join( ',', $ts->coords ), "):\n"; foreach my $row ( $ts->rows ) { $OUT-> print( join( ',', @$row ), "\n"); } }

below is my output from abv fetched data in which top 9 lines are not properly formatted

Amount,,,,,,,,,,,,,, Region,Level 31.03.2016,,Sanction/Renewal 01.04.2016 to 28.02.2017,,,Level 28.02.2017,,Sanction/Renewal During Current Month ,,,Level 26.03.2017,,Growth as on 26.03.2017, ,No.,Balance,No.,Limit,Balance,No.,Balance,No.,Limit,Balance,No.,Balan +ce,GDM,GUM TEMPORARY-01,19600,288.36,14306,272.25,194.22,19246,284.53,989,19.02,1 +2.94,19450,290.33,5.80,1.97 TEMPORARY-02,17417,167.40,9466,123.61,99.40,16717,167.24,823,11.71,9.1 +1,16721,169.51,2.27,2.11 TEMPORARY-03,13545,180.62,8395,144.63,110.32,12675,179.13,333,7.38,5.3 +8,12630,180.13,1.00,-0.49 TEMPORARY-04,21826,225.82,10249,133.52,113.51,21558,230.69,624,10.07,7 +.84,21524,233.99,3.30,8.17 TEMPORARY-05,41299,736.24,34023,732.70,601.55,40822,732.78,3177,76.32, +60.46,40794,736.45,3.67,0.21 TEMPORARY-TOTAL,113687,1598.44,76439,1406.71,1119.00,111018,1594.37,59 +46,124.50,95.73,111119,1610.41,16.04,11.97
SO i dont know how to extract data with row span and column span.

Replies are listed 'Best First'.
Re: Fetching table from website using HTML::TableExtract
by Corion (Patriarch) on Mar 29, 2017 at 11:42 UTC

    I don't think HTML::TableExtract can do that for you.

    Why not fix the data in Perl once you've extracted it in table form?

      An alternative to the cell() method in HTML::TableExtract::Table is the space() method. It is largely similar to cell(), except when given coordinates of a cell that was covered due to rowspan or colspan effects, it will return the contents of the cell that was covering that space rather than undef. So if, for example, cell (0,0) had a rowspan of 2 and colspan of 2, cell(1,1) would return undef and space(1,1) would return the same content as cell(0,0) or space(0,0).

        Can you please help me with some example ......i am first time using table extract ..

      actually not sure how to fix the data as data i am getting from webpage directly .

        Corion doesn't mean you fix the data in the webpage directly, but after you have extracted it with HTML::TableExtract which is properly inserting empty fields for data cells skipped due to colspan or rowspan. But you have multiline fields:

        The second row of your $ts->rows is

        Region,Level 31.03.2016,,Sanction/Renewal 01.04.2016 to 28.02.2017,,,Level 28.02.2017,,Sanction/Renewal During Current Month ,,,Level 26.03.2017,,Growth as on 26.03.2017,

        After Level 31.02.2016 there's an empty field because of colspan="2". The next field is

        Sanction/Renewal During Current Month

        so all you have to do is removing trailing whitespace/newlines from each field:

        foreach my $ts ( $te->tables ) { print "Table (", join( ',', $ts->coords ), "):\n"; foreach my $row ( $ts->rows ) { s/[\s\n]+\z/ for @$row; # <--- here # s/\n/ /gs for @$row; # uncomment if you want to convert # multiline fields into single line $OUT-> print( join( ',', @$row ), "\n"); } }
        perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
Re: Fetching table from website using HTML::TableExtract
by AppleFritter (Vicar) on Mar 29, 2017 at 14:07 UTC

    below is my output from abv fetched data in which top 9 lines are not properly formatted

    Could you expound on this? What do you mean by "not properly formatted" -- what format would you expect? I.e., given e.g. the table above, what would you want for the output to be?

      hi APple

      I want my these line

      Amount,,,,,,,,,,,,,, Region,Level 31.03.2016,,Sanction/Renewal 01.04.2016 to 28.02.2017,,,Level 28.02.2017,,Sanction/Renewal During Current Month ,,,Level 26.03.2017,,Growth as on 26.03.2017, ,No.,Balance,No.,Limit,Balance,No.,Balance,No.,Limit,Balance,No.,Balan +ce,GDM,GUM

      in below format

      line 1 :Amount line 2: Region,Level 31.03.2016,Sanction/Renewal 01.04.2016 to 28.02.2 +017,Level 28.02.2017,Sanction/Renewal During Current Month,Level 26.0 +3.2017,Growth as on 26.03.2017 line 3:No.,Balance,No.,Limit,Balance,No.,Balance,No.,Limit,Balance,No. +,Balance,GDM,GUM

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1186362]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (2)
As of 2022-01-17 02:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    In 2022, my preferred method to securely store passwords is:












    Results (50 votes). Check out past polls.

    Notices?