comment on

Hey monks, I'd like to fill a database with values that I grab from 50000 html documents. There is no API available and I can't decide what method to use to parse the html structure.

Right now I have saved all the files locally (later on a direct access via web would be great) and they look like this:

... (the usual html, head, body tags, a table, some text)

<table width=75%><tr><td width=50%><table width=95%><tr><td width=45% 
+valign=top>
<table width=100% cellspacing=0 cellpadding=0><tr bgcolor=#DFDFDF><td 
+colspan=2 height=30><font size=4><center>tool1_name</center></font></
+td></tr>
<tr bgcolor=#999999><td width=70%>
<b>heading_1</b>
</td><td width=30%></td></tr>
<tr bgcolor=#DFDFDF><td><font size=1>drill diameter:</font></td>      
+                              <td><font size=1>936</font></td></tr>

<tr bgcolor=#CCCCCC><td><font size=1>drill depth:</font></td>         
+                           <td><font size=1>20</font></td></tr>
<tr bgcolor=#DFDFDF><td><font size=1>drill speed:</font></td>         
+                   <td><font size=1>4</font></td></tr>
<tr bgcolor=#CCCCCC><td><font size=1>drill material:</font></td>      
+                              <td><font size=1>506</font></td></tr>
<tr bgcolor=#DFDFDF><td><font size=1>height:</font></td>              
+                      <td><font size=1>502</font></td></tr>
<tr bgcolor=#CCCCCC><td><font size=1>width:</font></td>               
+             <td><font size=1>6</font></td></tr>

<tr bgcolor=#DFDFDF><td><font size=1>angle:</font></td>               
+                 <td><font size=1>2.76</font></td></tr>
<tr bgcolor=#CCCCCC><td><font size=1>cooling liquid:</font></td>      
+                          <td><font size=1>14</font></td></tr>
<tr bgcolor=#DFDFDF><td><font size=1>manufactured in:</font></td>     
+                               <td><font size=1>27</font></td></tr>
<tr bgcolor=#CCCCCC><td><font size=1>lane code:</font></td>           
+                     <td><font size=1>76</font></td></tr>

<tr bgcolor=#DFDFDF><td><font size=1>quality test 1:</font></td>      
+                              <td><font size=1>581 (11.4%)</font></td
+></tr>
<tr bgcolor=#CCCCCC><td><font size=1>quality procedure:</font></td>   
+                                 <td><font size=1>19,021</font></td><
+/tr>
<tr bgcolor=#DFDFDF><td><font size=1>quality test 2:</font></td>      
+                          <td><font size=1>843 (90.1%)</font></td></t
+r>
<tr bgcolor=#CCCCCC><td><font size=1>package worth:</font></td>       
+                             <td><font size=1>$257,524</font></td></t
+r>
<tr bgcolor=#DFDFDF><td><font size=1>single unit worth:</font></td>   
+                                 <td><font size=1>$90,945</font></td>
+</tr>

<tr bgcolor=#CCCCCC><td><font size=1>colour:</font></td>              
+              <td><font size=1>48</font></td></tr>
<tr bgcolor=#DFDFDF><td><font size=1>coating:</font></td>             
+               <td><font size=1>2,602</font></td></tr>
</table><br>
<table width=100% cellspacing=0 cellpadding=0><tr bgcolor=#999999><td 
+width=70%>
<b>sells</b>
</td><td width=30%></td></tr>
<tr bgcolor=#DFDFDF><td><font size=1>sold this month:</font></td>     
+                   <td><font size=1>118</font></td></tr>

<tr bgcolor=#CCCCCC><td><font size=1>sold in plant A:</font></td>
(...)
[download]

There are about 110 unique values in 12 tables that I have to grab. On the pages are always two sets of these values: first the values (110 values in 12 tables) of a reference drill, then the values that are interesting to me.

So how do I parse these files quickly, reading all these values (stripped of dollar signs, commas, percentages) as quickly as possible?

I guess I'd use File::Slurp to store a file in a scalar, then HTML::TableExtract (How do I get the second occurrence?)? Or should I use a regex (how do I get the second occurrence?)? Or a template (how?)?

I'd be very grateful for your ideas and I really would appreciate code-snippets as I am really new to perl (replacing a bash script (yep) now...

Thanks!

In reply to how to quickly parse 50000 html documents? by brengo

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


good chemistry is complicated, and a little bit messy -LW
	PerlMonks