Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked

Particular HTML contents to CSV or DB

by nicpon (Initiate)
on Aug 25, 2005 at 14:49 UTC ( [id://486574]=perlquestion: print w/replies, xml ) Need Help??

nicpon has asked for the wisdom of the Perl Monks concerning the following question:

I have about 3000 html files on my drive, each of them has item name, description,size, price etc.   I want to move it all into the database.   How should i create a script that would do the following or maybe you guys have seen script like this:

  1. take one html file at time scan through it
  2. pick the item name,description
  3. drop the html tags(this can be done at any time i guess)
  4. insert the picked values into new csv file

Since I'm new to perl, I'm still struggling with regular expressions and I think they are crucial for this little project.   So the big question is how do I pick the lines I want from these html files?   I know there is some patter ie the description always starts wits "Description:"

janitored by ybiC: Retitle from "Looking for script help" to help site search results, also minor format and layout tweaks for legibility.

Replies are listed 'Best First'.
Re: Particular HTML contents to CSV or DB
by pboin (Deacon) on Aug 25, 2005 at 15:04 UTC

    Well, you need to start out by breaking your job down into smaller and smaller parts. Once you get them down small enough, each one will be a simple little problem to solve. Off the bat, you might want to start with a list like this:

    1. Connect to the DB
    2. Find each of the 3000 files.
        For each file:
      1. Open the file
      2. Parse the file
      3. Insert a record
      4. Update some counter/stats
      5. Close the file

    It seems like your question is really 'How do I program' more than anything. And, that's such a big topic, that in a way it's harder to answer than 'I asked FunctionX for a result and I got Y instead of Z' type question.

    So, just take a deep breath, break it down into teeny-tiny parts, and post specific questions.

    Good luck.

Re: Particular HTML contents to CSV or DB
by pbeckingham (Parson) on Aug 25, 2005 at 15:07 UTC

    You probably need a combination of HTML parsing, and then regular expressions to isolate the data you need to put in the database. You need to look into HTML parsing, DBI and regular expressions.

    Are all those files regular? What I mean is are they all highly structured, perhaps because they were generated by a program? Could you post a sample?

    pbeckingham - typist, perishable vertebrate.
Re: Particular HTML contents to CSV or DB
by sk (Curate) on Aug 25, 2005 at 15:09 UTC
    Depends on how complicated you HTML file is -

    Check out HTML::TokeParser for parsing HTML file in a nice way.

    However if your HTML file is as simple as <HTML><BODY><Tag>Stuff1=val1,Stuff2=Val2</tag></BODY> </HTML> then I would just do a simple regex and we can help you if we see the HTML page.

    regarding reading 3000 files that is very simple

    foreach my $file (@ARGV) { my $fh; open $fh, "<", $file or die "Can't open $file ($!)"; push @filehandles, $fh; }

    Once you have your handles you can loop through them. Actually you can do this whole thing in a while loop instead of storing the handbles. something like  while ($i++ < 3000) { open (blah,blah); do_stuff with blah; }

    there is also glob if your filenames are not counter based i.e. file1 file2 etc.

    For DB part - you can use DBI. once you have the values you can instert them into a table easily. You can also create a CSV file and then just load them into the DB (most DBs support that)



Re: Particular HTML contents to CSV or DB
by nicpon (Initiate) on Aug 25, 2005 at 15:44 UTC
    All the html files were generated in php. Here is the link to give an idea how does the html file look like .What I would need from that is bold name and then the rest of the info frm that product(all the fields under the product name). SO, if i could first take name and the rest of the info and insert it into new file with commas separated values(I can use csv since then i can easily import it into database and this way i dont have to worry about connnection frm the script) and then strip all the html. Or other way would be first take just the part of each html and insert it into new file since the product info is always starts with and ends with . My other question is how do i get a list of all the files in the folder ??

      As the pages were generated programmatically, I think you would be more successful scraping the relevent data out of the surrounding markup. This avoids traversing the DOM or mucking around with regular expressions. See


      time was, I could move my arms like a bird and...
Re: Particular HTML contents to CSV or DB
by jZed (Prior) on Aug 25, 2005 at 15:47 UTC
    The answer to your question depends on how the information is stored in the HTML file. If, for example, you have the data in HTML tables, DBD::AnyData can read the HTML tabled as if they were a database tables and write the data out as CSV. It would take about four lines of code total to go from the HTML tables to CSV tables. If your data is not in HTML tables, what format is it in? Is it each on the separate line of an HTML list, or always preceeded by some word, etc. etc. You can use general tools like HTML::Parser but there are also more specific tools for more specific structures and we have no way of knowing which of those to recommend without more details of your context.
Re: Particular HTML contents to CSV or DB
by nicpon (Initiate) on Aug 25, 2005 at 16:07 UTC
    The files look exactly like the link i had in my previous post. The data is stored in every single file. Its not dynamic if thats the question. The data is in tables but this tables have no name or anything. jZed you can view the context here
Re: Particular HTML contents to CSV or DB
by tphyahoo (Vicar) on Aug 25, 2005 at 16:22 UTC
    I have had good results parsing html with HTML::Treebuilder and HTML::Element, fetching the elements I want with the look_down() function. Here's some code that might help you get started with the file fetching using File::Find
Re: Particular HTML contents to CSV or DB
by nicpon (Initiate) on Aug 25, 2005 at 18:19 UTC
    I got all the files that have to be parsed in array, then i have the code through array of these files and open one at time. How should I now parse it so it extracts the lines i need from single html file ??

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://486574]
Approved by holli
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (5)
As of 2024-04-19 09:49 GMT
Find Nodes?
    Voting Booth?

    No recent polls found