Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Parsing HTML files

by ajju
on Nov 17, 2010 at 22:18 UTC ( #872067=perlquestion: print w/ replies, xml ) Need Help??
ajju has asked for the wisdom of the Perl Monks concerning the following question:

Hi , I have a set of html files in a folder.Using perl program i want to read each HTML file in that folder. After reading i want to extract the images and the text in the div tag which is exactly present below img tag and we should not consider other div tags. We should be able to store the text inbetween that div tag to a separate text file. I need your help monks...

Comment on Parsing HTML files
Re: Parsing HTML files
by kcott (Abbot) on Nov 17, 2010 at 22:33 UTC

    This should get you started:

    1. Use your specification to generate some pseudo-code.
    2. Use your pseudo-code to write some Perl code.

    -- Ken

      i want to know about the modules available for parsing and the code which could satisfy my requirement
        Any and all modules available for parsing would satisfy your requirement
Re: Parsing HTML files
by roboticus (Canon) on Nov 17, 2010 at 22:41 UTC

    ajju:

    In addition to kcott's notes, you'll also want to read over HTML::Parser to get a head start on parsing HTML files. The File::Find module will help you dig through a set of directories (you didn't specify subdirectories, but I wasn't sure).

    ...roboticus

    Update: Your Mother suggests a couple of other modules. Since I've used HTML::Parser exactly once, you may want to consider those suggestions instead. (I'm not much on GUI and/or web stuff, so my opinions should be suitably discounted in those topics.)

Re: Parsing HTML files
by ww (Bishop) on Nov 17, 2010 at 22:42 UTC

    What you really need is a little guidance and a lot of self-help... and perhaps a bit of reading about the local norms; that is, On asking for help, How do I post a question effectively?, and I know what I mean. Why don't you?.

    As to what kind of other help you need -- well, we're at a bit of a disadvantage here, as we don't have a really clear idea of the structure of the HTML; any notion of how you obtained it (or do you need help downloading the webpage(s)?); what you have tried to accomplish your goal (or, IOW, are you bringing this plea to us because you ...

    1. ...know nothing about programming;
    2. ...don't know enough to find the errors in the code you wrote;
    3. ...have a problem comprehending some specific piece of documentation;
    4. ... or something else?

    or, come to think of it, because you mistakenly believe PM is a free code-writing service?

    And, BTW, along with sample data and your code, your might try to make your post somewhat less difficult to interpret ... and somewhat less like a badly paraphrased homework assignment. (If it is indeed "homework", disclosing that will also stand you in better stead than having us learn it later).

      I just want to know how to accomplish this in perl...please given me some tips to start the code
        We just did.

        Oh! "to start the code" -- well, that's a different matter.

        We generally recommend starting with a hashbang line and a pair of strictures:

        #!/usr/bin/perl use strict; use warnings; (code goes here)
      iam new to perl...
        Thank u ken and roboticus...ur messages are very helpful to me and gave me support...i will start in that way...
Re: Parsing HTML files
by Your Mother (Canon) on Nov 17, 2010 at 23:14 UTC

    HTML::TreeBuilder or HTML::TokeParser::Simple are both going to be much friendlier than HTML::Parser which I recommend you avoid. If you get stuck come back with some sample code you're working with and someone here will certainly help work it out. You can search around in here for numerous usage examples of those packages.

Re: Parsing HTML files
by aquarium (Curate) on Nov 18, 2010 at 00:27 UTC
    As you've already gathered from the responses..it's pretty difficult to recommend anything with such sketchy specification. it's easy enough to recommend any of the parser modules. but because there's so many of them, suited to particular needs and situations, you're going to have to elaborate a bit further. if you think you're doing something new and creative, then don't worry, nobody here will steal your idea. a context for your specification, i.e. what are you really trying to achieve, and more detail of the specs for the html files and how many etc. a example extract of one of the html files would be most useful.
    Your question as originally posed sounds dodgy. either come forth with it, or don't ask. Additionally, whilst beginners are welcome in the forum, you need to show some aptitude for learning perl, e.g. show some attempted code or such. Even an absolute perl (or other language) novice, would be expected to have read some books/resources/perldoc. Otherwise you're just wasting our and your time.
    the hardest line to type correctly is: stty erase ^H
Re: Parsing HTML files
by Tux (Monsignor) on Nov 18, 2010 at 07:46 UTC

    With HTML::TreeBuilder, as Your Mother already mentioned, you can do so, but please keep in mind that html may change. I have several monitors running that parse HTML constantly, and I have to change the code on a very regular basis because the people that generate or maintain the HTML keep changing it. So on true advice: be very very defensive in your parsing strategy and don't hardcode the sequence of events: the generator might add a div tag in between or swap the sequence of text and image.

    use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new; $tree->parse_content ($html); foreach my $img ($tree->look_down (_tag => "img")) { my $p = $img->parent; $p->tag eq "div" or next; # <img> not inside a <div> my $txt = $p->as_text; }

    As you can see, this module offers you all rope you need to hang yourself or do what you need. It also offers a nice way to generate nicely formatted HTML from parsed trees:

    print $tree->as_HTML (undef, " ", {});

    Enjoy, Have FUN! H.Merijn
      hi Tux, I had my $html="htmlfilepath"; added to your code. Running your code is giving the below error, Use of uninitialized value in subroutine entry at C:/Perl/site/lib/HTML/TreeBuil der.pm line 121.

        If you would have taken the time to read the documentatio, e.g. using "perldoc HTML::TreeBuilder", you should have seen, if the method name parse_content wasn't obvious enough already, that to parse a file, you should use the parse_file method.


        Enjoy, Have FUN! H.Merijn
      totally agree that scraping html is quite bad and unstable. my rough guide for scraping, from most to least desireable
      • don't scrape...if you can find out if there's a RESTful way to get the results instead, via some API or alternate (e.g. XML format) url
      • if the html is well formed (i.e. xhtml) then it will be almost guaranteed to be well structured, will proper closing tags etc...so use one of the XML based parsers. you can easily get to specific elements via a well defined hierarchy once parsed
      • this stuff gets ugly when you start slurping whole html into a scalar and progressively find markers where to suck bits into desired variables for inspection. even here you can code a bit defensively by picking sane markers, e.g. "id" or "class" elements. never anchor to text containing inline html styles or other bits that are likely to change fequently, like inline javascript or such.
      finally, if you end up producing html in the CGI, don't mix actual output with styling. write a stylesheet instead. producing a well formed xhtml document in the CGI without inline styles, provides later opportunity to use the output of the CGI via another CGI or whatever. it's also much easier to change the output of the CGI via a stylesheet, rather than digging in perl code.
      there are frameworks for doing even fancier scraping, where you end up running a browser engine server side, to pretend that your program is a browser. this is necessary when a website dynamically produces most of it's output with javascript. and naturally because javascript is browser/client side code, you won't see the results of that unless you run it. this is pretty horrid stuff. although you can do automatic login and traverse a website and results...it typically breaks as soon as absolutely anything changes on the website. A good/helpful website, even if dynamically fancy rendered with javascript, should provide a RESTful api to get data out. But some companies still insist on not being very helpful.
      the hardest line to type correctly is: stty erase ^H
Re: Parsing HTML files
by chrestomanci (Priest) on Nov 18, 2010 at 09:22 UTC

    Adding to the suggestion from Your Mother suggesting the use of HTML::TreeBuilder

    I have found it useful in the past to use a GUI HTML tree inspector such as Firebug, or the inspect element tool in google chrome.

    Using such a tool will quickly tell you how the element you are interested in sits within the HTML structure, and will quickly tell you about the div and other useful tags that are above it in the html tree.

    Contrary to what Tux said, I have not found that changing structure is much of a problem, because the the html code of most big websites these days is generated out of CMS databases by computer programs, so the structure tends to be very consistent. Occasionally a site will have a major re-design, but the rest of the time the sites are very stable. I guess the situation is different if you are dealing with hand created html on small websites.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://872067]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (10)
As of 2014-10-01 10:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    What is your favourite meta-syntactic variable name?














    Results (7 votes), past polls