Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

How to grab a portion of file with regex

by romy_mathew (Beadle)
on Mar 14, 2013 at 19:02 UTC ( #1023530=perlquestion: print w/ replies, xml ) Need Help??
romy_mathew has asked for the wisdom of the Perl Monks concerning the following question:

Hi can you help me to grab a portion of webpage via regex. Below is what I am trying to do
<div class="fk-srch-item fk-inf-scroll-item"> <div class="line fksd-bodytext "> <div class="unit left fk-sitem-image-section"> <div class="line"> <div class="unit num fksd-smalltext"> <span class="sno-div">1.</span> </div> <div class="lastUnit rposition"> <a href="/software-engineering-practitioner-s-approach +-7th/p/itmczynwvwmjatgq?pid=9780071267823&query=Roger+Pressman&srno=s +_1&ref=1d31579d-d54b-4c3e-8abf-100b33486a91&otracker=from-search"> <img onerror="img_onerror(this);" +data-error-url="http://img1.flixcart.com/img/thumb/book.jpg" height=" +100" width="100" data-src="http://img7.flixcart.com/image/book/8/2/3/ +software-engineering-a-practioner-s-approach-100x100-imad96yhpzgnapyz +.jpeg" src="http://img6.flixcart.com/www/prod/images/gray_1x1-8185705 +5.gif" onload="lzld(this)" alt="Buy Book Software Engineering : A Pra +ctitioner's Approach 7th Edition by " title="Software Engineering : A + Practitioner's Approach 7th Edition by "></img> </a> </div> </div> </div> <div class="line bmargin10"> <h2 class="fk-srch-item-title fksd-bodytext unboldtext"><a + href="/software-engineering-practitioner-s-approach-7th/p/itmczynwvw +mjatgq?pid=9780071267823&query=Roger+Pressman&srno=s_1&ref=1d31579d-d +54b-4c3e-8abf-100b33486a91&otracker=from-search" class="fk-srch-title +-text fksd-bodytext">Software Engineering : A Practitioner's Approach + 7th Edition (Paperback)</a></h2> <span class='fk-item-authorinfo-text fksd-smalltext'>by <a hr +ef="/author/roger-pressman?query=Roger+Pressman&vertical=books&otrack +er=from-search"><b>Roger</b> <b>Pressman</b></a></span>

below is code I am currently using to fetch the portion.
while($page =~m/class="lastUnit.*\n.*\s+.*\n.*\n.*\n.*\n/ig) { print "sample Text\n" }
The code is currently used to fetch a portion of page from class = lastUnit to class = line margin10
can anyone suggest a good regex for selecting the above section

Comment on How to grab a portion of file with regex
Select or Download Code
Re: How to grab a portion of file with regex
by swkronenfeld (Hermit) on Mar 14, 2013 at 19:14 UTC
    Parsing HTML with regexes is really hard. You are better off using an HTML Parser and coding from there.
Re: How to grab a portion of file with regex (don't)
by Anonymous Monk on Mar 14, 2013 at 20:02 UTC
Re: How to grab a portion of file with regex
by kielstirling (Scribe) on Mar 15, 2013 at 00:01 UTC
    Hi,

    It is generally not recommended to use regex matches to parse HTML files.

    Instead as swkronenfeld pointed out its better to use the CPAN module HTML::Parser

    Below is an example of its usage.
    #!/usr/bin/perl use Modern::Perl; use autodie; use HTML::Parser (); my $p = HTML::Parser->new( start_h => [\&start, 'tagname, attr'], ); open my $fh, '<', shift; $p->parse_file($fh); $fh->close; sub start { my ($tag_name, $attrs) = @_; return unless $tag_name eq 'div'; say 'sample Text' if exists $attrs->{class} and $attrs->{class} and $attrs->{class} =~ /^lastUnit. +*/; }
    -Kiel

      Instead as swkronenfeld pointed out its better to use the CPAN module HTML::Parser

      Not by much, HTML::Parser is very low-level, use a DOM parser supporting xpaths

        Well instead of trolling why not supply a working example to help ??

        Its always the Anonymous Monk lacking courage to put a name to a comment
        And for html files that are 9,000 GB's in size?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1023530]
Approved by tye
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (13)
As of 2015-07-02 08:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (31 votes), past polls