Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

extract string between 2 elements

by satishchandra (Initiate)
on Feb 17, 2011 at 22:59 UTC ( [id://888817]=perlquestion: print w/replies, xml ) Need Help??

satishchandra has asked for the wisdom of the Perl Monks concerning the following question:

hi, This is my program,iam trying to extract a string between the elements
if(open(MYFILE,"htmlfile.txt")) { $line =<MYFILE>; @array = ($line =~ m/<.[^>]*>/g); print"ARRAY iS @array\n"; for($i=0;$i<@array;$i++){ print"The Element $i is $array[$i]\n"; } for($j =0;$j<@array;$j++){ for($k=$j+1;$k<@array;$k++){ if($array[$j] eq $array[$k]){ print" substring($array[$j],$array[$k])\n"; } } } }

Replies are listed 'Best First'.
Re: extract string between 2 elements
by ww (Archbishop) on Feb 18, 2011 at 03:10 UTC
    NetWallah nailed it (if indeed, we understand what you're trying to acomplish) because parsing html with homebrew regexen is simply too easy to screw up.

    IOW, use an appropriate module, which having stood the test of at least some time (and the terrors of CPAN's testing process) is more apt to be reliable than the one-off the newbie invents.

    However, because maybe you really meant something like this?

    #!/usr/bin/perl use strict; use warnings; # 888817 my @array; my $file = "888817.txt"; open FH, '<', $file or die "Can't open $file: $!"; # while ( $file ) { my @line = <FH>; for my $line(@line) { if ( $line =~ /^\n/ ) { next; } else { (my $found) = $line =~ m/<.[^>]*>/g; print "\$found: $found \n"; push @array, $found; } } for( my $i=0; $i<@array; $i++){ print "The Element $i is $array[$i]\n"; } for( my $j =0; $j<@array; $j++){ for( my $k=$j+1; $k<@array; $k++) { if( $array[$j] eq $array[$k]) { print "substring($array[$j],$array[$k])\n"; } } }

    Where the data looked like this:

    <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http:/ +/www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <meta name="DESCRIPTION" content="Abcdef Hose Co. #1 -- protecting the + Abcdef, New York" /> <link type="text/css" rel="stylesheet" href="NHC1.css" /> <link rel="shortcut icon" href="http://Abcdef.org/favicon.ico" /> <title>(ww)Abcdef Hose Company #2 - Home</title> </head> <body> <div id="title"> <span style="color: #cc0000; background-color: black;">Address: </span +> 26 New Avenue, Abcdef, NY &nbsp; <span style="color: #cc0000; backg +round-color: black;">phone: </span>nnn.nnn.nnnn</div> <address> Abcdef Hose Co. #1<br /> </address> <br /> <p style="color: black; background-color: transparent; line-height: 99 +%;">Do you have something that you would like to see on the website? +If so, let us know (use the email link below) and we will try to inco +rporate it.</p> <p><a href="mailto:ww@Abcdef.org"><img src="gfx/box.gif" alt="contact +webmaster" width="43" height="55" />Email webmaster</a></p> </div> <!-- end div left sidebar (lsb) --> <div id="main"> <div id="main_header" style="width:100%;"> <h1 style="text-align: center;">Abcdef Hose Company #1</h1> <img src="gfx/hoseco2.jpg" alt="Shoulder Patch: Abcdef Hose Company #2 +" width="235" height="255" hspace="250" /> </div> <!-- end main_header --> <div id="news"> <div style="text-align:left;" class="style1"><strong>Latest Hot Stuff. +..</strong></div> <p>Special Drill, 10am, Tuesday, 14 May: MA companies at Hoovertown Ma +ll</p> </div> <!-- end div news --> </div> <!-- end div main --> </body> </html>

    producing this output:

    $found: <?xml version="1.0" encoding="UTF-8"?> $found: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" + "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> $found: <html xmlns="http://www.w3.org/1999/xhtml"> $found: <head> $found: <meta http-equiv="Content-Type" content="text/html; charset=UT +F-8" /> $found: <meta name="DESCRIPTION" content="Abcdef Hose Co. #1 -- protec +ting the Abcdef, New York" /> $found: <link type="text/css" rel="stylesheet" href="NHC1.css" /> $found: <link rel="shortcut icon" href="http://Abcdef.org/favicon.ico" + /> $found: <title> $found: </head> $found: <body> $found: <div id="title"> $found: <span style="color: #cc0000; background-color: black;"> $found: <address> $found: <br /> $found: </address> $found: <br /> $found: <p style="color: black; background-color: transparent; line-he +ight: 99%;"> $found: <p> $found: </div> $found: <div id="main"> $found: <div id="main_header" style="width:100%;"> $found: <h1 style="text-align: center;"> $found: <img src="gfx/hoseco2.jpg" alt="Shoulder Patch: Abcdef Hose Co +mpany #2" width="235" height="255" hspace="250" /> $found: </div> $found: <div id="news"> $found: <div style="text-align:left;" class="style1"> $found: <p> $found: </div> $found: </div> $found: </body> $found: </html> The Element 0 is <?xml version="1.0" encoding="UTF-8"?> The Element 1 is <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transiti +onal//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> The Element 2 is <html xmlns="http://www.w3.org/1999/xhtml"> The Element 3 is <head> The Element 4 is <meta http-equiv="Content-Type" content="text/html; c +harset=UTF-8" /> The Element 5 is <meta name="DESCRIPTION" content="Abcdef Hose Co. #1 +-- protecting the Abcdef, New York" /> The Element 6 is <link type="text/css" rel="stylesheet" href="NHC1.css +" /> The Element 7 is <link rel="shortcut icon" href="http://Abcdef.org/fav +icon.ico" /> The Element 8 is <title> The Element 9 is </head> The Element 10 is <body> The Element 11 is <div id="title"> The Element 12 is <span style="color: #cc0000; background-color: black +;"> The Element 13 is <address> The Element 14 is <br /> The Element 15 is </address> The Element 16 is <br /> The Element 17 is <p style="color: black; background-color: transparen +t; line-height: 99%;"> The Element 18 is <p> The Element 19 is </div> The Element 20 is <div id="main"> The Element 21 is <div id="main_header" style="width:100%;"> The Element 22 is <h1 style="text-align: center;"> The Element 23 is <img src="gfx/hoseco2.jpg" alt="Shoulder Patch: Abcd +ef Hose Company #2" width="235" height="255" hspace="250" /> The Element 24 is </div> The Element 25 is <div id="news"> The Element 26 is <div style="text-align:left;" class="style1"> The Element 27 is <p> The Element 28 is </div> The Element 29 is </div> The Element 30 is </body> The Element 31 is </html> substring(<br />,<br />) substring(<p>,<p>) substring(</div>,</div>) substring(</div>,</div>) substring(</div>,</div>) substring(</div>,</div>) substring(</div>,</div>) substring(</div>,</div>)

    ....(in which, I still see no rhyme nor reason, but as they say: diff'rent strokes for diff'rent folks).

Re: extract string between 2 elements
by NetWallah (Canon) on Feb 18, 2011 at 02:04 UTC
    If you are looking to capture HTML tags, this regex may help you get started:
    ~m/<(.[^>]*)>/g
    However, this is easily fooled by text disguised as tags. Use a real HTML/XML parser for production work.

         Syntactic sugar causes cancer of the semicolon.        --Alan Perlis

Re: extract string between 2 elements
by jwkrahn (Abbot) on Feb 17, 2011 at 23:17 UTC

    Your problem is probably due to the fact that you are only reading the first line from the file and none of the others.

      hey, yes iam reading only one line , i just want to try with one line,as the same logic applies for n lines iam trying to extract a string between two tags ,is my program correct ?
        Nope, doesn't look like it is.
        # Where's "use strict", "use warnings"? # Where's your error checking for open? if (open(MYFILE,"htmlfile.txt")) { $line = <MYFILE>; # Don't think this regex is going to do what you want. Should pro +bably focus on this first and see if @array contains what you expect. + We might be able to suggest a better regex if we actually saw teh r +eal data. @array = ($line =~ m/<.^>*>/g); print"ARRAY iS @array\n"; for ($i=0;$i<@array;$i++){ print "The Element $i is $array$i\n"; } for ($j=0; $j<@array; $j++){ for ($k=$j+1; $k<@array; $k++){ # What's this you're tryign to do here? if ($array$j eq $array$k){ # Another hrm... print " substring($array$j,$array$k)\n"; } } } }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://888817]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (3)
As of 2024-04-24 02:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found