Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

getting and printing form values etc from html stripping out all else

by kalkisong (Initiate)
on Feb 24, 2010 at 19:21 UTC ( [id://825146]=perlquestion: print w/replies, xml ) Need Help??

kalkisong has asked for the wisdom of the Perl Monks concerning the following question:

I've been working on some code that will search several html pages in my site and strip out specific lines of html, reformat it the way I want and print to a single page. right now it sort of works, I can get my lines but it prints the entire line, I am not sure how to write the code to get exactly want I want. example: current perl code below will find all lines on html pages that have br< exists on the same line the code prints the entire line like this below.
<img src="http://www.mysite/graphics/blue.jpg" alt="Hey" width="100" height="100" ><br>yada yada
I really just want it to print something like this: <g:image_link>http://www.mysite/graphics/blue.jpg</g:image_link> on one line.
I am also grabbing hidden forms as well and trying to print the just the values from them as well. How do strip out  <br> <hr> <b> etc, etc, etc.
I am stuck now and not really sure what to do now thank you for assistance.
#!/usr/bin/perl -T # Set variables ###################################################################### +####### $output_file = "../../cgi/generator/source-output.html"; #Chmod 766 ###################################################################### +####### #Pages to Scan ###################################################################### +####### my @files= qw| ../../page1.html ../../page2.html |; my @allfiles; for my $filename(@files){ open FILE, $filename || die "Cannot open $filename for reading: $!\n"; push @allfiles, $_ while (<FILE>); @lines = @allfiles; $line = @allfiles; close FILE; } #output generator ###################################################################### +####### open (OUTPUT,">$output_file") || die "Can't Open $output_file: $!\n"; printf OUTPUT "Generator CGI Tester\n\n"; s/\<[^\<]+\>//; foreach $line (@lines) { $print_flag = 0; s/\<[^\<]+\>//; #generator variables ###################################################################### +####### ###Image if ($line =~ m/<img src/) { $img = $line; printf OUTPUT "$img"; $print_flag = 1; } ###Form action if ($line =~ m/<input type=\"hidden\" name=\"description\"/) { $hidden = $line; printf OUTPUT "$hidden"; $print_flag = 1; } } close (OUTPUT); # Redirect browser to generated page print "Location: $output_file\n\n"; exit;

Replies are listed 'Best First'.
Re: getting and printing form values etc from html stripping out all else
by pemungkah (Priest) on Feb 24, 2010 at 21:18 UTC
    The big problem is that the tool you're using (regular expressions) is terribly bad at parsing things like HTML. I'd strongly recommend one of the HTML parsing modules (HTML::Parser or HTML::TreeParser) to do this job. They take care of the messy business of actually understanding the HTML and let you concentrate on stuff like "I want the contents of this tag".
      Thank you i've looked over the links and the info for the HTML::Parser etc, however I am unsure how to implement it into my current code. I am fairly new to this. I'll keep reading into it, anymore assistance will be appreciated.
Re: getting and printing form values etc from html stripping out all else
by ww (Archbishop) on Feb 25, 2010 at 02:13 UTC

    For your first requirement, a regex is probably safe and effective, since (unless I'm having a Sr. moment) the html 4.x standard does not allow an image tag with a literal ">" inside the tag.

    One way to approach the job, therefore, is to extend your regex with less-greedy (aka "minimally greedy") matching and a lookahead. Here's a sketch, minus file-handling, CGI, etc:

    #!/usr/bin/perl use strict; use warnings; #825146 my @line = <DATA>; for my $line(@line) { chomp $line; if ( $line =~ m/(<img .*?[^>]+)/ ){ print "<g:image_link> " . $1 . "> </g:image_link>\n"; } else { print "\t nope: $line \n"; # you may want to send this to a di +fferent file } } __DATA__ <p><img src="http://www.mysite/graphics/blue.jpg" alt="Hey" width="100 +" height="100" ><br>yada yada</p> <p><img src="../grapics/blue1.gif" alt="Yo" width="200" height="75"></ +p> <p>foobar with no img</p> <blockquote><img width="75" height="75" src="blue2.png"></blockquote>

    Output:

    <g:image_link> <img src="http://www.mysite/graphics/blue.jpg" alt="Hey +" width="100" height="100" > </g:image_link> <g:image_link> <img src="../grapics/blue1.gif" alt="Yo" width="200" he +ight="75"> </g:image_link> nope: <p>foobar with no img</p> <g:image_link> <img width="75" height="75" src="blue2.png"> </g:image_ +link>

    BUT take the advice from pemungkah above: Use a parser! Trying to deal with all the possible unwanted tags in a form with regexen is going to get you deeper and deeper into complexities.

    And if you're planning to read user input from a form, for heaven's sake, read about untainting. You really don't want to let the fumble-fingered or malicious run around loose in your playground.

Re: getting and printing form values etc from html stripping out all else
by Anonymous Monk on Feb 25, 2010 at 02:54 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://825146]
Approved by sweetblood
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (5)
As of 2024-04-19 07:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found