use strict;
use warnings;
use HTML::TreeBuilder;
my $str = '<html><title>GAL7</title>
<body bgcolor=white>
<h2 align=center>GAL7</h2>
<hr>
<form method="post" action="/cgi-bin/SCPD/getgene2?GAL7" enctype="appl
+ication/x-www-form-urlencoded">
<input type="submit" name="action" value="Get mapped sites" />
<input type="submit" name="action" value="Get putative sites" />
<input type="submit" name="action" value="Get interg
enic region" /><br />
<input type="submit" name="action" value="Retrieve sequence" />Start<-
+ATG
<input type="text" name="start" value="-450" size="5" maxlength="5" />
+ATG->End
<input type="text" name="end" value="50" size="5" maxlength="5" />
<div></div></form>
<hr>
<pre>
>YBR018C GAL7 275433 275933
TTTGATATCACTCACAACTATTGCGAAGCGCTTCAGTGAAAAAATCATAA
GGAAAAGTTGTAAATATTATTGGTAGTATTCGTTTGGTAAAGTAGAGGGG
GTAATTTTTCCCCTTTATTTTGTTCATACATTCTTAAATTGCTTTGCCTC
TCCTTTTGGAAAGCTATACTTCGGAGCACTGTTGAGCGAAGGCTCATTAG
ATATATTTTCTGTCATTTTCCTTAACCCAAAAATAAGGGAAAGGGTCCAA
AAAGCGCTCGGACAACTGTTGACCGTGATCCGAAGGACTGGCTATACAGT
GTTCACAAAATAGCCAAGCTGAAAATAATGTGTAGCTATGTTCAGTTAGT
TTGGCTAGCAAAGATATAAAAGCAGGTCGGAAATATTTATGGGCATTATT
ATGCAGAGCATCAACATGATAAAAAAAAACAGTTGAATATTCCCTCAAAA
ATGACTGCTGAAGAATTTGATTTTTCTAGCCATTCCCATAGACGTTACAA
</pre>';
my $tree = HTML::TreeBuilder->new;
$tree->parse ($str);
print $_->as_text () . "\n" for $tree->find ('pre');
Prints:
>YBR018C GAL7 275433 275933
TTTGATATCACTCACAACTATTGCGAAGCGCTTCAGTGAAAAAATCATAA
GGAAAAGTTGTAAATATTATTGGTAGTATTCGTTTGGTAAAGTAGAGGGG
GTAATTTTTCCCCTTTATTTTGTTCATACATTCTTAAATTGCTTTGCCTC
TCCTTTTGGAAAGCTATACTTCGGAGCACTGTTGAGCGAAGGCTCATTAG
ATATATTTTCTGTCATTTTCCTTAACCCAAAAATAAGGGAAAGGGTCCAA
AAAGCGCTCGGACAACTGTTGACCGTGATCCGAAGGACTGGCTATACAGT
GTTCACAAAATAGCCAAGCTGAAAATAATGTGTAGCTATGTTCAGTTAGT
TTGGCTAGCAAAGATATAAAAGCAGGTCGGAAATATTTATGGGCATTATT
ATGCAGAGCATCAACATGATAAAAAAAAACAGTTGAATATTCCCTCAAAA
ATGACTGCTGAAGAATTTGATTTTTCTAGCCATTCCCATAGACGTTACAA
Update: Fixed link
DWIM is Perl's answer to Gödel
| [reply] [d/l] [select] |
Well, given the sample of data you've shown, this would do what you want (assuming the string is in $_):
s{.*<pre>}{}s;
That is, delete everything up to and including the "pre" tag. Note the "m" modifier at the end, so that "." is allowed to match "\n".
Now, if there's also a </pre> tag that you're not showing us, and more html data after that, you'll probably want to get rid of that as well:
s{</pre>.*}{}s;
Of course, if a given html page contains more than one "pre" segment, you'll need to be more careful. Ultimately, you might need to actually read the manual page for an HTML parsing module, and start using it, because that would be the preferred approach for this sort of thing.
But if the data are consistently as simple as your sample, a couple regex substitutions will probably suffice.
(updated my regexes to use the "s" modifier as intendedm, rather than the "m" modifier. Thanks, mreece!!) | [reply] [d/l] [select] |
Note the "m" modifier at the end, so that "." is allowed to match "\n".
you have that backwards! /s allows . to match \n, not /m.
| [reply] [d/l] [select] |
use HTML::Parser;
my $VAR1 = '<html><title>GAL7</title>
<body bgcolor=white>
<h2 align=center>GAL7</h2><hr>
<form method="post" action="/cgi-bin/SCPD/getgene2?GAL7" enctype="appl
+ication/x-www-form-urlencoded">
<input type="submit" name="action" value="Get mapped sites" /><input t
+ype="submit" name="action" value="Get putative sites" /><input type="
+submit" name="action" value="Get interg
enic region" /><br /><input type="submit" name="action" value="Retriev
+e sequence" />Start<-ATG <input type="text" name="start" value="-450"
+ size="5" maxlength="5" />ATG->End <inp
ut type="text" name="end" value="50" size="5" maxlength="5" /><div></d
+iv></form><hr>
<pre>
>YBR018C GAL7 275433 275933
TTTGATATCACTCACAACTATTGCGAAGCGCTTCAGTGAAAAAATCATAA
GGAAAAGTTGTAAATATTATTGGTAGTATTCGTTTGGTAAAGTAGAGGGG
GTAATTTTTCCCCTTTATTTTGTTCATACATTCTTAAATTGCTTTGCCTC
TCCTTTTGGAAAGCTATACTTCGGAGCACTGTTGAGCGAAGGCTCATTAG
ATATATTTTCTGTCATTTTCCTTAACCCAAAAATAAGGGAAAGGGTCCAA
AAAGCGCTCGGACAACTGTTGACCGTGATCCGAAGGACTGGCTATACAGT
GTTCACAAAATAGCCAAGCTGAAAATAATGTGTAGCTATGTTCAGTTAGT
TTGGCTAGCAAAGATATAAAAGCAGGTCGGAAATATTTATGGGCATTATT
ATGCAGAGCATCAACATGATAAAAAAAAACAGTTGAATATTCCCTCAAAA
ATGACTGCTGAAGAATTTGATTTTTCTAGCCATTCCCATAGACGTTACAA
</pre>Some other stuff</body></html>';
sub default_start
{
my ($self, $tagname) = @_;
if ( $tagname eq 'pre' )
{
$self->handler(text => \&get_text, "self,dtext");
$self->handler(end => \&end_text, "self,tagname");
}
}
sub get_text
{
my ($self, $text) = @_;
if ( not exists $self->{_text} )
{
$self->{_text} = $text;
}
else
{
$self->{_text} .= $text;
}
}
sub end_text
{
my ( $self, $tagname) = @_;
if ( $tagname eq 'pre' )
{
$self->handler(text => '');
$self->handler(start => '');
$self->handler(end => '');
}
}
my $parser = HTML::Parser->new(start_h => [\&default_start,'self,tagna
+me']);
$parser->parse($VAR1);
print $parser->{_text};
This might have the advantage over using other parsers if you are dealing with large documents as it doesn't build a preparsed representation of the documentation before handing the events to you.
/J\ | [reply] [d/l] |
PRE tag within text area | [reply] |
use HTML::TokeParser::Simple;
my $p = HTML::TokeParser::Simple->new( file => 'test_data.html' );
my $t; #token;
my @text;
#get all text between pre tags
while ($t = $p->get_token) {
next unless $t->is_start_tag('pre');
my $content;
while ($t = $p->get_token) {
last if $t->is_end_tag('pre');
$content .= $t->as_is;
}
push @text, $content;
}
I'm guessing this isn't the fastest approach... but hey, TMTOWTDI.
| [reply] [d/l] |
if you want to do this with regular expressions, which is in most cases a bad idea (arguably unless you know the precise structure of your html, such as being darned certain there won't be nested or unmatched tags, etc) .. consider:
## OP specified 'last <pre>' tag,
## so assume there can be more than one <pre>..</pre> block
## find all <pre> blocks, using non-greedy .*? and also
## get \n in the case where the html ends with a newline and no </pre>
## anchor to non-capturing match for closing </pre> or end of string
my @pre = ( $VAR1 =~ m{<pre>(.*?\n?)(?:</pre>|$)}isg );
## we want the last one
my $new_output = pop @pre;
| [reply] [d/l] |