Extracting Text After <pre> tag in HTML

monkfan has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
Given this type of html (stored in a variable), how can I extract the string after the last <pre> tag:

$VAR1 = '<html><title>GAL7</title>
<body bgcolor=white>
<h2 align=center>GAL7</h2><hr>
<form method="post" action="/cgi-bin/SCPD/getgene2?GAL7" enctype="appl
+ication/x-www-form-urlencoded">
<input type="submit" name="action" value="Get mapped sites" /><input t
+ype="submit" name="action" value="Get putative sites" /><input type="
+submit" name="action" value="Get interg
enic region" /><br /><input type="submit" name="action" value="Retriev
+e sequence" />Start<-ATG <input type="text" name="start" value="-450"
+ size="5" maxlength="5" />ATG->End <inp
ut type="text" name="end" value="50" size="5" maxlength="5" /><div></d
+iv></form><hr>
<pre>
>YBR018C  GAL7  275433  275933
TTTGATATCACTCACAACTATTGCGAAGCGCTTCAGTGAAAAAATCATAA
GGAAAAGTTGTAAATATTATTGGTAGTATTCGTTTGGTAAAGTAGAGGGG
GTAATTTTTCCCCTTTATTTTGTTCATACATTCTTAAATTGCTTTGCCTC
TCCTTTTGGAAAGCTATACTTCGGAGCACTGTTGAGCGAAGGCTCATTAG
ATATATTTTCTGTCATTTTCCTTAACCCAAAAATAAGGGAAAGGGTCCAA
AAAGCGCTCGGACAACTGTTGACCGTGATCCGAAGGACTGGCTATACAGT
GTTCACAAAATAGCCAAGCTGAAAATAATGTGTAGCTATGTTCAGTTAGT
TTGGCTAGCAAAGATATAAAAGCAGGTCGGAAATATTTATGGGCATTATT
ATGCAGAGCATCAACATGATAAAAAAAAACAGTTGAATATTCCCTCAAAA
ATGACTGCTGAAGAATTTGATTTTTCTAGCCATTCCCATAGACGTTACAA
';
[download]

such that it returns simply:

my $new_output = '
>YBR018C  GAL7  275433  275933        #this fasta marker line is to be
+ kept
TTTGATATCACTCACAACTATTGCGAAGCGCTTCAGTGAAAAAATCATAA
GGAAAAGTTGTAAATATTATTGGTAGTATTCGTTTGGTAAAGTAGAGGGG
GTAATTTTTCCCCTTTATTTTGTTCATACATTCTTAAATTGCTTTGCCTC
TCCTTTTGGAAAGCTATACTTCGGAGCACTGTTGAGCGAAGGCTCATTAG
ATATATTTTCTGTCATTTTCCTTAACCCAAAAATAAGGGAAAGGGTCCAA
AAAGCGCTCGGACAACTGTTGACCGTGATCCGAAGGACTGGCTATACAGT
GTTCACAAAATAGCCAAGCTGAAAATAATGTGTAGCTATGTTCAGTTAGT
TTGGCTAGCAAAGATATAAAAGCAGGTCGGAAATATTTATGGGCATTATT
ATGCAGAGCATCAACATGATAAAAAAAAACAGTTGAATATTCCCTCAAAA
ATGACTGCTGAAGAATTTGATTTTTCTAGCCATTCCCATAGACGTTACAA
';
[download]

I couldn't figure out how to create mechanism that can distinguished between html tag and 'fasta' tag marked by ">".

Regards,
Edward

Comment on Extracting Text After <pre> tag in HTML Select or Download Code

Replies are listed 'Best First'.
Re: Extracting Text After <pre> tag in HTML by GrandFather (Saint) on Sep 22, 2006 at 01:28 UTC
Use HTML::TreeBuilder: use strict; use warnings; use HTML::TreeBuilder; my $str = '<html><title>GAL7</title> <body bgcolor=white> <h2 align=center>GAL7</h2> <hr> <form method="post" action="/cgi-bin/SCPD/getgene2?GAL7" enctype="appl +ication/x-www-form-urlencoded"> <input type="submit" name="action" value="Get mapped sites" /> <input type="submit" name="action" value="Get putative sites" /> <input type="submit" name="action" value="Get interg enic region" /><br /> <input type="submit" name="action" value="Retrieve sequence" />Start<- +ATG <input type="text" name="start" value="-450" size="5" maxlength="5" /> +ATG->End <input type="text" name="end" value="50" size="5" maxlength="5" /> <div></div></form> <hr> <pre> >YBR018C GAL7 275433 275933 TTTGATATCACTCACAACTATTGCGAAGCGCTTCAGTGAAAAAATCATAA GGAAAAGTTGTAAATATTATTGGTAGTATTCGTTTGGTAAAGTAGAGGGG GTAATTTTTCCCCTTTATTTTGTTCATACATTCTTAAATTGCTTTGCCTC TCCTTTTGGAAAGCTATACTTCGGAGCACTGTTGAGCGAAGGCTCATTAG ATATATTTTCTGTCATTTTCCTTAACCCAAAAATAAGGGAAAGGGTCCAA AAAGCGCTCGGACAACTGTTGACCGTGATCCGAAGGACTGGCTATACAGT GTTCACAAAATAGCCAAGCTGAAAATAATGTGTAGCTATGTTCAGTTAGT TTGGCTAGCAAAGATATAAAAGCAGGTCGGAAATATTTATGGGCATTATT ATGCAGAGCATCAACATGATAAAAAAAAACAGTTGAATATTCCCTCAAAA ATGACTGCTGAAGAATTTGATTTTTCTAGCCATTCCCATAGACGTTACAA </pre>'; my $tree = HTML::TreeBuilder->new; $tree->parse ($str); print $_->as_text () . "\n" for $tree->find ('pre'); [download] Prints: >YBR018C GAL7 275433 275933 TTTGATATCACTCACAACTATTGCGAAGCGCTTCAGTGAAAAAATCATAA GGAAAAGTTGTAAATATTATTGGTAGTATTCGTTTGGTAAAGTAGAGGGG GTAATTTTTCCCCTTTATTTTGTTCATACATTCTTAAATTGCTTTGCCTC TCCTTTTGGAAAGCTATACTTCGGAGCACTGTTGAGCGAAGGCTCATTAG ATATATTTTCTGTCATTTTCCTTAACCCAAAAATAAGGGAAAGGGTCCAA AAAGCGCTCGGACAACTGTTGACCGTGATCCGAAGGACTGGCTATACAGT GTTCACAAAATAGCCAAGCTGAAAATAATGTGTAGCTATGTTCAGTTAGT TTGGCTAGCAAAGATATAAAAGCAGGTCGGAAATATTTATGGGCATTATT ATGCAGAGCATCAACATGATAAAAAAAAACAGTTGAATATTCCCTCAAAA ATGACTGCTGAAGAATTTGATTTTTCTAGCCATTCCCATAGACGTTACAA [download] Update: Fixed link DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re: Extracting Text After <pre> tag in HTML by graff (Chancellor) on Sep 22, 2006 at 01:35 UTC
Well, given the sample of data you've shown, this would do what you want (assuming the string is in $_): `s{.<pre>}{}s;` [download] That is, delete everything up to and including the "pre" tag. Note the "m" modifier at the end, so that "." is allowed to match "\n". Now, if there's also a `</pre>` tag that you're not showing us, and more html data after that, you'll probably want to get rid of that as well: `s{</pre>.}{}s;` [download] Of course, if a given html page contains more than one "pre" segment, you'll need to be more careful. Ultimately, you might need to actually read the manual page for an HTML parsing module, and start using it, because that would be the preferred approach for this sort of thing. But if the data are consistently as simple as your sample, a couple regex substitutions will probably suffice. (updated my regexes to use the "s" modifier as intendedm, rather than the "m" modifier. Thanks, mreece!!)	[reply] [d/l] [select]
Re^2: Extracting Text After <pre> tag in HTML by mreece (Friar) on Sep 22, 2006 at 20:13 UTC
Note the "m" modifier at the end, so that "." is allowed to match "\n". you have that backwards! `/s` allows . to match `\n`, not `/m`.	[reply] [d/l] [select]
Re: Extracting Text After <pre> tag in HTML by gellyfish (Monsignor) on Sep 22, 2006 at 21:25 UTC
Just for the sake of completeness here's how you might do it with HTML::Parser: use HTML::Parser; my $VAR1 = '<html><title>GAL7</title> <body bgcolor=white> <h2 align=center>GAL7</h2><hr> <form method="post" action="/cgi-bin/SCPD/getgene2?GAL7" enctype="appl +ication/x-www-form-urlencoded"> <input type="submit" name="action" value="Get mapped sites" /><input t +ype="submit" name="action" value="Get putative sites" /><input type=" +submit" name="action" value="Get interg enic region" /><br /><input type="submit" name="action" value="Retriev +e sequence" />Start<-ATG <input type="text" name="start" value="-450" + size="5" maxlength="5" />ATG->End <inp ut type="text" name="end" value="50" size="5" maxlength="5" /><div></d +iv></form><hr> <pre> >YBR018C GAL7 275433 275933 TTTGATATCACTCACAACTATTGCGAAGCGCTTCAGTGAAAAAATCATAA GGAAAAGTTGTAAATATTATTGGTAGTATTCGTTTGGTAAAGTAGAGGGG GTAATTTTTCCCCTTTATTTTGTTCATACATTCTTAAATTGCTTTGCCTC TCCTTTTGGAAAGCTATACTTCGGAGCACTGTTGAGCGAAGGCTCATTAG ATATATTTTCTGTCATTTTCCTTAACCCAAAAATAAGGGAAAGGGTCCAA AAAGCGCTCGGACAACTGTTGACCGTGATCCGAAGGACTGGCTATACAGT GTTCACAAAATAGCCAAGCTGAAAATAATGTGTAGCTATGTTCAGTTAGT TTGGCTAGCAAAGATATAAAAGCAGGTCGGAAATATTTATGGGCATTATT ATGCAGAGCATCAACATGATAAAAAAAAACAGTTGAATATTCCCTCAAAA ATGACTGCTGAAGAATTTGATTTTTCTAGCCATTCCCATAGACGTTACAA </pre>Some other stuff</body></html>'; sub default_start { my ($self, $tagname) = @_; if ( $tagname eq 'pre' ) { $self->handler(text => \&get_text, "self,dtext"); $self->handler(end => \&end_text, "self,tagname"); } } sub get_text { my ($self, $text) = @_; if ( not exists $self->{_text} ) { $self->{_text} = $text; } else { $self->{_text} .= $text; } } sub end_text { my ( $self, $tagname) = @_; if ( $tagname eq 'pre' ) { $self->handler(text => ''); $self->handler(start => ''); $self->handler(end => ''); } } my $parser = HTML::Parser->new(start_h => [\&default_start,'self,tagna +me']); $parser->parse($VAR1); print $parser->{_text}; [download] This might have the advantage over using other parsers if you are dealing with large documents as it doesn't build a preparsed representation of the documentation before handing the events to you. /J\	[reply] [d/l]
Re^2: Extracting Text After <pre> tag in HTML by Anonymous Monk on Mar 30, 2007 at 08:41 UTC
how to work PRE tag within text area	[reply]
Re: Extracting Text After <pre> tag in HTML by radiantmatrix (Parson) on Sep 27, 2006 at 15:02 UTC
In case you haven't been given enough Ways To Do It, my first thought was HTML::TokeParser::Simple: `use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new( file => 'test_data.html' ); my $t; #token; my @text; #get all text between pre tags while ($t = $p->get_token) { next unless $t->is_start_tag('pre'); my $content; while ($t = $p->get_token) { last if $t->is_end_tag('pre'); $content .= $t->as_is; } push @text, $content; }` [download] I'm guessing this isn't the fastest approach... but hey, TMTOWTDI. <–radiant.matrix–> A collection of thoughts and links from the minds of geeks The Code that can be seen is not the true Code I haven't found a problem yet that can't be solved by a well-placed trebuchet	[reply] [d/l]
Re: Extracting Text After <pre> tag in HTML by mreece (Friar) on Sep 22, 2006 at 20:28 UTC
if you want to do this with regular expressions, which is in most cases a bad idea (arguably unless you know the precise structure of your html, such as being darned certain there won't be nested or unmatched tags, etc) .. consider: `## OP specified 'last <pre>' tag, ## so assume there can be more than one <pre>..</pre> block ## find all <pre> blocks, using non-greedy .? and also ## get \n in the case where the html ends with a newline and no </pre> ## anchor to non-capturing match for closing </pre> or end of string my @pre = ( $VAR1 =~ m{<pre>(.?\n?)(?:</pre>\|$)}isg ); ## we want the last one my $new_output = pop @pre;` [download]	[reply] [d/l]


The stupid question is the question not asked
	PerlMonks