Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Regular Expression Help

by Anonymous Monk
on Sep 01, 2001 at 12:35 UTC ( #109622=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I wrote a code that goes through files, searching for a pattern in the text that file holds. If its matched, the contents of the file is added to an array. Later every value in the array is printed out. Before it is printed out, though, I'd like to replace the pattern with itself surrounded by the <b> and </b> tags. However, if the pattern is in an HTML tag already, I dont want the bold tags applied to it.

I've tried several times to write a successful regular expression to check if the pattern is within an HTML tag, and if it is not to apply the bold tags to it. I've failed every time though. If anyone could helo me with this, I'd really appreciate it.

Replies are listed 'Best First'.
Re: Regular Expression Help
by tachyon (Chancellor) on Sep 01, 2001 at 14:54 UTC

    Hi, it sounds like the files you are parsing are HTML. There is a great little widget called HTML::TokeParser that parses the tokens (tags) in an HTML file. Say you want to find "foo", here are some examples:

    <p>Here is a foo <p>Here is another foo. <p>Here <b>is a foo in bold</b> <p><a href="http://foo.com"> foo.com </a>

    and here is how that renders in a browser - this is that exact HTML:

    Here is a foo

    Here is another foo.

    Here is a foo in bold

    foo.com

    Some of the problems include the fact that the opening tag may not be on the same line. In HTML newlines are ignored when the content is rendered. There may of may not be a closing tag. Also you may note that technically everything is 'within' some sort of tags so you need to specify what you want more exactly. Assuming you mean within like in the href example, or even if you don't TokeParser is your friend.

    To show you how useful it is here is a little TokeParser example that finds all the heading tags (h1 h2 h3 h4) in an HTML doc, gets the trimmed text between the opening and closing tag (minus other tags), color codes it and then prints out the color coded headings producing a quick and dirty index. Anyway if you only want to look for stuff in the text this makes it easy! You can rebuild the line from the tokens.

    So have a look at the docs for TokeParser. It breaks everything down into little bits. Once you have done this testing if it is in a tag (whatever you mean by that) is easy as TokeParser has done all the work for you. A regex solution will almost always be a kludge and broken in some cases. Reliability == TokeParser

    #!/usr/bin/perl -w use strict; use HTML::TokeParser; my $dir = "c:/windows/desktop/book/"; my $file = $dir."work_index.htm"; my $p = HTML::TokeParser->new($file) || die "Can't open $file: $!"; my %font = ( h1 => '#0000ff', h2 => '#0000a0', h3 => '#000060', h4 => '#000000'); while (my $token = $p->get_tag(qw(h1 h2 h3 h4))) { my $open = $token->[0]; my $close = '/'.$open; my $text = $p->get_trimmed_text($close); print "<$open><font color='$font{$open}'>$text</font><$close>\n"; }

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: Regular Expression Help
by George_Sherston (Vicar) on Sep 01, 2001 at 13:38 UTC
    Other monks may have wiser suggestions: if I were doing this, I'd strip all the codes before putting the text in the array, using something like
    $text =~ s/<.+?>//g; push @array, $text;
    Then when printing out,
    for (@array) {print '<b>' . $_ . '</b>'}
    ... or whatever other formatting you might want to use.

    Not sure this is what you want to do, but if it is, then this will do it. "Those people who like this kind of thing will find it the kind of thing they like", as Abraham Lincoln says.

    George Sherston
      In my own experience, something like s/<\/?\w+(.*?(\".*?\")?(\'.*?\')?)*?>//g is usually nicer for stripping tags, because often >s have a habit of slipping into automatically generated hrefs. And occasionally javascript event properties.

      I'm not definate if the above works perfectly, cause I just hacked it up in a couple of minutes, but it should be ok.

Re: Regular Expression Help
by ton (Friar) on Sep 01, 2001 at 20:39 UTC
    I use XML::Parser for all my XML (and therefore HTML) parsing needs. Be warned that you need to have expat installed on your machine.

    Good luck!

    -Ton
    -----
    Be bloody, bold, and resolute; laugh to scorn
    The power of man...

      HTML (3.0, 4.0, etc.) is not a subset of XML, at least not until you get to the XHTML stage. XML and HTML are each subsets of SGML. The main reason I bring up this point is that HTML is - by and large - not well-formed. I'm willing to bet most XML parsers will choke on a common HTML page, simply because most HTML pages aren't structured properly. A <P> tag without a corresponding </P> tag would probably be the second most common offense, not to mention <IMG SRC="blah.gif"> doesn't have a slash terminator; neither of which are smiled upon in XML.

      Granted, it's a moot point if you hand-craft the HTML code going into your programs, but if you're analyzing other websites, assuming that they have properly-structured HTML is probably an unwise programming move, IMO.

      andre germain
      "Wherever you go, there you are."

        There are several ways to go from HTML to XML, so you can use XML tools with it:

        • install XML::PYX (and HTML::TreeBuilder) and do pyxhtml file.html | pyxw > file.xml,
        • use tidy. Just do tidy --output-xhtml yes file.html > file.xml. Note that you can get a Perl wrapper for it: sl-tidy.pl

        Note that if you are only working with HTML it might not be really usefull to convert everything to XML, and you might want to use HTML::Parser instead.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://109622]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (5)
As of 2021-10-27 20:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My first memorable Perl project was:







    Results (94 votes). Check out past polls.

    Notices?