Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

regec to select text ather than remove HTML tags

by Anonymous Monk
on Jan 23, 2012 at 12:02 UTC ( [id://949383]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

In the below example I want to select the word 'hello' from the first 3 but not in the fourth
1.hello> 2.<hello 3.hello <hello>
I have a regex for \bhello\b(?![ ^\]\w:-]*?>) but this doent work for "hello>" please help

Replies are listed 'Best First'.
Re: regec to select text ather than remove HTML tags
by Anonymous Monk on Jan 23, 2012 at 12:08 UTC

    You could maybe use  /^\d+\..*?hello.*$/m

    It means

    use YAPE::Regex::Explain; print YAPE::Regex::Explain->new( qr/^(\d+\..*?hello.*)$/m )->explain; __END__ The regular expression: (?m-isx:^(\d+\..*?hello.*)$) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?m-isx: group, but do not capture (with ^ and $ matching start and end of line) (case- sensitive) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- ^ the beginning of a "line" ---------------------------------------------------------------------- ( group and capture to \1: ---------------------------------------------------------------------- \d+ digits (0-9) (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- \. '.' ---------------------------------------------------------------------- .*? any character except \n (0 or more times (matching the least amount possible)) ---------------------------------------------------------------------- hello 'hello' ---------------------------------------------------------------------- .* any character except \n (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- $ before an optional \n, and the end of a "line" ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------
Re: regec to select text ather than remove HTML tags
by sundialsvc4 (Abbot) on Jan 23, 2012 at 13:38 UTC

    Not being too much of a “golfer,” I tend to solve such problems in two steps:   first, I look for the string-structure that I am looking for, then I look for “hello...” within that string.

    One issue that you should consider is that ... right now, you have no clearly-defined beginning/ending delimiter:   where does the string begin, and where does it end?   In such a case, the less-than/greater-than strings are the only reliable anchor-points that you have, in which case split() and pos() become your friends.   (Along with the i,g modifiers of a regex.)   You might be able to construct the argument (and therefore, a program) which says that what you really have here is a string that is “split by” either of these two characters.   You iterate through the string, looking for these characters and noting their positions.   You decide if a string-of-interest could be “beginning” or “ending,” and you extract the pieces for a closer look with substr().

    Really, the true challenge of this kind of algorithm is “ruggedly and completely defining it.”   It probably will be a two-part solution.   (“First, find the strings, then, see if they’re interesting.”)   After you have used perldoc and then maybe a few experimental programs to confirm in your own mind how these various Perl tools work, spend some serious thought-time defining your algorithm.   It might not be entirely trivial.   I would go so far as to recommend constructing a series of test-cases with test-strings, and build a Test::More test suite to actually and completely test it.   You could easily construct a subtly flawed algorithm, bang it a few times, say, “yep, it seems to work,” and find that you are totally-wrong when your code goes into production.   It happens.   (A lot.)   And, it’s not pretty or fun.   The “extra” time needed to “prove it!!” will be worthwhile.

Re: regec to select text ather than remove HTML tags
by JavaFan (Canon) on Jan 23, 2012 at 14:21 UTC
    Untested:
    !/<hello>/ and /(hello)/ and print $1
      I need a plain regex expression which can be used as a condition what I have come up with is : \bhello\b(?! ^\\w:-]*?>) please help
Re: regec to select text ather than remove HTML tags
by Veer (Initiate) on Jan 23, 2012 at 12:33 UTC
    that did not work I want to select all the following combinations <hello hello hello> but not <hello> thanks for your help
Re: regec to select text ather than remove HTML tags
by Veer (Initiate) on Jan 23, 2012 at 12:34 UTC
    that did not work I want the follwing combinations to be selected <hello hello> hello but not <hello> thanks for your help

        The code seemed to work for me.

        Using pm_txt.txt for input for pm_regex.pl pm_txt.txt

        1.hello> 2.<hello 3.hello <hello>

        pm_regex.pl

        use strict; use warnings; my $filename = shift or die "Usage $0 FILENAME\n"; open my $fh, '<', $filename or die "Could not open '$filename'\n"; while (my $line = <$fh>) { chomp $line; if ($line =~ /^\d+\..*?(hello).*$/) { print "In $line $1 matches\n"; } else { print "$line doesn't match\n"; } }

        Running perl pm_regex.pl pm_text.txt produced the output:

        In 1.hello> hello matches

        In 2.<hello hello matches

        In 3.hello hello matches

        <hello> doesn't match

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://949383]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (7)
As of 2024-04-18 14:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found