Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re-define Word Boundary?

by JimJ (Acolyte)
on Nov 20, 2003 at 23:04 UTC ( #308744=perlquestion: print w/replies, xml ) Need Help??
JimJ has asked for the wisdom of the Perl Monks concerning the following question:

I've got a large database of parts that I need to search. I break all the records into words using :

while ( $rec =~ /\b\w+\b/g ) { $word = $&; }

That works well except I don't want it to define a word break as a slash ("/"). This is because a lot of words are notations like: "a/c", "4-5/16", etc. What I'd like to do is redefine what "\b" is. I know that is not possible, but I'm trying to figure out a technique that would let be create something like the "\b" asserion only something I could define. For example, I want to break a word at space, comma, periord, but not slash.

Also, what characters does the "\b" assertion encompass? Is it any character that is not alpha, digit or underscore?

Replies are listed 'Best First'.
Re: Re-define Word Boundary?
by sauoq (Abbot) on Nov 20, 2003 at 23:20 UTC
    Also, what characters does the "\b" assertion encompass?

    It doesn't encompass any characters. It is zero-width and matches the break between word and non-word characters. In your example, it is completely superfluous. You could just as easily have used /\w+/g instead.

    If you want to include other characters, you can just use your own character class. You stated outright that you want to include a slash ("/"), but one of your examples, "4-5/16", indicates you'd like a dash too. So, try m![-/\w]+!g and see if that does the trick.

    -sauoq
    "My two cents aren't worth a dime.";
    
Re: Re-define Word Boundary?
by davido (Archbishop) on Nov 20, 2003 at 23:11 UTC
    If I recall, a \b word boundry is synonymous to, or defined as follows:

    /(?: (?<!\w)(?=\w) | (?<=\w)(?!\w) )/x

    That means that you can create your own version using zero-width lookahead and lookbehind assertions along with character classes.

    You may have a look at the entire Why do zero width assertions care about lookahead/behind? thread for further discussion on the \b metacharacter and zero-width assertions.


    Dave


    "If I had my life to live over again, I'd be a plumber." -- Albert Einstein
Re: Re-define Word Boundary?
by Abigail-II (Bishop) on Nov 20, 2003 at 23:32 UTC
    What I'd like to do is redefine what "\b" is.

    You can, by overloading regular expressions. The following is just a quick hack I threw together, but I think it catches most cases, including /[\b]/ and /\\b/.

    #!/usr/bin/perl use strict; use warnings; no warnings qw /syntax/; BEGIN { my $B = '(?:(?<=[\w/])(?=[^\w/])|' . '(?<=[^\w/])(?=[\w/])|' . '^(?=[\w/])|(?<=[\w/])$)'; $^H |= 0x30000; $^H {qr} = sub { local $_ = $_ [1]; s/(\[\^?]?[^]]*]|[^\\]+|\\.)/$1 eq '\b' ? $B : $1/eg; $_; } } while (<DATA>) { chomp; print "$_: "; print $& if /\b\w+\b/; print "\n"; } __DATA__ foo &foo& &foo/bar& foo: foo &foo&: foo &foo/bar&:

    Abigail

      Please don't do that, unless you are planning on sticking with the same (tested) perl version forever. perldoc perlvar says:
      $^H WARNING: This variable is strictly for internal use only. Its availability, behavior, and contents are subject to change without notice.
      and
      %^H WARNING: This variable is strictly for internal use only. Its availability, behavior, and contents are subject to change without notice.
      See perldoc perlre "Creating custom RE engines" for the documented way to overload regular expressions.
Re: Re-define Word Boundary?
by thelenm (Vicar) on Nov 20, 2003 at 23:27 UTC

    \w matches alphanumerics and underscore. \b is effectively the same as using lookbehinds and lookaheads like this:

    (?:(?<=\w)(?=\W|\z)|(?:(?<=\W)|(?<=\A))(?=\w)

    Update: Hmm, or even nicer, as merlyn posted in •Re: Why do zero width assertions care about lookahead/behind? (code examples also updated),

    (?:(?<!\w)(?=\w)|(?<=\w)(?!\w))

    So to make a specialized version of \b that views "-" and "/" as "word characters" (sort of), you might use something like this:

    (?:(?<![\w/-])(?=[\w/-])|(?<=[\w/-])(?![\w/-]))

    So maybe something like this will suit you?

    my $w = '\w/-'; my $b = "(?:(?<![$w])(?=[$w])|(?<=[$w])(?![$w]))"; my @words = ($rec =~ /${b}[$w]+${b}/g);

    I've tested this a little but not a lot, and it seems all right. You'll want to verify it yourself before you go using it for anything important :-)

    -- Mike

    --
    XML::Simpler does not require XML::Parser or a SAX parser. It does require File::Slurp.
    -- grantm, perldoc XML::Simpler

Re: Re-define Word Boundary?
by thospel (Hermit) on Nov 21, 2003 at 03:55 UTC
    You're not actually USING the \b, since you will match the same things without them (the \w+ will start at the earliest possible moment and stop at the last moment, both of which points will be \b boundaries).

    Now assuming you want to actually match "a/c" in the case of your example, you are also changing the meaning of \w, and your new \b still isn't used, so you can just as well write:

    while ($rec =~ m![\w/]+!g) { $word = $&; }
Re: Re-define Word Boundary?
by fletcher_the_dog (Friar) on Nov 21, 2003 at 16:16 UTC
    This you might want to use split:
    my @words = split /[\s,.]+/,$rec;
use overload qr
by ambrus (Abbot) on Nov 21, 2003 at 19:17 UTC

    You can use overload to change the meaning of regexps. It works like this: you can redefine the meaning of //, s///, qr//... literals so that perl applies some to them before compiling to internal form. This way, you can change \b's to some look{ahead,behind} expression, or better still, use some other escape sequence for that.

    I have no example code for this. I am sure I have once seen one, but I can't find it.

    Update: as ysth said, that code is in the perlre pod. Sorry.

      Abigail already posted it in the raw, overload.pm-less fashion.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://308744]
Approved by jdtoronto
Front-paged by broquaint
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (2)
As of 2017-12-17 01:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What programming language do you hate the most?




















    Results (459 votes). Check out past polls.

    Notices?