Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

I think regex Should Help Here... but How!?

by ozboomer (Pilgrim)
on Feb 15, 2014 at 11:59 UTC ( #1075042=perlquestion: print w/replies, xml ) Need Help??
ozboomer has asked for the wisdom of the Perl Monks concerning the following question:

I'm falling-over with another regex problem that would appear to be fairly straightforward; I just can't seem to get my head to work with regex very well :(

I simply want to match a string like "comp.hw." against other strings that might be like "comp.hw.new" or "comp.sw.old", etc. The thinking might be to say that I want to match on the "2nd occurrence of a '.'"... and to retain the string up to that point... and then proceed to match the resultant string against whatever.

perlre... and "Perl Cookbook" don't really help me too much...

So, I've resorted to rubbish like the following code:

#!/usr/bin/perl #! $master = "comp.hw."; @items = ( "comp.", "comp.hw.", "comp.hw.new.", "comp.hw.hw.", "comp.sw.old.", "muse.hw.new.", "ancient." ); for ($level = 1; $level < 4; $level++) { printf("Match Level: $level\n"); # $level = no. of '.' to match @m = split(/\./, $master); $num_m = @m; $buf = ""; for ($i = 0; $i < $level; $i++) { if ($i >= $num_m) { last; } $buf .= $m[$i] . "."; } printf("Master: >%s<\n", $buf); foreach $str (@items) { if ($str =~ /^$buf/i) { printf("\t%s MATCHES %s\n", $str, $buf); } else { printf("\t%s DOES NOT MATCH %s\n", $str, $buf); } } printf("\n"); }

...which produces the following output:-

Match Level: 1
Master: >comp.<
	comp. MATCHES comp.
	comp.hw. MATCHES comp.
	comp.hw.new. MATCHES comp.
	comp.hw.hw. MATCHES comp.
	comp.sw.old. MATCHES comp.
	muse.hw.new. DOES NOT MATCH comp.
	ancient. DOES NOT MATCH comp.

Match Level: 2
Master: >comp.hw.<
	comp. DOES NOT MATCH comp.hw.
	comp.hw. MATCHES comp.hw.
	comp.hw.new. MATCHES comp.hw.
	comp.hw.hw. MATCHES comp.hw.
	comp.sw.old. DOES NOT MATCH comp.hw.
	muse.hw.new. DOES NOT MATCH comp.hw.
	ancient. DOES NOT MATCH comp.hw.

Match Level: 3
Master: >comp.hw.<
	comp. DOES NOT MATCH comp.hw.
	comp.hw. MATCHES comp.hw.
	comp.hw.new. MATCHES comp.hw.
	comp.hw.hw. MATCHES comp.hw.
	comp.sw.old. DOES NOT MATCH comp.hw.
	muse.hw.new. DOES NOT MATCH comp.hw.
	ancient. DOES NOT MATCH comp.hw.

Any thoughts!?

Thanks.

Edit: Added "comp." to the test array to complete that part of the picture... and have included the output produced by the code -OzB

Replies are listed 'Best First'.
Re: I think regex Should Help Here... but How!?
by graff (Chancellor) on Feb 15, 2014 at 15:55 UTC
    I'm afraid I don't understand what you're trying to do. You said:

    I want to match on the "2nd occurrence of a '.'"... and to retain the string up to that point…

    That's all clear enough, but :

    and then proceed to match the resultant string against whatever.

    Are you using "resultant" to mean "remaining"? And what does "match against whatever" mean? Can you give an example of a "before" and "after" to make it clear? Something like "given this string as input: … I want to have one variable set to … and another set to … (and another set to …)"

    As for your sample code, your "$master" only has two "."-delimited parts, so when you loop over it three times, the 2nd and 3rd iterations are the same. That certainly does seem pointless, but there's nothing in your post that says what the point is supposed to be.

Re: I think regex Should Help Here... but How!?
by kcott (Chancellor) on Feb 16, 2014 at 05:26 UTC

    G'day ozboomer,

    "I just can't seem to get my head to work with regex very well :("
    ...
    "perlre... and "Perl Cookbook" don't really help me too much..."

    Before delving into perlre, I'd recommend reading though perlrequick and perlretut.

    I believe the following code does what you want. (See Notes at the end.)

    #!/usr/bin/env perl -l use strict; use warnings; my $master = 'comp.hw.'; my @items = qw{ comp. comp.hw. comp.hw.new. comp.hw.hw. comp.sw.old. muse.hw.new. +ancient. comp comp.hw comp.hw.new comp.hw.hw comp.sw.old muse.hw.new ancien +t }; for my $level (1 .. split /\./, $master, -1) { print "Match Level: $level"; my $re = '^' . join('\.' => (split /\./, $master, $level + 1)[0 .. $leve +l - 1]); $re .= $re =~ /\.$/ ? '[^.]' : '(?:[.]|$)'; print "Matching: /$re/"; print "\t$_" for grep { /$re/ } @items; }

    Output:

    Match Level: 1 Matching: /^comp(?:[.]|$)/ comp. comp.hw. comp.hw.new. comp.hw.hw. comp.sw.old. comp comp.hw comp.hw.new comp.hw.hw comp.sw.old Match Level: 2 Matching: /^comp\.hw(?:[.]|$)/ comp.hw. comp.hw.new. comp.hw.hw. comp.hw comp.hw.new comp.hw.hw Match Level: 3 Matching: /^comp\.hw\.[^.]/ comp.hw.new. comp.hw.hw. comp.hw.new comp.hw.hw

    Notes:

    • Your OP had all @items with a terminal dot; your clarification later in the thread suggests this isn't the case: I've used both in my code to show it handles either case.
    • Programmatically determining the levels, rather than using a hard-coded value, allows the script to be reused with any value for $master.
    • Note that I've used the strict avd warnings pragmata. I strongly recommend you get into the habit of doing this also.
    • Overall, your code has a C-like feel to it (e.g. instances of for (;;) and printf()); I've provided a more Perl-like way of doing it.

    -- Ken

Re: I think regex Should Help Here... but How!?
by AnomalousMonk (Chancellor) on Feb 15, 2014 at 12:51 UTC

    One possible regex-based approach:

    c:\@Work\Perl>perl -wMstrict -le "my $master = 'comp.hw.new.'; ;; my @items = qw( comp.hw. comp.hw.new. comp.hw.hw. comp.sw.old. muse.hw.new. ancient +. ); ;; LEVEL: for my $level (1 .. 4) { print qq{Match Level: $level}; ;; my ($buf) = $master =~ m{ \A (?: [^.]* [.]){$level} }xmsg or print qq{no master match at level $level}; next LEVEL unless defined $buf; print qq{ Master: '$buf'}; ;; for my $str (@items) { if ($str =~ m{ \A (?i) \Q$buf\E }xms) { print qq{ '$str' matches '$buf'}; } else { print qq{ '$str' DOES NOT match '$buf'}; } } } " Match Level: 1 Master: 'comp.' 'comp.hw.' matches 'comp.' 'comp.hw.new.' matches 'comp.' 'comp.hw.hw.' matches 'comp.' 'comp.sw.old.' matches 'comp.' 'muse.hw.new.' DOES NOT match 'comp.' 'ancient.' DOES NOT match 'comp.' Match Level: 2 Master: 'comp.hw.' 'comp.hw.' matches 'comp.hw.' 'comp.hw.new.' matches 'comp.hw.' 'comp.hw.hw.' matches 'comp.hw.' 'comp.sw.old.' DOES NOT match 'comp.hw.' 'muse.hw.new.' DOES NOT match 'comp.hw.' 'ancient.' DOES NOT match 'comp.hw.' Match Level: 3 Master: 'comp.hw.new.' 'comp.hw.' DOES NOT match 'comp.hw.new.' 'comp.hw.new.' matches 'comp.hw.new.' 'comp.hw.hw.' DOES NOT match 'comp.hw.new.' 'comp.sw.old.' DOES NOT match 'comp.hw.new.' 'muse.hw.new.' DOES NOT match 'comp.hw.new.' 'ancient.' DOES NOT match 'comp.hw.new.' Match Level: 4 no master match at level 4

    Updates:

    1. Oops... Added  \Q...\E to original  m{ \A (?i) $buf }xms because synthesized pattern contains literal  '.' (period) characters. Output unchanged. (quotemeta-ing would be a good idea in any event.)
    2. graff has observed below that the written specs in ozboomer's OP are just a tad bit vague, and I have to agree. What I did was to assume the output of the code (however rubbishy) given in the OP was correct and, with my understanding of the specs, write code giving essentially the same output.

Re: I think regex Should Help Here... but How!?
by ozboomer (Pilgrim) on Feb 15, 2014 at 23:21 UTC

    Many thanks for the postings, folks.. and profuse apologies for the somewhat vague requirements -- I really shouldn't try to be overly creative at (what is for me) such a late hour...!

    Maybe things would be clearer if step back some and explain the original issue...

    I have some records where one of the fields is a category, of sorts. It can be a simple, single item ('comp') or it can be a composite ('muse.new').

    I start doing my processing by running though the complete set of records (in the 100s of 100+ character records, so not large), noting the record category in a hash (and storing other data besides). As a by-product, I might note that category strings can range from 1 to, say, 3 dot-delimited elements; for example, the formats of the categories might match one of 'aaa' (no dots), 'bbbb.cc' (one dot) and 'dd.eeee.f' (two dots).

    Now, let's define a 'level' as the number of 'words' in the category (as determined by the dot delimiters). So, a 'Level 1' category would include 'comp' and 'muse' but would NOT include '' (null) nor 'comp.hw'.

    Similarly, a 'Level 2' category would include 'comp.hw' and 'muse.new' but would NOT include 'garden.hw.new' nor 'magic.ancient.toys.tin'.

    ...and so it goes through all the 'levels' that I'd found in my initial pass through all the data records.

    So, in some sort of pseudocode, we might progress like:-

    j = 1 while (j <= 3) hash = (null) test list = (null) for each item in master category list if category is a 'Level j' add category to test list endif endfor for each data record get record category, record data for each test category if record category matches test category at level 'j' hash{test category} += record data endif endfor endfor foreach key in hash output hash(key) endfor j += 1 endwhile

    Thus, we'd end up with an output that is something like:

    At Level 1:
       comp = 100   (includes comp, comp.hw, comp.sw...)
       muse = 200   (inlcudes muse, muse.new, ...)
       
    At Level 2:
       comp.hw = 100   (includes comp.hw, NOT comp...)
       comp.sw = 200   (includes comp.sw, comp.sw.old, comp.sw.new...)
       
       ...
    

    Does that make things clearer?

      Hi ozboomer.

      Some questions:

      1. Does the code of your OP, however inelegant it may be, do what you want done? (For instance, I notice that it will never give you the plain  'comp' (no dot) permutation of the master string that your post above seems to require: "It can be a simple, single item ('comp') ..."). If the code does what you want, are you looking for better ways to do the same thing?
      2. If the code does not do what you want, how does the output that is produced differ from the output you want? Please give concrete, concise comparisons.
      3. Your post above mentions data records. Can you give a few concise examples of these records?
      4. Does the code of my previous post do anything to address your concerns? If so, how so? If not, how not?
      As it's the weekend, please be patient awaiting responses.

        (see the updates I've made to the OP)

        Sure, the ugly code produces the output I want -- that the supplied category (albeit, with an additional trailing dot), which comes from the list of possible categories, does match the category in the record (again, doctored with a trailing dot), when considering the 'level of matching' required.

        Although it's not my current application, perhaps think of the number of postings in the Usenet hierarchies. The data might be:

        comp.lang.c,100
        comp.lang.beta,23
        comp.lang.java.help,123
        comp.object,12
        alt.3d,12
        alt.animals.llama,1423
        ...
        

        The types of question I'm looking to answer:

        "How many postings are there in the 'comp' hierarchy and below?"

        For this question, we can say:

        Matches: comp, comp.lang, comp.lang.c... (the group names all start with 'comp')
        
        Do not Match: alt, alt.3d, alt.animals.llama... (the group names do not start with 'comp')
        

        "How many postings are there in the 'alt.*' hierarchy and below?"

        For this question, we can say:

        Matches: alt.3d, alt.animals.llama... (the group names all start with 'alt.{something}' and {something} is non-null)
        
        Do not match: alt, comp, comp.lang.c... (the group names do NOT start with 'alt.{something}' and {something} is non-null)
        

        Conceptually, it's such a simple thing: "Does RECORD CAT start with the TEST string?" ...

        TEST         RECORD          MATCHES?
        
        comp         comp.lang       Yes
        comp         comp            Yes
        comp         comp.hw         No
        
        comp.lang    comp            No
        comp.lang    comp.lang       Yes
        comp.lang    comp.lang.c     Yes
        comp.lang    comp.lang.c++   Yes
        comp.lang    alt             No
        comp.lang    alt.test        No
        

        This is why I was thinking there must be a simple regex thing to say "give me the first 2 items from the category string" (using parentheses and a dot or end-of-string as the separator - 'comp.lang') and I'll compare that to the start of the record string (in a simple regex: /^$rec_string\.*$/ or something).....

        ...shaking his head in bewilderment...

Re: I think regex Should Help Here... but How!?
by karlgoethebier (Prior) on Feb 15, 2014 at 13:05 UTC
    "...to retain the string up to that point..."

    Perhaps something like this matches: /comp\.hw\.(.+)/. What you want should be in $1.

    Update: May be i missed something about the specs. If so, i regret it. I posted this in a hurry, not good.

    Regards, Karl

    «The Crux of the Biscuit is the Apostrophe»

Re: I think regex Should Help Here... but How!?
by RichardK (Parson) on Feb 15, 2014 at 12:30 UTC

    As you're only trying to match the beginning of a string with no wildcards using index seems like a simpler approach.

    foreach (@items) { say "$_ match" if index($_,$master) == 0; }

      This from ozboomer's OP

      if ($str =~ /^$buf/i) { ... }
      indicates a case-insensitive match is required, so index would not be appropriate.

Re: I think regex Should Help Here... but How!?
by ozboomer (Pilgrim) on Feb 16, 2014 at 12:32 UTC

    Again, many thanks for the assistance.. and methinks I need to knuckle-down and get organized to study some regex *properly*..!

    Anyway, here are the two functions I'm comparing (at the bottom of this posting); my original 'TestMCa' and a new sub based on kcott's suggestion, 'TestMCb'.

    I know there's a lot of repetition of the same thing being done with these sorts of subs (I repeatedly work-out the '1st level' of the old $master string each time I call the sub, which could be done differently, if I was really worried about it) ...but I'll probably re-work that at some time... but for the moment, both of these subs work Ok.

    So, I've been trying them both in my 'real' code... but you could just as easily check them out with some code that reads a file of Usenet groups, as I mentioned before... and the whole record is the 'category'.

    I did some testing with each sub... matching categories on the '1st level' only ('one word')... and processing about 700 records. I ran the program a few times with each option, so as to get rid of any variances of caching, program RAM space, etc... and the results follow:-

        TestMCa: ELAPSED: 0.0775  0.0767  0.0751  0.0758 
        TestMCb: ELAPSED: 0.0658  0.0643  0.0641  0.0659 
                  Saving:    15%     16%     15%     13%
    

    I'd used the Time::HiRes module to simply check the elapsed time to run each sub, so the results are in seconds.

    Like I so often see in Perl, you can write something pretty awful and something pretty clever and the performance will not be very different... but I guess the difference can be significant if you're doing an operation a few million times. In my application, the difference is practically nothing... and I'll only invoke the program a handful of times in some analysis.. so it's really a moot point, I guess, to have even worried about the rubbishy sub... *hmmm*

    Still, it's all good to learn... and it gives me another boot to get onto working with regex 'seriously' :)

    Oh... kcott - a couple of notes for you... It was the whole point of the exercise to get the code to determine the strings to match; hence, providing a 'level' argument to the subs. I take your point about 'use strict' and 'use warnings' ... and I admit I'm slack about that. ...and true enough about the "C-like feel" in my coding -- but my home node might provide some insight into where that comes from :)

    Thanks again, everyone, for all your help. I appreciate it a lot.

      I've just revisited this thread and noticed your new post. (As it was a reply to your OP, I didn't see it earlier.)

      "...and true enough about the "C-like feel" in my coding -- but my home node might provide some insight into where that comes from :)"

      I check home nodes before replying: it oftens gives a hint as to how to frame the answer. And, yes, I did see "Programming in C since 1989, ..." as well as the Location :-)

      -- Ken

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1075042]
Approved by Skeeve
help
Chatterbox?
[Corion]: Discipulus: Yeah, from my investigations, you can somewhat silence+disable Cortana, but some services of it remain always running unfortunately. What a waste of resources :-/
[Corion]: I hope you had a good weekend still ;)
choroba had a workshop with the band
[choroba]: which counts as a good weekend
[Discipulus]: yes, (at least until Sun afternoon...): Saturday we got splendid birthday party in a park: lot of eat, drink and children amusement: bag running, magnetic fishing, rope and that big pot full of candies to smash with a club

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (5)
As of 2017-09-25 08:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    During the recent solar eclipse, I:









    Results (277 votes). Check out past polls.

    Notices?