Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re: I think regex Should Help Here... but How!?

by ozboomer (Friar)
on Feb 15, 2014 at 23:21 UTC ( [id://1075066]=note: print w/replies, xml ) Need Help??


in reply to I think regex Should Help Here... but How!?

Many thanks for the postings, folks.. and profuse apologies for the somewhat vague requirements -- I really shouldn't try to be overly creative at (what is for me) such a late hour...!

Maybe things would be clearer if step back some and explain the original issue...

I have some records where one of the fields is a category, of sorts. It can be a simple, single item ('comp') or it can be a composite ('muse.new').

I start doing my processing by running though the complete set of records (in the 100s of 100+ character records, so not large), noting the record category in a hash (and storing other data besides). As a by-product, I might note that category strings can range from 1 to, say, 3 dot-delimited elements; for example, the formats of the categories might match one of 'aaa' (no dots), 'bbbb.cc' (one dot) and 'dd.eeee.f' (two dots).

Now, let's define a 'level' as the number of 'words' in the category (as determined by the dot delimiters). So, a 'Level 1' category would include 'comp' and 'muse' but would NOT include '' (null) nor 'comp.hw'.

Similarly, a 'Level 2' category would include 'comp.hw' and 'muse.new' but would NOT include 'garden.hw.new' nor 'magic.ancient.toys.tin'.

...and so it goes through all the 'levels' that I'd found in my initial pass through all the data records.

So, in some sort of pseudocode, we might progress like:-

j = 1 while (j <= 3) hash = (null) test list = (null) for each item in master category list if category is a 'Level j' add category to test list endif endfor for each data record get record category, record data for each test category if record category matches test category at level 'j' hash{test category} += record data endif endfor endfor foreach key in hash output hash(key) endfor j += 1 endwhile

Thus, we'd end up with an output that is something like:

At Level 1:
   comp = 100   (includes comp, comp.hw, comp.sw...)
   muse = 200   (inlcudes muse, muse.new, ...)
   
At Level 2:
   comp.hw = 100   (includes comp.hw, NOT comp...)
   comp.sw = 200   (includes comp.sw, comp.sw.old, comp.sw.new...)
   
   ...

Does that make things clearer?

Replies are listed 'Best First'.
Re^2: I think regex Should Help Here... but How!?
by AnomalousMonk (Archbishop) on Feb 16, 2014 at 00:49 UTC

    Hi ozboomer.

    Some questions:

    1. Does the code of your OP, however inelegant it may be, do what you want done? (For instance, I notice that it will never give you the plain  'comp' (no dot) permutation of the master string that your post above seems to require: "It can be a simple, single item ('comp') ..."). If the code does what you want, are you looking for better ways to do the same thing?
    2. If the code does not do what you want, how does the output that is produced differ from the output you want? Please give concrete, concise comparisons.
    3. Your post above mentions data records. Can you give a few concise examples of these records?
    4. Does the code of my previous post do anything to address your concerns? If so, how so? If not, how not?
    As it's the weekend, please be patient awaiting responses.

      (see the updates I've made to the OP)

      Sure, the ugly code produces the output I want -- that the supplied category (albeit, with an additional trailing dot), which comes from the list of possible categories, does match the category in the record (again, doctored with a trailing dot), when considering the 'level of matching' required.

      Although it's not my current application, perhaps think of the number of postings in the Usenet hierarchies. The data might be:

      comp.lang.c,100
      comp.lang.beta,23
      comp.lang.java.help,123
      comp.object,12
      alt.3d,12
      alt.animals.llama,1423
      ...
      

      The types of question I'm looking to answer:

      "How many postings are there in the 'comp' hierarchy and below?"

      For this question, we can say:

      Matches: comp, comp.lang, comp.lang.c... (the group names all start with 'comp')
      
      Do not Match: alt, alt.3d, alt.animals.llama... (the group names do not start with 'comp')
      

      "How many postings are there in the 'alt.*' hierarchy and below?"

      For this question, we can say:

      Matches: alt.3d, alt.animals.llama... (the group names all start with 'alt.{something}' and {something} is non-null)
      
      Do not match: alt, comp, comp.lang.c... (the group names do NOT start with 'alt.{something}' and {something} is non-null)
      

      Conceptually, it's such a simple thing: "Does RECORD CAT start with the TEST string?" ...

      TEST         RECORD          MATCHES?
      
      comp         comp.lang       Yes
      comp         comp            Yes
      comp         comp.hw         No
      
      comp.lang    comp            No
      comp.lang    comp.lang       Yes
      comp.lang    comp.lang.c     Yes
      comp.lang    comp.lang.c++   Yes
      comp.lang    alt             No
      comp.lang    alt.test        No
      

      This is why I was thinking there must be a simple regex thing to say "give me the first 2 items from the category string" (using parentheses and a dot or end-of-string as the separator - 'comp.lang') and I'll compare that to the start of the record string (in a simple regex: /^$rec_string\.*$/ or something).....

      ...shaking his head in bewilderment...

        I had thought I had given a 'pure' regex approach (for what its worth) that satisfied your original request, one that can easily be adjusted for the terminal-dot versus no-terminal-dot alternatives, which of these you require being a point I still do not quite grasp. Your response to kcott's reply below indicates you are satisfied with the code you have now, so I will not comment further along these lines.

        However, I would encourage you to become familiar with regular expression techniques and be wildered no more! In addition to the valuable links given by others in this thread, I have found Jeffrey Friedl's (admittedly rather expensive) book Mastering Regular Expressions to be very helpful; see his site.

        Update: Or perhaps I should have said "be less wildered", for even though I've been using and studying regexes a long time now, I still regularly trip over them and fall flat on my face! But hang in there and enlightenment will come.

        Nope, just couldn't leave it alone. Here's a solution to analyzing the Usenet hierarchies data. Note this is cumulative: repeated entries add together.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1075066]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (4)
As of 2024-04-25 13:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found