Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Simple RegEx Substring Extraction from a Delimited Text Record

by ozboomer (Friar)
on Mar 15, 2006 at 05:11 UTC ( #536777=perlquestion: print w/replies, xml ) Need Help??

ozboomer has asked for the wisdom of the Perl Monks concerning the following question:

A simple query, this one, I think... but of interest to those who aren't too flash on regex (like me!)

I have a text record and I would like to extract the 2nd field from it, viz:

SLOT3=4,4,2!INT=115!VC=4!CS=270!PK=/

Now, the simple but kindof-obscure way to do it might be:

$rec = "SLOT3=4,4,2!INT=115!VC=4!CS=270!PK=/"; ($int_spec) = (split("!", $rec))[1]; $int = $int_spec; $int =~ s/INT=//; printf("\$int: $int\n");

The somewhat better way, although more obscure for people not used to regex might be:

$rec = "SLOT3=4,4,2!INT=115!VC=4!CS=270!PK=/"; $int = ($rec =~ /(INT=([0-9]*)!)/)[1]; printf("\$int: $int\n");

Perhaps there's an even more elegant way to do it using regex?

At any rate, I generally go for the simplest way of doing something, as the PCs we use these days are getting pretty quick (yes, Win32 development again)... but I want to get some more practice with using regex.

I looked in the on-line docs, amongst the monks articles and in the O'Reilly regex and cookbook books but couldn't find something simple... So I thought I'd drop a question in here.

Something else... In terms of maintenance, what is the 'best'(!?) way to go? Assume my maintainer is an expert in regex or go for the "lowest common denominator" (assuming performance isn't critical and simplicity is king)?

Would appreciate any thoughts....

Replies are listed 'Best First'.
Re: Simple RegEx Substring Extraction from a Delimited Text Record
by Samy_rio (Vicar) on Mar 15, 2006 at 05:47 UTC

    Hi ozboomer, Try this,

    use strict; use warnings; my ($int) = (split "!(?:INT=)?", "SLOT3=4,4,2!INT=115!VC=4!CS=270!PK=/ +")[1]; print "Method 1 :\t\$int: $int\n"; my ($int1) = ("SLOT3=4,4,2!INT=115!VC=4!CS=270!PK=/")=~ m/\!INT=([^\!] ++)/; print "Method 2 :\t\$int: $int1\n";

    Comparsion of above methods.

    use strict; use warnings; use Benchmark 'cmpthese'; cmpthese(-1, { method1 => 'my ($int) = (split "!(?:INT=)?", "SLOT3=4,4,2!INT=11 +5!VC=4!CS=270!PK=/")[1]', method2 => 'my ($int) = ("SLOT3=4,4,2!INT=115!VC=4!CS=270!PK=/") +=~ m/\!INT=([^\!]+)/', }); __END__ Rate method1 method2 method1 133487/s -- -64% method2 366033/s 174% --

    Regards,
    Velusamy R.


    eval"print uc\"\\c$_\""for split'','j)@,/6%@0%2,`e@3!-9v2)/@|6%,53!-9@2~j';

Re: Simple RegEx Substring Extraction from a Delimited Text Record
by duckyd (Hermit) on Mar 15, 2006 at 05:50 UTC
    For parsing a delimited text format like the one you've described, use split. Even if the next person who has to maintain your code is a "regex expert", assuming they know perl they'll expect split to be used in cases like this. As for your example, I find the regex example very confusing. It's certainly not suitable for obtaining anything except the second field, and requires that the second field be of a particular form. If you wanted all fields from a regex, you'd need something like:
    /([^!]*)!([^!]*)!([^!]*)!([^!]*)!([^!]*)/
    which is surely a lot less readable or maintainable than
    split /!/
    If you just want the numbers between INT= and the following !, you might do something more like:
    my $int = ($rec =~ /!INT=([0-9]+)!/);
    Note that doesn't guarantee that the value will come from the second field, but neither did your example. If you wanted to do that, you might use:
    my $int = ($rec =~ /^[^!]*!INT=([0-9]+)!/);
    One final note, you should be using print rather than printf in your examples.

      Interestingly, in perl6 I would think it natural to begin with such as:

      grammar foo_db { rule key { <[A-Z]> } rule scalar { <[^,]>+ } rule list { [<scalar> ,]+ <scalar> } rule term { <key> = [<list> | <scalar>] } rule record { [<term> <[|]>]* <term> } }

      -- to exactly specify the records of the database, rather than either loosely accept the data or make it hard to quickly determine how loosely (or, perhaps, wrongly) the data is taken.

      Although I like to do this in perl5, it's not quite so easy to take a verifying regex and make it only accept certain keys, or fail on invalid values. As it isn't so easy or terse or maintainable, split// is preferred, and ultimately very strenuous testing is preferred.

      Please don't take that last paragraph as an indictment.

Re: Simple RegEx Substring Extraction from a Delimited Text Record
by ayrnieu (Beadle) on Mar 15, 2006 at 05:38 UTC
    ($int) = $rec =~ /(INT=\d*(?=!|$))/

    Or you can internalize your DB into structured data when you load it, rather than pass around such strings. Or you can consider Data::Record.

    I tend to make a blessed array for the database, and fill it with objects for the records. At least at first.

Re: Simple RegEx Substring Extraction from a Delimited Text Record
by reasonablekeith (Deacon) on Mar 15, 2006 at 09:54 UTC
    I'm a bit late to this post, but I couldn't resist adding the following. Why not just split it all into a hash?
    #!/bin/perl -w use strict; my $string = 'SLOT3=4,4,2!INT=115!VC=4!CS=270!PK=/'; my %parsed_values = map { split '=' } split '!', $string; print $parsed_values{'INT'};
    ---
    my name's not Keith, and I'm not reasonable.
Re: Simple RegEx Substring Extraction from a Delimited Text Record
by NetWallah (Canon) on Mar 15, 2006 at 06:43 UTC
    Your approach should depend on what idea you consider the more important to communicate.

    • Is it important/relevant that this is the SECOND field ?
    • Is the NAME (INT) important/significant ?
    • Is this a general purpose program/subroutine, where you may want to accept non-numeric , or multiple values for INT ?
    I would probably put in a WHILE loop, and parse chunks of the regex with a "g" flag, and store results into a hash.
    (Too lazy to write the code- whoring for XP).

         "For every complex problem, there is a simple answer ... and it is wrong." --H.L. Mencken

Re: Simple RegEx Substring Extraction from a Delimited Text Record
by ozboomer (Friar) on Mar 15, 2006 at 11:52 UTC
    As always, it's great to see how a simple question can generate a lot of chatter... and is a testament to how keen we all get about good ol' Perl :)

    With all the good ideas presented, it seems a re-working of my original split usage might suit me best, as I'm likely to want to refer to each of the data items by name elsewhere. The actual structure in the file is pretty convoluted (well, multiple multi-line records in many 'regions' of the same text file) but this method is likely to get me farthest, I think. I always find regex strange, not coming from a 'predominantly Unix' background... but the facility can certainly do a lot... in accomplished hands (not yet mine!).

    Many thanks for the suggestions, everyone.

      "I'm likely to want to refer to each of the data items by name elsewhere."

      This screams hash, IMO. See reasonablekeith's suggestion.

      ----------
      Using perl 5.8.6 unless otherwise noted. Apache/2.0.54 unless otherwise noted. Fedora Core 4 (2.6.11-1.1369_FC4) unless otherwise noted.
        However... contrary to a strange student in an AI class I took, hashing is not necessarily the answer to everything. But in this case, it is the answer :) .

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://536777]
Approved by spiritway
Front-paged by kwaping
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (1)
As of 2022-05-18 22:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you prefer to work remotely?



    Results (71 votes). Check out past polls.

    Notices?