Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

More Regular Expressions (text data handling)

by graq (Curate)
on Dec 04, 2001 at 16:33 UTC ( #129313=perlquestion: print w/ replies, xml ) Need Help??
graq has asked for the wisdom of the Perl Monks concerning the following question:

I am writing a script that takes some raw data (text), extracts the set of information and inserts into a database. Wow, big surprise you all say. Yeah, well, I have a feeling that I am not doing this particularly well.

The thing that is making it non-trivial is that the rules for the data are not as trivial as usual. I don't think I will explain the rules, but rather give an example.

# DATA: Start Graq Agnostic Number: 634321 age: 27 hair colour: black height: 73 weight: 123 legs: 2 arms: 2 jameson bells guinness favourite detests likes # DATA: End
(Name is Graq, Number is 634321, age 27, jameson is my fave drink, detest bells etc)

Now this data is also surrounded by further noise, and may contain extra blank lines. But the 'Number' tag is unique, and there is always exactly 70 lines of (non-empty) relavent data, so I can index that and grab the lines I need.

# CODE: Start #!perl -w use strict; my $pasteRaw; while(<STDIN>) { # Remove spaces from before ':' to avoid splitting on 'hair colour' +. substr($_, 0, 1+index($_, ":")) =~ tr/ //d; $pasteRaw .= $_; } my @pasteSplitAll = split( "[^:] |\n", $pasteRaw ); # [1] See below # Remove empty parts of the array. my @pasteNotEmpty = grep { $_ ne "" } @pasteSplitAll; # Defined an index for splicing later. my $index = 0; # Set the index (looking for 'Number') $pasteNotEmpty[$_] =~ /^Number/ and $index = $_ and last for 0..$#pasteNotEmpty; my @pasteUseful = splice( @pasteNotEmpty, $index-2, 70 ); $pasteUseful[$_] =~ s/^\w+:// for 0..$#pasteUseful; # CODE: End
So this gives me an array of 'useful' data. A big thanks to people in CB for some of the individual lines in there, but it is all starting to look a little clumsy (and I am stripping some unwanted values at [1])

So.. to my point. I am looking for some help in handling this data, preferably into a hash, so that I can do stuff with it.

Anything ranging from help on the individual REs to new approaches on tackling the problem as a whole. Am I just kicking a dead horse and might aswell write something a lot less generic?

<a href="http://www.graq.co.uk">Graq</a>

Comment on More Regular Expressions (text data handling)
Select or Download Code
Re: More Regular Expressions (text data handling)
by frankus (Priest) on Dec 04, 2001 at 16:55 UTC

    For brevity and clarity in the question: Using the __DATA__ label will enable folks to run this easily.
    Update: Agh! the gotcha here is there are some lines with 2 key value pairs on em :(

    #!/usr/bin/perl -w use strict; use Data::Dumper; for(grep {/\w/}<DATA>){ # g repeats the regex, e executes the Perl in the substitution, # returns the number of matches into the condition. # $1 and $2 are the bracketed matches in the regex in order. + unless( s/^(^[a-z ]+) *: *([\w\d]+)/$_{$1}=$2/ige ){ chomp; if (/:/){ # key part $_{$_}=join(',',@_); # make new key item @_=() } else { # list part push @_,$_ } } } print Dumper(\%_); __DATA__ Graq: Agnostic: Number: 634321 age: 27 hair colour: black height: 73 weight: 123 legs: 2 arms: 2 jameson bells guinness favourite: detests: likes:

    --

    Brother Frankus.

    ¤

      Agh! the gotcha here is there are some lines with 2 key value pairs on em :(

      Yes, sorry, I should have expanded on the data. Especially the noise either side of the data. Hence the indexing step. Below is a data example closer to a true example.

      __DATA__ NOISE noise Graq 121212 rubbish: values Graq Agnostic Number: 634321 age: 27 hair colour: black height: 73 weight: 123 legs: 2 arms: 2 balls: 1 aminals: 3 leftandright rugby cute "more noise here - and don't forget the blank lines..." __END_DATA__
      The first guaranteed unique identifer is the line begining with Number (ie /^Number:/).

      Lines before the colon (:) seperated set of values are key-less (hard coded keys must be used), values after are sub-values of the preceding : values.

      So, if you like, {arms}->{2}->{leftandright}, {balls}->{1}->{rugby} ..

      Is it becoming any clearer?? :-(

      <a href="http://www.graq.co.uk">Graq</a>

Re: More Regular Expressions (text data handling)
by buckaduck (Chaplain) on Dec 04, 2001 at 18:16 UTC
    It's hard to know how to parse this in a general sense if you won't tell us the "rules" that are used to produce the data. As it is, your dataset is very inconsistent.

    Some lines have a field name and a value:

    Number: 634321
    Some lines have two field/value pairs, with no separator!
    age: 27 hair colour: black
    Some lines have a value, with no field name:
    Graq Agnostic
    And toward the end you have a series of values, followed by a series of field names:
    jameson bells guinness favourite detests likes

    If there are in fact any real rules governing the data, you should tell us. Better yet, you should if possible change your data structure to something easier. I'll bet that you could parse this easily:

    # DATA: Start name: Graq religion: Agnostic Number: 634321 age: 27 hair colour: black height: 73 weight: 123 legs: 2 arms: 2 favorite: jameson detests: bells likes: guinness # DATA: End

    buckaduck

      Well said.

      Unfortunately I cannot provide an actual example.

      So here goes:

      1. I have no control over the incoming data. It will effectively be pasted by someone in a <TEXTAREA>.
      2. The data lines may contain sporadic empty lines, which are to be ignored.
      3. There are useless lines and characters before the useful data starts.
      4. There are useless lines and characters after the useful data ends.
      5. The first unique point at which a line can be identified as being useful, is that it start with the characters 'Number:'.
        This identifier (index) is not at the start of the data.
      6. The data can be split into 4 sections:
        1. TOP
          A set of values with no keys. These are always in the same place relative to each other.
          So, in my previous example, you could safely say $result{Name}='Graq';.
          There is only one value per line.

        2. MIDDLE
          A set of key-value pairs seperated by a colon and space /: /.
          Some lines have 2 key-value pairs on them (never more).
          Keys may contain spaces, values may not.
        3. BOTTOM
          1. A set of key-value pairs seperated by a colon and space.
            One key-value pair per line (see below).
          2. A set of values. These values correspond to key-value pair in (i).
            One value per line (see below).
            NB: The first line of (ii) will be on the same line as the last line of (i), seperated by a space.
        4. END
          A single key-value pair (colon and space seperated), where the value may be missing.

      Note on lines with multiple key-value pairs:

      1. The TOP and MIDDLE never mix.
      2. MIDDLE values may have multiple entries.
      3. MIDDLE and BOTTOM(i) may overlap.
      4. BOTTOM(i) and BOTTOM(ii) always overlap.
      5. BOTTOM(ii) and END never mix.

      Please don't ask why this is :(

      <a href="http://www.graq.co.uk">Graq</a>

      edited by footpad, ~Tue Dec 4 14:42:09 2001 (GMT)

Re: More Regular Expressions (text data handling)
by joealba (Hermit) on Dec 04, 2001 at 20:05 UTC
    Sorry, but you can't consistently parse data that is THIS inconsistent. Blank lines are easy to ignore, but how can you ignore lines with "noise" at the start without a solid way to denote the start of your data?

    Try your best to clean up the incoming data. Until then, here are some parsing tricks that might help you keep some of your code maintainable (note I didn't say fast).
    my %KEY_PARSER = ( "number" => { START_COMMAND => qr{number:\s*}i, VALUE_MATCH => qr{\d+}, }, "hair color" => { START_COMMAND => qr{hair colou?r:\s*}i, VALUE_MATCH => qr{[\w\s]+}, }, "height" => { START_COMMAND => qr{height:\s*}i, VALUE_MATCH => qr{\d+}, }, "weight" => { START_COMMAND => qr{weight:\s*}i, VALUE_MATCH => qr{\d+}, }, ); foreach my $line (grep {/\w/} <DATA>) { foreach (keys %KEY_PARSER) { while ($line =~ /$KEY_PARSER{$_}{START_COMMAND}/) { $line =~ s/($KEY_PARSER{$_}{START_COMMAND})\s*($KEY_PARSER +{$_}{VALUE_MATCH})//; next unless $2; my ($key,$value) = ($1,$2); chomp ($key,$value); print "Found KEY: $key = $value\n"; } } }
    Crap, even when munged by the magical Perl, still smells like crap.
      As I stated earlier, the noise is before and after and it is possible to identify an index and work from there.
      The number of sets of data is always 70 and the index /^Number:/ is always the third piece of data (after blank lines are removed).
      So you can, somewhat, ignore that for the question. I was including it for completeness.

      <a href="http://www.graq.co.uk">Graq</a>

        As I see it, you require the use of forward lookaheads in a regex:

        Since the line before Number contains the name and the persons details are terminated again by name,
        something that grabs the name and the text between two instances of the name can be got.

        You could then make a hash of names with the value being a hash of details, does that sound good?

        --

        Brother Frankus.

        ¤

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://129313]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (5)
As of 2014-07-25 04:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (167 votes), past polls