graq has asked for the
wisdom of the Perl Monks concerning the following question:
I am writing a script that takes some raw data (text), extracts the set of information and inserts into a database. Wow, big surprise you all say. Yeah, well, I have a feeling that I am not doing this particularly well.
The thing that is making it non-trivial is that the rules for the data are not as trivial as usual. I don't think I will explain the rules, but rather give an example.
# DATA: Start
Graq
Agnostic
Number: 634321
age: 27 hair colour: black
height: 73
weight: 123
legs: 2 arms: 2
jameson
bells
guinness
favourite
detests
likes
# DATA: End
(Name is Graq, Number is 634321, age 27, jameson is my fave drink, detest bells etc)
Now this data is also surrounded by further noise, and may contain extra blank lines.
But the 'Number' tag is unique, and there is always exactly
70 lines of (non-empty) relavent data, so I can index
that and grab the lines I need.
# CODE: Start
#!perl -w
use strict;
my $pasteRaw;
while(<STDIN>)
{
# Remove spaces from before ':' to avoid splitting on 'hair colour'
+.
substr($_, 0, 1+index($_, ":")) =~ tr/ //d;
$pasteRaw .= $_;
}
my @pasteSplitAll = split( "[^:] |\n", $pasteRaw ); # [1] See below
# Remove empty parts of the array.
my @pasteNotEmpty = grep { $_ ne "" } @pasteSplitAll;
# Defined an index for splicing later.
my $index = 0;
# Set the index (looking for 'Number')
$pasteNotEmpty[$_] =~ /^Number/ and $index = $_ and last
for 0..$#pasteNotEmpty;
my @pasteUseful = splice( @pasteNotEmpty, $index-2, 70 );
$pasteUseful[$_] =~ s/^\w+:// for 0..$#pasteUseful;
# CODE: End
So this gives me an array of 'useful' data. A big thanks to people in CB for some of the individual lines in there, but it is all starting to look a little clumsy (and I am stripping some unwanted values at [1])
So.. to my point. I am looking for some help in handling this data, preferably into a hash, so that I can do stuff with it.
Anything ranging from help on the individual REs to new approaches on tackling the problem as a whole. Am I just kicking a dead horse and might aswell write something a lot less generic?
<a href="http://www.graq.co.uk">Graq</a>
Re: More Regular Expressions (text data handling) by frankus (Priest) on Dec 04, 2001 at 16:55 UTC |
For brevity and clarity in the question: Using the __DATA__ label will enable folks to run this easily.
Update: Agh! the gotcha here is there are some lines with 2 key value pairs on em :(
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
for(grep {/\w/}<DATA>){
# g repeats the regex, e executes the Perl in the substitution,
# returns the number of matches into the condition.
# $1 and $2 are the bracketed matches in the regex in order.
+
unless( s/^(^[a-z ]+) *: *([\w\d]+)/$_{$1}=$2/ige ){
chomp;
if (/:/){ # key part
$_{$_}=join(',',@_); # make new key item
@_=()
}
else { # list part
push @_,$_
}
}
}
print Dumper(\%_);
__DATA__
Graq:
Agnostic:
Number: 634321
age: 27 hair colour: black
height: 73
weight: 123
legs: 2 arms: 2
jameson
bells
guinness
favourite:
detests:
likes:
--
Brother Frankus.
¤ | [reply] [d/l] |
|
__DATA__
NOISE
noise
Graq
121212
rubbish: values
Graq
Agnostic
Number: 634321
age: 27 hair colour: black
height: 73
weight: 123
legs: 2 arms: 2
balls: 1
aminals: 3
leftandright
rugby
cute
"more noise here - and don't forget the blank lines..."
__END_DATA__
The first guaranteed unique identifer is the line begining with Number (ie /^Number:/).
Lines before the colon (:) seperated set of values are key-less (hard coded keys must be used), values after are sub-values of the preceding : values.
So, if you like, {arms}->{2}->{leftandright}, {balls}->{1}->{rugby} ..
Is it becoming any clearer?? :-(
<a href="http://www.graq.co.uk">Graq</a> | [reply] [d/l] |
Re: More Regular Expressions (text data handling) by buckaduck (Chaplain) on Dec 04, 2001 at 18:16 UTC |
Number: 634321
Some lines have two field/value pairs, with no separator!
age: 27 hair colour: black
Some lines have a value, with no field name:
Graq
Agnostic
And toward the end you have a series of values, followed by a series of field names:
jameson
bells
guinness
favourite
detests
likes
If there are in fact any real rules governing the data, you should tell us. Better yet, you should if possible change your data structure to something easier. I'll bet that you could parse this easily:
# DATA: Start
name: Graq
religion: Agnostic
Number: 634321
age: 27
hair colour: black
height: 73
weight: 123
legs: 2
arms: 2
favorite: jameson
detests: bells
likes: guinness
# DATA: End
buckaduck | [reply] [d/l] [select] |
|
Well said.
Unfortunately I cannot provide an actual example.
So here goes:
- I have no control over the incoming data. It will effectively be pasted by someone in a <TEXTAREA>.
- The data lines may contain sporadic empty lines, which are to be ignored.
- There are useless lines and characters before the useful data starts.
- There are useless lines and characters after the useful data ends.
- The first unique point at which a line can be identified as being useful, is that it start with the characters 'Number:'.
This identifier (index) is not at the start of the data.
- The data can be split into 4 sections:
TOP
A set of values with no keys. These are always in the same place relative to each other.
So, in my previous example, you could safely say $result{Name}='Graq';.
There is only one value per line.
- MIDDLE
A set of key-value pairs seperated by a colon and space /: /.
Some lines have 2 key-value pairs on them (never more).
Keys may contain spaces, values may not.
- BOTTOM
- A set of key-value pairs seperated by a colon and space.
One key-value pair per line (see below).
- A set of values. These values correspond to key-value pair in (i).
One value per line (see below).
NB: The first line of (ii) will be on the same line as the last line of (i), seperated by a space.
- END
A single key-value pair (colon and space seperated), where the value may be missing.
Note on lines with multiple key-value pairs:
- The TOP and MIDDLE never mix.
- MIDDLE values may have multiple entries.
- MIDDLE and BOTTOM(i) may overlap.
- BOTTOM(i) and BOTTOM(ii) always overlap.
- BOTTOM(ii) and END never mix.
Please don't ask why this is :(
<a href="http://www.graq.co.uk">Graq</a>
edited by footpad, ~Tue Dec 4 14:42:09 2001 (GMT)
| [reply] [d/l] [select] |
Re: More Regular Expressions (text data handling) by joealba (Hermit) on Dec 04, 2001 at 20:05 UTC |
Sorry, but you can't consistently parse data that is THIS inconsistent. Blank lines are easy to ignore, but how can you ignore lines with "noise" at the start without a solid way to denote the start of your data?
Try your best to clean up the incoming data. Until then, here are some parsing tricks that might help you keep some of your code maintainable (note I didn't say fast).
my %KEY_PARSER = (
"number" => {
START_COMMAND => qr{number:\s*}i,
VALUE_MATCH => qr{\d+},
},
"hair color" => {
START_COMMAND => qr{hair colou?r:\s*}i,
VALUE_MATCH => qr{[\w\s]+},
},
"height" => {
START_COMMAND => qr{height:\s*}i,
VALUE_MATCH => qr{\d+},
},
"weight" => {
START_COMMAND => qr{weight:\s*}i,
VALUE_MATCH => qr{\d+},
},
);
foreach my $line (grep {/\w/} <DATA>) {
foreach (keys %KEY_PARSER) {
while ($line =~ /$KEY_PARSER{$_}{START_COMMAND}/) {
$line =~ s/($KEY_PARSER{$_}{START_COMMAND})\s*($KEY_PARSER
+{$_}{VALUE_MATCH})//;
next unless $2;
my ($key,$value) = ($1,$2);
chomp ($key,$value);
print "Found KEY: $key = $value\n";
}
}
}
Crap, even when munged by the magical Perl, still smells like crap.
| [reply] [d/l] |
|
| [reply] |
|
As I see it, you require the use of forward lookaheads in a regex:
Since the line before Number contains the name and the persons details are terminated again by name,
something that grabs the name and the text between two instances of the name can be got.
You could then make a hash of names with the value being a hash of details, does that sound good?
--
Brother Frankus.
¤
| [reply] |
|
|
|