Parsing semi-erratic text

SamCG has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to parse the

+ + + + + + + + + + Item Overridden   : Earnings Per Share + + + + + + + + + + Initial Value     :  (USD) + + + + + + + + + + Current Value     :  () + + + + + + + + + + Overridden Value  : 160 (USD) + + + + + + + + + + Effective         : 08/20/1999 + + + + + + + + + + Override Type     : Data SecurityID        : 1076665 Sedol             : 2451234 Cusip             : N66696606 ISIN              : NL0006122988 


Update:  Ah, I've 

 
 -----------------
 _{s''limp';@p=split '!','n!h!p!';s,}          

         

         

         

         

through 08/20/2000                             

href="?part=1;displaytype=displaycode;abspart=1;node_id=569180">[download]

found a potential way. while ($bdy=~/(.*?):\s(.*?)\s\s/g) seems to work alright. Comments on this approach? class="pmsig-300781"> m,s,;$s=y;$c=slice @p1;so brutally;d;$n=reverse;$c=$s**$#p;print(''.$c^chop($n))while($c/=$#p)>=1; notes" style="text-align:center">

Comment on Parsing semi-erratic text Select or Download Code

Replies are listed 'Best First'.
Re: Parsing semi-erratic text by ikegami (Patriarch) on Aug 23, 2006 at 19:39 UTC
The data you submitted does have newlines. `while (<DATA>) { my ($key, $val) = /^\s([^:]?)\s:\s(.?)\s$/ or next; print("[$key:$val]\n"); }` [download] and `while (<DATA>) { my ($key, $val) = split(/:/, $_, 2); next if not defined $val; s/^\s+//, s/\s+$// for $key, $val; print("[$key:$val]\n"); }` [download] both do the trick.	[reply] [d/l] [select]
Re^2: Parsing semi-erratic text by SamCG (Hermit) on Aug 23, 2006 at 19:50 UTC
Hrmm...perhaps an effect of my cutting and pasting? The body of my email gets read into a variable (so it's like slurping a file). I can't seem to split on newlines, and if I use a regex to count I get only one in each email (which I presume is at the end). Thank you for the implicit character class (`[^:]`) suggestion, by the way. I hate putting .* into regexes, even with the non-greedy modifier. ----------------- _{s''limp';@p=split '!','n!h!p!';s,m,s,;$s=y;$c=slice @p1;so brutally;d;$n=reverse;$c=$s**$#p;print(''.$c^chop($n))while($c/=$#p)>=1;}	[reply] [d/l]
Re^3: Parsing semi-erratic text by ikegami (Patriarch) on Aug 23, 2006 at 20:04 UTC
I agree. `.` and `.?` usually/often assume the data is formatted correctly.	[reply] [d/l] [select]
Re: Parsing semi-erratic text by GrandFather (Saint) on Aug 23, 2006 at 21:07 UTC
Not quite. Consider (note I've trimmed the number of trailing spaces and retained the line ends (but strip them): use strict; use warnings; use Date::EzDate; my $str = <<DATA; Security : BULGY N V- Item Overridden : Earnings Per Share Initial Value : (USD) Current Value : () Overridden Value : 160 (USD) Effective : 08/20/1999 through 08/20/2000 Override Type : Data SecurityID : 1076665 Sedol : 2451234 Cusip : N66696606 ISIN : NL0006122988 DATA $str =~ s/\n//g; while ($str =~ /(.?):\s(.?)\s\s/g) { my ($key, $value) = ($1, $2); $key =~ s/^\s//; $key =~ s/\s$//; $value =~ s/^\s//; $value =~ s/\s$//; print ">$key: $value<\n"; } [download] Prints: `>Security: BULGY N V-< >Item Overridden: Earnings Per Share< >Initial Value: (USD)< >Current Value: ()< >Overridden Value: 160 (USD)< >Effective: 08/20/1999 through 08/20/2000< >Override Type: Data SecurityID< >: 1076665 Sedol< >: 2451234 Cusip< >: N66696606 ISIN< >: NL0006122988<` [download] The regex `/(\w[\w ]{17}):\s+((?:(?!\w[\w ]{17}:).)*)/g` latches on to a 18 character wide label preceeding a : and then grabs characters upto the next label field. The result is: `>Security: BULGY N V-< >Item Overridden: Earnings Per Share< >Initial Value: (USD)< >Current Value: ()< >Overridden Value: 160 (USD)< >Effective: 08/20/1999 through 08/20/2000< >Override Type: Data< >SecurityID: 1076665< >Sedol: 2451234< >Cusip: N66696606< >ISIN: NL0006122988<` [download] DWIM is Perl's answer to Gödel	[reply] [d/l] [select]

Back to Seekers of Perl Wisdom


Perl Monk, Perl Meditation
	PerlMonks