waxmop has asked for the wisdom of the Perl Monks concerning the following question:
I need to parse this page and create a hash using the codes as keys and the descriptions as values.
This is a section showing what the page looks like:
Total index B50001
Crude processing (capacity) B5610C
Primary & semifinished processing (capacity) B562A3C
Finished processing (capacity) B5640C
Manufacturing ("SIC") B00004
Manufacturing (NAICS) GMF
Durable manufacturing (NAICS) GMFD
Wood product G321
+ 321
Nonmetallic mineral product G327
+ 327
Primary metal G331
+ 331
Iron and steel products G3311A2
+ 3311,2
Fabricated metal product G332
+ 332
Machinery G333
+ 333
I want to build a hash that would work like this:
my $code = "GMF";
print "$code: $description_hash{$code}.\n";
That should print:
GMF: Manufacturing (NAICS).
All preceding and trailing whitespace needs to be removed from the description.
I've never been expert with regular expressions, so I'd love to see how the really smart people that hang out on this site would build that hash.
Thanks in advance!
Re: Help me write a good reg-exp for this text
by hardburn (Abbot) on Sep 05, 2003 at 15:59 UTC
|
Is the data of fixed width? If so, you're better off with unpack:
# @LIST_OF_ENTRIES contains your data, one line per element
my %description_hash;
foreach my $entry (@LIST_OF_ENTRIES) {
# Replace 20 with the number of characters in the value portion
my ($value, $name) = unpack('A20 A*', $entry);
$value =~ s/\A\s*//;
$value =~ s/\s*\z//;
$description_hash($name} = $value;
}
---- I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.
-- Schemer
Note: All code is untested, unless otherwise stated
| [reply] [d/l] [select] |
|
Hi -
Thanks for the help. I don't understand the \A in s/\A\s*// however. Can you explain it to me?
| [reply] |
|
$ perldoc perlre
The \A anchors the search at the beginning of the string. A ^ anchors a search at the beginning of the LINE in a string. Documentation is a wonderful thing.
-- [ e d @ h a l l e y . c c ] | [reply] [d/l] |
|
Re: Help me write a good reg-exp for this text
by benn (Vicar) on Sep 05, 2003 at 16:07 UTC
|
As this looks like fixed-length formatting, I wouldn't go with a regex at all, but with...
my ($desc,$code) = unpack("A60 A*",$line);#or however long 1st field i
+s.
$description_hash{$code}=$desc;
Cheers, Ben.
Update: as would hardburn :) | [reply] [d/l] |
Re: Help me write a good reg-exp for this text
by BrowserUk (Patriarch) on Sep 05, 2003 at 16:35 UTC
|
#! perl -slw
use strict;
my %desc;
m[^\s*(.*?)\s+([A-Z0-9]+)\s+(?:[0-9,]+\s*)?$]
and $desc{ $2 } = $1
while <DATA>;
my $code = 'GMF';
print "$code : $desc{ $code }";
__DATA__
Total index B50001
Crude processing (capacity) B5610C
Primary & semifinished processing (capacity) B562A3C
Finished processing (capacity) B5640C
Manufacturing ("SIC") B00004
Manufacturing (NAICS) GMF
Durable manufacturing (NAICS) GMFD
Wood product G321
+ 321
Nonmetallic mineral product G327
+ 327
Primary metal G331
+ 331
Iron and steel products G3311A2
+ 3311,2
Fabricated metal product G332
+ 332
Machinery G333
+ 333
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
If I understand your problem, I can solve it! Of course, the same can be said for you.
| [reply] [d/l] |
Re: Help me write a good reg-exp for this text
by broquaint (Abbot) on Sep 05, 2003 at 16:10 UTC
|
If you don't mind loosing the numbers at the end of some of the fields (or at least storing them elsewhere) you could just use a split
use strict;
my $str = <<TXT;
Total index B50001
Crude processing (capacity) B5610C
Primary & semifinished processing (capacity) B562A3C
Finished processing (capacity) B5640C
Manufacturing ("SIC") B00004
Manufacturing (NAICS) GMF
Durable manufacturing (NAICS) GMFD
Wood product G321
+ 321
Nonmetallic mineral product G327
+ 327
Primary metal G331
+ 331
Iron and steel products G3311A2
+ 3311,2
Fabricated metal product G332
+ 332
Machinery G333
+ 333
TXT
my(%hash, %numbers);
for(split "\n" => $str) {
my @fields = reverse split;
$numbers{$fields[1]} = shift @fields
if $fields[0] =~ /\d(?:,\d+)?/;
$hash{$fields[0]} = join ' ' => reverse @fields[1 .. $#fields];
}
my $code = 'GMF';
print "$code: $hash{$code}.\n";
__output__
GMF: Manufacturing (NAICS).
So that should give you the hash you want.
HTH
_________ broquaint | [reply] [d/l] |
|
That doesn't work. Many of the text descriptions have spaces in them, plus there's spaces at the beginning of most of the lines.
As this is fixed length, use unpack. Now, you're going to have to also use some logic if you care about the indenting stuff to make sure that you don't keep that whitespace at the beginning of your description. If you don't, it's easy enough to strip off the indenting whitespace. To get what you exactly wanted, do something like:
# Change these to the actual column widths. Use a star at the end to g
+et the rest.
my @column_widths = (###, ###, '*');
my $unpack_spec = join ' ', map { "A$_" } @column_widths;
my %codes;
while (<IN_FILE>)
{
chomp;
my ($desc, $code, $other_thingy) = unpack $unpack_spec, $_;
# If you want to remove the pre-pended whitespace on the descripti
+on ...
$desc =~ s/^\s+//;
$codes{$code} = {
Description => $desc,
Other_Thing => $other_thingy,
};
}
my $choice = 'GMT';
print "$choice: $codes{$choice}{Description}\n";
------ We are the carpenters and bricklayers of the Information Age. The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6 Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified. | [reply] [d/l] |
Re: Help me write a good reg-exp for this text
by shenme (Priest) on Sep 05, 2003 at 16:44 UTC
|
The example key you give ('GMF') to show what you want to key on is perhaps too simple? What do you want to use for one of your more complicated lines like: (some spaces removed)
Iron and steel products G3311A2 3311,2
What would you be wanting to use as the key? I might imagine at least three possibilities:
- everything in those fixed columns, thus 'G3311A2 3311,2' including the spaces between the strings,
- just one part of all that, such as just 'G3311A2',
- both parts individually as alternate keys 'G3311A2' and '3311,2'
dragonchild uses $other_thingy to capture the '3311,2' separately. But what really should be done with that part?
| [reply] [d/l] |
|
Iron and steel products G3311A2 3311,2
The key should be 'G3311A2' and the value should be 'Iron and steel products'. The '3311,2' information is not needed by me. | [reply] [d/l] |
|
So if the data format really _is_ fixed-width columns then something like dragonchild's code would work, using
my @column_widths = (57, 17, '*');
for the widths (check against the real column widths). Although to remove the leading _and_ trailing spaces from each piece I'd do something like:
my ($desc, $code, $other_thingy) = unpack $unpack_spec, $_;
foreach my $piece ($desc, $code, $other_thingy) {
$piece =~ s/^\s+//;
$piece =~ s/\s+$//;
}
(I think that's right, hmmm, testing with dragonchild's modified code ....)
# Change these to the actual column widths. Use a star at the end to g
+et the rest.
my @column_widths = ( 57, 17, '*');
my $unpack_spec = join ' ', map { "A$_" } @column_widths;
my %codes;
while (<DATA>)
{
chomp;
my ($desc, $code, $other_thingy) = unpack $unpack_spec, $_;
foreach my $piece ($desc, $code, $other_thingy) {
$piece =~ s/^\s+//;
$piece =~ s/\s+$//;
}
$codes{$code} = {
Description => $desc,
Other_Thing => $other_thingy,
};
}
my $choice = 'GMF';
print "$choice: $codes{$choice}{Description}\n";
$choice = 'G3311A2';
print "$choice: $codes{$choice}{Description}\n";
__DATA__
Total index B50001
Crude processing (capacity) B5610C
Primary & semifinished processing (capacity) B562A3C
Finished processing (capacity) B5640C
Manufacturing ("SIC") B00004
Manufacturing (NAICS) GMF
Durable manufacturing (NAICS) GMFD
Wood product G321
+ 321
Nonmetallic mineral product G327
+ 327
Primary metal G331
+ 331
Iron and steel products G3311A2
+ 3311,2
Fabricated metal product G332
+ 332
Machinery G333
+ 333
_ _ OUTPUT _ _
GMF: Manufacturing (NAICS)
G3311A2: Iron and steel products
| [reply] [d/l] [select] |
Re: Help me write a good reg-exp for this text
by graff (Chancellor) on Sep 06, 2003 at 02:45 UTC
|
Here's one more (regex only) approach, which no one seems to have tried yet. I'm basing this on the assumption that the second column (which contains your intended hash key) is always separated from the first column by at least three spaces, whereas word separations within the first column are always single spaces:
#!/usr/bin/perl -w
use strict;
my %hash;
while (<DATA>) {
s/^\s+//; # remove leading whitespace
if ( /(.*?)\s{3,}(\S+)/ ) {
my ($val,$key) = ($1,$2);
$hash{$key} = $val;
}
}
print map { "$_ : $hash{$_}\n" } sort keys %hash;
__DATA__
Total index B50001
Crude processing (capacity) B5610C
Primary & semifinished processing (capacity) B562A3C
Finished processing (capacity) B5640C
Manufacturing ("SIC") B00004
Manufacturing (NAICS) GMF
Durable manufacturing (NAICS) GMFD
Wood product G321
+ 321
Nonmetallic mineral product G327
+ 327
Primary metal G331
+ 331
Iron and steel products G3311A2
+ 3311,2
Fabricated metal product G332
+ 332
Machinery G333
+ 333
__OUTPUT__
B00004 : Manufacturing ("SIC")
B50001 : Total index
B5610C : Crude processing (capacity)
B562A3C : Primary & semifinished processing (capacity)
B5640C : Finished processing (capacity)
G321 : Wood product
G327 : Nonmetallic mineral product
G331 : Primary metal
G3311A2 : Iron and steel products
G332 : Fabricated metal product
G333 : Machinery
GMF : Manufacturing (NAICS)
GMFD : Durable manufacturing (NAICS)
I have to confess, I've been pretty slow to get comfortable with unpack(), myself. It is certainly one of the more difficult functions to grasp (and its description in perfunc is still a bit hard to follow). | [reply] [d/l] |
|
|