http://www.perlmonks.org?node_id=588246

smist has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I have some data in a text file I want to process and insert into a database. It's in the format of several columns that are separated by whitespace. Some columns do not have any values. This is where my problem is. I can't figure out how to parse the data so I can insert the values from the columns that do have values into the corresponding columns in the database. Below is a link to the file I want to parse. Can someone please give me some advice on how to do something like this? Thanks! http://www.pointstreak.com/prostatstext/UHL/boxscore_340251.txt

Replies are listed 'Best First'.
Re: parsing question
by BrowserUk (Patriarch) on Dec 07, 2006 at 04:01 UTC

    As the file is in several distinct sections each with it's own format, first break the file into those sections. As it's also a fairly small file, slurp it and split it on the section separator:

    ## Slurp the file and break into sections my @sections = split '-{103}', do{ local $/; <> }; close *ARGV;

    Once you have the sections separated, you can treat each one differently.

    Rather than having to count all the spaces and manually construct the unpack formats for dealing with them--which is a PIA and they also might change--notice that each section of stats is preceeded by a header line, and that the left hand edge of the column headers forms a left edge limit for the data in the columns.

    Also notice that although some of the column titles are multiple words, every title is preceeded by at least two spaces, whilst the multi-word titles themselves contain only single spaces.

    That information allows you to use the header lines to construct the unpack formats programmically. The following subroutine takes a header line, uses it to discover the column boundaries and then uses those to construct a format:

    sub buildFmt { my $templ = shift; my @cols; push @cols, $-[ 0 ] while $templ =~ m[(?<=\s\s)(?=\S)]gc; my $p = 0; my $fmt = ''; $fmt .= 'A' . ( $_ - $p ) . ' ' and $p = $_ for @cols; return $fmt; }

    This can be reused for all the columnised (sub)sections in the file.

    By way of example, this is how to use it break down the four subsections of the 'PLAYER GAME STATISTICS':

    ## Section 2 ## Break the section into lines my @section2 = split "\n", $sections[ 1 ]; ## discard header lines; shift @section2 for 1 .. 3; ## Away ## Construct the format from the header my $fmt = buildFmt( shift @section2 ); ## Use it to parse the Away player game stats my @awayStats; push @awayStats, [ unpack $fmt, shift @section2 ] while $section2[ 0 ] + =~ m[\S]; print "@$_" for @awayStats; ## Discard blank lines shift @section2 while $section2[ 0 ] !~ m[\S]; ## Away totals ... same two steps again $fmt = buildFmt( shift @section2 ); my @awayTotals = unpack $fmt, shift @section2; print "@awayTotals"; ## Discard blank lines shift @section2 while $section2[ 0 ] !~ m[\S]; ## home ... and again $fmt = buildFmt( shift @section2 ); ## They could vary. my @homeStats; push @homeStats, [ unpack $fmt, shift @section2 ] while $section2[ 0 ] + =~ m[\S]; print "@$_" for @homeStats; ## Discard blank lines shift @section2 while $section2[ 0 ] !~ m[\S]; ## Home totals ... and again $fmt = buildFmt( shift @section2 ); my @homeTotals = unpack $fmt, shift @section2; print "@homeTotals";

    Parsing the other sections (with columnised data), is just a repeat of the above.

    The code all together as far as I've taken it:

    The output from section 2 debugging lines:

    c:\test>588245 boxscore_340251.txt 4 | Drevitch, Scott | | 1 | +1 | 4 | | | | | | | | | | | +| | 8 | Mann, Chris | | 1 | +1 | 2 | 4 | 2 | | | | | | | | | | + | 9 | Anderson, Erik | 1 | | +1 | 4 | | | | | | | | | | | | + | 11 | Lefebvre, Marc | | | -1 | | | | | | | | | | | | | + | 14 | Scott, Mark | | 1 | 0 | 5 | | | | | | | | | | | | +| 17 | Wray, Scott | | | 0 | 10 | 4 | 2 | | | | | | | | | | + | 18 | Lazarev, Yevgeny | 1 | | 0 | 1 | | | | | | | | | | | + | | 20 | Miller, Derek | | | -2 | 1 | | | | | | | | | | | | + | 21 | Fitzpatrick, Chans | | 1 | 0 | | 2 | 1 | | | | | | | | + | | | 22 | Cullaton, Brent | | 1 | +1 | 4 | | | | | | | | | | | + | | 23 | Kotsopoulos, Tommy | | 1 | -1 | 4 | 5 | | 1 | | | | | | +| | | | 24 | Morelli, David | 1 | | 0 | 2 | | | | | | | | | | | +| | 27 | Littlejohn, Frank | 1 | 1 | +1 | 3 | 6 | 3 | | | | | | 1 | + | | | | 28 | Lyke, R.C. | | | +1 | | | | | | | | | | | | | | 29 | Gajda, Tyson | | | 0 | | | | | | | | | | | | | | 30 | Tebbs, Kris | | | 0 | | | | | | | | | | | | | | 32 | Tidball, Curtis | | | 0 | | | | | | | | | | | | | + | 44 | Lupul, Dale | | | -1 | 5 | | | | | | | | | | | | | TOTALS | 5 | 7 | +1 | 46 | 23 | 9 | 1 | | | | | 1 | | | 1 | +1 | 2 | Lupandin, Andrei | | 3 | +1 | 1 | | | | | | | | | | | + | | 4 | Currie, Brent | | | -1 | | 2 | 1 | | | | | | | | | | + | 5 | Yoder, Jami | | | -1 | | 2 | 1 | | | | | | | | | | | 7 | Radoslovich, Matt | 1 | | -1 | 2 | | | | | | | | | | | + | | 8 | Pilkington, Brett | | 1 | +1 | 4 | | | | | | | | | | | + | | 9 | Granbois, Travis | | | 0 | | | | | | | | | | | | | + | 12 | Nadeau, Patrick | | | -2 | 1 | | | | | | | | | | | +| | 13 | Parsons, Don | 2 | 1 | +2 | 4 | | | | | | | | 1 | | | +| | 14 | Starke, Sean | | | 0 | 3 | | | | | | | | | | | | +| 17 | Stewart, Blake | | 1 | -2 | 1 | | | | | | | | | | | +| | 18 | Chwedoruk, Justin | | | 0 | 3 | | | | | | | | | | | + | | 19 | Woollard, Chad | 1 | 1 | +2 | 3 | 2 | 1 | | | | | | | | +| | | 20 | Durdin, Sergei | | | -2 | 2 | | | | | | | | | | | | + | 24 | Harloff, Nick | | | 0 | 3 | 4 | 2 | | | | | | | | | +| | 25 | Stauffacher, Luke | | | 0 | 4 | 7 | 1 | 1 | | | | | | | + | | | 27 | Wathier, Mathieu | | | +1 | 3 | | | | | | | | | | | + | | 41 | Sikich, Zach | | 1 | 0 | | | | | | | | | | | | | +| 83 | Tapp, Jason | | | 0 | | | | | | | | | | | | | | TOTALS | 4 | 8 | -2 | 34 | 17 | 6 | 1 | | | | | 1 | | | | +|

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Could someone please explain the following line? I haven't been able to figure out what the split function does here.
      my @sections = split '-{103}', do{ local $/; <> };

        This sets the default line-separator (held in "$/", normally set to "\n") to undef, so that whatever comes in from STDIN (<>) will get slurped up, split on the string "-{103}", and assigned to @sections.

        A short example... put the following into a script named "test.pl":

        #!/usr/bin/perl -w use strict; my @sections = split 'FOO', do {local $/; <>}; close *ARGV; print join("\n", @sections);

        Then, put the following into a file called "test.txt":

        FOObarFOObarFOObarFOObarFOObar FOObarFOObarFOObarFOObarFOObar FOObarFOObarFOObarFOObarFOObar

        Then, run the following at the command-line:

        cat test.txt | ./test.pl
Re: parsing question
by ikegami (Patriarch) on Dec 07, 2006 at 02:54 UTC
    unpack would work great here.
    my $format = join '', map { "A$_" } 3, # Player number 21, # Player name 4, # G 4, # A 6, # +/- 5, # S 6, # PIM 6, # MIN 6, # MAJ 5, # 10 5, # GM 5, # GR 5, # MT 6, # PPG 6, # SHG 6, # GWG 6, # OTG 6, # UAG 6, # ENG 5; # FG while (<$fh>) { next unless /^\d+\s/; # Skip headers. my @record = unpack($format, $_); # Parse data. $record[4] =~ s/^\s+//; # Fix "+/-" formatting. $sth->execute(@record); # Add data to the database. }
Re: parsing question
by grep (Monsignor) on Dec 07, 2006 at 03:16 UTC
    First off you need to make sure you have permission to use the copyrighted material you linked to. I would use unpack.

    use strict; use warnings; use Data::Dumper; my @fields = qw/num name G A plus S PIM MIN MAJ ten GM GR MT PPG SH +G GWG OTG UAG ENG FG/; my $pattern = "A3 A21 A4 A5 A5 A5 A5 A5 A5 A5 A5 A5 A5 A5 A5 + A5 A5 A5 A5 A5"; my @data; while (my $line = <DATA>) { chomp($line); next if ($line !~ /^\d+ /); my %hash; @hash{@fields} = unpack($pattern,$line); # unpack the data accordi +ng to the pattern push(@data,\%hash); } print Dumper \@data; ## UPDATE: Reformated code

    grep
    XP matters not. Look at me. Judge me by my XP, do you?

      First off you need to make sure you have permission to use the copyrighted material you linked to

      Copyright doesn't protect statistics. It does copy documents containing statistics (such as the linked text file), but we're not making a copy of the text file[*].

      There could be contractual (licensing, Terms of Service) issues, but no Copyright issues.

      * — Well, technically the file is copied numerous times (e.g. from the TCP stream into RAM), but Copyright allows those copies since they are necessary steps to viewing the file.

Re: parsing question
by wjw (Priest) on Dec 07, 2006 at 03:36 UTC
    Am wondering if your data base is only taking the player stats? Or are you wanting the break-downs to go into your DB as well... goalie stats, team stats etc...?

    ...the majority is always wrong, and always the last to know about it...

      I need the player stats, including the goalies. I don't need the team stats. What I'm working on is creating a very simple fantasy hockey league for the league my local team is in. It will be a free league to play in for the fans.
[OT]Re: parsing question
by japhy (Canon) on Dec 07, 2006 at 05:08 UTC
    Out of curiousity, since my Google-fu is failing me, what are the "GM", "MT", and "FG" statistics? I know all the others.

    Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
    How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart
      GM is Game Misconduct, I believe. I'm not sure of the other two. I'm not using all of the stats.