Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine

Re: Parsing and extracting data from files.

by FloydATC (Deacon)
on Apr 09, 2013 at 13:47 UTC ( #1027743=note: print w/replies, xml ) Need Help??

in reply to Parsing and extracting data from files.

One should not forget the usefulness of spending some time writing code that is doomed to fail because one didn't have the necessary skill and experience. I have no idea how many wheels I've reinvented (and scrapped) over the years but the result is that I have a pretty good idea how those wheels work and how they do not.

Cataloging a CD database is an excellent place to start, as long as you understand that your first solutions will look nothing like the final product, and you're prepared to see this as a positive thing rather than a waste of time.

Start with the database design. Figure out how you want to model your data, then write code to produce it. Finally, write code to do queries and reports. You'll want to start out with just regular expressions, DBI and perhaps a hash or two. And whatever you do, don't worry about character sets and encoding/decoding until you've built up some confidence. That would just discourage you.

-- Time flies when you don't know what you're doing
  • Comment on Re: Parsing and extracting data from files.

Replies are listed 'Best First'.
Re^2: Parsing and extracting data from files.
by WhiteTraveller (Novice) on Apr 09, 2013 at 21:30 UTC

    Hello again.

    I only code because of the journey. I would say that, in the last 20 years, there has not been a single piece I am proud of. However, I generally manage to hack a workable solution, and the achievement is usually enough.

    The end result is fairly well planned - as is the sql to get it there.

    my %album = ( name => "Collection Name", upc_ean => "123456789012", disc => [{ title => "CD 1", disc_id => "12345", track => [{ title => "Track 1 title", isrc => "aa-aaa-13-12345" }] }], );

    This summarises the structure well enough. Add in sections of CD-Text where appropriate. Few are mandatory, as far as I am concerned.

    My current reading centres on either Marpa, Parse::RecDescent or Regexp::Grammars. I suspect one of these will do what I am looking for...

    I'll update later, when I have something cobbled together...

      Well, after some reading, I ended up attempting RecDescent (only because it ended up 1st on my list), and have started as follows:

      #!/usr/bin/perl use vars qw(%VARIABLE); use Data::Dumper; use Parse::RecDescent; $::RD_ERRORS = 1; $::RD_HINT = 1; $::RD_WARN = 1; $::RD_TRACE = 1; my %album = ( Title => 'The Collected Works of Mozart', Performer => 'The Royal Symphonic Orchestra', Barcode => '1234567890123', ); my %hash1 = ( Title => 'Disk 1', Type => 'Audio', Foo => 'bar', ); my %hash2 = ( Title => 'Disk 2', Type => 'Audio', Foo => 'FooBar', ); my %hash3 = ( Title => 'Chopsticks', Performer => 'Pascal Roge', ISRC => 'AABBB1122222', ); # Example data for illustration purposes. $album{'Disc'}[0]=\%hash1; # Example data, stored as Disc[0]. $album{'Disc'}[1]=\%hash2; # Example data, stored as Disc[1]. $album{'Disc'}[1]{'Track'}[0]=\%hash3; # Disc 2 Track 1 #=========== Start of actual parsing code ============================ +======== my $file = '/home/Media/Music/tmp/01.toc'; { local $/; undef $/; open my $grammarfh, '<', 'TOC.bnf' or die "Arghh! Cannot open gramma +r.\n"; $grammar = <$grammarfh>; open my $fh, '<', $file or die "Arghh! Cannot open file.\n"; $text = <$fh> ; } my $parser = new Parse::RecDescent($grammar) || die "Bad Grammar!\n"; my $cd = $parser->contents($text); push @{$album{'Disc'}}, $cd; # Not quite right! Check.. Cop +y data, not store a reference. print Dumper(\%album); print Dumper(\%VARIABLE); # Perhaps we should store the parsed +info in here? print Dumper($cd); sub subroutine { shift; print "Entered Subroutine\n"; my ($foo, $bar) = @_; return $foo;

      It has been drafted specifically to load the grammar from an external file. It allows me to edit thta just a little easier, but also allows me to reuse the same code por parsing a CUE file later. However, it is the grammar that is proving frustrating. This is what I have so far...

      #===============================================# # RecDescent grammar to parse a CD TOC file. # #===============================================# { # Nothing here yet. } # Grammar: contents: line(s?) # <skip: qr/[^\S\n]/> line: text { } | Parameter {$return = $item{'Parameter'};} | word foo { $main::VARIABLE{$item{'word'}}=$item{'foo +'} } # not quite sure how this will be useful... | text | word { $return = $item{'word'}; } | BlankLine # | Comment | <error> # Next line not quite right. Consider using $VARIABLE Parameter: word qstring { $return = { $item{'word'} => $item{'qstri +ng'} }; } # CD_TEXT is *always* followed by a <CR>, then LANGUAGE_MAP or LANGUAG +E. # Should I be considering recursion here? text: /CD_TEXT {/ { return main::subroutine(@item) } setting: /LANGUAGE_MAP \d/ { print "Map\n"; } | /LANGUAGE \d/ { print "Lang\n"; } # Tokens: BlankLine: <skip: q{}> /^\s+$/m Comment: <skip: qr{\s* (/[*] .*? [*]/ \s*)*}x> word: /\w+/ msf: /\d\d:\d\d:\d\d/ newline: /\n/ number: /\d+/ qstring: '"'/[^"]+/'"' { $return = $item[2]; } #qstring: <perl_quotelike> # See http://www.perlmonks.o +rg/?node_id=485933 # { my ($marker, $quote, $text) = @{$item[0]}[0..2] +; } foo: /\d+.\d+.\d+/ # This will match both 14:43: +00 and 38935137

      Apologies - it is quite awful at the moment, but I am too tired and confused to start tidying it up... If you have the time, I could do with a pointer or two. I have a feeling that I should be calling recursively to parse the CD_TEXT, but I am afraid I don't know RecDescent well enough.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1027743]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (4)
As of 2018-06-24 10:13 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (126 votes). Check out past polls.