Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Parsing and extracting data from files.

by WhiteTraveller (Novice)
on Apr 04, 2013 at 21:06 UTC ( #1027032=perlquestion: print w/ replies, xml ) Need Help??
WhiteTraveller has asked for the wisdom of the Perl Monks concerning the following question:

Greetings, Venerable Monks.

I, like many, am seeking enlightenment. However, I come to your gates not seeking answers, but instead looking for guidance and advice on which path I should take. If you have a moment, I will tell you my tale...

I have been looking through my CD collection, and would really like to implement an automatic way of slurping data, and storing it in a local database. As these are the first steps, I do NOT want to do anything online - I am simply using the data available on each CD. Thus, I have a .TOC file for each CD to parse, process, and store.

So far, I can locate, and open, each file. I can also connect, and write, to my database. I am now at the stage of parsing each data-file. It is, of course, mandatory that I do this step in an enlightened way, with efficient simplicity.

Thus I come before you. How would you advise I start on this journey?

If it is of help, here are some links to the structure and layout of the file:
http://lavica.fesb.hr/cgi-bin/info2html?%28libcdio%29CDRDAO%20TOC%20Format
http://www.xbloome.com/stuff/tutorial/audio_master_cd/audiocd.toc

(Also, if relevant, I also intend - at a later stage - to get the same script to parse .cue files as well. See:
http://www.hep.by/gnu/libcdio/libcdio_23.html
http://wiki.hydrogenaudio.org/index.php?title=Cue_sheet
http://en.wikipedia.org/wiki/Cue_sheet_%28computing%29#Cue_sheet_syntax
http://digitalx.org/cue-sheet/examples/

)

Comment on Parsing and extracting data from files.
Re: Parsing and extracting data from files.
by Anonymous Monk on Apr 05, 2013 at 03:54 UTC
Re: Parsing and extracting data from files.
by jaredor (Deacon) on Apr 05, 2013 at 18:11 UTC

    Hola, Initiate. Mine is not a direct answer to what seems to be your petition (elegant TOC file parsing) but I can't resist but to point out a dusty path that seems aligned exactly in the direction you are going: TheDamian teaching Perl OO via creating a CD::Music class.

    That path is a bit rocky and disused. Most monks seem to prefer following Moose tracks....

      Thank you, jaredor.

      Moose looks interesting - but is, I am afraid, beyond me at this stage. However, the link to Conway is most helpful.

      However, I was actually looking for advice on the parsing of the file. Unless there is a better way, I am likely to do a regexp on the first word of each line, and use a switch statement to subsequently process and store the data.

      Thanks again.

        You're welcome!

        Here's a direct answer for you, but it's a three-part answer: The first part is vague, the second, long, and the third....

        1. Check out the code in a CPAN module that parses a text file. I would not necessarily recommend any of the CSV modules, since those are probably pretty hairy and difficult to comb through. Something more like Text::Delimited is probably a good place to start for ideas. These smaller modules may or may not be as elegant as what TheDamian would write, but presumably they work and working code is always a good place to start from.
        2. An excellent book to read for perl and CS ideas in general is Higher Order Perl, which is online and free. It builds up to parsing, so is not a quick solution for you to adopt. (And after you read HOP, then Moose will not be daunting :-)
        3. Lastly, there is nothing wrong with what you propose to do for parsing your files. Do that until it fails, then pose the problem back here. PM tends to be supportive of petitioners who arrive at the gates with actual code and you will find no lack of advice on how to make things better. The relative paucity of responses to this question is likely because you asked a general question about parsing. The answers to that could range from zen-like koans to a book. (I'm all about the trivia, so that is what I dropped on you.) I'm assuming that you are doing this as an exercise in learning perl, so in that case, go out there and reinvent some wheels!
Re: Parsing and extracting data from files.
by FloydATC (Hermit) on Apr 09, 2013 at 13:47 UTC
    One should not forget the usefulness of spending some time writing code that is doomed to fail because one didn't have the necessary skill and experience. I have no idea how many wheels I've reinvented (and scrapped) over the years but the result is that I have a pretty good idea how those wheels work and how they do not.

    Cataloging a CD database is an excellent place to start, as long as you understand that your first solutions will look nothing like the final product, and you're prepared to see this as a positive thing rather than a waste of time.

    Start with the database design. Figure out how you want to model your data, then write code to produce it. Finally, write code to do queries and reports. You'll want to start out with just regular expressions, DBI and perhaps a hash or two. And whatever you do, don't worry about character sets and encoding/decoding until you've built up some confidence. That would just discourage you.

    -- Time flies when you don't know what you're doing

      Hello again.

      I only code because of the journey. I would say that, in the last 20 years, there has not been a single piece I am proud of. However, I generally manage to hack a workable solution, and the achievement is usually enough.

      The end result is fairly well planned - as is the sql to get it there.

      my %album = ( name => "Collection Name", upc_ean => "123456789012", disc => [{ title => "CD 1", disc_id => "12345", track => [{ title => "Track 1 title", isrc => "aa-aaa-13-12345" }] }], );

      This summarises the structure well enough. Add in sections of CD-Text where appropriate. Few are mandatory, as far as I am concerned.

      My current reading centres on either Marpa, Parse::RecDescent or Regexp::Grammars. I suspect one of these will do what I am looking for...

      I'll update later, when I have something cobbled together...

        Well, after some reading, I ended up attempting RecDescent (only because it ended up 1st on my list), and have started as follows:

        #!/usr/bin/perl use vars qw(%VARIABLE); use Data::Dumper; use Parse::RecDescent; $::RD_ERRORS = 1; $::RD_HINT = 1; $::RD_WARN = 1; $::RD_TRACE = 1; my %album = ( Title => 'The Collected Works of Mozart', Performer => 'The Royal Symphonic Orchestra', Barcode => '1234567890123', ); my %hash1 = ( Title => 'Disk 1', Type => 'Audio', Foo => 'bar', ); my %hash2 = ( Title => 'Disk 2', Type => 'Audio', Foo => 'FooBar', ); my %hash3 = ( Title => 'Chopsticks', Performer => 'Pascal Roge', ISRC => 'AABBB1122222', ); # Example data for illustration purposes. $album{'Disc'}[0]=\%hash1; # Example data, stored as Disc[0]. $album{'Disc'}[1]=\%hash2; # Example data, stored as Disc[1]. $album{'Disc'}[1]{'Track'}[0]=\%hash3; # Disc 2 Track 1 #=========== Start of actual parsing code ============================ +======== my $file = '/home/Media/Music/tmp/01.toc'; { local $/; undef $/; open my $grammarfh, '<', 'TOC.bnf' or die "Arghh! Cannot open gramma +r.\n"; $grammar = <$grammarfh>; open my $fh, '<', $file or die "Arghh! Cannot open file.\n"; $text = <$fh> ; } my $parser = new Parse::RecDescent($grammar) || die "Bad Grammar!\n"; my $cd = $parser->contents($text); push @{$album{'Disc'}}, $cd; # Not quite right! Check.. Cop +y data, not store a reference. print Dumper(\%album); print Dumper(\%VARIABLE); # Perhaps we should store the parsed +info in here? print Dumper($cd); sub subroutine { shift; print "Entered Subroutine\n"; my ($foo, $bar) = @_; return $foo;

        It has been drafted specifically to load the grammar from an external file. It allows me to edit thta just a little easier, but also allows me to reuse the same code por parsing a CUE file later. However, it is the grammar that is proving frustrating. This is what I have so far...

        #===============================================# # RecDescent grammar to parse a CD TOC file. # #===============================================# { # Nothing here yet. } # Grammar: contents: line(s?) # <skip: qr/[^\S\n]/> line: text { } | Parameter {$return = $item{'Parameter'};} | word foo { $main::VARIABLE{$item{'word'}}=$item{'foo +'} } # not quite sure how this will be useful... | text | word { $return = $item{'word'}; } | BlankLine # | Comment | <error> # Next line not quite right. Consider using $VARIABLE Parameter: word qstring { $return = { $item{'word'} => $item{'qstri +ng'} }; } # CD_TEXT is *always* followed by a <CR>, then LANGUAGE_MAP or LANGUAG +E. # Should I be considering recursion here? text: /CD_TEXT {/ { return main::subroutine(@item) } setting: /LANGUAGE_MAP \d/ { print "Map\n"; } | /LANGUAGE \d/ { print "Lang\n"; } # Tokens: BlankLine: <skip: q{}> /^\s+$/m Comment: <skip: qr{\s* (/[*] .*? [*]/ \s*)*}x> word: /\w+/ msf: /\d\d:\d\d:\d\d/ newline: /\n/ number: /\d+/ qstring: '"'/[^"]+/'"' { $return = $item[2]; } #qstring: <perl_quotelike> # See http://www.perlmonks.o +rg/?node_id=485933 # { my ($marker, $quote, $text) = @{$item[0]}[0..2] +; } foo: /\d+.\d+.\d+/ # This will match both 14:43: +00 and 38935137

        Apologies - it is quite awful at the moment, but I am too tired and confused to start tidying it up... If you have the time, I could do with a pointer or two. I have a feeling that I should be calling recursively to parse the CD_TEXT, but I am afraid I don't know RecDescent well enough.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1027032]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (15)
As of 2014-07-24 19:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (166 votes), past polls