Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re: parsing a bibliography

by BrowserUk (Pope)
on Dec 01, 2004 at 23:46 UTC ( #411615=note: print w/ replies, xml ) Need Help??


in reply to parsing a bibliography

It'll probably need tweaking constantly for different input, but it works for the sample data. I've no idea what $thing is though.

#! perl -slw use strict; $^W=0; while( <DATA> ) { my( $authors, $title, $thing, $pub, $date, $comment, $no ) = m/ ^ -( .*? \. ) \s(?=[A-Z][a-z]) ( .+ ) \.\s+ ( [^:]+? ) : \s+ (\S+), \s+ ( \d{4} ) [^.]* \. \s+ ( [^\[]+ ) \[ ( \d+ ) \] \s* $ /x; printf " Author:'%s'\n" . " Title:'%s'\n" . " Thing?:'%s'\n" . "Publisher:'%s'\n" . " Date:'%4d'\n" . " Comment:'%s'\n" . " No:'%d'\n\n", $authors, $title, $thing, $pub, $date, $comment, $no; } __DATA__ -Lightfoot, J. B. St. Paul’s Epistle to the Philippians. Grand Rapids: + Zondervan, 1953 (= 1913). Classic commentary by one of the greatest +English-speaking NT scholars of all time. [2] -Martin, Ralph P. Philippians. Rev. ed.; NCB. Grand Rapids: Eerdmans, +1980. Clear and informed. [2] -O'Brien, Peter, T. Commentary on Philippians. NIGTC. Grand Rapids: Ee +rdmans, 1991. Thorough and insightful comments on the Greek text. [1] -Silva, Moisés. Philippians. Baker Exegetical Commentary. Grand Rapids +: Baker, 1993. Sound comments on the Greek text. [2] -Barth, Markus and Helmut Blanke. The Letter to Philemon: A New Transl +ation with Notes and Commentary. Grand Rapids: Eerdmans, 2000. With o +ver 500 pages devoted to a letter that was probably written on a sing +le sheet of papyrus, this work will be consulted by all who want the +most thorough treatment of Philemon and avoided by the rest of us. [3 +] -Bruce, F. F. The Epistles to the Colossians, to Philemon, and to the +Ephesians. NIC. Grand Rapids: Eerdmans, 1984. See comments under “Com +mentaries on Ephesians.” [2]

Output:

[23:43:10.06] P:\test>411598 Author:'Lightfoot, J. B.' Title:'St. Paul?ÇÖs Epistle to the Philippians' Thing?:'Grand Rapids' Publisher:'Zondervan' Date:'1953' Comment:'Classic commentary by one of the greatest English-speaking +NT scholars of all time. ' No:'2' Author:'Martin, Ralph P.' Title:'Philippians. Rev. ed.; NCB' Thing?:'Grand Rapids' Publisher:'Eerdmans' Date:'1980' Comment:'Clear and informed. ' No:'2' Author:'O'Brien, Peter, T.' Title:'Commentary on Philippians. NIGTC' Thing?:'Grand Rapids' Publisher:'Eerdmans' Date:'1991' Comment:'Thorough and insightful comments on the Greek text. ' No:'1' Author:'Silva, Mois??s.' Title:'Philippians. Baker Exegetical Commentary' Thing?:'Grand Rapids' Publisher:'Baker' Date:'1993' Comment:'Sound comments on the Greek text. ' No:'2' Author:'Barth, Markus and Helmut Blanke.' Title:'The Letter to Philemon: A New Translation with Notes and Co +mmentary' Thing?:'Grand Rapids' Publisher:'Eerdmans' Date:'2000' Comment:'With over 500 pages devoted to a letter that was probably w +ritten on a single sheet of papyrus, this work will be consulted by a +ll who want the most thorough treatment of Philemon and avoided by th +e rest of us. ' No:'3' Author:'Bruce, F. F.' Title:'The Epistles to the Colossians, to Philemon, and to the Eph +esians. NIC' Thing?:'Grand Rapids' Publisher:'Eerdmans' Date:'1984' Comment:'See comments under ?Ç£Commentaries on Ephesians.?Ç¥ ' No:'2'

Examine what is said, not who speaks.
"But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
"Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo         "Efficiency is intelligent laziness." -David Dunham
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon


Comment on Re: parsing a bibliography
Select or Download Code
Re^2: parsing a bibliography
by jimbojones (Friar) on Dec 02, 2004 at 00:05 UTC
    I think "thing" is the location of the publisher. Seems that Grand Rapids, (MI?) has some serious biblical publishing houses ...

    - j

Re^2: parsing a bibliography
by patrickrock (Beadle) on Dec 02, 2004 at 00:50 UTC
    BrowserUK, oh. my. stars.

    I am in awe of you. Thanks ever so much.

Re^2: parsing a bibliography
by tachyon (Chancellor) on Dec 02, 2004 at 01:42 UTC

    One small but worthwhile modification would be to open an errors file and ouput non matching records to it. You could then tune or post process these.

    open ERRS, ">error.log" or die $!; while( ... ) { my ( ... ) =~ m/ ... /x; if ( $authors ) { # output as desired } else { print ERRS "$_\n"; } } close ERRS;

    cheers

    tachyon

Re^2: parsing a bibliography
by ww (Bishop) on Dec 02, 2004 at 16:48 UTC
    wonderful! Wish I could ++ your solution repeatedly! This writeup led to a "Eureka!" moment; the kind of haze-clearing that makes PM so valuable to beginners like me.

    request: please add to our understanding by commenting lines of regex, esp that part of line8 reading

    (?=[A-Z][a-z])

    (grouped but non-capture??)

    and in line13,

    ( [^\[]+ )

    which, as I read Owl (pocket ref), means capture one-or-more of a class including not-an-open_BRKT and close_BRKT ...which doesn't make sense to me, and -- more importantly, doesn't seem to WORK that way.

      The regex commented.

      my( $authors, $title, $thing, $pub, $date, $comment, $no ) = m/ ^ ## Author(s): Capture the minimum needed to satisfy that: ## a) It ends with a '. ' ## b) And the next word is not an initial ## IE: Lookahead and check the next word starts with ## 1 uppercase *and* one lowercase character. -( .*? \. ) \s(?=[A-Z][a-z]) ## Title: Greedily capture something that ends with '. ' ( .+ ) \.\s+ ## Location: Non-greedily capture ## Ends with a ': '. ## Doesn't contain a ':' ( [^:]+? ) : \s+ ## Publisher: ## Single word followed by a ', ' (\S+), \s+ ## Year: Capture Four digits ## Discard anything else upto '. ' ( \d{4} ) [^.]* \. \s+ ## Comment: Greedy capture non-'[' characters ## Ie. Stop capturing when you see a '[' ( [^\[]+ ) ## No: Capture 1 (or more) digits between '[' & ']' ## Discard any trailing space to the EOS. \[ ( \d+ ) \] \s* $ /x;

      Examine what is said, not who speaks.
      "But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
      "Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo         "Efficiency is intelligent laziness." -David Dunham
      "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://411615]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (16)
As of 2015-07-02 19:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (45 votes), past polls