Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re: parsing a bibliography

by BrowserUk (Patriarch)
on Dec 01, 2004 at 23:46 UTC ( [id://411615]=note: print w/replies, xml ) Need Help??


in reply to parsing a bibliography

It'll probably need tweaking constantly for different input, but it works for the sample data. I've no idea what $thing is though.

#! perl -slw use strict; $^W=0; while( <DATA> ) { my( $authors, $title, $thing, $pub, $date, $comment, $no ) = m/ ^ -( .*? \. ) \s(?=[A-Z][a-z]) ( .+ ) \.\s+ ( [^:]+? ) : \s+ (\S+), \s+ ( \d{4} ) [^.]* \. \s+ ( [^\[]+ ) \[ ( \d+ ) \] \s* $ /x; printf " Author:'%s'\n" . " Title:'%s'\n" . " Thing?:'%s'\n" . "Publisher:'%s'\n" . " Date:'%4d'\n" . " Comment:'%s'\n" . " No:'%d'\n\n", $authors, $title, $thing, $pub, $date, $comment, $no; } __DATA__ -Lightfoot, J. B. St. Paul’s Epistle to the Philippians. Grand Rapids: + Zondervan, 1953 (= 1913). Classic commentary by one of the greatest +English-speaking NT scholars of all time. [2] -Martin, Ralph P. Philippians. Rev. ed.; NCB. Grand Rapids: Eerdmans, +1980. Clear and informed. [2] -O'Brien, Peter, T. Commentary on Philippians. NIGTC. Grand Rapids: Ee +rdmans, 1991. Thorough and insightful comments on the Greek text. [1] -Silva, Moisés. Philippians. Baker Exegetical Commentary. Grand Rapids +: Baker, 1993. Sound comments on the Greek text. [2] -Barth, Markus and Helmut Blanke. The Letter to Philemon: A New Transl +ation with Notes and Commentary. Grand Rapids: Eerdmans, 2000. With o +ver 500 pages devoted to a letter that was probably written on a sing +le sheet of papyrus, this work will be consulted by all who want the +most thorough treatment of Philemon and avoided by the rest of us. [3 +] -Bruce, F. F. The Epistles to the Colossians, to Philemon, and to the +Ephesians. NIC. Grand Rapids: Eerdmans, 1984. See comments under “Com +mentaries on Ephesians.” [2]

Output:

[23:43:10.06] P:\test>411598 Author:'Lightfoot, J. B.' Title:'St. Paul?ÇÖs Epistle to the Philippians' Thing?:'Grand Rapids' Publisher:'Zondervan' Date:'1953' Comment:'Classic commentary by one of the greatest English-speaking +NT scholars of all time. ' No:'2' Author:'Martin, Ralph P.' Title:'Philippians. Rev. ed.; NCB' Thing?:'Grand Rapids' Publisher:'Eerdmans' Date:'1980' Comment:'Clear and informed. ' No:'2' Author:'O'Brien, Peter, T.' Title:'Commentary on Philippians. NIGTC' Thing?:'Grand Rapids' Publisher:'Eerdmans' Date:'1991' Comment:'Thorough and insightful comments on the Greek text. ' No:'1' Author:'Silva, Mois??s.' Title:'Philippians. Baker Exegetical Commentary' Thing?:'Grand Rapids' Publisher:'Baker' Date:'1993' Comment:'Sound comments on the Greek text. ' No:'2' Author:'Barth, Markus and Helmut Blanke.' Title:'The Letter to Philemon: A New Translation with Notes and Co +mmentary' Thing?:'Grand Rapids' Publisher:'Eerdmans' Date:'2000' Comment:'With over 500 pages devoted to a letter that was probably w +ritten on a single sheet of papyrus, this work will be consulted by a +ll who want the most thorough treatment of Philemon and avoided by th +e rest of us. ' No:'3' Author:'Bruce, F. F.' Title:'The Epistles to the Colossians, to Philemon, and to the Eph +esians. NIC' Thing?:'Grand Rapids' Publisher:'Eerdmans' Date:'1984' Comment:'See comments under ?Ç£Commentaries on Ephesians.?Ç¥ ' No:'2'

Examine what is said, not who speaks.
"But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
"Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo         "Efficiency is intelligent laziness." -David Dunham
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

Replies are listed 'Best First'.
Re^2: parsing a bibliography
by tachyon (Chancellor) on Dec 02, 2004 at 01:42 UTC

    One small but worthwhile modification would be to open an errors file and ouput non matching records to it. You could then tune or post process these.

    open ERRS, ">error.log" or die $!; while( ... ) { my ( ... ) =~ m/ ... /x; if ( $authors ) { # output as desired } else { print ERRS "$_\n"; } } close ERRS;

    cheers

    tachyon

Re^2: parsing a bibliography
by jimbojones (Friar) on Dec 02, 2004 at 00:05 UTC
    I think "thing" is the location of the publisher. Seems that Grand Rapids, (MI?) has some serious biblical publishing houses ...

    - j

Re^2: parsing a bibliography
by patrickrock (Beadle) on Dec 02, 2004 at 00:50 UTC
    BrowserUK, oh. my. stars.

    I am in awe of you. Thanks ever so much.

Re^2: parsing a bibliography
by ww (Archbishop) on Dec 02, 2004 at 16:48 UTC
    wonderful! Wish I could ++ your solution repeatedly! This writeup led to a "Eureka!" moment; the kind of haze-clearing that makes PM so valuable to beginners like me.

    request: please add to our understanding by commenting lines of regex, esp that part of line8 reading

    (?=[A-Z][a-z])

    (grouped but non-capture??)

    and in line13,

    ( [^\[]+ )

    which, as I read Owl (pocket ref), means capture one-or-more of a class including not-an-open_BRKT and close_BRKT ...which doesn't make sense to me, and -- more importantly, doesn't seem to WORK that way.

      The regex commented.

      my( $authors, $title, $thing, $pub, $date, $comment, $no ) = m/ ^ ## Author(s): Capture the minimum needed to satisfy that: ## a) It ends with a '. ' ## b) And the next word is not an initial ## IE: Lookahead and check the next word starts with ## 1 uppercase *and* one lowercase character. -( .*? \. ) \s(?=[A-Z][a-z]) ## Title: Greedily capture something that ends with '. ' ( .+ ) \.\s+ ## Location: Non-greedily capture ## Ends with a ': '. ## Doesn't contain a ':' ( [^:]+? ) : \s+ ## Publisher: ## Single word followed by a ', ' (\S+), \s+ ## Year: Capture Four digits ## Discard anything else upto '. ' ( \d{4} ) [^.]* \. \s+ ## Comment: Greedy capture non-'[' characters ## Ie. Stop capturing when you see a '[' ( [^\[]+ ) ## No: Capture 1 (or more) digits between '[' & ']' ## Discard any trailing space to the EOS. \[ ( \d+ ) \] \s* $ /x;

      Examine what is said, not who speaks.
      "But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
      "Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo         "Efficiency is intelligent laziness." -David Dunham
      "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://411615]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (8)
As of 2024-04-19 08:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found