Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Books, Magic Wands and Perl

by hsmyers (Canon)
on May 27, 2003 at 16:00 UTC ( #261033=CUFP: print w/replies, xml ) Need Help??

For about as long as I can remember, books have been a significant part of my life. And while I won't admit to being a biblioholic, I will admit to a certain obsession! After all, I am surrounded by the proof---books! From floor to ceiling, from wall to wall, shelved and stacked, books, books, books. You remember the sorcerer's apprentice and his small problem with mops and pails? Paw!--- he had nothing on me. If my books were as lively as his little friends, they would be sorting and dusting themselves instead of sitting, standing and just generally laying about.

Part of the problem of an obsession is living with it. And one of the problems of living with books is the small matter of the catalog. Between stacking, un-stacking, bookcase building and buying, acquisition and all of the rest, who has time to keep a card catalog up to date? Were I a professor with grad students, the solution would be obvious if not particularly rewarding for the students. But I'm not---I only know grad students, I can't enslave them, books as bait is too blatant even for them.

Once again, Perl to the rescue™!

One of the fantasies I had for years features the waving of a magic wand, followed by the creation of a complete card catalog. Over time, the catalog has transformed from a lovely piece of vintage (Victorian I think) furniture to todays more useful digital counter-part. And the ash and ebony wand have morphed into a bar code reader but you get the general drift. Any good software solution needs to be better than solution it replaces even when the previous solution was almost Zen-like in it's 'nothingness'! More to the point in this case, it can't involve much more work than what is involved in the normal handling of books from moment to moment, else it won't get used. The way I figure it, I can be depended on to supply a 'minimum' of information for each book, no more. The rest would have to be done by the computer. My idea of perfect: given a pile of books, take the few seconds to scan the EAN on the back and presto! Instant database entries! (EAN is the 13 digit bar code typically found on the back of most books these days) It should also be free, extensible and fairly small as software goes.

Well, sometimes magic happens...

Update fix readmore tag!
Update New version, remove redundant POD!
#!/perl/bin/perl # # -- Application converts EANs to BibTeX data. use strict; use warnings; use diagnostics; use Getopt::Long; use Pod::Usage; use LWP::Simple; use XML::Simple; our $VERSION = '0.02'; my $devtok = $ENV{'AMAZON_DEVTOK'}; my $simple = XML::Simple->new(); my %bib = ( author => '?', title => '?', publisher => '?', address => '?', edition => '?', year => '?', ISBN => '?', image => '?', ); my $isbn_list; my %lookup; my @isbns; my $isbn; GetOptions( 'debug=i' => \( my $debug = 0 ), 'help|?' => \( my $opt_help ), 'man' => \( my $opt_man ), 'version' => \( my $opt_version ), 'lookup=s' => \( my $opt_lookup ), 'isbn' => \( my $opt_isbn ), 'output' => \( my $opt_output ), 'noappend' => \( my $opt_noappend ), 'location=s' => \( my $opt_location ), 'image' => \( my $opt_image ), 'nolist' => \( my $opt_nolist ), 'authors' => \( my $opt_authors ), ) or pod2usage(2); if ($opt_version) { print " vrs. $VERSION\n"; exit; } pod2usage(1) if $opt_help; pod2usage( -verbose => 2 ) if $opt_man; if ($debug) { *OUTPUT = *STDOUT; } else { if ($opt_output) { if ($opt_noappend) { open ( OUTPUT , ">$opt_output" ) or die "Can't open $opt_o +utput: $!\n"; } else { open ( OUTPUT , ">>$opt_output" ) or die "Can't open $opt_ +output: $!\n"; } } else { if ($opt_noappend) { open ( OUTPUT , ">ean2bib.bib" ) or die "Can't open ean2bi +b.bib: $!\n"; } else { open ( OUTPUT , ">>ean2bib.bib" ) or die "Can't open ean2b +ib.bib: $!\n"; } } } if ($opt_lookup) { open( ADDRESS, $opt_lookup ) or die "Can't open $opt_lookup: $!\n" +; while (<ADDRESS>) { chomp; my ($publisher,$address) = split(/\|/,$_); $lookup{$publisher} = $address; } close ( ADDRESS ); } if ($debug) { while (<DATA>) { chomp; push ( @isbns, ean2isbn($_) ); } } elsif (@ARGV == 0) { if ($opt_isbn) { print "Enter ISBN+newline---^Z to end. ^C to cancel.\n"; while (<>) { chomp; my ( $ok, $checkdigit ) = checkISBN($_); if ($ok) { push ( @isbns, $_ ); } else { if ($checkdigit eq '-') { print "Too few characters in ISBN, 10 required.\n" +; } elsif ($checkdigit eq '+') { print "Too many characters in ISBN, 10 required.\n +"; } else { print "Bad check digit in ISBN, one or more charac +ters wrong.\n" } } } } else { print "Enter EAN+newline---^Z to end. ^C to cancel\n"; while (<>) { chomp; push ( @isbns, ean2isbn($_) ); } } } else { for (map { glob } @ARGV) { filter($_); } } unless ($opt_nolist) { if ($opt_output) { open ( FILE, ">$opt_output.isbn") or die "Couldn't open $opt_o +utput.isbn: $!\n"; } else { open ( FILE, ">ean2bib.bib.isbn") or die "Couldn't open ean2bi +b.bib.isbn: $!\n"; } for (@isbns) { print FILE $_,"\n"; } close (FILE); } while (@isbns) { $isbn_list = join ( ",",splice(@isbns,0,30) ); my $Amazon_webpage = get(" +vices-20&dev-t=$devtok&AsinSearch=$isbn_list&type=lite&f=xml"); if ( $Amazon_webpage ) { my $root = $simple->XMLin( $Amazon_webpage ); my $details = $root->{'Details'}; for (@$details) { for ( keys %bib ) { $bib{$_} = '?'; } ParseAmazon( $_, \%bib ); print OUTPUT "\@book{,\n"; print OUTPUT " author=\{", $bib{'author'}, "\}\n"; print OUTPUT " title=\{", $bib{'title'}, "\}\n"; print OUTPUT " publisher=\{", $bib{'publisher'}, "\}\n"; print OUTPUT " address=\{", $bib{'address'}, "\}\n"; print OUTPUT " year=\{", $bib{'year'}, "\}\n"; print OUTPUT " ISBN=\{", $bib{'ISBN'}, "\}\n"; if ($opt_location) { print OUTPUT " location=\{",$opt_location,"\}\n"; } if ($opt_image) { print OUTPUT " image=\{",$bib{'image'},"\}\n"; } print OUTPUT "}\n\n"; } } else { print "Bad luck this time\n"; } } close ( OUTPUT ); sub filter { my $file = shift; open ( FILE, $file ) or die "Couldn't open $file: $!\n"; print STDERR "$file:\n"; while (<FILE>) { chomp; if ($opt_isbn) { my ( $ok, $checkdigit ) = checkISBN($_); if ($ok) { push ( @isbns, $_ ); } else { if ($checkdigit eq '-') { print STDERR "Too few characters in ISBN '$_', 10 +required.\n"; } elsif ($checkdigit eq '+') { print STDERR "Too many characters in ISBN '$_', 10 + required.\n"; } else { print STDERR "Bad check digit in ISBN '$_', one or + more characters wrong.\n" } } } else { push ( @isbns, ean2isbn($_) ); } } close ( FILE ); } sub ParseAmazon { my ( $detail, $bib ) = @_; $bib->{'title'} = clean( $detail->{'ProductName'} ); $bib->{'publisher'} = clean( $detail->{'Manufacturer'} ); $bib->{'ISBN'} = clean( $detail->{'Asin'} ); $bib->{'image'} = $detail->{'ImageUrlSmall'}; my $year = $detail->{'ReleaseDate'}; $year =~ s/.*? (\d+)$/$1/; $bib->{'year'} = $year; my $authors = $detail->{'Authors'}->{'Author'}; my @list; if ( ref($authors) eq 'ARRAY' ) { for (@$authors) { push ( @list, $_ ); } unless ($opt_authors) { $bib->{'author'} = clean( join ( " and ", @list ) ); } } else { $bib->{'author'} = clean($authors); } my $webpage = get(" +&ESN=F&startrec=1&maxrecords=10&dbname=Voyager&srchtype=1,7,2,3,3,1,4 +,1,5,1,6,1&term_term_1=$bib->{'ISBN'}"); scrapeLOC($webpage,$bib); if ($bib->{'address'} eq '?' and $opt_lookup) { if (exists($lookup{$bib{'publisher'}})) { $bib->{'address'} = $lookup{$bib{'publisher'}}; } } if ($bib->{'author'} eq '?' and scalar(@list)) { $bib->{'author'} = clean( join ( " and ", @list ) ); } } sub GetSubfield { my ( $subfield, $key ) = @_; if ( ref($subfield) eq 'ARRAY' ) { for (@$subfield) { if ( $_->{'code'} eq $key ) { return clean( $_->{'content'} ); } } } else { return clean( $subfield->{'content'} ); } } sub scrapeLOC{ my ( $webpage, $bib ) = @_; if ($webpage and $webpage !~ /<b>0<\/b> records/) { my @list; my %fields; while ($webpage =~ /^(.*?:)/mg) { push(@list,$1); } for (0..@list - 2) { $fields{$list[$_]} = $list[$_ + 1]; } $fields{$list[-1]} = '<\/PRE>'; if (exists($fields{'Title:'})) { my $title = parsespan('Title:',$fields{'Title:'},$webpage) +; my @title = split(/\//,$title); for (@title) { $_ = alltrim($_); } $title = $title[0]; my $author = $title[1]; $author =~ s/,/ and/g; $author =~ s/\.$//; $bib->{'title'} = $title if ($title and $bib->{'title'} eq + '?'); $bib->{'author'} = $author if ($author and $bib->{'author' +} eq '?'); } if (exists($fields{'Author:'}) and $bib->{'author'} eq '?') { my $author = parsespan('Author:',$fields{'Author:'},$webpa +ge); $author =~ s/\.$//; $bib->{'author'} = $author if $author; } if (exists($fields{'Edition:'}) and $bib->{'edition'} eq '?') +{ my $edition = parsespan('Edition:',$fields{'Edition:'},$we +bpage); $edition =~ s/(.*?)\s.*$/$1/; $bib->{'edition'} = $edition if $edition; } if (exists($fields{'ISBN:'}) and $bib->{'ISBN'} eq '?') { my $isbn = parsespan('ISBN:',$fields{'ISBN:'},$webpage); $isbn =~ s/.*?(\d{9}\d|X).*/$1/; $bib->{'ISBN'} = $isbn; } if (exists($fields{'Published:'})) { my $published = parsespan('Published:',$fields{'Published: +'},$webpage); $published =~ /(.*?)\s:\s(.*?),.*?(\d{4}).*$/; $bib->{'publisher'} = $2 if ($2 and $bib->{'publisher'} eq + '?'); $bib->{'address'} = alltrim($1) if ($1 and $bib->{'address +'} eq '?'); $bib->{'year'} = $3 if ($3 and $bib->{'year'} eq '?'); } } } sub parsespan { my ($begin,$end,$s) = @_; $s =~ /$begin((?:(?!$begin).)*)$end/ms; my $span = $1; $span =~ s/\n/ /g; $span =~ s/\s\s+/ /g; return $span; } sub alltrim { my $s = shift; $s =~ s/^\s+//; $s =~ s/\s+$//; return $s; } sub ean2isbn { my $isbn = substr( shift, 3, 10 ); return substr($isbn,0,9) . checkDigit($isbn); } sub checkISBN { my $isbn = shift; my $n = length($isbn); if ( $n != 10 ) { return ( 0, ( $n < 10 ? '-' : '+' ) ); } else { my $cd = checkDigit($isbn); return ((($cd eq substr($isbn,-1,1)) ? 1 : 0), $cd); } } sub checkDigit { my @digits = split ( //, uc(shift) ); my $sum = 0; my $m = 10; for ( 0 .. @digits - 2 ) { $sum += $digits[$_] * $m--; } return qw(0 X 9 8 7 6 5 4 3 2 1) [ $sum % 11 ]; } sub clean { my $s = shift; $s =~ s/[,:\\\/.]$//; $s =~ s/\s+$//; $s =~ s/[\x7f-\xff]//g; return $s; } =head1 NAME -- Application to create BibTeX data from EANs. =head1 SYNOPSIS [options] filespec Options: -debug set debug level, default is off -help brief help message -man full documentation -version version number -lookup publisher address lookup file -isbn ISBN entry versus EAN entry, default is EAN entry -output name of file to write bib entries to, defaults to 'ean2bi +b.bib' -noappend turn append mode for output ON or OFF, defaults to OFF (d +o append) -location add 'location' tag to bib entry. -image add 'image' tag to bib entry using Amazon's small image -nolist suppress backup of ISBNs, defaults to OFF (do backup list +) -authors use Library of Congress lookup instead of Amazon for 'aut +hors' field Switches that don't define a value can be done in long or short form. eg: --man -m =head1 OPTIONS =over 8 =item B<-debug> Display debug information as program is executed. Control is set by le +vel of the value passed on the command line. Default value is off (debug == 0). Setting + this to something greateer than zero will also send the output to STDOUT overriding the +normal defaults. Skips appending etc. =item B<-help> Print a brief help message and exit. =item B<-man> Print the manual page (full documentation) and exit. =item B<-version> Print the version number and exit. =item B<-lookup> Specify a 'lookup' file for publisher addresses. =item B<-isbn> Flag to allow ISBN entry versus EAN entry. Default is EAN entry. =item B<-output> Name of file to write output to. Default is 'ean2bib.bib'. =item B<-noappend> Flag to turn off append mode for output file. Default is OFF (do appen +d). =item B<-loction> Turn on 'location' tag for bib entry. =item B<-image> Turn on 'image' tag of bib entry. Currently 'small' image is hardcoded + in. =item B<-nolist> Suppress backup of ISBNs, defaults to OFF (do backup list). =item B<-authors> Note AMAZON currently has a bug in that they think it is alright to li +st 'Vic Broquard', 'Broquard Vic', and 'Victor E. Broquard' as multiple a +uthors! Using -authors forces a LOC lookup instead of accepting the AMAZON loo +kup for those cases of multiple authorship. Does not affect single authorship +lookup. =back =head1 DESCRIPTION This application uses the AMAZON web services API and the Library of C +ongress MARC database to create a BibTeX entry based solely on the EAN taken e +ither from text files specified on the command line or entered by way of a bar co +de wand at the command line. =head2 EXAMPLE Suppose you had a file of EANs, say eans.txt that contained: 9780446611336 9780451458711 9780446610902 9780596004361 9780201185379 9780201489460 9780764545696 9780138482763 One EAN per line. Then say you typed 'ean2bib eans.txt' at the command + line. The result would be a the creation of a file called 'ean2bib.bib.isbn' and + another called 'ean2bib.bib'. The '.bib' file would be appended to if it alrea +dy existed. On the first run however, it would look like (shorted slightly for POD): @book{, author={Rebecca Neason} title={The Truest Power} publisher={Warner Books} address={?} year={2002} ISBN={0446611336} } @book{, author={Alan F. Troop} title={The Dragon Delasangre} publisher={Roc} address={New York} year={2002} ISBN={0451458710} } @book{, author={Katherine Kurtz} title={Crusade of Fire: Mystical Tales of the Knights Templar} publisher={Warner Books} address={?} year={2002} ISBN={0446610909} } @book{, author={Peter Prinz and Ulla Kirch-Prinz} title={C Pocket Reference} publisher={O'Reilly & Associates} address={?} year={2002} ISBN={0596004362} } . . . @book{, author={Vic Broquard and Broquard Vic and Victor E. Broquard} title={Intermediate Mfc} publisher={Prentice Hall PTR} address={Upper Saddle River, NJ} year={1998} ISBN={0138482764} } In other words, the output is suitable as input into a LaTeX document +as a .bib file. Noticeable is the occasional 'address={?}' entry. This is because of t +wo things; first, Amazon doesn't supply publisher address information. Second, while the + library of Congress is better, sometimes it doesn't either! Because of this, the --lookup +option was added. This option supplies a text file to use as a address lookup table back +stop. As an example: A. H. Baily and Co.|London, UK Ace Books|New York, NY Ace|New York, NY ActiveState Tool Corp.|Vancouver, Canada, BC Addison Wesley|Boston, MA Addison Wesley|Reading, MA Addison-Wesley|Upper Saddle River, NJ Aeonian Press|Mattituck, NY Al. Kalmajs Prtg. Co.|Chicago, Illinois Aladdin|New York, NY Albert E. Woolum|N. Richland Hills, TX Alfred Kalnajs and Son|Gulfport, FL Alfreds Kalnajs|Chicago, Illinois American Chess Promotions|Macon, GA American Chess Promotion|Macon, GA American Chess Quarterly|Columbus, OH Andreyevski Flag|Himberg Austria Arbor House|New York, NY Aspect|New York, NY Atheneum|New York, NY Atria|New York, NY AvoNova|New York, NY Avon Books (Trd)|New York, NY Avon|New York, NY Forms the first few lines of my 'address.txt' file. The format is simp +le; publisher name followed by publisher address, delimited by '|'. Also noticeable is the 'author={Vic Broquard and Broquard Vic and Vict +or E. Broquard} from the last EAN in the example. Amazon currently thinks this is a good th +ing! I think it is a bug, but Amazon is bigger than I am. At any rate, I've added t +he --authors option to fix this. Briefly it skips the author information from Amazon in fa +vor of the author information from the Library of Congress lookup. =head2 WHAT IS LATEX? LaTeX is a high-quality typesetting system, with features designed for + the production of technical and scientific documentation. LaTeX is the de +facto standard for the communication and publication of scientific documents +. =head2 WHAT IS BIBTEX? From BibTeX is the bibliography handling tool related to the TeX/LaTeX type +setting system (available on almost all conceivable OS platforms). When a docu +ment is typeset with LaTeX, an auxiliary file (with extension '.aux') is gener +ated. This auxiliary file contains, amongst other things, a list of referenc +es (to books, articles, research reports, web-url's,...) cited in the origina +l document. BibTeX takes this list, together with the name of a style file (extens +ion '.bst') and a list of bibliography databases (extension '.bib') which are also + given in the auxiliary file, and produces a wonderfully formatted list of re +ferences which the LaTeX system subsequently (and automatically) appends to you +r typeset document. For those who think this sounds a bit too complicated, the b +enefits in terms of quality of output (both textual and mathematical) and automat +ic generation of all manner of numbering schemes (figures, equations, sections), and + document parts (table of contents, index, bibliography), together with cross-pl +atform availability, mean that many people consider it the _best_ current sys +tem for producing quality documents. =head2 WHAT IS A BIB FILE? From There are several approaches to including a bibliography into a LaTeX +document. The easiest, but not the best is to construct a bbl file. The bbl file, of + FILE.bbl where file is the name of the LaTeX file simply contains the command L +aTeX interprets to write out the bibliography. The difficulty with a bbl fi +le is that it is rigid. Once constructed, in order to change the formatting +or ordering, manually rewriting the file is the only option available. Instead, the preferred mechanism for producing a bibliography is to co +nstruct a bib file, i.e., a file, usually with the name FILE.bib. The bib file is a +database file, which can be processed by the utility bibtex to produce a bbl fi +le. The advantage is that the formatting of the bbl file is then determined by + a style template, consequently the bibliography can easily be reforatted to me +et the requirements of a particular publication. =head2 WHAT IS A EAN? EAN stands for European Article Number and in this context refers to the Bookland EAN, a thirteen digit, strictly numeri +c bar code for the publishing industry. It is derived from the ISBN number assigned t +o a particular title. =head1 AUTHOR Hugh S. Myers =head1 BUGS None that I know of. =head1 TODO Mon May 26 10:51:13 2003 Add LCCN option for older books. Mon May 26 11:08:04 2003 Convert file entry and command line entry to + use same code. Mon May 26 12:55:47 2003 Correct flaw in logic for checking EANs. DON +E Mon May 26 13:50:10 2003 Add 'image' option for Amazon thumbnail imag +e URLs. DONE Fri Jun 06 09:21:52 2003 Drop VBZOOMC.ZoomFactory in favor of a web s +crape. DONE Fri Jun 06 09:23:00 2003 Add CS1504 option to set barcode type and lo +cation info. Fri Jun 06 09:45:21 2003 Begin adding debug features for testing. DON +E Fri Jun 06 11:16:20 2003 Add 'authors' option as a Amazon bug work-ar +ound. DONE =head1 UPDATES 0.02 Removes dependancy on ActiveX .dll. Improve documentation. =cut __DATA__ 9780446611336 9780451458711 9780446610902 9780596004361 9780201185379 9780201489460 9780764545696 9780138482763


"Never try to teach a pig to wastes your time and it annoys the pig."

Edit by tye, change PRE to CODE around long lines.

Replies are listed 'Best First'.
Re: Books, Magic Wands and Perl
by mr_mischief (Monsignor) on Jun 02, 2003 at 00:59 UTC
    As I'm sure many others agree, this is a seriously CUFP. I like it very much.

    I do have one nit to pick. I had to read the code to know the format of the input file. While it was easy to figure out and even easy to guess correctly, a fully documented program includes such details in the documentation.

    Christopher E. Stith
    use coffee;
      Right you are (although I do plead 'beta') look for it in an update. Particularly since I plan on adding ability to barcode location (room, case and shelf) in a 'next' version.


      "Never try to teach a pig to wastes your time and it annoys the pig."
        Like I said, it was a nit I was picking. I like the program very, very much. ++ on the original node and on your response to my response.

        I'm glad to hear it's going to be updated in both features and documentation. The documentation of programs often gets so overlooked, that it's good to know people still care.

        I, unfortunately, am one of those who often overlooks part of the documentation. That's one reason I like to point it out -- to help others keep from falling into the same pits. ;-)

        Christopher E. Stith
        use coffee;
Re: Books, Magic Wands and Perl
by SciDude (Friar) on Jun 06, 2003 at 04:34 UTC
    Am I the only person having problems with this code? So far, I have found that the code is dependant upon a visual basic binding to zoom. This is (my guess) a workaround due to compilation issues for zoom on windows. My knowledge of visual basic is poor. Must I install something from the Visual Basic Binding Zoom page for this code to work on my Windows2000 box?

    Screen scraping with lwp::simple and the online catalog is starting to sound much easier. This is a great idea - I just wish it were it bit more user friendly.

      No, you are not the only one bothered by a dependancy on a VB (shudder) ActiveX dll. In More Books and Magic I describe the solution to a more reasonable approach. Look for version 0.02 of some time this weekend.


      "Never try to teach a pig to wastes your time and it annoys the pig."

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: CUFP [id://261033]
Approved by sschneid
Front-paged by sschneid
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (6)
As of 2017-06-24 06:14 GMT
Find Nodes?
    Voting Booth?
    How many monitors do you use while coding?

    Results (556 votes). Check out past polls.