The stupid question is the question not asked | |
PerlMonks |
MedlineParser: to parse and load MEDLINE into a RDBMSby BioGeek (Hermit) |
on Feb 27, 2005 at 23:33 UTC ( [id://434946]=CUFP: print w/replies, xml ) | Need Help?? |
Some time ago I was struggeling placing a copy of the OMIM database locally on my computer, into MySQL. OMIM is a database of human genetic diseases and provided through the National Library of Medicine (NLM). Today, I came across a paper describing software that did exactly that, but with MEDLINE, another database of the NLM. I will mention it here, in the hope that I can help other bioinformatics monks with it. Medline is the NLM's bibliographic database covering the fields of medicine, dentistry, nursing, veterinary medicine, healthcare administration, and the pre-clinical sciences dating back to 1966. It indexes articles from more than 4,600 international journals published in the U.S. and 70 other countries and contains all citation information for each paper, as well as abstracts for most of the papers. The usual way in which users query MEDLINE is through PubMed, a web-based interface and search engine. Researchers who use MEDLINE for text mining, information extraction, or natural language processing may benefit from having a copy of MEDLINE that they can manage locally. Diane E. Oliver, Prof. Adam Arkin and collegues from the universities of Stanford and Berkeley developed software tools to parse the MEDLINE data files and load their contents into a relational database. Although the task is conceptually straightforward, the size and scope of MEDLINE make the task nontrivial. The entire content of MEDLINE is available as a set of text files formatted in XML (eXtensible Markup Language). The NLM distributes these files at no cost to the licensee, but the files are large and not easily searched without additional indexing and search tools. For example, in the 2003 release of MEDLINE, there are 396 files (which cover citations through 2002), and the total uncompressed size of these files is 40.8 gigabytes (GB). The MedlineParser program was run on a networked Sun Enterprise 3500 server with eight 400-MHz processors and 4 GB of RAM (for reading input files and writing intermediate output files) using Oracle 9i. It took 196 hours (8 days and 4 hours) for the Perl MedlineParser to load MEDLINE. A similar implementation written in Java (tar.gz file), and run on an Intel system (Linux), using IBM's DB2 database management system loaded the database in 76 hours (3 days and 4 hours). There were numerous differences between the two systems, and it was not possible to test each variable independently. It is believed that differences in processor speed, memory, disk read-write efficiency, and optimization methods employed in commercial database-management systems may have affected loading times. The Perl code is less flexible and not as readily extensible as the object-oriented code of the Java software, but the functionality offered by the resulting database implementations is very similar. The open-source code for this most current version of MedlineParser is available at http://biotext.berkeley.edu. Source: most of the text for this node came from the following article: Tools for loading Medline into a local relational database Diane E. Oliver, Gaurav Bhalotia, Ariel S. Schwartz, Russ B. Altman, Marti A. Hearst, BMC Bioinformatics 2004, (7Oct2004) Update: I originally posted the source coude here, but then this node got over its size limit, but you can find the code to parsemedline.pl here.
Back to
Cool Uses for Perl
|
|