Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw

converting a large SGML file into a database

by dannoura (Pilgrim)
on Nov 18, 2005 at 19:36 UTC ( #509920=perlquestion: print w/replies, xml ) Need Help??
dannoura has asked for the wisdom of the Perl Monks concerning the following question:


I want to convert a ~1GB SGML file into a database. I decided to use DBM::Deep (which is pure perl), since I've never used databases before and I don't have the time to learn about them right now. I tried my code on several smaller files and it works ok. Now I have to see if it works with the large file.

So here are my questions:

  • Is it possible to simulate the behaviour of the script without actually using the large file?
  • Will the script which uses the database be able to access it? I understand that there are some issues with files over 2GB and the database file is always much larger than the SGML file. (I'm using perl 5.8.4 on WinXP)
  • Is it possible that the script will overload the RAM?

Thanks for your help.

The relevant sub is:

sub convert { my ($cassis_file_entry, $db_file_entry, $status, $MW)=@_; my $cassis_file=$cassis_file_entry->get; my $db_file=$db_file_entry->get; my $db = new DBM::Deep( file => $db_file, type => DBM::Deep::TYPE_ARRAY ); $db->clear(); $db->optimize(); my $p=HTML::TokeParser->new($cassis_file); my $i=-1; # Counter for @records my %tags=( pn => 'patent_no', ap => 'application', pd => 'dates', # Issue date dr => 'rcrd_dates', # Date assignment recorded ae => 'assignee', ar => 'assignor'); while (my $token=$p->get_token) { foreach my $tag (keys %tags) { if ($token->[0] eq 'S' && $tag eq $token->[1]) { push @{${$db}[$i]->{$tags{$tag}}}, $p->get_trimmed_tex +t; } } if ($token->[0] eq 'S' && $token->[1] eq 'asn') { $i++; } } $$status='Done'; }

Replies are listed 'Best First'.
Re: converting a large SGML file into a database
by Zaxo (Archbishop) on Nov 19, 2005 at 06:44 UTC

    First a design question; why should it be in a database? SGML files are pretty well-defined and few applications need a lot of them.

    Second: Being in a database suggests that a partial SGML file is good enough for most purposes. Perhaps. How do you keep them consistent?

    Large file support is strictly up to the OS. Perl will usually accomodate itself to whatever the OS allows, so long as you compile your own perl.

    After Compline,

      I decided to go for a database because the SGML file data structure is not the structure I want and also because I think a database would be faster for lookup (the next step in this script is converting the data to a hash of arrays so I have a hash lookup).

      I'm not sure I understand your second point. Once I convert the SGML file into a database I don't use the SGML file any more.

      Is there any reason to assume WinXP won't handle files of that size?

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://509920]
Approved by idsfa
Front-paged by Courage
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (2)
As of 2018-04-22 05:52 GMT
Find Nodes?
    Voting Booth?