http://www.perlmonks.org?node_id=509920

dannoura has asked for the wisdom of the Perl Monks concerning the following question:

hi,

I want to convert a ~1GB SGML file into a database. I decided to use DBM::Deep (which is pure perl), since I've never used databases before and I don't have the time to learn about them right now. I tried my code on several smaller files and it works ok. Now I have to see if it works with the large file.

So here are my questions:

Thanks for your help.

The relevant sub is:

sub convert { my ($cassis_file_entry, $db_file_entry, $status, $MW)=@_; my $cassis_file=$cassis_file_entry->get; my $db_file=$db_file_entry->get; my $db = new DBM::Deep( file => $db_file, type => DBM::Deep::TYPE_ARRAY ); $db->clear(); $db->optimize(); my $p=HTML::TokeParser->new($cassis_file); my $i=-1; # Counter for @records my %tags=( pn => 'patent_no', ap => 'application', pd => 'dates', # Issue date dr => 'rcrd_dates', # Date assignment recorded ae => 'assignee', ar => 'assignor'); while (my $token=$p->get_token) { foreach my $tag (keys %tags) { if ($token->[0] eq 'S' && $tag eq $token->[1]) { push @{${$db}[$i]->{$tags{$tag}}}, $p->get_trimmed_tex +t; } } if ($token->[0] eq 'S' && $token->[1] eq 'asn') { $i++; } } $$status='Done'; }

Replies are listed 'Best First'.
Re: converting a large SGML file into a database
by Zaxo (Archbishop) on Nov 19, 2005 at 06:44 UTC

    First a design question; why should it be in a database? SGML files are pretty well-defined and few applications need a lot of them.

    Second: Being in a database suggests that a partial SGML file is good enough for most purposes. Perhaps. How do you keep them consistent?

    Large file support is strictly up to the OS. Perl will usually accomodate itself to whatever the OS allows, so long as you compile your own perl.

    After Compline,
    Zaxo

      I decided to go for a database because the SGML file data structure is not the structure I want and also because I think a database would be faster for lookup (the next step in this script is converting the data to a hash of arrays so I have a hash lookup).

      I'm not sure I understand your second point. Once I convert the SGML file into a database I don't use the SGML file any more.

      Is there any reason to assume WinXP won't handle files of that size?