I am a very non-technical user
Bioperl may be way more complicated than you need to get started( don't get me wrong, it is fabulous, just not simple to follow )
The easiest approach is to read each line of the file, and matching the words/characters up to the first colon. How you store the data really depends on what data you want and what you intend to do with it all( again, this can get extremely tricky ).
Here is an extremely simple script that reads all sequences for an entry, blast them, and writes the information to a file. If you refer to your Camel book, this script should be easy enough to figure out and extend. Remember, though, Excel can read csv( comma seperated value ) files, and you will need to read up on DBI if you want to throw data into Access.
#!/usr/bin/perl -w
use strict; ## watch typos and errors
$|++; ## output
my $begin = time; ## just seeing how long we take
## change these
my $root = 'C:\Windows\'; ## where the base of our script is
my $source = "$root/PDBFind.txt"; ## do NOT use the full PDB
+Find.TXT file, I mean it!
my $out = "$root/out.txt"; ## file to save blast results
+ to
my $blast_exe = 'C:\Blast\bl2seq.exe'; ## where our blast pr
+ogram is
my $options = '-G 11 -E 1 -W 3 -X 50 -F F -p blastp'; ## our
+ blast options
my $min_size = 10; ## don't blast shorty sequences, this isn't a
+ protein anyway
## prolly don't change these
my $blast_i = "$root/i.$$"; ## temp files to hold sequences
my $blast_j = "$root/j.$$";
## quick recap so we know what we are doing and where we are
print <<"ENDOFPRINT";
SCRIPT : $0
IN : '$source'
OUT : '$out'
BLAST : '$blast_exe $options -i "$blast_i" -j "$blast_j"'
ENDOFPRINT
## lines in PDBFind I care about
my $DELIMITER = '//';
my $ID = 'ID : ';
my $CHAIN = 'Chain : ';
my $SEQUENCE = ' Sequence : ';
my $COMPOUND = 'Compound : ';
my $SOURCE = 'Source : ';
my $HEADER = 'Header : ';
## counters
my $pdb_count = 0;
my $chain_count = 0;
my $blast_count = 0;
my $wild_count = 0;
my $stub_count = 0;
my $dna_count = 0;
## variables relevant to data I am looking for
my ( $id, $chain );
my ( @chain_seq_pairs, @chain_wild, @chain_stubby, @chain_dna );
## open files
open( SOURCE, "<$source" )
or die "Can not open $source for reading: $!";
open( OUT, ">$out" )
or die "Can not open $out for writing: $!";
## parse source and blast chains
while ( <SOURCE> )
{
chomp;
next if ( /^\s*$/ );
## match lines
if ( /^$DELIMITER/ )
{
## a holder
my @been_blasted;
## a little sanity
next unless $id;
## up counter
$pdb_count++;
## part of final counter
print OUT "${ID}${id}\n";
## blast everything
foreach my $i ( @chain_seq_pairs )
{
foreach my $j ( @been_blasted )
{
print OUT "${id} $i->[0] $j->[0]", '-' x 30, "\n";
## fill our blast i and j temp files
open( BLASTI, ">$blast_i" )
or die "Can not open $blast_i for writing: $!";
open( BLASTJ, ">$blast_j" )
or die "Can not open $blast_j for writing: $!";
print BLASTI $i->[1];
print BLASTJ $j->[1];
close( BLASTI );
close( BLASTJ );
## blast away
my @blast_results = split( "\n", `$blast_exe $options
+-i "$blast_i" -j "$blast_j"` );
## increase counter
$blast_count++;
## now cycle through lines,
## put filters, what have yous here
foreach my $line ( @blast_results )
{
print OUT "${id} $i->[0] $j->[0]| $line\n";
}
}
## save current sequence to be blasted against
push( @been_blasted, $i );
}
print OUT "\n";
## increase counters
$chain_count += scalar( @chain_seq_pairs );
$wild_count += scalar( @chain_wild );
$stub_count += scalar( @chain_stubby );
$dna_count += scalar( @chain_dna );
## clean up a few things
undef @chain_seq_pairs;
undef @chain_dna;
undef @chain_stubby;
undef @chain_wild;
undef $id;
undef $chain;
next;
}
if ( /^$ID/ )
{
$id = $';
next;
}
## These are noops, but this is how you can catch their values
## if ( /^$COMPOUND/ ) {}
## if ( /^$HEADER/ ) {}
## if ( /^$SOURCE/ ) {}
if ( /^$CHAIN/ )
{
$chain = $';
next;
}
if ( /^$SEQUENCE/ )
{
my $tmp = $';
my $bad = 0;
if ( $tmp =~ /^x+$/i )
{
## we're a straight wild card chain
## ignore because that means we can be anything
## just a bunch of aa's stuck together but
## we don't know / care about specifics
push( @chain_wild, [ $chain, $tmp ] );
$bad++;
}
if ( $tmp ne uc( $tmp ) )
{
## we're dna, crude test, but it works on the data
## DNA is lower case
push( @chain_dna, [ $chain, $tmp ] );
$bad++;
}
if ( length( $tmp ) <= $min_size )
{
## we're too short
push( @chain_stubby, [ $chain, $tmp ] );
$bad++;
}
next if $bad;
push( @chain_seq_pairs, [ $chain, $tmp ] );
next;
}
}
## clean up
close( OUT );
close( SOURCE );
unlink( $blast_i, $blast_j ); ## no error checking?!
my $end = time - $begin;
print <<"ENDOFPRINT";
PDBs found : $pdb_count
Chains found : $chain_count
Blasts performed : $blast_count
Wild sequences : $wild_count
Stubby sequences : $stub_count
DNA sequences : $dna_count
TIME : $end seconds
ENDOFPRINT
print "\nDone\n";
exit(0);
Again, this script should help you get started in parsing a PDB file. But, you will have to read up on DBI if you want to stuff the data in Access( there is plenty of material already on the site about DBI, btw ).
If you need some more help, feel free to send me an email.
Will perl for money
JJ Knitis
(404) 624-1525
gt8073a@industrialmusic.com |