I have XML files which I need to extract out certain values. So I wrote a code where it simply search for relevant text strings and extract out the values accordingly.
The problem is, how do I print the value of the formal_charge tag (e.g. -1)?
E.g. of my XML file:
<weight>18.998403205</weight>
<name>fluoride</name>
<smiles>[F-]</smiles>
<accession>W00662</accession>
-<experimental_properties>
-<property>
<kind>water_solubility</kind>
<value>0.00169 mg/mL at 25 °C</value>
<source/>
</property>
-<predicted_properties>
-<property>
<kind>formal_charge</kind>
<value>-1</value>
<source>ChemAxon</source>
</property>
Values I want is
weight (18.9984),
name (fluoride),
accession (W00662),
formal_charge (-1)
My code:
sub load_files() {
#get a list of all files in directory; ignore all files beginning wi
+th a . and other sub directories
opendir(my $dh, $dirname) or die "can't opendir $dirname: $!";
my @files = grep (/^[^\.]/ && -f "$dirname/$_", readdir($dh)); #only
+ keep those not beginning with '.' and are files
@files = sort(@files); #sort lexically, 'B' comes before 'a', so tha
+t output list is always in same order
closedir $dh;
my $numfiles = 0;
foreach my $file (@files) { #loop through the files
$numfiles++;
my $accefound = 0;
my $namefound = 0;
my $monofound = 0;
my $chargefound =0;
open(my $file_fh, "< $dirname/$file") or die("$$: Error: failed to
+ open file $dirname/$file. $!\n");
while(<$file_fh>) { #read each line of file
if (/(<weight>)(.+)(<\/weight>)/ && !$monofound) { #if first enc
+ounter with the tag
$monofound = $2;
$monofound =~ s/^\s+//; #trim leading whitespace of string
$monofound =~ s/\s+$//; #trim trailing whitespace of string
}
elsif (/(<name>)(.+)(<\/name>)/ && !$namefound) { #if first encoun
+ter with the tag
$namefound = $2;
$namefound =~ s/^\s+//; #trim leading whitespace of string
$namefound =~ s/\s+$//; #trim trailing whitespace of string
}
elsif (/(<accession>)(.+)(<\/accession>)/ && !$accefound) { #if
+first encounter with the tag (the tag might not be unique)
$accefound = $2;
$accefound =~ s/^\s+//; #trim leading whitespace of string
$accefound =~ s/\s+$//; #trim trailing whitespace of string
}
elsif (/(<formal_charge>)(.+)(<\/formal_charge>)/ && !$charge
+found) { #if first encounter with the tag
$chargefound = $2;
$chargefound =~ s/^\s+//; #trim leading whitespace of string
$chargefound =~ s/\s+$//; #trim trailing whitespace of string
}
}
print "$monofound\t$namefound\t$accefound\t$chargefound\n";
close($file_fh) or die("$$: Error: failed to close file $dirname/$
+file. $!\n");
}
}
main();
What I got is:
_OUTPUT DATA_
18.998403205 fluoride W00662 0
The charge value is not reflecting -1, but it put "0" instead. I know it should match the word "value" , but in this case, there are many "value" tags in the file, so how do I actually match it to this value tag instead of incorrectly match other value tag?
-<property>
<kind>formal_charge</kind>
<value>-1</value>
<source>ChemAxon</source>
</property>
I hope there is no need to involve any module and just searching relevant match string is sufficient?