Re: XML Tags Stripping & Calculating checksum on it
by Anonymous Monk on Jul 16, 2008 at 07:29 UTC
|
#!/usr/bin/perl --
use strict;
use warnings;
use XML::Twig;
use Digest;
my $xml = q~<Transaction>
<MessageCode>100</MessageCode>
<ToAccountNo>12989898900</ToAccountNo>
<ToBranchCode>08876</ToBranchCode>
<FromAccountNo>S MARR SON </FromAccountNo>
<FromBranchCode>ABG89097</FromBranchCode>
<CurrencyCode>INR</CurrencyCode>
<Amount>0000000000221000000.00</Amount>
<TransactionDate>2008-07-10T12:04:25</TransactionDate>
<ValueDate>2008-07-10</ValueDate>
<CustomerRefNo>/CUST/</CustomerRefNo>
<ReferenceNo1>ADH7870822</ReferenceNo1>
<ReferenceNo2>/FRASER TRANSACTION</ReferenceNo2>
<DealerCode>1780</DealerCode>
<SalesOrganisation>1000</SalesOrganisation>
<TransactionType>N</TransactionType>
<SequenceNo>1</SequenceNo>
</Transaction>~;
{
my $ctx = Digest->new('SHA1');# MD5 ...
my $t = new XML::Twig(
TwigHandlers=> {
Transaction => sub {
print $_->children_text,$/;
$ctx->add($_->children_text);
}
}
);
$t->parse($xml);
print $/, $ctx->hexdigest,$/;
undef $ctx;
undef $t;
}
__END__
1001298989890008876S MARR SON ABG89097INR0000000000221000000.002008-07
+-10T12:04:252008-07-10/CUST/ADH7870822/FRASER TRANSACTION17801000N1
e5dc8115e0a2c22ffd034e07d4db8446b9d6696e
# FInal data will be as below
1001298989890008876S MARR SON ABG89097INR0000000000221000000.002008-07
+-10T12:04:252008-07-10/CUST/ADH7870822/FRASER TRANSACTION17801000N1
| [reply] [d/l] |
|
Thanks for the approach, but iam thinking how can i achieve the below functionality...
next if ( $tag =~ /TotalAmount|NoOfRecords|TotalBatch|CurrentBatch|EOD
+TransactionDate|BankFileSeqNo|TotAmount/i );
The above are some tags which i want to exclude from being a part fo checksum calculation.. iam browsing XML::Twig doc still not sure how to achieve functionality
| [reply] [d/l] |
Re: XML Tags Stripping & Calculating checksum on it
by ady (Deacon) on Jul 16, 2008 at 08:02 UTC
|
The main time consumption of your processing is in the file IO, ie. (as you indicate) the line-by-line parsing, plus writing & generating CRC on an external file;
Is this necessary? Couldn't you just read the transactions one-by-one from the file, and run the CRC-calculation on each XML-trans in memory ?? -- As in (untested) :
Allan Dystrup | [reply] [d/l] |
|
Thanks, but our requirement is weird, getting all data into one line and calculating checksum, i cant help it, our client uses a java routine in similar fashion to get checksum, so i should make i use of same method to match checksum, indeed we propsed client a filebased checksum instead of string based , with no luck
| [reply] |
|
First of all a big thanks for putting effort on my query
my purpose of calculating checksum in above fashion is to achieve matching of checksum from my client, they do this in above fashion in java.iam trying to improve performance of my script retaining the final checksum to be same.
if i calculate incremental checksum , how can i get the final checksum which can be matched by our client?
| [reply] |
|
If you use i.e. the String::CRC32 module, just use the incremental form $crc=crc32($additionalString,$crc) starting with $crc=0;. Finally, $crc is a 32bit unsigned integer value that can be compared. Maybe you need a conversion into a hex-notation or something beforehand (Edit: I mean before comparison, if you have to compare against some string format.): $clear=sprintf("%X",$crc); or $clear=uc(unpack("H8", pack("N",$crc)));
| [reply] [d/l] [select] |
|
my $data = do { local( $/ ) ; <$XmlFile> } ; # Slurp file
$data =~ s!(^\s*)?\<(TotalAmount|NoOfRecords|TotalBatch|CurrentBatch|E
+ODTransactionDate|BankFileSeqNo|TotAmount)\>(.+?)\<\/\1\>\n?!!mig;
+# Zap headers
$data =~ s!(^\s*)?</?.*?>\n?!!mig; # Zap xml tags
CRC32($data); # CRC remaining data
Should be fast, but in general regex'ing x/html is fragile and NOT recommended.
(implicit assumptions about the data content is one of the traps, -- which may/not apply in your case).
You probably should use XML::Twig instead!
| [reply] [d/l] |
Re: XML Tags Stripping & Calculating checksum on it
by Perlbotics (Archbishop) on Jul 16, 2008 at 09:22 UTC
|
I suggest to stronger interleave reading and writing of
whole (processed) records or lines keeping $tempContent short. Try to calculate the CRC incrementally. E.g. String::CRC32 or pack could be helpful here. The last chunk to contribute to the CRC will be the trailing "\n".
| [reply] |
Re: XML Tags Stripping & Calculating checksum on it
by pajout (Curate) on Jul 16, 2008 at 22:19 UTC
|
If you will not be satisfied by XML::Twig, I can propose XML::Parser::Expat, or some similar parser: You can register your callbacks (i.e. perform your subroutines when, for instance, element start tag occurs or text node occurs), collect proper strings and, as Perlbotics advised, recalculate CRC from time to time.The advantage is that you don't need whole collected text and you can write very proprietary callbacks, which will reflect required logic accurately. If you are not familiar with it, let me know and I will post an example here. | [reply] |
Re: XML Tags Stripping & Calculating checksum on it
by CountZero (Bishop) on Jul 16, 2008 at 16:44 UTC
|
My comment is OT, but still ...I really wonder why one wants to calculate a CRC-32 value on only the text portion of the file. To me it seems a totally wrong application of CRC-32. As you know, CRC-32 is only useful to detect "bursty" type of errors in files (say linenoise in a modem transmission or hard-disk transmission errors), but then you have to look at the whole file. Only looking at the non-tag parts serves no purpose. You could have all the text OK, but the tags totally goofed-up and you would never know as the CRC-32 would still match. Perhaps they use the CRC-32 as some kind of cryptographic check on the text data in the file (to see it has not been tampered with), but due to the trivially easy way to calculate a CRC-32, you can change the content and add a few bytes somewhere in the file which would make the CRC-32 match again. As said above, it only "protects" against random, bursty type of changes but for that you need to look at the transmitted file as a whole. So, my question remains: Why?
CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James
| [reply] |
|
Right, I've seen this kind of checksummung before as an easy way to identify duplicates when batch files are processed. Nowadays one would use message digests (SHA, MD5, etc.) to further decrease the probability of false duplicates. Sometimes, these files are imported into a DB or transcoded somehow. Then, it makes (some) sense to focus on the content only - if one can recover the correct sequence in order to re-compute the checksum. Usually, those DBs contain colums like batchfilename and seqno.
Perlbotics
| [reply] |