Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling

XML Tags Stripping & Calculating checksum on it

by harishnuti (Beadle)
on Jul 16, 2008 at 06:58 UTC ( #697874=perlquestion: print w/replies, xml ) Need Help??

harishnuti has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

i have a unique requirement for calculating checksum on a XML file as below

* Read the whole XML file
* Strip of all XML tags
* Bring all XML data into one line
* Will call CRC32 routine to calculate checksum on this one line
to Achieve this iam doing below

my $tempContent = undef; # Holds final XML data in one line while (<FILE>){ my $xml = $_; chomp($_); while($xml=~m/\<(.+?)\>(.+?)\<\/\1\>/sig){ my $tag=lc(strip($1)); ## Below line is the selective condition to ignore all + header XML tags and consider only data next if ( $tag =~ /TotalAmount|NoOfRecords|TotalBatch| +CurrentBatch|EODTransactionDate|BankFileSeqNo|TotAmount/i ); my $attribs="?"; my $value=strip($2); $tempContent .= qq~$value~; } $tempContent .= qq~\n~; open(OUT,">tmp") or die "Unable to open $! \n"; print OUT $tempContent; close OUT; CRC32 ( tmp) ; # this one can be ignored , i have no problem here #Showing on trasaction, but in daily there are plenty __DATA__ <Transaction> <MessageCode>100</MessageCode> <ToAccountNo>12989898900</ToAccountNo> <ToBranchCode>08876</ToBranchCode> <FromAccountNo>S MARR SON </FromAccountNo> <FromBranchCode>ABG89097</FromBranchCode> <CurrencyCode>INR</CurrencyCode> <Amount>0000000000221000000.00</Amount> <TransactionDate>2008-07-10T12:04:25</TransactionDate> <ValueDate>2008-07-10</ValueDate> <CustomerRefNo>/CUST/</CustomerRefNo> <ReferenceNo1>ADH7870822</ReferenceNo1> <ReferenceNo2>/FRASER TRANSACTION</ReferenceNo2> <DealerCode>1780</DealerCode> <SalesOrganisation>1000</SalesOrganisation> <TransactionType>N</TransactionType> <SequenceNo>1</SequenceNo> </Transaction> # FInal data will be as below 1001298989890008876S MARR SON ABG89097INR0000000000221000000.002008-07 +-10T12:04:252008-07-10/CUST/ADH7870822/FRASER TRANSACTION17801000N1 # Will calculate checksum for above one line

my problem is, is there any better way to do above ? also , when XML file contains around 10,000 or more trasactions, its taking more time since iam processing line by line and appending in temp variable. pls let me know if i can improve in any of the above areas

Replies are listed 'Best First'.
Re: XML Tags Stripping & Calculating checksum on it
by Anonymous Monk on Jul 16, 2008 at 07:29 UTC
    I think this is better
    #!/usr/bin/perl -- use strict; use warnings; use XML::Twig; use Digest; my $xml = q~<Transaction> <MessageCode>100</MessageCode> <ToAccountNo>12989898900</ToAccountNo> <ToBranchCode>08876</ToBranchCode> <FromAccountNo>S MARR SON </FromAccountNo> <FromBranchCode>ABG89097</FromBranchCode> <CurrencyCode>INR</CurrencyCode> <Amount>0000000000221000000.00</Amount> <TransactionDate>2008-07-10T12:04:25</TransactionDate> <ValueDate>2008-07-10</ValueDate> <CustomerRefNo>/CUST/</CustomerRefNo> <ReferenceNo1>ADH7870822</ReferenceNo1> <ReferenceNo2>/FRASER TRANSACTION</ReferenceNo2> <DealerCode>1780</DealerCode> <SalesOrganisation>1000</SalesOrganisation> <TransactionType>N</TransactionType> <SequenceNo>1</SequenceNo> </Transaction>~; { my $ctx = Digest->new('SHA1');# MD5 ... my $t = new XML::Twig( TwigHandlers=> { Transaction => sub { print $_->children_text,$/; $ctx->add($_->children_text); } } ); $t->parse($xml); print $/, $ctx->hexdigest,$/; undef $ctx; undef $t; } __END__ 1001298989890008876S MARR SON ABG89097INR0000000000221000000.002008-07 +-10T12:04:252008-07-10/CUST/ADH7870822/FRASER TRANSACTION17801000N1 e5dc8115e0a2c22ffd034e07d4db8446b9d6696e # FInal data will be as below 1001298989890008876S MARR SON ABG89097INR0000000000221000000.002008-07 +-10T12:04:252008-07-10/CUST/ADH7870822/FRASER TRANSACTION17801000N1

      Thanks for the approach, but iam thinking how can i achieve the below functionality...
      next if ( $tag =~ /TotalAmount|NoOfRecords|TotalBatch|CurrentBatch|EOD +TransactionDate|BankFileSeqNo|TotAmount/i );

      The above are some tags which i want to exclude from being a part fo checksum calculation.. iam browsing XML::Twig doc still not sure how to achieve functionality
Re: XML Tags Stripping & Calculating checksum on it
by ady (Deacon) on Jul 16, 2008 at 08:02 UTC
    The main time consumption of your processing is in the file IO, ie. (as you indicate) the line-by-line parsing, plus writing & generating CRC on an external file;

    Is this necessary? Couldn't you just read the transactions one-by-one from the file, and run the CRC-calculation on each XML-trans in memory ?? -- As in (untested) :
    Allan Dystrup

      Thanks, but our requirement is weird, getting all data into one line and calculating checksum, i cant help it, our client uses a java routine in similar fashion to get checksum, so i should make i use of same method to match checksum, indeed we propsed client a filebased checksum instead of string based , with no luck

      First of all a big thanks for putting effort on my query

      my purpose of calculating checksum in above fashion is to achieve matching of checksum from my client, they do this in above fashion in java.iam trying to improve performance of my script retaining the final checksum to be same.

      if i calculate incremental checksum , how can i get the final checksum which can be matched by our client?
        If you use i.e. the String::CRC32 module, just use the incremental form $crc=crc32($additionalString,$crc) starting with $crc=0;. Finally, $crc is a 32bit unsigned integer value that can be compared. Maybe you need a conversion into a hex-notation or something beforehand (Edit: I mean before comparison, if you have to compare against some string format.): $clear=sprintf("%X",$crc); or $clear=uc(unpack("H8", pack("N",$crc)));
        my $data = do { local( $/ ) ; <$XmlFile> } ; # Slurp file $data =~ s!(^\s*)?\<(TotalAmount|NoOfRecords|TotalBatch|CurrentBatch|E +ODTransactionDate|BankFileSeqNo|TotAmount)\>(.+?)\<\/\1\>\n?!!mig; +# Zap headers $data =~ s!(^\s*)?</?.*?>\n?!!mig; # Zap xml tags CRC32($data); # CRC remaining data
        Should be fast, but in general regex'ing x/html is fragile and NOT recommended.
        (implicit assumptions about the data content is one of the traps, -- which may/not apply in your case).
        You probably should use XML::Twig instead!
Re: XML Tags Stripping & Calculating checksum on it
by Perlbotics (Bishop) on Jul 16, 2008 at 09:22 UTC
    I suggest to stronger interleave reading and writing of whole (processed) records or lines keeping $tempContent short. Try to calculate the CRC incrementally. E.g. String::CRC32 or pack could be helpful here. The last chunk to contribute to the CRC will be the trailing "\n".
Re: XML Tags Stripping & Calculating checksum on it
by pajout (Curate) on Jul 16, 2008 at 22:19 UTC
    If you will not be satisfied by XML::Twig, I can propose XML::Parser::Expat, or some similar parser: You can register your callbacks (i.e. perform your subroutines when, for instance, element start tag occurs or text node occurs), collect proper strings and, as Perlbotics advised, recalculate CRC from time to time.

    The advantage is that you don't need whole collected text and you can write very proprietary callbacks, which will reflect required logic accurately. If you are not familiar with it, let me know and I will post an example here.

Re: XML Tags Stripping & Calculating checksum on it
by CountZero (Bishop) on Jul 16, 2008 at 16:44 UTC
    My comment is OT, but still ...

    I really wonder why one wants to calculate a CRC-32 value on only the text portion of the file.

    To me it seems a totally wrong application of CRC-32. As you know, CRC-32 is only useful to detect "bursty" type of errors in files (say linenoise in a modem transmission or hard-disk transmission errors), but then you have to look at the whole file. Only looking at the non-tag parts serves no purpose. You could have all the text OK, but the tags totally goofed-up and you would never know as the CRC-32 would still match.

    Perhaps they use the CRC-32 as some kind of cryptographic check on the text data in the file (to see it has not been tampered with), but due to the trivially easy way to calculate a CRC-32, you can change the content and add a few bytes somewhere in the file which would make the CRC-32 match again.

    As said above, it only "protects" against random, bursty type of changes but for that you need to look at the transmitted file as a whole. So, my question remains: Why?


    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      Right, I've seen this kind of checksummung before as an easy way to identify duplicates when batch files are processed. Nowadays one would use message digests (SHA, MD5, etc.) to further decrease the probability of false duplicates. Sometimes, these files are imported into a DB or transcoded somehow. Then, it makes (some) sense to focus on the content only - if one can recover the correct sequence in order to re-compute the checksum. Usually, those DBs contain colums like batchfilename and seqno.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://697874]
Approved by moritz
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (5)
As of 2022-10-05 21:55 GMT
Find Nodes?
    Voting Booth?
    My preferred way to holiday/vacation is:

    Results (25 votes). Check out past polls.