Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

get string between two < tags > in .js file (xml)

by kamchez (Initiate)
on Jul 03, 2012 at 08:53 UTC ( #979623=perlquestion: print w/ replies, xml ) Need Help??
kamchez has asked for the wisdom of the Perl Monks concerning the following question:

Hey Guys.

I've been struggling with this for quite some time now ... I have a .js file with some text and I need to find all "unknown" strings that are between two tags </Name><Symbol>XXX</Symbol> The unknown string that I need to get is in this exampel called XXX .. but could be YYY or AAA or anything else really ... The file looks like this (before desired output) and is really large; usually around 16mb or larger:

{ revision:1, roundLotSize:1, instrRegistry:"ABCDEFG0XXX", tradedCurrency:"EUR", priceType:"PER_UNIT", activationTime:0123456000000, inactivationTime:01234500000, publicationTime:123456000000, unpublicationTime:12345600000, xml:"<?xml version=\"1.0\" encoding=\"UTF-8\"?><Instru +ment xmlns=\"http://www.ngm.se/ns/InstrumentSchema/1.9\" xmlns:xsi=\" +http://www.w3.org/2001/XMLSchema\"><Identification><OrderbookID>@ORDE +RBOOK_ID@</OrderbookID><Symbol>XXXX XXXXX</Symbol><ISIN>CH0000012345< +/ISIN><Name>Gearing Certificate Select Dividend Basket, 2008-2013</Na +me></Identification><LastUpdated>2012-07-02T17:48:42</LastUpdated><CF +I>RWXXXX</CFI><TradingCurrency>EUR</TradingCurrency><PriceType>UNIT</ +PriceType><LotSize>1</LotSize><FirstTradedDate>2008-05-16</FirstTrade +dDate><Issuer><Name>ABC DE</Name><Symbol>XXX</Symbol><BIC>ABCDEFG0XXX +</BIC><Country>CH</Country></Issuer><CSD><Name>Dummy Country xx</Name +><BIC>ABCDEFG0XXX</BIC></CSD><IssuedQuantity>93</IssuedQuantity><Warr +ant EUSIPAClassification=\"2100\"><Underlyings><Underlying><Identific +ation><Name>Dow Jones STOXX Select Dividend 30 Index</Name></Identifi +cation><Weight>0.25</Weight><Strike>2470.28</Strike></Underlying><Und +erlying><Identification><Name>Dow Jones U.S Select Dividend Price Ind +ex</Name></Identification><Weight>0.25</Weight><Strike>429.62</Strike +></Underlying><Underlying><Identification><Name>Dow Jones Asia/Pacifi +c Select Dividend 30 Index</Name></Identification><Weight>0.25</Weigh +t><Strike>282.23</Strike></Underlying><Underlying><Identification><Na +me>Dummy Name Country Select Dividend Index</Name></Identification><W +eight>0.25</Weight><Strike>1000</Strike></Underlying></Underlyings><P +utCallIndicator>C</PutCallIndicator><CalculationPeriod><MaturityDay>2 +013-01-01</MaturityDay></CalculationPeriod><LastTradedDay>2013-01-01< +/LastTradedDay><ReimbursementDay>2013-01-01</ReimbursementDay><Arrang +er><Name>Some broker</Name></Arranger></Warrant></Instrument>", marketId:"XXXX", marketSegmentId:"ABCD", attribs:{SURVEILLANCE_DATA_CHANNEL_ID:"SD#1",MARKET_SE +RVER_ID:"MS#1",MARKET_DATA_CHANNEL_ID:"MD#5"}, tickRules:"[100:100:inf]", altIds:{ISIN:"CH000001234", EXCHANGE_SYMBOL:"XXXX XXXX +X", SECONDARY_MARKETPLACE_ASSIGNED_IDENTIFIER:"123456"} }, { ... },

Desired output : I need to find all of the lines containing </Name><Symbol>XYZ</Symbol> inside the xml: part of the txt and create a new column named : NEWCOLUMN:"XYZ", This is the desired output :

### we found this : </Name><Symbol>XXX</Symbol> { revision:1, roundLotSize:1, instrRegistry:"ABCDEFG0XXX", tradedCurrency:"EUR", priceType:"PER_UNIT", activationTime:123456000000, inactivationTime:12345600000, publicationTime:123456000000, unpublicationTime:13624800000, xml:"<?xml version=\"1.0\" encoding=\"UTF-8\"?><Instru +ment xmlns=\"http://www.ngm.se/ns/InstrumentSchema/1.9\" xmlns:xsi=\" +http://www.w3.org/2001/XMLSchema\"><Identification><OrderbookID>@ORDE +RBOOK_ID@</OrderbookID><Symbol>XXXX XXXXX</Symbol><ISIN>CH0000012345< +/ISIN><Name>Gearing Certificate Select Dividend Basket, 2008-2013</Na +me></Identification><LastUpdated>2012-07-00T00:00:00</LastUpdated><CF +I>RWXXXX</CFI><TradingCurrency>XXX</TradingCurrency><PriceType>UNIT</ +PriceType><LotSize>1</LotSize><FirstTradedDate>2008-05-16</FirstTrade +dDate><Issuer><Name>ABC DE</Name><Symbol>XXX</Symbol><BIC>ABCDEFG0XXX +</BIC><Country>CH</Country></Issuer><CSD><Name>Dummy Country Oy</Name +><BIC>ABCDEFG0XXX</BIC></CSD><IssuedQuantity>93</IssuedQuantity><Warr +ant EUSIPAClassification=\"2100\"><Underlyings><Underlying><Identific +ation><Name>Dummy STOXX Select Dividend 30 Index</Name></Identificati +on><Weight>0.25</Weight><Strike>2470.28</Strike></Underlying><Underly +ing><Identification><Name>Dummy U.S Select Dividend Price Index</Name +></Identification><Weight>0.25</Weight><Strike>100</Strike></Underlyi +ng><Underlying><Identification><Name>Dummy Country/Pacific Select Div +idend 30 Index</Name></Identification><Weight>0.25</Weight><Strike>10 +0.00</Strike></Underlying><Underlying><Identification><Name>Dummy Cou +ntry Select Dividend Index</Name></Identification><Weight>0.25</Weigh +t><Strike>100.00</Strike></Underlying></Underlyings><PutCallIndicator +>C</PutCallIndicator><CalculationPeriod><MaturityDay>2013-01-01</Matu +rityDay></CalculationPeriod><LastTradedDay>2013-01-01</LastTradedDay> +<ReimbursementDay>2013-01-01</ReimbursementDay><Arranger><Name>Some b +roker</Name></Arranger></Warrant></Instrument>", marketId:"XXX", marketSegmentId:"ABCDE", attribs:{SURVEILLANCE_DATA_CHANNEL_ID:"SD#1",MARKET_SE +RVER_ID:"MS#1",MARKET_DATA_CHANNEL_ID:"MD#5"}, tickRules:"[100:100:inf]", altIds:{ISIN:"CH000001234", EXCHANGE_SYMBOL:"XXXX XXXX +X", SECONDARY_MARKETPLACE_ASSIGNED_IDENTIFIER:"123456"} },

This is what I've got so far ... I've tried everything, awk, sed, etc... but I found that this one liner is the one that gets the correct data, but Im not sure how to loop it through the entire txt file and produce the new column :

perl -0777 -pe 's%.*</Name><Symbol>%%s;s%</Symbol>.*%%s' txtfile.js

produces : BNP This is what I've written but I'm not sure how to make that one liner work inside a script

#!/usr/bin/perl use strict; use warnings; use File::Basename; use Text::ParseWords; if ($#ARGV == 0) { open my $file, "<", $ARGV[0] or die "Couldn't open file '$ARGV +[0]': $! \nDid you specify a valid file?"; my ($SYMBOL); while (<$file>) { $SYMBOL = s%.*</Name><Symbol>%%s;s%</Symbol>.* +%%s; ## this is'nt obviously working ...### but this is what I want t +o achieve print "NEWCOLUMN:\"$SYMBOL\"\n"; } } } else { print "You need to specify an input file \n"; print "\n"; print "They are usually located here : \n"; print "/PATH/* \n"; print "\n"; print "Usage : ".basename($0)." difffile.txt \n"; print "\n"; exit; }

Comment on get string between two < tags > in .js file (xml)
Select or Download Code
Re: get string between two < tags > in .js file (xml)
by Anonymous Monk on Jul 03, 2012 at 09:33 UTC
      thank you for replying ... That one liner is not going to help me here because it removes all characters before .*</Name><Symbol> and all characters after </Symbol>.* leaving me with only 1 (the first match) of the string
      perl -0777 -pe 's#.*</Name><Symbol>##s;s#</Symbol>.*##s'
      Any ideas on how to solve this the proper way?

        Any ideas on how to solve this the proper way?

        Sure, but it appears you don't want to do it that way :)( you want regex )

Re: get string between two < tags > in .js file (xml)
by jethro (Monsignor) on Jul 03, 2012 at 11:20 UTC

    You need m%%, the regular expression search operator instead of s%%%, the regular expression substitute operator. The code below should give you a start, more you can find out in the perl documentation perlre

    my ($result)= m%</Name><Symbol>([^<]*)</Symbol>%;

      great ! thanks a lot this is what I've got :

      use strict; use warnings; use File::Basename; use Text::ParseWords; if ($#ARGV == 0) { open my $file, "<", $ARGV[0] or die "Couldn't open file '$ARGV +[0]': $! \nDid you specify a valid file?"; while (<$file>) { if ($_ =~ m%</Name><Symbol>([^<]+)%) { my $SYMBOL; $SYMBOL = $1; print "NEWCOLUMN:\"$SYMBOL\"\n"; } } }

      Next part ... print all matches and insert them into each field

Re: get string between two < tags > in .js file (xml)
by sundialsvc4 (Abbot) on Jul 03, 2012 at 14:19 UTC

    Just do it the right way and be done.   16 megabytes is not “huge.”   This is a JSON-formatted file, and within that file some of the records are in XML format.   Therefore, first use a CPAN package that understands JSON.   Then, feed the extracted strings into another CPAN package that understands XML.   From here, an XPath query can dive right into the XML to extract from it precisely whatever you need to know.   Because of XPath, you do not have to write code to pick apart the XML structure itself.   You could, in less than 50 lines of “code that you actually had to write,” be looking at a robust and reliable (i.e. “real”) solution to this task.   Finito!

    You are simply de-constructing the file in more or less the same way that it was originally constructed; probably using the same tool.   It is, if I may say, abjectly pointless to “prove” that something can be done the wrong way, even if you “succeed.”   (And, please, take this stern-sounding advice in an impersonal way, not as a flame, but as the pointed and direct admonition from an engineering colleague who deems it very important to get this point across.)

      thank you for your reply ... Yes that would be the proper way of doing it, you are absolutely right and I will look into it. For now , here is a quick and dirty solution that solved it for me:

      use strict; use warnings; use File::Basename; use Text::ParseWords; if ($#ARGV == 0) { open my $file_in, "<", $ARGV[0] or die "Couldn't open file '$A +RGV[0]': $! \nDid you specify a valid file?"; # open up a new file to write the changes made to open my $file_out, ">", "$ARGV[0].new" or die "Can't write new fil +e '$ARGV[0].new' : $! \nDo you have write permissions?"; # these are our currently active market makers my @list=("BNP","CBK","CIT","NDS","OHD","OHM","RBN","RBS","SEK","S +GA"); while (<$file_in>) { ## write all changes to new file print $file_out $_; # if we find a match for any Symbols if ($_ =~ m%</Name><Symbol>([^<]+)%) { my $SYMBOL; my $MATCH; $SYMBOL = $1; # and the $SYMBOL matches the array @list for active market ma +kers if (grep {$_ eq $SYMBOL} @list) { # Print and add the line marketMakerOrganization: $SYMBOL to t +he $file_out print $file_out "\t\tmarketMakerOrganization:\"$SYMBOL +\",\n"; } } } } else { print "You need to specify an input file \n"; print "\n"; print "They are usually located here : \n"; print "/PATH/orderbooks-xx-hostname-yy.x.xxx.xx.js"; print "\n"; print "Usage : ".basename($0)." difffile.txt \n"; print "\n"; exit; }

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://979623]
Approved by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (11)
As of 2014-12-22 16:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (121 votes), past polls