http://www.perlmonks.org?node_id=979623

kamchez has asked for the wisdom of the Perl Monks concerning the following question:

Hey Guys.

I've been struggling with this for quite some time now ... I have a .js file with some text and I need to find all "unknown" strings that are between two tags </Name><Symbol>XXX</Symbol> The unknown string that I need to get is in this exampel called XXX .. but could be YYY or AAA or anything else really ... The file looks like this (before desired output) and is really large; usually around 16mb or larger:

{ revision:1, roundLotSize:1, instrRegistry:"ABCDEFG0XXX", tradedCurrency:"EUR", priceType:"PER_UNIT", activationTime:0123456000000, inactivationTime:01234500000, publicationTime:123456000000, unpublicationTime:12345600000, xml:"<?xml version=\"1.0\" encoding=\"UTF-8\"?><Instru +ment xmlns=\"http://www.ngm.se/ns/InstrumentSchema/1.9\" xmlns:xsi=\" +http://www.w3.org/2001/XMLSchema\"><Identification><OrderbookID>@ORDE +RBOOK_ID@</OrderbookID><Symbol>XXXX XXXXX</Symbol><ISIN>CH0000012345< +/ISIN><Name>Gearing Certificate Select Dividend Basket, 2008-2013</Na +me></Identification><LastUpdated>2012-07-02T17:48:42</LastUpdated><CF +I>RWXXXX</CFI><TradingCurrency>EUR</TradingCurrency><PriceType>UNIT</ +PriceType><LotSize>1</LotSize><FirstTradedDate>2008-05-16</FirstTrade +dDate><Issuer><Name>ABC DE</Name><Symbol>XXX</Symbol><BIC>ABCDEFG0XXX +</BIC><Country>CH</Country></Issuer><CSD><Name>Dummy Country xx</Name +><BIC>ABCDEFG0XXX</BIC></CSD><IssuedQuantity>93</IssuedQuantity><Warr +ant EUSIPAClassification=\"2100\"><Underlyings><Underlying><Identific +ation><Name>Dow Jones STOXX Select Dividend 30 Index</Name></Identifi +cation><Weight>0.25</Weight><Strike>2470.28</Strike></Underlying><Und +erlying><Identification><Name>Dow Jones U.S Select Dividend Price Ind +ex</Name></Identification><Weight>0.25</Weight><Strike>429.62</Strike +></Underlying><Underlying><Identification><Name>Dow Jones Asia/Pacifi +c Select Dividend 30 Index</Name></Identification><Weight>0.25</Weigh +t><Strike>282.23</Strike></Underlying><Underlying><Identification><Na +me>Dummy Name Country Select Dividend Index</Name></Identification><W +eight>0.25</Weight><Strike>1000</Strike></Underlying></Underlyings><P +utCallIndicator>C</PutCallIndicator><CalculationPeriod><MaturityDay>2 +013-01-01</MaturityDay></CalculationPeriod><LastTradedDay>2013-01-01< +/LastTradedDay><ReimbursementDay>2013-01-01</ReimbursementDay><Arrang +er><Name>Some broker</Name></Arranger></Warrant></Instrument>", marketId:"XXXX", marketSegmentId:"ABCD", attribs:{SURVEILLANCE_DATA_CHANNEL_ID:"SD#1",MARKET_SE +RVER_ID:"MS#1",MARKET_DATA_CHANNEL_ID:"MD#5"}, tickRules:"[100:100:inf]", altIds:{ISIN:"CH000001234", EXCHANGE_SYMBOL:"XXXX XXXX +X", SECONDARY_MARKETPLACE_ASSIGNED_IDENTIFIER:"123456"} }, { ... },

Desired output : I need to find all of the lines containing </Name><Symbol>XYZ</Symbol> inside the xml: part of the txt and create a new column named : NEWCOLUMN:"XYZ", This is the desired output :

### we found this : </Name><Symbol>XXX</Symbol> { revision:1, roundLotSize:1, instrRegistry:"ABCDEFG0XXX", tradedCurrency:"EUR", priceType:"PER_UNIT", activationTime:123456000000, inactivationTime:12345600000, publicationTime:123456000000, unpublicationTime:13624800000, xml:"<?xml version=\"1.0\" encoding=\"UTF-8\"?><Instru +ment xmlns=\"http://www.ngm.se/ns/InstrumentSchema/1.9\" xmlns:xsi=\" +http://www.w3.org/2001/XMLSchema\"><Identification><OrderbookID>@ORDE +RBOOK_ID@</OrderbookID><Symbol>XXXX XXXXX</Symbol><ISIN>CH0000012345< +/ISIN><Name>Gearing Certificate Select Dividend Basket, 2008-2013</Na +me></Identification><LastUpdated>2012-07-00T00:00:00</LastUpdated><CF +I>RWXXXX</CFI><TradingCurrency>XXX</TradingCurrency><PriceType>UNIT</ +PriceType><LotSize>1</LotSize><FirstTradedDate>2008-05-16</FirstTrade +dDate><Issuer><Name>ABC DE</Name><Symbol>XXX</Symbol><BIC>ABCDEFG0XXX +</BIC><Country>CH</Country></Issuer><CSD><Name>Dummy Country Oy</Name +><BIC>ABCDEFG0XXX</BIC></CSD><IssuedQuantity>93</IssuedQuantity><Warr +ant EUSIPAClassification=\"2100\"><Underlyings><Underlying><Identific +ation><Name>Dummy STOXX Select Dividend 30 Index</Name></Identificati +on><Weight>0.25</Weight><Strike>2470.28</Strike></Underlying><Underly +ing><Identification><Name>Dummy U.S Select Dividend Price Index</Name +></Identification><Weight>0.25</Weight><Strike>100</Strike></Underlyi +ng><Underlying><Identification><Name>Dummy Country/Pacific Select Div +idend 30 Index</Name></Identification><Weight>0.25</Weight><Strike>10 +0.00</Strike></Underlying><Underlying><Identification><Name>Dummy Cou +ntry Select Dividend Index</Name></Identification><Weight>0.25</Weigh +t><Strike>100.00</Strike></Underlying></Underlyings><PutCallIndicator +>C</PutCallIndicator><CalculationPeriod><MaturityDay>2013-01-01</Matu +rityDay></CalculationPeriod><LastTradedDay>2013-01-01</LastTradedDay> +<ReimbursementDay>2013-01-01</ReimbursementDay><Arranger><Name>Some b +roker</Name></Arranger></Warrant></Instrument>", marketId:"XXX", marketSegmentId:"ABCDE", attribs:{SURVEILLANCE_DATA_CHANNEL_ID:"SD#1",MARKET_SE +RVER_ID:"MS#1",MARKET_DATA_CHANNEL_ID:"MD#5"}, tickRules:"[100:100:inf]", altIds:{ISIN:"CH000001234", EXCHANGE_SYMBOL:"XXXX XXXX +X", SECONDARY_MARKETPLACE_ASSIGNED_IDENTIFIER:"123456"} },

This is what I've got so far ... I've tried everything, awk, sed, etc... but I found that this one liner is the one that gets the correct data, but Im not sure how to loop it through the entire txt file and produce the new column :

perl -0777 -pe 's%.*</Name><Symbol>%%s;s%</Symbol>.*%%s' txtfile.js

produces : BNP This is what I've written but I'm not sure how to make that one liner work inside a script

#!/usr/bin/perl use strict; use warnings; use File::Basename; use Text::ParseWords; if ($#ARGV == 0) { open my $file, "<", $ARGV[0] or die "Couldn't open file '$ARGV +[0]': $! \nDid you specify a valid file?"; my ($SYMBOL); while (<$file>) { $SYMBOL = s%.*</Name><Symbol>%%s;s%</Symbol>.* +%%s; ## this is'nt obviously working ...### but this is what I want t +o achieve print "NEWCOLUMN:\"$SYMBOL\"\n"; } } } else { print "You need to specify an input file \n"; print "\n"; print "They are usually located here : \n"; print "/PATH/* \n"; print "\n"; print "Usage : ".basename($0)." difffile.txt \n"; print "\n"; exit; }