Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Split on regex, don't match partial regex

by aldo (Initiate)
on Sep 28, 2012 at 03:19 UTC ( #996103=perlquestion: print w/ replies, xml ) Need Help??
aldo has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to fix a perl plugin for the Squeezebox server.

I need to split the following text into an id (from tag=) and the url (from url=).

These tag={id}&url={url} pairs are delimited by commas, but commas can also appear in the url, AND so can duplicate 'tag=' elements which must also stay so I cant split by that character alone.

an example of the text (all one line as it arrives)

itag=44&url=http://o-o---preferred---sn-u5a3u5a3-h5oe---v13---lscache3 +.c.youtube.com/videoplayback?upn=8kbZJLkF5PA&sparams=cp%2Cid%2Cip%2Ci +pbits%2Citag%2Cratebypass%2Csource%2Cupn%2Cexpire&fexp=927101%2C92300 +6%2C922401%2C920704%2C912806%2C913419%2C913546%2C913556%2C919349%2C91 +9351%2C925109%2C919003%2C920201%2C912706&key=yt1&expire=1348823962&it +ag=44&ipbits=8&sver=3&ratebypass=yes&mt=1348800611&ip=92.22.37.231&mv +=m&source=youtube&ms=au&cp=U0hTTVhNUV9LTENOM19QR1VKOkFyQWNVSVFNbmNL&i +d=1100a4b92b939cd6&type=video/webm;+codecs="vp8.0,+vorbis"&fallback_h +ost=tc.v13.cache3.c.youtube.com&sig=8353F6329CDA8168C4F7F29E20F2AE3F6 +509D85F.C582D63C02534232CE8E28D5ADC5B119AAEF2963&quality=large,itag=3 +5&url=http://o-o---preferred---sn-u5a3u5a3-h5oe---v11---lscache4.c.yo +utube.com/videoplayback?upn=8kbZJLkF5PA&sparams=algorithm%2Cburst%2Cc +p%2Cfactor%2Cid%2Cip%2Cipbits%2Citag%2Csource%2Cupn%2Cexpire&fexp=927 +101%2C923006%2C922401%2C920704%2C912806%2C913419%2C913546%2C913556%2C +919349%2C919351%2C925109%2C919003%2C920201%2C912706&expire=1348823962 +&algorithm=throttle-factor&burst=40&ip=92.22.37.231&itag=35&sver=3&ke +y=yt1&mt=1348800611&mv=m&source=youtube&ms=au&ipbits=8&factor=1.25&cp +=U0hTTVhNUV9LTENOM19QR1VKOkFyQWNVSVFNbmNL&id=1100a4b92b939cd6&type=vi +deo/x-flv&fallback_host=tc.v11.cache4.c.youtube.com&sig=885C9C098DF9D +80E780177E01CF944BC4F9564FE.9A374618A2BE8C2E562C8622DCB449A7071E37BD& +quality=large,itag= ...AND SO ON

I'd like the data to end up in a hash of id,url.

I first tried this, but it only splits the first found pair, and not properly.

for my $stream (split(/itag=(.*)&url=/, $streams)) { print $stream; }

I expected it to print out

44

http://o-o---preferred---sn-u5a3u5a3-h5oe---v13---lscache3.c.youtube.c +om/videoplayback?upn=8kbZJLkF5PA&sparams=cp%2Cid%2Cip%2Cipbits%2Citag +%2Cratebypass%2Csource%2Cupn%2Cexpire&fexp=927101%2C923006%2C922401%2 +C920704%2C912806%2C913419%2C913546%2C913556%2C919349%2C919351%2C92510 +9%2C919003%2C920201%2C912706&key=yt1&expire=1348823962&itag=44&ipbits +=8&sver=3&ratebypass=yes&mt=1348800611&ip=92.22.37.231&mv=m&source=yo +utube&ms=au&cp=U0hTTVhNUV9LTENOM19QR1VKOkFyQWNVSVFNbmNL&id=1100a4b92b +939cd6&type=video/webm;+codecs="vp8.0,+vorbis"&fallback_host=tc.v13.c +ache3.c.youtube.com&sig=8353F6329CDA8168C4F7F29E20F2AE3F6509D85F.C582 +D63C02534232CE8E28D5ADC5B119AAEF2963&quality=large

Comment on Split on regex, don't match partial regex
Select or Download Code
Re: Split on regex, don't match partial regex
by GrandFather (Cardinal) on Sep 28, 2012 at 03:25 UTC

    * is greedy. Change it to *? to get a non-greedy match.

    True laziness is hard work
Re: Split on regex, don't match partial regex
by Anonymous Monk on Sep 28, 2012 at 03:41 UTC

    URI is URI

    use CGI; my $q = CGI->new(q{itag=44&url=...AND SO ON }); print $_, ' = ', $q->param($_),"\n" for qw/ itag url /; __END__ itag = 44 url = ...AND SO ON

      Hi the ..AND SO ON was just me expressing that the itag url pair repeats multiple times separated by commas, sorry for not explaining that properly

        Hi the ..AND SO ON was just me expressing that the itag url pair repeats multiple times separated by commas, sorry for not explaining that properly

        Doesn't much matter, CGI handles repeaters

Re: Split on regex, don't match partial regex
by 2teez (Priest) on Sep 28, 2012 at 07:14 UTC
    Hi,

    There are more than one id tag and url in the example you gave.
    Let's take the example you gave as one line as it arrives as given in the OP. Please note that how those line(s) are inputted (or arrives to use your words) into the perl script was not shown.
    You can do like so:

    use warnings; use strict; my $line = 'itag=44&url=http://o-o---preferred---sn-u5a3u5a3-h5oe---v13---lscache +3.c.youtube.com/videoplayback?upn=8kbZJLkF5PA&sparams=cp%2Cid%2Cip%2C +ipbits%2Citag%2Cratebypass%2Csource%2Cupn%2Cexpire&fexp=927101%2C9230 +06%2C922401%2C920704%2C912806%2C913419%2C913546%2C913556%2C919349%2C9 +19351%2C925109%2C919003%2C920201%2C912706&key=yt1&expire=1348823962&i +tag=44&ipbits=8&sver=3&ratebypass=yes&mt=1348800611&ip=92.22.37.231&m +v=m&source=youtube&ms=au&cp=U0hTTVhNUV9LTENOM19QR1VKOkFyQWNVSVFNbmNL& +id=1100a4b92b939cd6&type=video/webm;+codecs="vp8.0,+vorbis"&fallback_ +host=tc.v13.cache3.c.youtube.com&sig=8353F6329CDA8168C4F7F29E20F2AE3F +6509D85F.C582D63C02534232CE8E28D5ADC5B119AAEF2963&quality=large,itag= +35&url=http://o-o---preferred---sn-u5a3u5a3-h5oe---v11---lscache4.c.y +outube.com/videoplayback?upn=8kbZJLkF5PA&sparams=algorithm%2Cburst%2C +cp%2Cfactor%2Cid%2Cip%2Cipbits%2Citag%2Csource%2Cupn%2Cexpire&fexp=92 +7101%2C923006%2C922401%2C920704%2C912806%2C913419%2C913546%2C913556%2 +C919349%2C919351%2C925109%2C919003%2C920201%2C912706&expire=134882396 +2&algorithm=throttle-factor&burst=40&ip=92.22.37.231&itag=35&sver=3&k +ey=yt1&mt=1348800611&mv=m&source=youtube&ms=au&ipbits=8&factor=1.25&c +p=U0hTTVhNUV9LTENOM19QR1VKOkFyQWNVSVFNbmNL&id=1100a4b92b939cd6&type=v +ideo/x-flv&fallback_host=tc.v11.cache4.c.youtube.com&sig=885C9C098DF9 +D80E780177E01CF944BC4F9564FE.9A374618A2BE8C2E562C8622DCB449A7071E37BD +&quality=large,itag= ...AND SO ON'; if ( my @arr = ($line) =~ m/itag=(.+?)&url=(.+?=large)/ig ) { print join "\n" => @arr; }

    Output
    44 http://o-o---preferred---sn-u5a3u5a3-h5oe---v13---lscache3.c.youtube.c +om/videoplayback?upn=8kbZJLkF5PA&sparams=cp%2Cid%2Cip%2Cipbits%2Citag +%2Cratebypass%2Csource%2Cupn%2Cexpire&fexp=927101%2C923006%2C922401%2 +C920704%2C912806%2C913419%2C913546%2C913556%2C919349%2C919351%2C92510 +9%2C919003%2C920201%2C912706&key=yt1&expire=1348823962&itag=44&ipbits +=8&sver=3&ratebypass=yes&mt=1348800611&ip=92.22.37.231&mv=m&source=yo +utube&ms=au&cp=U0hTTVhNUV9LTENOM19QR1VKOkFyQWNVSVFNbmNL&id=1100a4b92b +939cd6&type=video/webm;+codecs="vp8.0,+vorbis"&fallback_host=tc.v13.c +ache3.c.youtube.com&sig=8353F6329CDA8168C4F7F29E20F2AE3F6509D85F.C582 +D63C02534232CE8E28D5ADC5B119AAEF2963&quality=large 35 http://o-o---preferred---sn-u5a3u5a3-h5oe---v11---lscache4.c.youtube.c +om/videoplayback?upn=8kbZJLkF5PA&sparams=algorithm%2Cburst%2Ccp%2Cfac +tor%2Cid%2Cip%2Cipbits%2Citag%2Csource%2Cupn%2Cexpire&fexp=927101%2C9 +23006%2C922401%2C920704%2C912806%2C913419%2C913546%2C913556%2C919349% +2C919351%2C925109%2C919003%2C920201%2C912706&expire=1348823962&algori +thm=throttle-factor&burst=40&ip=92.22.37.231&itag=35&sver=3&key=yt1&m +t=1348800611&mv=m&source=youtube&ms=au&ipbits=8&factor=1.25&cp=U0hTTV +hNUV9LTENOM19QR1VKOkFyQWNVSVFNbmNL&id=1100a4b92b939cd6&type=video/x-f +lv&fallback_host=tc.v11.cache4.c.youtube.com&sig=885C9C098DF9D80E7801 +77E01CF944BC4F9564FE.9A374618A2BE8C2E562C8622DCB449A7071E37BD&quality +=large
    If then you need put your data in an hash, you could simply just do
    my %hash_contain; if(...){ ... %hash_contain = @arr; }
    Since, you will always have itag ids and urls(except otherwise, which was not told us), then your hash look like so:
    35 => ..., 44 => ...,
    Hope this helps.

    If you tell me, I'll forget.
    If you show me, I'll remember.
    if you involve me, I'll understand.
    --- Author unknown to me

      hi thanks for the response, the data arrives as a long urlencoded string from an HTTP request to the YouTube gdata API. I've already decoded it

      Your code works, however the quality=large at the end could be quality=small quality=medium etc so I can't use that in the regex. I suppose I just need to match a pattern which is

      itag=\d{1:2}&url={any number of chars until next itag=\d{1:2}&url= pattern}

      There's also a comma separating the itag url pairs

      I hope that makes sense!

        however the quality=large at the end could be quality=small quality=medium etc so I can't use that in the regex

        Use this then, it works I believe:
        ... if ( my @arr = ($line) =~ m/itag=(.+?)&url=(.+?=.+?)(?=,itag.+?)/ig ) +{ print join "\n" => @arr; } ...
        OR
        use warnings; use strict; while (<DATA>) { chomp; if ( my @arr = /itag=(.+?)&url=(.+?=.+?)(?=,itag.+?)/ig ) { print join "\n" => @arr; } } __DATA__ itag=44&url=http://o-o---preferred---sn-u5a3u5a3-h5oe---v13---lscache3 +.c.youtube.com/videoplayback?upn=8kbZJLkF5PA&sparams=cp%2Cid%2Cip%2Ci +pbits%2Citag%2Cratebypass%2Csource%2Cupn%2Cexpire&fexp=927101%2C92300 +6%2C922401%2C920704%2C912806%2C913419%2C913546%2C913556%2C919349%2C91 +9351%2C925109%2C919003%2C920201%2C912706&key=yt1&expire=1348823962&it +ag=44&ipbits=8&sver=3&ratebypass=yes&mt=1348800611&ip=92.22.37.231&mv +=m&source=youtube&ms=au&cp=U0hTTVhNUV9LTENOM19QR1VKOkFyQWNVSVFNbmNL&i +d=1100a4b92b939cd6&type=video/webm;+codecs="vp8.0,+vorbis"&fallback_h +ost=tc.v13.cache3.c.youtube.com&sig=8353F6329CDA8168C4F7F29E20F2AE3F6 +509D85F.C582D63C02534232CE8E28D5ADC5B119AAEF2963&quality=large,itag=3 +5&url=http://o-o---preferred---sn-u5a3u5a3-h5oe---v11---lscache4.c.yo +utube.com/videoplayback?upn=8kbZJLkF5PA&sparams=algorithm%2Cburst%2Cc +p%2Cfactor%2Cid%2Cip%2Cipbits%2Citag%2Csource%2Cupn%2Cexpire&fexp=927 +101%2C923006%2C922401%2C920704%2C912806%2C913419%2C913546%2C913556%2C +919349%2C919351%2C925109%2C919003%2C920201%2C912706&expire=1348823962 +&algorithm=throttle-factor&burst=40&ip=92.22.37.231&itag=35&sver=3&ke +y=yt1&mt=1348800611&mv=m&source=youtube&ms=au&ipbits=8&factor=1.25&cp +=U0hTTVhNUV9LTENOM19QR1VKOkFyQWNVSVFNbmNL&id=1100a4b92b939cd6&type=vi +deo/x-flv&fallback_host=tc.v11.cache4.c.youtube.com&sig=885C9C098DF9D +80E780177E01CF944BC4F9564FE.9A374618A2BE8C2E562C8622DCB449A7071E37BD& +quality=large,itag=104&url=http://o-o---preferred---sn-u5a3u5a3-h5oe- +--v13---lscache3.c.youtube.com/videoplayback?upn=8kbZJLkF5PA&sparams= +cp%2Cid%2Cip%2Cipbits%2Citag%2Cratebypass%2Csource%2Cupn%2Cexpire&fex +p=927101%2C923006%2C922401%2C920704%2C912806%2C913419%2C913546%2C9135 +56%2C919349%2C919351%2C925109%2C919003%2C920201%2C912706&key=yt1&expi +re=1348823962&itag=44&ipbits=8&sver=3&ratebypass=yes&mt=1348800611&ip +=92.22.37.231&mv=m&source=youtube&ms=au&cp=U0hTTVhNUV9LTENOM19QR1VKOk +FyQWNVSVFNbmNL&id=1100a4b92b939cd6&type=video/webm;+codecs="vp8.0,+vo +rbis"&fallback_host=tc.v13.cache3.c.youtube.com&sig=8353F6329CDA8168C +4F7F29E20F2AE3F6509D85F.C582D63C02534232CE8E28D5ADC5B119AAEF2963&qual +ity=small,itag=15&url=http://o-o---preferred---sn-u5a3u5a3-h5oe---v11 +---lscache4.c.youtube.com/videoplayback?upn=8kbZJLkF5PA&sparams=algor +ithm%2Cburst%2Ccp%2Cfactor%2Cid%2Cip%2Cipbits%2Citag%2Csource%2Cupn%2 +Cexpire&fexp=927101%2C923006%2C922401%2C920704%2C912806%2C913419%2C91 +3546%2C913556%2C919349%2C919351%2C925109%2C919003%2C920201%2C912706&e +xpire=1348823962&algorithm=throttle-factor&burst=40&ip=92.22.37.231&i +tag=35&sver=3&key=yt1&mt=1348800611&mv=m&source=youtube&ms=au&ipbits= +8&factor=1.25&cp=U0hTTVhNUV9LTENOM19QR1VKOkFyQWNVSVFNbmNL&id=1100a4b9 +2b939cd6&type=video/x-flv&fallback_host=tc.v11.cache4.c.youtube.com&s +ig=885C9C098DF9D80E780177E01CF944BC4F9564FE.9A374618A2BE8C2E562C8622D +CB449A7071E37BD&quality=quality=small quality=medium,itag=55&url=http +://o-o---preferred---sn-u5a3u5a3-h5oe---v11---lscache4.c.youtube.com/ +videoplayback?upn=8kbZJLkF5PA&sparams=algorithm%2Cburst%2Ccp%2Cfactor +%2Cid%2Cip%2Cipbits%2Citag%2Csource%2Cupn%2Cexpire&fexp=927101%2C9230 +06%2C922401%2C920704%2C912806%2C913419%2C913546%2C913556%2C919349%2C9 +19351%2C925109%2C919003%2C920201%2C912706&expire=1348823962&algorithm +=throttle-factor&burst=40&ip=92.22.37.231&itag=35&sver=3&key=yt1&mt=1 +348800611&mv=m&source=youtube&ms=au&ipbits=8&factor=1.25&cp=U0hTTVhNU +V9LTENOM19QR1VKOkFyQWNVSVFNbmNL&id=1100a4b92b939cd6&type=video/x-flv& +fallback_host=tc.v11.cache4.c.youtube.com&sig=885C9C098DF9D80E780177E +01CF944BC4F9564FE.9A374618A2BE8C2E562C8622DCB449A7071E37BD&quality=qu +ality=small quality=medium,itag= ...AND SO ON
        Output
        44 http://o-o---preferred---sn-u5a3u5a3-h5oe---v13---lscache3.c.youtube.c +om/videoplayback?upn=8kbZJLkF5PA&sparams=cp%2Cid%2Cip%2Cipbits%2Citag +%2Cratebypass%2Csource%2Cupn%2Cexpire&fexp=927101%2C923006%2C922401%2 +C920704%2C912806%2C913419%2C913546%2C913556%2C919349%2C919351%2C92510 +9%2C919003%2C920201%2C912706&key=yt1&expire=1348823962&itag=44&ipbits +=8&sver=3&ratebypass=yes&mt=1348800611&ip=92.22.37.231&mv=m&source=yo +utube&ms=au&cp=U0hTTVhNUV9LTENOM19QR1VKOkFyQWNVSVFNbmNL&id=1100a4b92b +939cd6&type=video/webm;+codecs="vp8.0,+vorbis"&fallback_host=tc.v13.c +ache3.c.youtube.com&sig=8353F6329CDA8168C4F7F29E20F2AE3F6509D85F.C582 +D63C02534232CE8E28D5ADC5B119AAEF2963&quality=large 35 http://o-o---preferred---sn-u5a3u5a3-h5oe---v11---lscache4.c.youtube.c +om/videoplayback?upn=8kbZJLkF5PA&sparams=algorithm%2Cburst%2Ccp%2Cfac +tor%2Cid%2Cip%2Cipbits%2Citag%2Csource%2Cupn%2Cexpire&fexp=927101%2C9 +23006%2C922401%2C920704%2C912806%2C913419%2C913546%2C913556%2C919349% +2C919351%2C925109%2C919003%2C920201%2C912706&expire=1348823962&algori +thm=throttle-factor&burst=40&ip=92.22.37.231&itag=35&sver=3&key=yt1&m +t=1348800611&mv=m&source=youtube&ms=au&ipbits=8&factor=1.25&cp=U0hTTV +hNUV9LTENOM19QR1VKOkFyQWNVSVFNbmNL&id=1100a4b92b939cd6&type=video/x-f +lv&fallback_host=tc.v11.cache4.c.youtube.com&sig=885C9C098DF9D80E7801 +77E01CF944BC4F9564FE.9A374618A2BE8C2E562C8622DCB449A7071E37BD&quality +=large 104 http://o-o---preferred---sn-u5a3u5a3-h5oe---v13---lscache3.c.youtube.c +om/videoplayback?upn=8kbZJLkF5PA&sparams=cp%2Cid%2Cip%2Cipbits%2Citag +%2Cratebypass%2Csource%2Cupn%2Cexpire&fexp=927101%2C923006%2C922401%2 +C920704%2C912806%2C913419%2C913546%2C913556%2C919349%2C919351%2C92510 +9%2C919003%2C920201%2C912706&key=yt1&expire=1348823962&itag=44&ipbits +=8&sver=3&ratebypass=yes&mt=1348800611&ip=92.22.37.231&mv=m&source=yo +utube&ms=au&cp=U0hTTVhNUV9LTENOM19QR1VKOkFyQWNVSVFNbmNL&id=1100a4b92b +939cd6&type=video/webm;+codecs="vp8.0,+vorbis"&fallback_host=tc.v13.c +ache3.c.youtube.com&sig=8353F6329CDA8168C4F7F29E20F2AE3F6509D85F.C582 +D63C02534232CE8E28D5ADC5B119AAEF2963&quality=small 15 http://o-o---preferred---sn-u5a3u5a3-h5oe---v11---lscache4.c.youtube.c +om/videoplayback?upn=8kbZJLkF5PA&sparams=algorithm%2Cburst%2Ccp%2Cfac +tor%2Cid%2Cip%2Cipbits%2Citag%2Csource%2Cupn%2Cexpire&fexp=927101%2C9 +23006%2C922401%2C920704%2C912806%2C913419%2C913546%2C913556%2C919349% +2C919351%2C925109%2C919003%2C920201%2C912706&expire=1348823962&algori +thm=throttle-factor&burst=40&ip=92.22.37.231&itag=35&sver=3&key=yt1&m +t=1348800611&mv=m&source=youtube&ms=au&ipbits=8&factor=1.25&cp=U0hTTV +hNUV9LTENOM19QR1VKOkFyQWNVSVFNbmNL&id=1100a4b92b939cd6&type=video/x-f +lv&fallback_host=tc.v11.cache4.c.youtube.com&sig=885C9C098DF9D80E7801 +77E01CF944BC4F9564FE.9A374618A2BE8C2E562C8622DCB449A7071E37BD&quality +=quality=small quality=medium 55 http://o-o---preferred---sn-u5a3u5a3-h5oe---v11---lscache4.c.youtube.c +om/videoplayback?upn=8kbZJLkF5PA&sparams=algorithm%2Cburst%2Ccp%2Cfac +tor%2Cid%2Cip%2Cipbits%2Citag%2Csource%2Cupn%2Cexpire&fexp=927101%2C9 +23006%2C922401%2C920704%2C912806%2C913419%2C913546%2C913556%2C919349% +2C919351%2C925109%2C919003%2C920201%2C912706&expire=1348823962&algori +thm=throttle-factor&burst=40&ip=92.22.37.231&itag=35&sver=3&key=yt1&m +t=1348800611&mv=m&source=youtube&ms=au&ipbits=8&factor=1.25&cp=U0hTTV +hNUV9LTENOM19QR1VKOkFyQWNVSVFNbmNL&id=1100a4b92b939cd6&type=video/x-f +lv&fallback_host=tc.v11.cache4.c.youtube.com&sig=885C9C098DF9D80E7801 +77E01CF944BC4F9564FE.9A374618A2BE8C2E562C8622DCB449A7071E37BD&quality +=quality=small quality=medium
        If you tell me, I'll forget.
        If you show me, I'll remember.
        if you involve me, I'll understand.
        --- Author unknown to me

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://996103]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (10)
As of 2014-09-02 18:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite cookbook is:










    Results (29 votes), past polls