Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot

Re: how to extract iframes from text

by B-Man (Acolyte)
on May 01, 2013 at 14:48 UTC ( #1031579=note: print w/replies, xml ) Need Help??

in reply to how to extract iframes from text

I actually developed some code to extract this. I'm not sure if it's the "easiest" way, but you don't need any external modules to make it work!

Yeah, I put the test string in a text so I didn't have to worry about escaping anything in a string literal.

use strict; use warnings; my $string; my $startOfIframe; my $endOfIframe; open TEST, "<test_string.txt" or die "Cannot open:$!"; $string = readline TEST; close TEST; $startOfIframe = index $string, '<iframe' ; $endOfIframe = index $string, '</iframe>'; while ( $startOfIframe != -1 ) { print substr ( $string, $startOfIframe, $endOfIframe - $startOfIfr +ame ) . '</iframe>'. "\n"; $startOfIframe = index $string, '<iframe', $endOfIframe; $endOfIframe = index $string, '</iframe>', $startOfIframe; }

I'm looping through the input string, extracting the iframe data with a substr call. The offset and length of the substr call are derived from the indexes of the beginning and ending iframe tags, which are updated at each iteration of the loop.

The indexes of the opening and closing iframe tags themselves are found by starting the search at the index of the tag immediately proceeding it. The loop continues until an opening iframe tag can't be found.

Derp, I meant </iframe> and not </index> in my print.

Replies are listed 'Best First'.
Re^2: how to extract iframes from text
by marto (Bishop) on May 01, 2013 at 15:17 UTC

    Infinite loop for this test file:

    <iframe id="derp" src="http://derp" width=800 hight=800> </iframe>


    <iframe id="derp"</index> <iframe id="derp"</index> <iframe id="derp"</index> <iframe id="derp"</index> ....and so on

    This is why it's better to use a parser to deal with mark up languages where possible.

    Update: added missing " to input file and output.

      Fair enough. This is actually easily fixable. You just have to replace a bit of code.

      $string = readline TEST;


      while ( <TEST> ) { chomp $_; $string .= $_ ; }

      There. Now you've merged separate lines into a single string to search, and this works again. Happy?

      Edit:Oh, and if you're still missing an ending iframe tag, you can see if that exists in your while loop condition text. Heck, you could probably tell the user where they're missing a iframe tag if you took that idea a little furhter.

        I was never unhappy. I was simply pointing out that your posted solution doesn't work. You provide no caveats for the input, you don't cater for all valid HTML. Fundamentally your code doesn't return what OP wants. They want the value of the src element, though you'd have had to have read the other responses in the thread to know that.

        According to your response in the CB, I'm:

        "acting like the problem can't be fixed, and that's a load of crap."

        "basically saying my idea can't possibly be tweaked to work, so I should just use a parser, marto. The thing is, it can and was tweaked, and you're not as clever as you think."

        "missing the point. I fixed the error, and as long as there's an ending iframe tag from now on, my code works. Shit, I guess I could make sure there's an ending iframe tag too and that would prevent infinite loops caused by invalid html."

        I still think it's you who is missing the point. At no point did I suggest you couldn't write code to properly parse HTML from scratch. It'll take you a very long time to create your own parser for HTML which caters for all of it's foibles, with a convenient way to access/select each valid attribute and it's associated value, which is also well tested with many test cases.

        I think perhaps the scope of the problem wasn't something you'd fully considered when posting, or jumping to conclusions regards my response. A vast amount of work goes into creating parsers for HTML/XML/whatever which address their requirements and shortfalls. The solution OP has chosen is well tested, and they now have access to a toolkit which makes it trivial to cater for changes in the input/source data. The goal is to write code which works well and is easy to maintain.

        On a non technical note I honestly don't care what you think of me, however I ask that you take a step and think before acting when communicating online in places such as this. If you post something that has issues expect people to tell you about it. If you say something in the chatterbox and someone responds take the time to try and understand what they're saying. Of course you're free to disagree, but there's no need to be rude and jump to bizarre conclusions as to what others are saying or thinking. Few regulars here will intentionally give you bad advice.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1031579]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (2)
As of 2017-04-30 00:49 GMT
Find Nodes?
    Voting Booth?
    I'm a fool:

    Results (534 votes). Check out past polls.