Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Regexp for HTML

by gossamer (Sexton)
on Jan 12, 2024 at 19:55 UTC ( [id://11156927]=perlquestion: print w/replies, xml ) Need Help??

gossamer has asked for the wisdom of the Perl Monks concerning the following question:

I created a problem for myself and now hoping someone can help me fix it. I believe I used HTML::Parser to identify URLs that I wanted to strip out of some HTML for a website, but it went terribly wrong and left me with these broken links. I'm thinking I now need a regexp of some kind to clean this up, but I have no idea how to do it. While the string below is similar to what I need to do, there may be some variations that include other forms of "widget=460", such as "widget=410" that would make an exact match more difficult.
<p><a href=";widget=460" rel="noopener" target="_blank"><img alt="Adve +rtiser" class="banner" height="60" src="/images/articles/newsletters/ +Paper-v4-468x60.jpg" style="display: block; margin-left: auto; margin +-right: auto;" width="468" /></a></p>
Ideas for how to approach this would be very much appreciated. I don't currently have any working code.

Replies are listed 'Best First'.
Re: Regexp for HTML
by soonix (Canon) on Jan 12, 2024 at 20:29 UTC
    You can find the canonical answer about the topic "Regex for HTML" here.

    Do you have a backup of your HTML as it was before you ripped the URLs out? Because to me it looks like your "stripped" URLs don't contain any recoverable information.

Re: Regexp for HTML
by marto (Cardinal) on Jan 12, 2024 at 20:42 UTC

    For clarity, all of the impacted links have a distinct widget in the href? If so do you have a string you'd like to replace it with?

      I would like to remove the entire string. Maybe an option would be to do a simple replace of the "widget=\d+" with an actual URL then use the regular HTML parser tools to delete the string?

        Here's an example using Mojo::DOM, to make life easy, even with borked HTML. It replaces hrefs beginning with the selector ;widget=460, replacing them with the URL for this site.

        #!/usr/bin/perl use strict; use warnings; use Mojo::DOM; use feature 'say'; # slurp from file, get from a live site via Mojolicious::UserAgent etc +... # hardcoded for example purposes my $html = '<p><a href="https://example.com">example.com</a></p><p><a +href=";widget=460" rel="noopener" target="_blank"><img alt="Advertise +r" class="banner" height="60" src="/images/articles/newsletters/Paper +-v4-468x60.jpg" style="display: block; margin-left: auto; margin-righ +t: auto;" width="468" /></a></p>'; my $dom = Mojo::DOM->new( $html ); for my $url ( $dom->find('a[href^=";widget=460"]')->each ){ $url->attr('href' => 'https://perlmonks.org'); } say $dom->content;

        Output:

        <p><a href="https://example/com">example.com</a></p><p><a href="https: +//perlmonks.org" rel="noopener" target="_blank"><img alt="Advertiser" + class="banner" height="60" src="/images/articles/newsletters/Paper-v +4-468x60.jpg" style="display: block; margin-left: auto; margin-right: + auto;" width="468"></a></p

        Armed with this, it'd be trivial to have a list of widgets & their real urls, substitute the selector and static url value in the code above, looping through the list of widgets.

        Update: See also Re: Batch remove URLs, or super search for more examples.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11156927]
Approved by marto
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (9)
As of 2024-05-21 11:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found