Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re^2: Module to extract text from HTML

by bliako (Monsignor)
on Feb 29, 2024 at 15:17 UTC ( [id://11157971]=note: print w/replies, xml ) Need Help??


in reply to Re: Module to extract text from HTML
in thread Module to extract text from HTML

your post reminded me that there is also lynx (https://lynx.invisible-island.net/) (a text-based web-browser) and CPAN module HTML::FormatText::Lynx which spawns a lynx and passes it an html filename or string.

Replies are listed 'Best First'.
Re^3: Module to extract text from HTML
by marto (Cardinal) on Feb 29, 2024 at 15:40 UTC

    You've inspired a reverse golf challenge, ignore all simple, portable solutions, what's the most convoluted way to achieve the goal :)

      my $text = `lynx -nolist -dump 'https://www.perlmonks.org/?node_id=11157915'` :)

        That's far too Effient. The purpose of such a challenge is to deliberately make it convoluted. Think Rube_Goldberg_machine. In real terms, not everyone has lynx, not everyone can install it on their web host.

        Update: added link.

      fair enough. But the problem of converting html to text can be solved with varied success especially if heuristics are applied, so the more options the better. That's why I keep adding to the list, though the mech-to-pdf was more joking than solving.

        Indeed, and my comment wasn't intended as a criticism, rather an opportunity/idea of the inverse golf/Rube Goldberg solution to problems. In so much that code golfing is an exercise, as is a needlessly convoluted one that generates a suitable response.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11157971]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (6)
As of 2024-05-21 10:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found