Module to extract text from HTML

Bod has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Module to extract text from HTML by marto (Cardinal) on Feb 27, 2024 at 11:29 UTC
Mojo::DOM (via Mojo::UserAgent): `use strict; use warnings; use feature 'say'; use Mojo::UserAgent; my $ua = Mojo::UserAgent->new; say $ua->get('https://www.perlmonks.org')->res->dom->all_text;` [download] WWW::Mechanize: `use strict; use warnings; use feature 'say'; use WWW::Mechanize; my $ua = WWW::Mechanize->new; $ua->get('https://perlmonks.org'); say $ua->content(format => 'text');` [download]	[reply] [d/l] [select]
Re^2: Module to extract text from HTML by Bod (Parson) on Feb 27, 2024 at 19:37 UTC
Thanks The Mojo::UserAgent solution looks like it could be just what I am looking for 👍	[reply]
Re: Module to extract text from HTML by hippo (Bishop) on Feb 27, 2024 at 11:30 UTC
Not sure if this is quite what you are after but HTML::Strip sounds like a reasonable place to start, anyway. There's also HTML::FormatText which does something similar. 🦛	[reply]
Re^2: Module to extract text from HTML by Bod (Parson) on Feb 27, 2024 at 19:35 UTC
Many thanks - two modules that are new to me, and both look very useful 😀	[reply]
Re: Module to extract text from HTML by Corion (Patriarch) on Feb 27, 2024 at 11:52 UTC
Depending on what text exactly you want (include/exclude stuff in the `<head>`), you might also get a solution working by running the Mozilla readability library using one of the JS libraries ( JavaScript::QuickJS, JavaScript::Duktape ), or by porting that library to Perl. Depending on the content, often you can find an RSS feed. I distinctly remember reading a paper about HTML content extraction, and that did some calculation on the tree structure of the page, and used something like the element with the highest number of direct children of (I think) type `p` or `div`, but I can't find that one anymore. This would be something that should be fairly simple to implement using XPath queries.	[reply] [d/l] [select]
Re^2: Module to extract text from HTML by bliako (Monsignor) on Feb 27, 2024 at 21:06 UTC
What? No WWW::Mechanize::Chrome? use Log::Log4perl qw(:easy); use WWW::Mechanize::Chrome; my %default_mech_params = ( headless => 1, launch_arg => [ '--window-size=600x800', '--password-store=basic', # do not ask me for stupid chrome ac +count password '--disable-gpu', '--ignore-certificate-errors', '--disable-background-networking', '--disable-client-side-phishing-detection', '--disable-component-update', '--disable-hang-monitor', '--disable-save-password-bubble', '--disable-default-apps', '--disable-infobars', '--disable-popup-blocking', ], ); my $mech = WWW::Mechanize::Chrome->new(%default_mech_params); $mech->get('https://perlmonks.org/?node_id=11157915'); $mech->sleep(5); my $text_string = $mech->content( format => 'text' ); print $text_string; [download] bw, bliako	[reply] [d/l]
Re^3: Module to extract text from HTML by parv (Parson) on Feb 27, 2024 at 21:22 UTC
What I like about the response are the various (interesting) `disable` options for `launch_arg`.	[reply] [d/l] [select]
Re^4: Module to extract text from HTML by bliako (Monsignor) on Feb 27, 2024 at 22:04 UTC
Re^2: Module to extract text from HTML by bliako (Monsignor) on Feb 27, 2024 at 22:36 UTC
And here is the long-winded road of using the mech to save to PDF and then use `pdftotext` (linux command line) to extract the text (all mixed up and good luck): ... my $pdf_data = $mech->content_as_pdf( format => 'A0' ); open(my $fh, '>:raw', 'the.pdf') or die $!; print $fh $pdf_data; close $fh; `pdftotext 'the.pdf'`; [download] Note that 'A0' paper size ...	[reply] [d/l] [select]
Re^3: Module to extract text from HTML by afoken (Chancellor) on Feb 28, 2024 at 19:41 UTC
And here is the long-winded road of using the mech to save to PDF and then use pdftotext I'm still waiting for someone to suggest printing out, scanning back in, doing OCR, and have an AI fix the OCR errors. ;-) Also, no traces of "just use a regex" so far. Which is really good. Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply]
Re^4: Module to extract text from HTML by bliako (Monsignor) on Feb 29, 2024 at 17:35 UTC
Re: Module to extract text from HTML by bliako (Monsignor) on Feb 27, 2024 at 22:09 UTC
Since some HTML parsers have been mentioned, and in case you take the long handcrafting route, I want to mention my personal favourite HTML5::DOM, it has not failed me yet and seems to be doing what I mean it to do, so far.	[reply]
Re: Module to extract text from HTML by parv (Parson) on Feb 27, 2024 at 11:34 UTC
Have you tried using XPath to search with some Perl wrapper(s) around XML parsing librar(y\|ies)? I personally do not remember Perl ones (Python has lxml package around `libxml2` C library). time passes XML::LibXML could have been the one.	[reply] [d/l]
Re^2: Module to extract text from HTML by marto (Cardinal) on Feb 27, 2024 at 11:46 UTC
Mojo::DOM is a parser which makes this trivial, however I get the impression from question that it's less about selecting a particular parts of the page ('just extracting the p tags which is not quite good enough'), and more about 'all' of the text.	[reply]
Re^3: Module to extract text from HTML by parv (Parson) on Feb 27, 2024 at 18:47 UTC
(hmm) Right. Yeah I missed that. (edited) Or, may be not. So `qx[w3m url]`? Anyway, I took a superficial look at dependencies of Mojolicious, did not see external parser like. Does it implement the parsing itself? Yes, yes it does.	[reply] [d/l]
Re^4: Module to extract text from HTML by marto (Cardinal) on Feb 28, 2024 at 17:06 UTC
Re^3: Module to extract text from HTML by Bod (Parson) on Feb 27, 2024 at 19:45 UTC
I get the impression from question that it's less about selecting a particular parts of the page We already hold the website of our customers (typically UK charities). We want them to complete a short section about their organisation. This is used to construct prompts for AI tools around our site that they use to streamline their workload. I am trying to make it easier for them to complete the description of their organisation by pulling text from their own website. This will give them something to work with instead of having to begin with a blank canvas (or contenteditable div).	[reply]
Re^4: Module to extract text from HTML by bliako (Monsignor) on Feb 28, 2024 at 14:15 UTC
Re^5: Module to extract text from HTML by Bod (Parson) on Mar 01, 2024 at 15:47 UTC
Re: Module to extract text from HTML by kikuchiyo (Hermit) on Feb 29, 2024 at 13:31 UTC
If you can use external programs instead of just Perl modules, try html2text. It does exactly what you want. Be warned though: there are (at least) two different programs with the same name, with different options.	[reply]
Re^2: Module to extract text from HTML by bliako (Monsignor) on Feb 29, 2024 at 15:17 UTC
your post reminded me that there is also lynx (https://lynx.invisible-island.net/) (a text-based web-browser) and CPAN module HTML::FormatText::Lynx which spawns a lynx and passes it an html filename or string.	[reply]
Re^3: Module to extract text from HTML by marto (Cardinal) on Feb 29, 2024 at 15:40 UTC
You've inspired a reverse golf challenge, ignore all simple, portable solutions, what's the most convoluted way to achieve the goal :)	[reply]
Re^4: Module to extract text from HTML (Reverse Golf) by eyepopslikeamosquito (Archbishop) on Feb 29, 2024 at 22:40 UTC
Re^4: Module to extract text from HTML by Danny (Pilgrim) on Feb 29, 2024 at 15:45 UTC
Re^5: Module to extract text from HTML by marto (Cardinal) on Feb 29, 2024 at 15:51 UTC
Re^4: Module to extract text from HTML by bliako (Monsignor) on Feb 29, 2024 at 17:32 UTC
Re^5: Module to extract text from HTML by marto (Cardinal) on Feb 29, 2024 at 17:34 UTC
Some notes below your chosen depth have not been shown here
Re: Module to extract text from HTML by perlfan (Vicar) on Feb 29, 2024 at 05:05 UTC
Web::Scraper is my general go-to. It makes managing changes on sites that update rather often, pretty easy.	[reply]


Welcome to the Monastery
	PerlMonks