Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re^3: Module to extract text from HTML

by Bod (Parson)
on Feb 27, 2024 at 19:45 UTC ( [id://11157933]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Module to extract text from HTML
in thread Module to extract text from HTML

I get the impression from question that it's less about selecting a particular parts of the page

We already hold the website of our customers (typically UK charities). We want them to complete a short section about their organisation. This is used to construct prompts for AI tools around our site that they use to streamline their workload.

I am trying to make it easier for them to complete the description of their organisation by pulling text from their own website. This will give them something to work with instead of having to begin with a blank canvas (or contenteditable div).

Replies are listed 'Best First'.
Re^4: Module to extract text from HTML
by bliako (Abbot) on Feb 28, 2024 at 14:15 UTC

    If I understood correctly that you are in control of websites and the formatting of their content, perhaps you could add some tags to the content by means of html comments or, better, custom attributes for html tags <p "data-purpose"="description" "data-index"="1">blah blav</p> and then you just reconstruct the text content from html.

      you are in control of websites and the formatting of their content

      No - although I am testing it on our own websites, it will be required to read our customers' websites over which we have no control.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11157933]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (3)
As of 2024-05-30 05:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found