Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Converting plain text to HTML and back again

by Nomis52 (Friar)
on Jul 08, 2003 at 04:32 UTC ( #272193=perlquestion: print w/ replies, xml ) Need Help??
Nomis52 has asked for the wisdom of the Perl Monks concerning the following question:

Ok here goes:

I'm in the process of writing a forum type web application. I accept messages from forms, save them in a database and then display them as web pages and/or email them to users.

The problem is I have two types of users. The first type their message in the text field and expect to see it displayed as they typed. The second want to pretty up their messages and so use HTML. Currently the HTML tags are limited to a small subset including b,i,p,br,a,ul and li

At the moment, the text is passed through HTML::Scrubber to limit the HTML tags & attributes (if any) and then stored in the db. When displayed on a webpage, the text is run through a simple regex which adds <p> and <br> tags in place of \n . The emailed msgs are sent out as plain text, with no additional filtering.

There are a least two problems with this approch however:

  • Those users who supply HTML tags, find that the regex conflicts with their supplied tags, adding extra <p> and <br> 's everywhere
  • Those users getting the messages via email get a bunch of HTML tags in the messages if the orginal poster used HTML.

So my thoughts on this was to start storing the data as HTML. I was thinking of accepting messages in both HTML and plain text format (adding a checkbox below the form, or maybe searching for HTML tags and deciding). The plain text messages would be passed through HTML::FromText, and then both would be Scrubbed as before.

On the output side, when displayed as webpages, the data can be taken straight from the db without any processing, while for emailing I was looking at using HTML::FormatText to convert back into plain text.

I've started to code up some examples to test this out and it _almost_ works. The issue is that I'd like to have the output text match as closely as possible to the input text, else I will get complaints :) There are a number of small problems like how HTML::FromText changes

* 1
* 2
to
<UL><LI><P>1</P>
<P>2</P>
</UL>
which HTML::FormatText renders as:
  *
 
    1
 
  *
 
    2
To fix these this I've started making small modifications to both HTML::FromText and HTML::FormatText. So one of my quesitons is should I submit these as patches to the authors or should I just fork and change them to MyApp::HTML::xxxx

And finally while typing this I've thought of maybe adding an attribute in the db to indicate whether or not the text is in html form. This will get rid of the converting back and forth. Thinking about this now it might be the best way to do it.

Am I going about this the right way? Someone must have done something simliar to this before and I'm interested in your comments

Comment on Converting plain text to HTML and back again
Re: Converting plain text to HTML and back again
by Cody Pendant (Prior) on Jul 08, 2003 at 04:54 UTC
    I was just over at LiveJournal, and they have a simple checkbox which says "don't auto-format my stuff into HTML pars" or something. Make that a sticky checkbox and that's one blunt way of attacking the problem.

    I must admit I'm surprised about that behaviour with the dotpoints. That must be very annoying.



    “Every bit of code is either naturally related to the problem at hand, or else it's an accidental side effect of the fact that you happened to solve the problem using a digital computer.”
    M-J D
Re: Converting plain text to HTML and back again
by Dog and Pony (Priest) on Jul 08, 2003 at 06:08 UTC
    Having a checkbox, or possibly autodetect sounds like a good idea. When the user is using HTML, don't apply any other formatting, but let the user take full responsibility. Save the choice as a boolean in the database as well,so you know what processing, if any, to make when it is time to produce output.

    Another option is to disallow <p> and <br>, so you can make replacements in all of them, but that is probably not so popular. Or to make it harder on yourself, try to guess what to do, ie /<br>\n/ does not get replaced, while a single \n will. Probably not a good route.

    In some kinda-similar solutions I've also had both types of data saved side-by-side in the database, especially when there would be a lot of processing overhead otherwise. DB size is rarely *that* important, after all. This could be useful in that when somebody saves a pure-text post, you also save a HTML version, but you still have the pure text for the email etc.

    Trying to format texts back and forth might be a bad idea, lots of special cases and whatnot. A one-off conversion from one to the other should probably be much more reliable in the long run. At least any bugs should be easily spotted and fixed, as opposed to a text that has gone from HTML to text to HTML to... if your users can edit posts for instance, that could happen really fast. :)


    You have moved into a dark place.
    It is pitch black. You are likely to be eaten by a grue.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://272193]
Approved by cfreak
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (12)
As of 2014-10-30 16:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (208 votes), past polls