http://www.perlmonks.org?node_id=640603

punch_card_don has asked for the wisdom of the Perl Monks concerning the following question:

Miscellaneous Monks,

I have your classic MySQL DB of user-profile info and a web-based html form for input. In my Perl middleware, I'm trying to use HTML::Entities to encode any and all non-alphanumeric characters plus all non-English characters in the user input before building the SQL.

Works OK, except for the "any and all non-alphanumeric characters plus all non-English characters" part. Tried the deault

$encoded_input = encode_entities($input
but, as it says in the documentation

This routine replaces unsafe characters in $string with their entity representation. .... The default set of characters to encode are control chars, high-bit chars, and the <, &, >, ' and " characters.

and that doesn't seem to include a whole bunch of non-alphnum characters like :, ;, , , ^, (, ) and a few more.

So I read:

A second argument can be given to specify which characters to consider unsafe (i.e., which to escape). ... this, for example, would encode just the <, &, >, and " characters:

$encoded = encode_entities($input, '<>&"');
OK, but I don't want to have to generate a list of every non-English character plus all the non-aplhanumerics - I might as well make my own regex if I have to do that.

So next I tried this, from the example:

encode_entities($string, "\200-\377");
But that leave a whole bunch of non-alphanumeric chars as well. So, what the heck, just enlarge the range, right?
encode_entities($a, "\1-\500");
converts every single character, alphanumeric and all. But maybe I'm getting closer...

Will appreciate pointers to get me there.

Thanks.




Forget that fear of gravity,
Get a little savagery in your life.

Replies are listed 'Best First'.
Re: HTML::Entities - encode all non-alphanumeric and foreign chars?
by Sidhekin (Priest) on Sep 23, 2007 at 19:28 UTC

    Your problem is easier if you invert how you express the requirements: Rather than encode everything non-English + non-alphanumeric, encode everything but the English alphanumerics. Which ought to be something like this, depending on your idea of "English alphanumerics":

    $encoded = encode_entities($input, '\W');

    or ...

    $encoded = encode_entities($input, '^\w');

    or ...

    $encoded = encode_entities($input, '^a-zA-Z0-9_');

    (That these follow the regex character class syntax is not actually documented, but I'd be surprised to see it stop working. Certainly, as you noted, the use of hyphen to denote character ranges is documented ...)

    print "Just another Perl ${\(trickster and hacker)},"
    The Sidhekin proves Sidhe did it!

      Hadn't imagined it would take regex elements...

      The first two

      $encoded = encode_entities($input, '\W'); $encoded = encode_entities($input, '^\w');
      wouldn't work for me. But I tried
      $encoded = encode_entities($input, '\\W'); # note double backslash
      and that did work, with one little picky issue - it was encoding every whiteepsace char as well whic, while not technically bothersome, is just not needed.

      So I tried the last formulation witha space added to list - had to add it as a simple typed space - wouldn't accept a \s:

      $encoded = encode_entities($input, '^a-zA-Z0-9_ ');
      and that does it perfectly.

      Thanks.

        $encoded = encode_entities($input, '\\W'); # note double backslash

        Single backslash works for me. Sure you weren't trying with a double-quoted string?

        ('\w', '\\w', "\\w" should all be the same string, \w — whereas "\w" is just w.)

        Oh, and the same goes for \s. It should Just Work in a single-quoted string, but in a double-quoted string, you'll need to double the backslash.

        print "Just another Perl ${\(trickster and hacker)},"
        The Sidhekin proves Sidhe did it!