Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re^2: Regexp and metacharacters

by Largins (Acolyte)
on Dec 27, 2011 at 01:14 UTC ( [id://945137]=note: print w/replies, xml ) Need Help??


in reply to Re: Regexp and metacharacters
in thread Regexp and metacharacters

Hello

I have random data coming in from many XML files which are being parsed and stuffed into a hash
After an entire document has been read in, the desired items are extracted from the hash by key, and then inserted into a database.
Since I don't know what I'm getting beforehand, I want to make sure perl doesn't interpret metacharacters in any way other than as literals.
I was using dbh_quote with good results, but got random burps, sometimes after 3000 files, so went to regexp instead of dbh_quote
My reasoning, perhaps flawed, for wanting a single (or as few as possible) lines of code was that it would run faster, maybe fewer passes per item

Largins

Replies are listed 'Best First'.
Re^3: Regexp and metacharacters
by wrog (Friar) on Dec 27, 2011 at 03:11 UTC
    I have random data coming in from many XML files which are being parsed and stuffed into a hash
    This vastly narrows things down, thanks (and also makes clear that quotemeta|\Q\E is not what you should be doing...)

    Much still depends on where your quoted strings are really coming from, and, at this point, I see two possibilities (yell if it's not one of these):

    1. element body, e.g.,
      <foo ... >"Hi mom!"</foo>
      which seems really unlikely to me, since in this context there's really no need to be quoting the string at all; there are already opening/closing tags (i.e., <foo> and </foo>) to bound things.

      But, on the off-chance that this is indeed the case for the XML documents you're getting, you would then need to consult the documentation for the particular XML Schema being used to find out what the actual format is for the text in <foo> bodies so that you know what sorts of escapes can possibly occur, because if you don't handle them correctly, you will be corrupting your data. And since XML allows people to do anything they want in element bodies, there isn't any alternative here; you have to find out what the schema expects/allows.

    2. attribute values, e.g.,
      <foo attr="Hi mom!" .../>
      In this case, your XML parser should already be providing the strings to you in unquoted form with all escapes/character-entities properly resolved, meaning that if the incoming XML file has, e.g., <foo attr="Jack &#38; Jill" ... /> the parser should be returning this string to you as "Jack & Jill" by which I mean an eleven character string whose first character is 'J', whose sixth character is an actual honest-to-god ASCII-38 ampersand, and whose last character is 'l', with no double-quotes anywhere in sight.

      If your XML parser is not doing this for you, then your XML parser sucks and needs to be replaced; there are lots of choices out there. (...and sorry if you were rolling your own — this is fine for learning how XML works; but if you just want things to work and not have to deal with all of the weird, stupid crap that can come up in XML files (do you handle CDATA properly? what about wonky character encodings?), you need something that's been combat-tested...). CPAN has lots of good XML parsers with pretty much every parsing interface you could ever want; download one and save yourself a lot of grief.

    As for getting stuff into databases, you were right the first time: $dbh->quote is the right way to insert an arbitrary string value into an SQL statement — that regexp quote happened to work for you is more a matter of luck that both Perl regexp and (your database's version of) SQL (apparently) use backslashes in the same way (most of the time, except for those cases where they don't, which you won't find out about until stuff breaks...).

    But actually a better way to do this is to use parameterized queries, if you can. For example, instead of

    $dbh->do('INSERT INTO mytab VALUES('. $dbh->quote($value) .', ...);');

    do

    $dbh->do('INSERT INTO mytab VALUES (?, ...);', {}, $value, ...);

    Granted, you'll need to check what format for parameter placeholders your driver will accept ("?" is supposed to be universal, except where it isn't. I believe MSFT uses something else, but I forget...). And if your driver does not grok parameters, you may be able to choose another one that does (e.g., there may be an ODBC-based driver for your database...).

    While you didn't say what sorts of things your current setup was burping on, and while my current bet would be on the homegrown XML parser screwing up character entities or CDATA stuff, there is also the (perhaps remote, but maybe not) possibility that $dbh->quote isn't doing quite the right thing for your database's version of SQL. The point being that parameterized queries leave it up to the database driver to implement the quoting, and since the driver is specific to your database, it's a lot more likely to get the quoting right (i.e., if there are any dangerous corners in your database's version of SQL that DBI.pm doesn't know about).

    Bottom line here being that if you (1) use a proper XML parser and (2) have a reasonable database driver, you should not be having to do any quote-stripping or escaping/unescaping at all.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://945137]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (5)
As of 2024-04-24 03:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found