Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Thwarting Screen Scrapers

by kschwab (Vicar)
on Jul 18, 2002 at 13:24 UTC ( #182794=perlquestion: print w/replies, xml ) Need Help??

kschwab has asked for the wisdom of the Perl Monks concerning the following question:

I'm curious if anyone has an experience in protecting a web-based interface from being "front-ended" by others for their own gain.

Say I spent a good amount of time and money to develop a site that sells "ProductX". Someone then reverse engineers my HTML form->submit process and creates their own front-end, secretly adding an upcharge to the customer. They also change all references to ProductX to ProductY, so I may not be able to manually search for and identify who's screen scraping.

What I've thought of:

  • Dynamically changing form element names, perhaps tied to a Digest::MD5 hash of the session key. Might help, but they could still guess some of that based on the provided values of the form elements.
  • Having the user type in text that matches what's displayed in a .gif image. ( See This pretty cool node from jcwren on defeating this sort of thing ). This bothers me, because I'm making the customer jump through hoops to buy something
  • Analyze web logs to find those people taking an odd path through the site. ( skipping intro pages ). Turns out this isn't useful, since there so much client-side caching.
Any other ideas ? Any modules that might help me ?

Replies are listed 'Best First'.
Re: Thwarting Screen Scrapers
by dws (Chancellor) on Jul 18, 2002 at 16:25 UTC
    I'm curious if anyone has an experience in protecting a web-based interface from being "front-ended" by others for their own gain.

    I've spent a fair amount of time on the other end of this problem, dealing with issues around how to co-navigate web pages that are in some way protected. (This was for a customer service application, where customers' support organization was having to work around roadblocks of the type you're looking to set up, set up internally by the web side of their organizations.)

    If you're willing to put some work in on the back end, one way of throwing a spanner in the works of anyone who is hijacking your form submission process without your noticing is to do the following:

    • When you generate the form, allocate an ID and record it on the backend (e.g., in a database, along with a timestamp that you can use to time the form out).
    • Generate an MD5 hash based on the ID and some secret key known only to your application. Add this hash as an argument to the form action URL1 (i.e., add "?key=$hash").
    • Put the ID into the form in a hidden field.

    When a form is submitted, it's a simple matter to

    • Check to see if the ID has been used already. This prevents them from grabbing one legitimate key/ID pair and reusing it.
    • Check to see if the ID has expired (if you care)
    • Generate a new hash based on the ID and your secret key, and compare to the one in param('key')
    This leaves the exploiters in the position of either having to come to your site to get a form, or trying to guess your secret key. If they have to come to your site to get a form, you can track and ban them. If your submission form is framed, you can do an automated check using your weblogs for form submissions that aren't matched to a fetch of the framing page. This isn't 100% accurate, but the repeat abuser is who you're looking for.

    1You could put the hash into a hidden field instead. I recall there being some reason why having it be part of the URL was advantageous, but don't remember specifics. It might have had to do with getting into into the weblogs for later processing.

      Thanks...this is the kind of input I was looking for.

      Obviously any type of measure has a countermeasure, and if it works on a browser, It would work in LWP ( or some other interface ).

      The addition of a timestamp into the hash calculation is an interesting one.

      We've already worked out a method of using dynamically generated form field names from a hash of the session key. Adding the timestamp purturbs it a bit, and keeps someone from keeping a session alive over a long period of time.

      dws++...thanks again.

        The addition of a timestamp into the hash calculation is an interesting one.

        Interesting, but not what I intended to suggest. Using a timestamp when generating the hash needlessly complicates verification.

        What I meant to suggest was that you save a timestamp when you record generated IDs. This gives you an easy way to "time out" forms, and flush abandoned forms out of your back-end database. It also sets you up for doing some analysis on things like average submit time (the gap between your generating the form, and a user submitting it). A really low submit time is an indication that there's a bot on the other end of the line.

Re: Thwarting Screen Scrapers
by ignatz (Vicar) on Jul 18, 2002 at 14:33 UTC
    Back in the dot com boom I spent a few months working for a company that scraped sites that had logins so that one could store all of them in one place and only have to register once. Many sites welcomed it because it got them new members. Some didn't and took counter-measures. Changing the form elements, moving the locations of the forms or changing the required cookies all played havoc on our application. The most effective weapon was sites that simply blocked our IP address.

    As for cookies and HTTP_REFERERs and the like, just because something that you do can be hacked doesn't mean that you should assume that they have hacked it and not check for it. This gives them the luxery of not even having to hack it in the first place.

    Generally, what these guys are doing isn't rocket science. Changing things even a little bit will throw a big spanner into their works. Making sure that your form validator confirms that EVERYTHING is as it should be will also be a big plus.

    ()-()
     \"/
      `                                                     
    
(jcwren) Re: Thwarting Screen Scrapers
by jcwren (Prior) on Jul 18, 2002 at 16:16 UTC

    I know this sounds evil, but for the moment, the best way to prevent screen scraping may be to use Flash. Flash now supports forms, submissions, authentication, yada yada yada.

    This comment was not intended to address usability, nor the ability of non-Flash enabled browsers to purchase your product. But Flash is becoming pervasive, and is available on the platform that is most likely to be your customers browser of choice.

    This will change, no doubt, as clients side Flash tools become as pervasive as various CPAN modules like LWP, etc. But for now, it seems to be the least likely method to address the issue you're concerned about.

    --Chris

    e-mail jcwren

Re: Thwarting Screen Scrapers
by kschwab (Vicar) on Jul 18, 2002 at 14:35 UTC
    I had hoped to limit this to the technical rather than philosophical points, but it looks like the replies are headed elsewhere.

    How about an example close to home ?

    I create a site called perldudes.com. Instead of developing my own community, I front-end perlmonks.com, taking the inbound http requests, pulling nodes from perlmonks, and substituting text as needed. ( s/perlmonks/perldudes/g, etc). I also put in my own advertisements and content, and maybe the interface is really crappy.

    As for what this has to do with perl, It's obviously a bit off-topic. I am interested, however, in how any technique might be implmented in perl, and what modules might help me along.

    I am aware that there is no way to completely stop this sort of thing. I'm looking for the best ideas to slow it down or at least stop the simplistic attempts

    Abigail: I understand your points, but...If someone else can sell my product, but creates the whole customer selling experience, and I have to create, ship and support the product, how is that okay ? The "scrapers" go to great lengths to make sure the Customer doesn't see the fact that there are two parties involved. They also dish off support, etc, by front-ending the feedback forms.

      kschwab,
      Find the ip it's coming from and block it. Unless it's client side scripting, it's going back to a central computer (or series of) somewhere.
      Block it.
      If you want to be really cool. When you find the IP, catch the POST or GET, go out to theonion.com or some other random site, do a get yourself and hand it to their request.
      The bad part about that is, you would be doing what they are but it would confuse the mess out of them for a moment ;o)
      Daniel
      I still don't see the big problem. If a person would go to your site and order something, you will have to create, ship and support the product. Just like you have to do when they go to someone elses site. Of course, if you don't want to create, ship and support a product, why do you have it?

      I do assume you are getting paid for creating, shipping and supporting the product. If not, and it's a burden to you, perhaps you should stop. ;-)

      What interests me is how they manage to get in the middle when it comes to paying. How are they getting their share? If they take a credit card number, take their share from the account, then pass the number to you so you take your part, the customers will frown, and someone will think "fraud".

      Abigail

        It's not one situation, but many.

        Indeed, some of them do make their own charge on the credit card, and I end up handling the resulting mess.

        There's several variations on the theme, some of them actually calling out the correct name for the product, but acting like they are some sort of authorized reseller.

        Other ones have a relationship with vendors of similar products, and get paid for those purchases. They include my product only for completeness, and make no money on the transaction.

        They do, however, get the Customer eyeballs, and create confusion. My product gets tied in with their advertisements, or perhaps their interface keeps crapping out, and I get that feedback.

        Part of the retail process is trying to get the customer to come back and give you more of their money. Hard to do that if they don't know who you are.
        ()-()
         \"/
          `                                                     
        
Re: Thwarting Screen Scrapers
by tjh (Curate) on Jul 18, 2002 at 14:41 UTC
    From the subject line I expected a conversation on hijacking content, possibly RSS or other news feed issues, copyright arguments, maybe even allusions to the U.S. entertainment industry trying, with vehement avarice, to technologically block any re-recording of anything (lol), and other things... :|

    Instead, I can't tell if you are a merchant that is somehow being disintermediated by your own reseller or what - even though you're still making the sale. I'm confused. If you're still making the sale and collecting the payment, I don't get it. Has someone pre-empted your front end? Why would they do that? If you're being targeted and your site hijacked that's different.

    If you have soft content, news or other written content, that someone is scraping and calling their own either by redisplaying on their own site, this is a different matter - a legal one without good Perl-specific solutions.

    Did you state your problem exactly - or is this a drill?

    Update: just read your follow up.

    The tech tactics are being listed by others (dynamic session id's per page call, dynamic field names, etc.) In an ideal world all session mgmt and user authentication would be application level with high granularity - down to each page or function call from the client, every time a request arrives. I know of no current solution, Perl or otherwise, that solves this completely. Would love to see one though.

    On the other front from your example, I have had this exact experience 2 times. All the technology solutions in the world won't stop someone who relentlessly intends this fraud. You have to detect them, copy the fraudulent material, get witnesses - do whatever your lawyer tells you to do about the copyright violation (and hope it's domestic). In one of my experiences, a simple email solved it. The other got a little warmer...

      In an ideal world all session mgmt and user authentication would be application level with high granularity - down to each page or function call from the client, every time a request arrives. I know of no current solution, Perl or otherwise, that solves this completely. Would love to see one though.
      Yes, this level of authentication can certainly be done. I'm currently involved in a large project, where this is being done, and we even go further. Unfortunally, I can't tell you more.

      It's not simple, and it takes large investments. The question isn't "can this level of authentication be done", the question is "how much are you willing to pay?" (pay in a broad sense - mostly costs to hire people).

      Abigail

      You're right, I haven't included all the details. I was trying to keep this generic enough to apply in more than one situation.

      Basically, I'm selling something direct via a website. I have no resellers. A set of people I don't know at all have created their own websites, but they are nothing more than a shell around my website. They make money by adding a "service charge" and billing it to the customer. ( Without adding any apparent value )

      They take all the http and https requests from the Customer, via their own forms, and then take the data and make simulated browser requests to my site to make the purchase. Other areas, such as feedback, etc, are directed to my site as well.

      They obviously feel they are doing something wrong, since they hide behind unprotected web proxy servers and use other "stealth" techniques to make stopping them difficult.

      If it were just one party, a legal approach would work. Unfortunately, this situation happens over and over again, with a different set of front-enders, sometimes with an offshore website.

        I see (I think). They're processing their own forms (order and payment) themselves, then, in turn, mapping the same sequence on your site. Does this mean that every time an order is made and paid on their site that they cause the same on yours? Are you getting the original customer name, addy, etc., or would you know?

        Real-time detection is possibly the first goal. Unless there is something unique you can detect in the incoming 'ghost' client that you can block with, maybe you can work to detect duplicate payments, shipping addresses etc on the tail of the transaction - which assumes that your new 'partners' are ordering from you then re-shipping to their customer.

        If they are taking the customer data from their own forms and re-submitting it to you, including payment (CC#?) info to you - with a markup - how are they collecting their markup? If they are collecting their full payment using the customer's payment data, THEN resending that same payment data to you, effectively double-billing the buyer, this is a much different type of problem and you should be contacting law enforcement.

        From the looks of your other responses in this thread - methinks you need to do both - tech and legal. If you have a product that is inspiring so much theft/fraud, you need to protect it immediately - but not so protected that it can't be sold at all... :)

Re: Thwarting Screen Scrapers
by mojotoad (Monsignor) on Jul 18, 2002 at 14:30 UTC
    Aside from the varioius comments above, I would add that if the scenario is as you describe, make sure that you show up in any cost comparison meta-sites. You're guaranteed to be the lower price.

    Matt

Re: Thwarting Screen Scrapers
by Sifmole (Chaplain) on Jul 19, 2002 at 11:46 UTC
    I don't see your problem.

    If you charge $100 for one unit of ProductX: for packaging, shipping, product, and support; and the other guy charges $120 for one unit of ProductY (aka ProductX) and then pays you $100 and you package, ship, manufacture, and support -- you get paid the same for the same amount of money. All you got is someone out there doing free marketing for you.

    If the problem is name brand recognition, well then.... Just go to your local Kinko's print up some package inserts:

    If you bought this product from anywhere other than www.iwanttosellmyownstuff.com, you may have paid too much. Please visit www.iwanttosellmyownstuff.com in the future for lower prices on this wonderful do-ma-higgy. Thanks for you patronage.

Re: Thwarting Screen Scrapers
by fireartist (Chaplain) on Jul 18, 2002 at 15:24 UTC
    How does your billing backend work, and do you store cc numbers?

    Why do I ask?
    I presume that if they are charging the customer extra, and keeping the profit, that they are charging the customers creditcard themselves, and then sending their own payment details to you to make the purchase from you.

    The ony way they could get round this were if they charged the customers cc a small fee themselves, and then sent the cc number to you to charge the rest.
    - and I hope that anybody would think this very suspicious if they saw this on their statement.

    So, I can see 2 possible solutions to counter this.
    If you store the cc numbers, then check to see if the same number is being used multiple times for the same product.
    Check the customers address against the cardholders address to see if they're different.

      do you store cc numbers?

      Wouldn't that be an incredibly bad practise? I have worked on a number of ecommerce projects but none of them stored the credit card number. Ever.

      If you store card numbers in your database and your server gets cracked then the cracker can get all the card numbers. My legal knowledge is small but I'd have thought a system design like that would leave you open to criminal negligence suits. If you don't store the card numbers there is no exposure.

        I know, I was going to add a disclaimer, but didn't bother.

        I said "do you?", because I know that some do it.
        - Amazon, for example, records my cc number.

        I have read about methods of storing cc numbers by using a machine behind a firewall, which the cgi server can access, but can't itself be accessed directly from the internet.
        I don't know all the implications/applications of this, so that's why I didn't go into it.
        (and don't really want to still ;)
Re: Thwarting Screen Scrapers
by Abstraction (Friar) on Jul 18, 2002 at 13:35 UTC
    This is just an idea and I'm sure someone will have a way around this. But when you display the form, set a cookie with a known, difficult to guess value. When you process the form, check for the existance of that cookie. If someone is posting from another domain they won't have have that cookie.

    You can also check the refering URL, but that can be spoofed I think.

      Nothing of course that prevents the other side to make a query to your site and get a cookie.

      Abigail

Re: Thwarting Screen Scrapers
by neilwatson (Priest) on Jul 18, 2002 at 13:41 UTC
    As Abigail says, so what?

    The whole point of the web is open standards. HTML is not hidden. If it was, it would not be as poplular. Now you want to hide so that you can sell your productX better than someone else can sell productY?

    A product should sell on the merits of its performance and quality. Not on how slick your website is.

    Neil Watson
    watson-wilson.ca

      That's missing the point. I don't care about honest competition. I just don't think someone should be able to leverage my infrastructure to re-sell my product.

      This leaves me no control over the selling process. The front-end does whatever it likes.

      Suppose they make a claim that the product has awesome feature xyz. They then take the money, hit my website, I take the order and ship it. The customer opens the box, finds out feature xyz doesn't exist, then finds the support contact info in the box. They call me and ask about feature xyz.

      Bah.

      Update neilwatson: Yes, in some cases it is fraud, and legal action is taken. The reason for the post was to find ideas to do whatever I can to discourage it in the first place. I'd rather make it hard to do than wait for it happen and take legal action.

        So we are talking about fraud? There's not really a productY at all. It is productX purchased from you, marked up and resold in a fraudulant manner. Surely there is a way for the product to be traced to whom is was sold to (the "scraper")? Product serial numbers?

        Perhaps finding the sites for these "scapers" and bringing legal action against them.

        Neil Watson
        watson-wilson.ca

Re: Thwarting Screen Scrapers
by Rhose (Priest) on Jul 18, 2002 at 17:35 UTC
    First off, let me say I have no experience in this area, but while reading this thread, I had a thought -- could you use a technique similar to the ones used to combat votebots? Since it seems you want your form to interface with a human and not a computer script/program, how about generating a confirmation image which must be reentered by the purchaser? I can't imagine the other site will want to hire people to sit around and reenter images as your product is purchased.

    merlyn has some details here (Even though jcwren was able to get around it; see A little fun with merlyn *Smiles*)

    Update

    Silly me, got a snack and realized they will just pass the image to the end user, capture the entered input, and forward it back to your form...

Re: Thwarting Screen Scrapers
by Abigail-II (Bishop) on Jul 18, 2002 at 13:34 UTC
    So, what's the problem? You're still selling your product, aren't you? Or did the others take over production of your product as well?

    BTW, what does this have to do with Perl?

    Abigail

      From my own non-business educated perspective, it appears that they are ripping off his brand name. Even though he is selling ProductX for the same amount of money, other people are selling his exact product, only calling it ProductY, therefore deminishing his brand recognition. So, for the time being, he might still make the sale, but in the long run, his business is hurt because his brand is less recognizable due to ProductY ripping him off.

      As far as being related to Perl, he posed a problem with a script(s) and ask for a technical solution for it.. I thought that was what Monks were all about.. helping with a technical problem with a Perl script?

      I could be wrong.. it's happened once before.

        So, he should put his product in a box that says "Product X". ;-)

        As for the solution - the problem isn't technical, and the solution shouldn't be either. If he needs outside help, he should ask a lawyer.

        Abigail

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://182794]
Approved by earthboundmisfit
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (4)
As of 2019-10-17 02:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Notices?