Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

When does XML get to be too much?

by newrisedesigns (Curate)
on Jan 24, 2003 at 20:09 UTC ( #229689=perlquestion: print w/replies, xml ) Need Help??

newrisedesigns has asked for the wisdom of the Perl Monks concerning the following question:

I have a script on my server that processes a large flat-file. The script takes about a second or so to process the information and return it to the user. Considering the possible hazard in having many people use the script, I decided to code some countermeasures to prevent the users from reloading too many times in a row and give them a limit per day.

This works great, except for the fact that I store their IP, last request time, and number of visits in an XML file. Right now, with only 20 or so visitors the file is at 2kb. When more unique users visit this page, the XML file will continue to grow. Will this become a future bottleneck? It would be really ironic if a security measure intended to relieve strain on a server causes more than it prevents.

The XML looks like this:

<opt> <ip0.0.0.0 visits="1" time="1043393316" /> <ip127.0.0.1 visits="5" time="1043370125" /> ...

The code for checking the XML looks like this:

my $xs = XML::Simple->new(); my $xmlf = './data/rsn_users.xml'; my $ip = 'ip' . $ENV{REMOTE_ADDR}; my $config = $xs->XMLin($xmlf); my ($time, $visits); if(exists $config->{$ip}){ $time = $config->{$ip}{'time'}; $visits = $config->{$ip}{'visits'}; } else{ $config->{$ip}{'time'} = time()-21; $config->{$ip}{'visits'} = 0; $time = $config->{$ip}{'time'}; $visits = $config->{$ip}{'visits'}; } if($time < time()-86400){ $config->{$ip}{'visits'} = $visits = 0; } if($time > time()-20){ print "Content-type: text/html\n\n"; print qq[You reloaded too soon!<br /> Your IP: $ip<br />Last visit: $config->{$ip}{'time'}<br /> Number of Visits: $config->{$ip}{'visits'}]; exit; } if($visits > 15){ print "Content-type: text/html\n\n"; print qq[You are over the allowed number of visits per day!<br /> Wait 86400 seconds (one day), then reload.<br /> Your IP: $ip<br />Last visit: $config->{$ip}{'time'} <br />Number of Visits: $config->{$ip}{'visits'}]; exit; }

Any suggestions?

Striving for better code,
John J Reiser

Replies are listed 'Best First'.
Re: When does XML get to be too much?
by Sifmole (Chaplain) on Jan 24, 2003 at 20:38 UTC
    Comments? Yeah, use a database.

    Basically what you are trying to do is use the XML file as a flat file database, and that doesn't really appear to be a good choice of technology for the problem.

    XML is intended as a medium to allow inter-system communication of information; it creates a framework that provides the structure of the data within the data. I don't think XML is providing you any benefit here, and is a slow performing technology for the application. You would be better served by some form of database.

    When is XML too much? When you use it for the wrong thing -- much the same issue as any tool.

      I agree with Simfole. And if you really need, for any reason, an XML output of the data, nothing prevents you to create a script that puts database data into XML tags and returns it.


      # Another Perl edition of a song:
      # The End, by The Beatles
      END {
        $you->take($love) eq $you->make($love) ;

Re: When does XML get to be too much?
by gjb (Vicar) on Jan 24, 2003 at 20:45 UTC

    With the XML schema you have right now, I hardly see the point of using XML in the first place. For such data a tab separated file format would do nicely. (You don't have to use XML, even if it's all the buzz right now.) Better than a tab separated file, use something like DBD::CSV, DB_File or the likes of it, i.e. a flat file that can be queried using SQL. That way you can migrate to a full fledged database should the need ever arise (this would imply that you just do a query, the data is permanently available in the database and doesn't have to be read each time the CGI script is executed).

    From an XML design point of view, I'd prefer something like

    <user ip="" visits="3" time="1043370125"/>
    since data should IMHO not go into the tag names, they're only there to structure the data, not to provide other info besides that.

    Just my 2 cents, -gjb-

      I know XML is a buzzword nowadays, and that's part of the reason I used it: practice. However, after I was done, I realized that it was probably a mistake.

      As for DBI... I don't know it. I read Dominus' article on using DBI, and I found it interesting, but I didn't grasp it right away. I will probably switch over to DBI and mySQL once I learn that.

      I was originally going to use a DB_File (Any_DBM, actually) database, but as far as I know, they don't allow for HoHs. If one does, I'd switch the program over. Any recommendations?

      John J Reiser

        Wha wha what?!?!?! You haven't learned DBI yet? Unshift that item to the front of your queue and start reading! ;)

        Seriously, DBI and an RDBM can take a programmer to a new level of programming. I recommend any programmer out there who doesn't know how to program with a database learn how to do so very soon. The DBI part is not hard, but setting up and maintaining an RDBM can be. Luckily, there is the wonderful SQLite and the even more wonderful DBD::SQLite that allows you to use SQLite in Perl. One of the greatest benefits of using SQLite is that you get to use SQL, and if you use DBI with DBD::SQLite, then when the time comes to migrate to a "real" RDBM, all you should have to do is simply replace use DBD::SQLite with use DBD::mysql or use DBD::Pg.

        Do yourself a big favor and take some time out to learn these tools.


        (the triplet paradiddle with high-hat)
        I was originally going to use a DB_File (Any_DBM, actually) database, but as far as I know, they don't allow for HoHs.

        Although, as gjb suggested, you can use MLDBM or you can use Data::Dumper or Storable to serialize your data structure by hand, the truth is that you don't even need an HoH in your case. You are storing the same data for each IP: visits and a timestamp. I would just store the two values in a single scalar delimited by a non-digit character of your choosing. (Consider using a comma, space, colon, semicolon, or pipe symbol as they are all commonly used for this kind of thing.) The right rule of thumb to follow in this case is "keep it simple."

        "My two cents aren't worth a dime.";

        I've never used it myself, but you could have a go at MLDBM, it should do what you want.

        Hope this helps, -gjb-

Re: When does XML get to be too much?
by mirod (Canon) on Jan 24, 2003 at 20:53 UTC

    I totally agree with Sifmole here, in this case you should really use a DB, which can be as simple as File::Gdbm, which uses key => value indexed files, or DBD::SQLite, a lightweight DBMS which comes with the module.

    I think it's time to start working on my next lightning talk, entitled "Stop using XML everywhere Damnit!".

    And yes, your tag names are really suspect, if you want to stick to XML::Simple have a look at the keyattr option an use elements that look like <visitor ip="" ... />

Re: When does XML get to be too much?
by true (Pilgrim) on Jan 24, 2003 at 21:26 UTC
    XML is a bridge, not a parking garage. Bottlenecks can be a huge problem for a large log. But the problem isn't with the bridge, but the lot. I like how my Apache keeps logs. Apache will rename a large log and begin on a new log. A simple perl script could check the size of your log and rename it as needed. So access_log becomes access_log.2, access_log.3 etc. Also since a large log is something you want to avoid, XML would only add multiple bytes you don't really need to the file.

    I'd keep it tab delimited and save tons of space. You can always convert the tab delimited file to XML later to go somewhere else.

Re: When does XML get to be too much?
by Fletch (Chancellor) on Jan 24, 2003 at 21:11 UTC

    If (Micro$oft|Oracle|Sun|...) says you need to be using XML, you probably don't need to be doing so (and check your wallet).</sarcasm>

    If you probably won't be sending the data between multiple systems, you may not need to use it. Consider having a simpler, easier to parse format for your regular data. Then make a simple => XML filter which mogrifies it for external consumption.

    Addendum: I agree with the sugguestions for a DB file as being more than sufficient for the data you gave as an example. Even that might be overkill if you don't need random access to all the records; simply appending lines to a delimited text file might work just as well.

Re: When does XML get to be too much?
by zengargoyle (Deacon) on Jan 25, 2003 at 02:03 UTC

    Cache::FileCache can make this easy.

    # from a CGI script which times users out after an interval require Cache::FileCache; my $cache = new Cache::FileCache({ default_expires_in => '60 minutes', auto_purge_interval => '4 hours', auto_purge_on_set => 1, filemode => 0077, namespace => 'cookies', username => 'www', }) or die "cookie_cache\n"; my $user = $cache->get($browser); # logged-in user or undef if ( defined $user and defined $q->param('_logout') ) { $q->delete('_logout'); $cache->remove($browser); # clear cache print($q->p("You are no longer logged in as $user.")); undef $user; } elsif ( not defined $user and defined (my $try_user = $q->param('_us +er')) ) { $q->delete('_user'); my $try_password = $q->param('_password'); $q->delete('_password'); if ( $try_user =~ /^[a-z]{1,8}$/ and verify($try_user,$try_pas +sword) ) { $user = $try_user; print($q->p("Welcome back $user.")); } else { print($q->p("I'm sorry, that's not right.")); } } if (defined $user) { $cache->set($browser,$user); # reset the timeout # do a logout form } else { # do a login form } ...

    to limit the program to running at most once per minute, try and get your entry, if you can it's been less than a minute so notify user and exit, if you can't get your entyr it's been more than a minute so do your stuff and set a new entry. you can do the same for each user and give them 1 run a minute or less than 5 runs per 20 minutes without too much work.

(nrd) When does XML get to be too much? (thanks for the replies)
by newrisedesigns (Curate) on Jan 24, 2003 at 22:08 UTC

    Many thanks to all.

    I'm using a tab delimited flat-file for right now, and it's working just fine. My web server doesn't have MLDBM, but it does have everything for DBI & mySQL. The next step will be using that for my log. Now I just have to learn SQL. :)

    John J Reiser

Re: When does XML get to be too much?
by Aristotle (Chancellor) on Jan 26, 2003 at 00:44 UTC
    I'd definitely use DB_File for this - zengargoyle's suggestion of Cache::FileCache is not bad either, but a DBM file probably beats it in this case. Something like
    use DB_File; use constant PACKFMT => "LI"; sub exit_limit_enforced { print(<<"EOT"), exit } Content-type: text/html <html> <head> <title>Traffic control limit reached</title> </head> <body> $_[0]<br /> Your IP: $ip<br /> Last visit: $time<br /> Number of Visits: $visits </body></html> EOT tie my %limit, 'DB_File', './data/rsn_users.dbm'; my $ip = $ENV{REMOTE_ADDR}; my ($time, $visits) = exists $limit{$ip} ? unpack PACKFMT, $limit{$ip} : (time() - 21, 0); $visits = 0 if $time < time() - 86400; exit_limit_enforced("You reloaded too soon!") if $time > time() - 20; exit_limit_enforced( "You are over the allowed number of visits per day!<br />" . "You may not visit again before" . localtime($time + 86400) ) if $visits > 15; $limit{$ip} = pack PACKFMT, $time, $visits; untie %limit;

    Makeshifts last the longest.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://229689]
Approved by valdez
Front-paged by htoug
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (5)
As of 2020-12-02 19:08 GMT
Find Nodes?
    Voting Booth?
    How often do you use taint mode?

    Results (44 votes). Check out past polls.