http://www.perlmonks.org?node_id=554988

devnul has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I find myself with a very tough problem involving lease calculations and was wondering if anyone might be able to help me figure out a solution here.

Basically I am getting a dump of data from an "old big clunky EBCDIC system". I've obviously converted this to something more ASCII-like and have removed all the non-floating-point pieces of data from each record.

The reason I've done this is because the system it comes from can only return "payoff information" for single leases entered by an operator and my client needs to know the payoff amount for every lease in the system (well over 200,000 - which would take an operator forever to accomplish), and this dump may provide the information to determine this. Also, being a proprietary system, this is about the only financial data I am able to extract from the system.

The issue I have is that I don't know fields are what -- only that by adding and substracting some combination of them, with a particular lease, I will arrive at a figure of $843.24.

I can't quite get my head around how to go about reversing engineering this formula, or even if this mountain can be moved.

Any help would, as always, be greatly appreciate.

- Devnul

Replies are listed 'Best First'.
Re: Reverse engineering a formula...
by roboticus (Chancellor) on Jun 13, 2006 at 12:09 UTC
    devnul:

    I'm not a financial analyst, but it seems to me that if you google payoff, amortization, net present value, etc., you'll get some standard formulae you can use as a starting point. Then you can use the hack and slash method, by playing with proportionality constants, etc., until you can match up the numbers, or you can also google for "curve fitting" and give it a try.

    That could be a lot of work, however. Have you considered writing a program to simulate the operator and making your perl script do the key-entry to get the data you need out of the existing machine?

    --roboticus

      roboticus speaks wisely.

      Many, many years ago, in the era of Windows 3.1 and Token Ring networks, I had a temp job at the local power company. I was given the task to transfer data from their mainframe application into a database running on the Windows network. As with the original question here, there were several thousand records to transfer and the mainframe application would only display one at a time.

      I have no idea how many hours (days!) they expected me to spend doing data entry on this task, nor how they intended to handle the errors that would surely have resulted, but it didn't prove that easy for them to keep me busy... I quickly taught the Windows macro recorder a sequence of keystrokes to copy and paste the data, then read a magazine for about an hour (getting many dirty looks from passers-by) and reported the task complete.

      And on that day, I learned both the wisdom of simulating a human operator to carry out repetitive tasks and the joy of suffering the reactions of others to one who truly "works smarter, not harder".

Re: Reverse engineering a formula...
by Gilimanjaro (Hermit) on Jun 13, 2006 at 12:54 UTC

    The following should work; I haven't had the guts to run it on your real data yet, but my mini-testset seems to result in an answer...

    #!/usr/bin/perl =cut my @t = ( '18.930167', '17.967469', '0.008720', '0.008720', '122.640000', '12493.320000', '359.520000', '288.700000', '359.520000', '89.880000', '32.960000', '56.920000', '13.470000', '1231.360000', '20587.440000', '359.520000', '792.170000', '629.160000', '972.290000', ); my $target = 843.24; =cut my @t = (100,2,32,4,65); my $target = 7; my @terms = _test($target,@t); for my $n (0..$#terms) { next unless $terms[$n]; print (($terms[$n]<0)?'subtract':'add'); print " value number ",$n+1," (",$t[$n],")"; print "\n"; } print "to obtain total of $target\n"; sub _test { my ($target,$number,@rest) = @_; for my $term (0,$number,-$number) { if(@rest==0) { next unless $target == sprintf('%.2f',$sum+$te +rm); return ($term/$number) } my @terms = _test(sprintf('%.2f',$target-$term),@rest) +; next unless @terms; return (($term/$number),@terms); } return; }

    It's a recursive algorithm that will eventually try every possibility until it finds a working set of terms... It will probably take a long time with the real dataset, so you'll want to eliminate as many terms as possible beforehand.

Re: Reverse engineering a formula...
by Moron (Curate) on Jun 13, 2006 at 11:57 UTC
    I remember when at school thinking that the intelligence tests of the form "what is the next number in the sequence" only served to prove that the questioners weren't intelligent enough to be testing other people's intelligence.

    The reason is that for example, the next number in the sequence 1,2,3,4,5, ? can be anything you like (even -7 x phi is as good an answer as any). Worse still, there are an uncountably infinite number of different formulas that can "justify" the sixth number being -7 x phi.

    You also need an infinite number of examples to overcome that problem -- even if you knew all 200000 answers, they wouldn't help you with the 200001th for the same reason.

    Your only recourse is to interview whoever has the business knowledge behind the formula.

    -M

    Free your mind

      I can argue that, given more precise question, a sort of tasks about sequences like "1,2,3,4,5,?" makes perfect sence.
      If you're referring to set of functions such that
      f(1) = 1 f(2) = 2 f(3) = 3 f(4) = 4 f(5) = 5
      and then calculate f(6) then indeed there surely exists infinite number of different f's conforming these conditions and yet f(6) equals to -7 x phi.

      But when you limit answers to functions having only natural numbers, of special kind (let that be polynoms with integral coefficients, with the largest power *no* more than three, then one and only one function fulfills the condition, and this is f(x)=x, and the correct answer is 6.)

      Like everywhere else in mathematics, very many things depend on exact phrasing.

      addition but I agree that OP has very badly phrased task

      update 2+1 small typos

        Consider the case of y = x + r*cos(pi*x) where r is any real, which produces a sine wiggle around a straight line, where the deviation from the straight line is maximum for half-values of y, but for integer values produces the illusion of y=x.

        Update oic, you want to limit the types of functions - but without the business knowledge how can you do that? We know it's leasing, but without knowing the business, how do you know it doesn't have a risk component linked to a normal distribution based on the customer's age? That wouldn't be polynomial.

        More update: yes fletch is right, s/cos/sin/ - my brain did the equivalent of a double negative when I was imagining this.

        -M

        Free your mind

      With apologies for the thread drift:

      Worse still, there are an uncountably infinite number of different formulas that can "justify" the sixth number being -7 x phi.

      Surely a "number of different formulas" is an integer, and therefore countable?

      Polonius
        You have a countable and finite number of intersections between a candidate function and the sample data, but the candidates may vary across real space in between the intersections, making them uncountable in number.

        -M

        Free your mind

      What you say is certainly true in the abstract, but real systems, whether they be physical or financial, tend to follow relatively simple (albeit noisy) models. But I don't think the OP's system is even a real system; I think it's likely to be a very simple model of a system, probably involving exponential growth.

      But all of this is guesswork. As others have said, the OP would probably get a great deal more help if only he would supply more data.

      Polonius
Re: Reverse engineering a formula...
by BrowserUk (Patriarch) on Jun 13, 2006 at 10:50 UTC

    Do you only have the one set of known data? As with any form of decoding, the more examples you have, the easier (and more reliable), it becomes.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      I can lookup other leases, if I am able to find a formula that works with this sequence of numbers.

      - Devnul

        With (say) half a dozen sets of data + the target value, it might be possible to exclude some of the numbers thereby reducing the search space.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Reverse engineering a formula...
by samizdat (Vicar) on Jun 13, 2006 at 13:51 UTC
    devnul, you didn't tell us if the individual reports gave you any clues that you might be able to match to the dump from the same data set record.

    Failing that, a combination of recursive brute force and success pruning (aka 'genetic') is the answer. I'm hoping you can assume that the data in the variables does have a constant format, so that once you find a solution that works for one set (within rounding deltas), it should work for every set. Where you'll have problems is if there are late payments with penalties and interest charges which are only stored in aggregate totals. That possibility would increase the scary factor immensely. :(

    Best of luck; this looks like a case of Interesting TimesTM.

    Don Wilde
    "There's more than one level to any answer."
Re: Reverse engineering a formula...
by pboin (Deacon) on Jun 13, 2006 at 20:52 UTC

    First of all, 'clunky old EBCDIC' systems still run most of the financial world. Companies like BankofAmerica, Amex, and Visa seem to think they're pretty OK.

    You're very likely fighting the system, and not using it the way it's designed. OS/390 applications are very batch-centric, sort of similar to the way Unixy things are pipe-centric. I can practically guarantee you that there is a batch interface to what you want to do. Gen one job, or one step per lease and let 'em rip. Google for 'IBM JCL' to get started.

    Finally, make sure your managment (and you) understand that you're just making up the lease calc if you go that route. This is money you're dealing with, and money mistakes tend to piss people off in a way that nothing else does. Be *very* careful to have your work audited by a bean-counter who knows what they're doing.

    Good Luck.

Re: Reverse engineering a formula...
by bobf (Monsignor) on Jun 14, 2006 at 02:58 UTC

    Instead of trying to determine the formula that was used for a (hopefully) complete set of numbers with unknown meaning, why don't you turn the problem around?

    Ask the client create a new dummy lease with known input data. Tweak one (and only one) of the parameters and create another new lease. Repeat...

    Given enough dummy data (known inputs and outputs) you should very quickly be able to determine which numbers are meaningless, which are significant, how each number affects the final value, and which numbers in the dumped data you need. Add a little algebra to get the formula, and you're done. This approach could be quite a bit simpler than trying to brute-force it. :-)

Re: Reverse engineering a formula...
by Skeeve (Parson) on Jun 13, 2006 at 10:21 UTC
    With the amount of information you gave, the only formular I can easily think of ;) is: 200,000 / 237.180

    Or in other words: How can you expect us to help you with this lack of information?

    s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
    +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
      I was trying to keep my question "theoretical". But here's the numbers if you really want them:

      $VAR1 = [ '18.930167', '17.967469', '0.008720', '0.008720', '122.640000', '12493.320000', '359.520000', '288.700000', '359.520000', '89.880000', '32.960000', '56.920000', '13.470000', '1231.360000', '20587.440000', '359.520000', '792.170000', '629.160000', '972.290000', ];
      - Devnul

        How's this for a guess? A 24-month lease

        An initial price of 12,493.32 A first payment of 1,231.36 + 23 monthly payements of 843.24 ===================================== Total cost (- tax relief) 20,587.44 Daily interest rate 0.008720 Annual Equivalent rate 17.967469 Annual effective rate (inc. one off payments) 18.930167

        Other numbers include one-off admin charges and daily/weekly effective costs?


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        Well, look at it this way:

        • You have 19 floats
        • Each float may:
          • Need to be added (multiply by 1)
          • Need to be subtracted (multiply by -1)
          • Need to be ignored (multiply by 0)

        This means there are 3 to the power 19 different ways to combine these numbers... That's a lot of possible combinations to try, but it can be done I suppose... You'd probably want to write a recursive algorithm to try all these permutations, and see if the rounded total matches the number you specified...

        Update: Possible solution code posted below

        That will never give the sum you told us, because of 18.930167 and 17.96746

        You will never get aything with just 2 decimal digits behind the decimal point with those 2 numbers.

        s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
        +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
Re: Reverse engineering a formula...
by salva (Canon) on Jun 13, 2006 at 15:34 UTC
    for a completelly different kind of solution you can try using a neural-network package.

    First you would need to generate some sample data, feeding some randomly distributed values into your "blackbox" and getting the output.

    Then use the obtained set of input/outputs to train the neural-network and once the error goes below your acceptable limit, you could use it to replace the original blackbox.

Re: Reverse engineering a formula...
by devnul (Monk) on Jun 13, 2006 at 12:38 UTC
    Thanks everyone!

    The answers (and, indeed, the questions you've asked) have helped direct my thinking in the right area on this, which is exactly what I was looking for.

    Thank you, again....

    - Devnul
Re: Reverse engineering a formula...
by Scott7477 (Chaplain) on Jun 13, 2006 at 18:33 UTC
    If you could provide the values for each of the fields that are associated with the $843.24 payoff amount, that would be helpful. The values that you would see in a lease calculation would typically include the original price of the item, the residual value, the interest rate, the lease term, and perhaps the monthly payment. The range of values for each of those value types can be constrained rather easily. Obviously, the interest rate is going to be a number somewhere between 5 and maybe 30 percent at the outside. The lease term will be either in months or years most likely, so you know that that is going to be a small integer.

    Your problem here is one of a type that always makes my blood boil when I hear about them. Regardless of whether a system is proprietary or not, a user ought to be able to retrieve in a simple fashion every piece of data that has been put into the systems. So many of the big name ERP systems make it really easy to put a lot of data in, but then they want to charge you big bucks for "reporting tools" just so you can see what's actually in the system. Every single table should be able to be dumped as a text file or CSV file or something. That's my rant for the day:)
System of Equations
by kbrint (Sexton) on Jun 13, 2006 at 20:40 UTC
    Is the equation the same for all data records? If so, then you can just take 19 separate entries and create a system of 19 variables and 19 equations. For example, the first equation would be:

    a*18.930167 + b* 17.967469 + c* 0.008720+...+s* 972.290000 = 843.24

    After you have 19 or so of those equations, you should be able to solve for a,b,c,d,...

    Based on the info you provided, if you get numbers other than {1,0,-1} then you know the equation isn't the same for all entries.

      you are assuming a linear equation that even if it is what is described on the OP, it's very unusual for a lease formula and probably wrong.