Learning and applying machine learning can be completely different things. When we learn machine learning, we get to know about the different algorithms, their hyperparameters, the type of problems they are most suited for, the mathematics behind them and so on. Applying machine learning to a business problem though, is hardly about the algorithm. Rather, it involves solving harder problems that one rarely gets to learn in academia. Let’s look at an example using a problem where machine learning is frequently applied: credit card fraud detection.

The credit card fraud detection problem has a very simple premise: given a host of attributes about a credit card transaction, predict whether it is a fraudulent one. This could be a binary classifier that takes in a transaction and gives the probability of it being fraudulent. It’s important to achieve both a reasonably good precision and recall for such a use case. Why? Low precision here would mean that we block too many non-fraudulent transactions by identifying them as fraudulent, whereas low recall would mean that too many fraudulent ones escape the algorithm. How do we know what’s a “reasonably good” metric? It really depends on the cost of false positives and false negatives.

Reasonably good precision

Let’s say that the profit per sale is $5, and the chargeback fee is $20 (chargeback fee is the penalty imposed on the merchant by the card issuer for a fraudulent transaction reported by the customer). Also, let’s say our algorithm has achieved a precision of 0.25. This means that out of every four transactions that we have identified as fraudulent, only one of them is actually fraudulent and the remaining three are legitimate. Although this looks like a very low precision, this is actually profitable for the merchant. Why? The three transactions blocked by the algorithm cost the merchant $15 ($5 times 3), whereas the one fraudulent transaction blocked by the algorithm saved $20!

Consider another extreme example where the profit per sale is $100 and the algorithm achieved a precision of 0.80. This means that out of every 5 transactions that we identified as fraudulent, 4 are actually fraudulent. That may look like a good precision, but if you work out the costs, the one legitimate transaction that we blocked cost us $100, whereas the 4 transactions that we blocked saved us just $80. Thus, we need the precision to be higher for the algorithm to actually be useful.

Measuring model performance in production

Let’s say we trained an algorithm through extensive feature engineering, feature selection and optimization of hyperparameters such that it attains a reasonably good level of precision and recall. When we implement the algorithm in production, how do we measure the precision and recall? This is tricky, because when we block a transaction because the algorithm predicted it to be a fraud, we don’t get to observe the “real” outcome. Thus, if there is no way to know the actual set of transactions which are indeed fraudulent, how can we do model evaluation to know whether our algorithm is performing well or not?

There is another issue as well. Say the performance of the algorithm is satisfactory in production, i.e., we don’t receive too many complaints from the customers about fraudulent transactions. Can we use the data that we ran through the algorithm to train it again? No, because now the fraudulent transactions that we see in the data will be the ones that the algorithm DID NOT identify, since the transactions identified by the algorithm would have been blocked.

Counterfactual evaluation

This is common to any problem where the machine learning algorithm intervenes with the outcome it is supposed to predict. Stripe, the popular payments processing platform implemented a very interesting solution to tackle this. Out of all the transactions that their algorithm identified fraudulent, they pick a random sample (say 5%) and let them go through. Out of these transactions, if x% turn out to be actually fraudulent, then “x” becomes the estimate of your precision in production!

What about recall, though? We were able to calculate an estimate of the precision by calculating it on a small sample, but we can’t do the same for recall, since we need the total population of actual fraudulent transactions. Let’s get a bit technical and try to calculate it.

Assume the total no. of transactions for our analysis is 100,000. And that we identified 10% of them as fraudulent. Which means we have blocked 10,000 transactions and 90,000 have been let through. Now, say we let 5% of the 10,000 go through (500), and let’s say 400 of them turn out to actually be fraudulent. We can extrapolate this to the blocked transactions of 10,000 by multiplying it by 20. Thus, we estimate that 8,000 of the 10,000 blocked transactions would have actually been fraudulent (true positives). Also, out of the 90,000 transactions that the algorithm did not identify, let’s say 5,000 transactions turned out to be fraudulent (false negatives). Thus, recall can be calculated as 8,000 / (8,000 + 5,000) = 61.5%.

Training

Let’s dig into this a little bit more. How are the fraudulent transactions actually identified? If you have a classification algorithm, what you usually get after evaluating a transaction is not a black-or-white decision saying whether it was fraudulent, but a probability number between 0 and 1 indicating the likelihood of the transaction being fraudulent. Using this probability, we have to make a determination of whether the transaction is fraudulent or not, based on whether the score is above or beyond a certain threshold. For example, if we set the threshold as 0.5, then a transaction is identified as fraudulent if the score is above 0.5, or legitimate otherwise. Thus, out of all the transactions where the score above 0.5, we let a small percentage go through. Let’s call this the pass-through rate. The graph below depicts a pass-through rate of 5%.

How do we use the data that has gone through the algorithm in production for improving the model? Based on the output of the algorithm and the likelihood of fraud, we would have either let the transaction go through or blocked it. Since we are able to observe the actual outcomes for all the transactions that we let go through, these are the ones we can use to iteratively train our model and improve it.

There is one important consideration, though. If our pass-through rate was 5%, for every transaction that we identified as fraudulent but let go through in order to observe the outcome, there were 19 others that we blocked. Since we cannot include blocked transactions in our training (as we didn’t observe the outcome), we can account for them by assigning appropriate weights to the transactions that we let through. Assigning weights to a record primarily amplifies the loss arising from the prediction of that record by a factor equal to that weight. This weight should be the inverse of the pass-through rate.

Sample weight = 1 / Passthrough rate

Thus, the sample weights for a pass-through rate of 5% will be 20.

Improving model evaluation

Now, recall the methodology described in the previous section to measure an estimate of the prediction and recall in production. When we identify a transaction as fraudulent, i.e., the score is greater than the threshold, we will let a small percentage of them go through. But, the transactions that score closer to 1 are the ones where the algorithm is pretty sure that they are fraudulent, whereas the ones where the score is closer to the threshold score are the ones where there is some ambiguity involved. Thus, the performance may be better if we let go more of the transactions when the score is higher than, but closer to, the threshold. This implies that the pass-through rate will be different for scores more than the threshold, and it will drop to zero when the score is 1.

Chart

Why would this “tweak” result in better evaluation scores? It is because it helps us observe the outcomes for the cases where the algorithm is the most “uncertain”. If the algorithm scores 0.95 on a transaction, there is a high probability that it is a fraudulent transaction - we gain almost no information by letting that transaction go through and observing the real outcome. But if we let a transaction with a score of 0.55 let through, we can learn more about that transaction by using it in subsequent training loops.

The training methodology will be the same as before, except the sample weights will not be uniform for all the transactions that we let through. If we are able to map each score to a pass-through rate (as shown in the graph above), then each transaction that we let through will have a separate pass-through rate, and consequently, a separate sample weight.

I hope this post gave you some insight into some of the problems faced by organizations in production, which you may not encounter while educating yourself about machine learning.

The methodology explained in this post was presented by Stripe at PyData — the video can be found here: https://www.youtube.com/watch?v=QWCSxAKR-h0

Search This Blog

Simply Curious

Machine learning applications - Credit card fraud detection

Reasonably good precision

Measuring model performance in production

Counterfactual evaluation

Training

Improving model evaluation

Comments

Post a Comment