Interpretable Machine Learning

About

Machine Learning (ML) is quickly becoming ubiquitous in banking for both predictive analytics and process automation applications. However, banks in the US remain cautious in adopting ML for high risk and regulated areas such as credit underwriting. Among the key concerns are ML explainability and robustness. To address the inadequacy of post-hoc explainability tools in XAI for high stake applications, we developed inherently interpretable machine learning models including Deep ReLU networks and Functional ANOVA-based ML models. Model robustness is a key requirement as models will be subjected to constantly changing environments during production. A conceptually sound model must be able to function properly—without continuous retraining—and invariant under a changing environment. Recently, we released the PiML (Python Interpretable Machine Learning) package as a tool to design inherently interpretable models and to test machine learning robustness, reliability, and resilience.

Agus Sudjianto

Executive Vice President Head of Corporate Model Risk Wells Fargo & Company

Transcript

Steve: 0:00 Okay, welcome back to our Techshares. This is our fourth session for the day and our final session. And we’re privileged to sit down with Agus Sudjianto, who is a executive vice president and head of corporate model risk for Wells Fargo, and we’re gonna discuss interpretable machine learning. A little bit about Agus, he’s Executive Vice President and head of model risk and a member of the Management Committee at Wells Fargo, where he is responsible for enterprise model risk management. Prior to his current position, Agus was a modeling and analytics Director and Chief model risk officer at Lloyds Banking Group in the United Kingdom. Before joining Lloyds, he’s an executive and head of quantitative risk at Bank of America. Prior to his career in banking, he was a product designer manager and the powertrain division of Ford Motor Company. Agus hold several US pens in both finance and engineering. He has published in numerous technical papers and as a co-author of design and modeling for computer experiments. His expertise and interests include quantitative risk, particularly credit risk modeling, machine learning, and computational statistics. He holds a master’s and doctorate degree in engineering and management from Wayne State University, and the Massachusetts and MIT. So welcome, Agus, we are very interested to learn more about interoperable machine learning. And I will turn the time over to you to get started.

Agus: 1:50 Thank you so much. Thank you, Steve. So what I would like to do today is to share our view and our need in highly regulated industry, where we apply model for various purpose. We have a few 1000 model in our inventory that we run day in and day out, to do many, many things in the bank. Banks basically run by model. And we are in the environment… in our case, really a lot of our application, but we call it this high risk application, they have significant impact, if the model are wrong, including model that can impact our customer as well. Part of this discussion, I’m going to share with you our approach in the using interpretable machine learning and how to test machine learning for real production and high risk environment. We have the tools that we publish out there. It has in GitHub, it’s called Pi-ML Python interpretable machine learning, if you Google Pi-ML toolbox, it should take you to the GitHub page where you can download the tool, you can try the tool, it has a lot of material, that we present it through workshop, etc. So that’s what I am going to share with you today. So let’s see if I can successfully control.  So, I talked about model risk, we believe that all models are wrong. And when they are wrong, they create harm or unintended consequence, they harm either the institution, the model user or the customer. In banking, we use for financial model, we create financial harm, be it credit loss or market loss or liquidity problem, it is the example in the crisis of 2008. It was the model that’s really problematic, then create all this financial harm that all of you that’s all enough experienced that but also non-financial harm, they create reputation, compliance or legal when particularly when our model is discriminatory to protected class so they can violate law or legal requirements. So, we care a lot about what can go wrong with models. So we have very, very rigorous process, how to prevent and how to manage that? But let me give you a quick example in terms of the situation. We have model, we all develop model and we deploy a model. The environment that the model deploy are very dynamic. So the development environment is not the same as deployment environment, the data that you use for model development have different distribution of the data that is in the deployment. So input can shift and drift, concept can shift and drift as well. So, the best model that develop can be the worst model when we deploy. So, a few questions on this then can we rely on auto ML? or Frequent retraining is the answer because the environment change. I would say both of them are… the answer to both of them are no but let’s see, go through a few example and then how the tool that I’m going to share with you try to address this. So, this is the problem that is in machine learning. Machine learning are model that overly parameterized, million parameters can be depending on the size of the model. So in machine learning, we have a phenomenon that we have the called multiplicity model, various model, very different model can have similar performance. The Roshomon’s Effect, some people call it Roshomon’s Effect,I think the great Leo Breiman, the inventor of random forest, call it basically Roshomon’s Effect,, model can have very similar output yet they are very, very different model. And the problem in machine learning as well, what we call it as benign overfitting, a model that seems… the model over fit. All machine learning model over fit, basically. But the overfitting doesn’t show up, the steep bias variance trade off in traditional model doesn’t show in machine learning, the more complex model can be still performed very well. So I give an example on the right hand side here, as the model more complex, the horizontal line is complexity or model, that if it’s gradient boosting machine, that’s really the depth of three and the number of boosting, if it’s deep learning is the number of nodes in a hidden layer and the number of layer. So the more complex model, the model can fit better and better. In fact, it can be also zero error. And surprisingly, also, on the testing the more complex model, you don’t see any overfitting or very, very minimum. Now, the problem is we have a lot of models, very, very different model. And which one do we want to choose? And is this benign overfitting is not not real problem? This benign overfitting, it turns out when during model deployment, the distribution change, that drift has the input distribution drift. This benign overfitting actually becoming harmful. This is why it’s very common, when you develop model, then you deploy the performance it degrading very, very fast. This is a very, very common problem in machine learning. The performance in real world, it’s way off from the performance through development because of this benign overfitting problem. When we do model development, we simply split training and testing data because we split randomly it will come from the same distribution of training data, those we will not detect any overfitting only when distribution drift during the model use, then you start seeing very dramatic model degradation. So this is the problem with machine learning, robustness and reliability that we get a lot in high risk environment. We want to build model that is really robust. So that when we have distribution drift, that the model will still perform as best as we can. So that’s one of the key. The other thing is unsigned model because of the multiplicity, some of the model doesn’t make sense, actually, if we understand the input and output relationship, they didn’t make sense, yet, they can still have very good performance as well. So dealing with that model, explainability is very, very important. The field of ExAI explainable AI is dedicated to that, to come up with techniques that can open up a black box. But those Ex-AI techniques, what we call it as post hoc explainer can be easily wrong. So with that, we inherit an environment we use what we call it inherently interpretable model, still very, very complex model, still deep learning, still a gradient boosting machine, but they are inherently interpretable because we control the architecture. So we’re going to do that a little bit. I spoke about robustness of the models. So, judging model like the practice in Auto-Ml, based on simple performance, AUC, accuracy, or MSE, that’s clearly, clearly not sufficient for real world. So, we have to pass model beyond simple accuracy, we have to understand overfitting, we have to understand model robustness, we have to understand model reliability and resiliency. So with that, this is what the key component in what we call model validation. When we test model, we look at the conceptual soundness and we look at the outcome analysis conceptual soundness, we’re looking at the overfitting causality, explainability, interpretability. So I’m making a distinction between explainability and interpretability. Explainability is really using post hoc explainability. Interpretability is really to build a model that is inherently interpretable. And then in the outcome analysis, we testing for error analysis, understanding the weak spots with area, that model is weak. We testing for reliability, how reliable of the prediction? What is the confidence on that prediction? Point that the prediction interval is why it’s less reliable. So we would like to be able to build model that the prediction interpretable, the uncertainty is narrower. So that’s testing for reliability is important. Bias and fairness is the… model is fair, not discriminatory to protect this class is very important. Robustness is very critical for us, because when model is deployed, it’s subjected to noise in the input. So we need to make sure that model is not sensitive to any noise in the inputs, model will still perform very well. And then we talk about resiliency, basically data drift, big distribution drift, whether the model will still perform or not. So this is all the functionality that we put in Pi-ML. The causality and fairness will be in the release 0.4, which is end of the month. So what’s for that. So what Pi-ML does is basically have two functionalities. One, is model building, how to build model that inherently interpretable, and then the model diagnostic that talk about reliability, robustness, resiliency, and so on. So visit the GitHub and look at what’s in there. Let me go a little bit on the interpretable model, then we can open it up for question and answer. When we talk about explainable machine learning we talk about two streams. One is designing and building model that inherently interpretable and the other one is black box model and we apply post hoc explain-ability. When people talk about ExAI today as the bottom, the post-hoc explain ability, the black box model, we apply post-hoc explain ability. And we will talk about what’s the problem with that approach. On the other stream is really let’s build sophisticated model, still gradient boosting machine, deep neural network, but we’re going to make it inherently interpretable. So we do not need post-hoc explain ability. And I’ll share with you a little bit the problem. Here is an example of post-hoc explainer, can we trust post-hoc explainer? People use a lot of LIME, SHAP and all kinds of those with faith today, without proof that those techniques really works, unfortunately, the more complex the model those techniques, most of the time, they don’t really work. So I give an example here using Pi-ML. So you can test this yourself, experience yourself when you use Pi-ML and open it up. And test it using data. There’s also some example in Pi-ML. So here’s an example. Let me pick SHAP because SHAP is the most popular and then with ReLU dnn. ReLU dnn we can make it inherently interpretable locally, because ReLU dnn is basically a ReLU function that basically piecewise linear. So for every point that you fit into ReLU dnn network, it to degenerate into local linear model, we can get exact local linear model. So if this is only for a low activation function, you cannot do it for any other activation function for ReLU activation function for a piecewise linear activation function, you can get exact local linear model, because you have exact local linear model, just like regression, you have the coefficient, you have exact. And here’s the one example that I have. This is you can try yourself using Pi-ML. In this case, for this data point, latitude and longitude is the most important. This is California house price data. And the actual model is actually longitude and medium income as the most important variable. So, post-hoc explainability tool can be wrong. So that’s why in our case, we really look at more on the inherently interpretable model for higher risk application. Here is an example of inherently interpretable model, you can read our publication in pattern recognition on this, you can visit the GitHub to get all of this as well. What’s inherently interpretable is basically we control it so that we have each individual variable, we feed it into network, if you use gradient boosting machine, you can feed it into gradient boosting machine, or to variable that it enters into a network. So you can see the effect of individual variables very clearly because you can plot it, or you can see the effect of interaction very, very clearly and exactly. So you have the real inherently interpretable by design. So we using deep learning or gradient boosting machine in a way that’s inherently interpretable. Here is an example of robustness. So I want to leave it more 10 minutes for q&a. Here is an example of why robustness is important and why auto ML is no good? So this is an example, I’m running XD boosts ReLU dnn and ReLU dnn there’s a lot simplified, simplified ReLU dnn less number of notes, less number of layer basically, the way we do it by applying L 1 penalty, and you can do this one in Pi-ML to get all this model. So I run XG Boost on this. comparing those three. So there’s one in the 00 here year basically, I split the data into training and testing. And I evaluate the model based on testing data, you look at it based on testing data. XG boost is the best, this is MSE, so the smaller the better. But what happened when in production, the input data has different noise. So you see here with very small noise perturbation 1% standard deviation, XG boost performance degrade very, very quickly, where the ReLU dnn, the more simplified model that performed the worse under testing that it performed best in the deployment, because it’s robust against noise, the degradation is much less. So, the bigger the noise, the performance still stand, well XG boost the performance degrade very, very rapidly. This illustrate that model that you choose or model that chosen by auto-ML based on testing data can be the worst model when you put it in production. So this all this evaluation is done in Pi-ML, this why Pi-ML has this robustness testing to make sure we can compare model and this model that will be robust in production. Same thing when we have big drift robustness is small perturbation, small noise how about if we have big distribution drift. So we can look at it as well, how the performance degrades with the pay performance drift, when the distribution of the training different from distribution when the model is used. And we can also see what variable that the drift that will make that the model is very sensitive about. So all this functionality, I just give a few example of functionality in Pi-ML, to really test model to… how to build model that inherently interpretable? Also how to test model that will be robust and reliable in production.? So with that, Steve, I’m going to open it up for q&a.

Steve: 19:52 Thank you Agus for sharing that. If anyone has any questions, please post those in the comment section. What’s level of support is what Wells Fargo providing for training? Are they have questions or features? How can someone reach out at this point as they want to use this?

Agus: 20:14 Yes, we have a dedicated team, our Pi-ML team that support this. So the best thing to reach out is go to the GitHub. It has comments. It has comments in there. So you can reach out people through the GitHub, put the question in there, and we provide the team regularly run workshop. And also, if you have any question on the tool, you can do it there as well. So we run workshop very frequently, actually, in the GitHub page. If you look at under documentation, it has a few workshop materials, a lot more detail. training material than what I spoke about, accounted for in all aspects, all the functionality in Pi-ML.

Steve: 21:08 Sounds good. And what we’ll do, also, for the techshare community will be advertising that workshops and more about the Pi-ML GitHub account. So we have one question that came in from Brian, it looks like a valuable tool, what would you say are some of the strong and weak points in the current capabilities you have today, and what’s coming in the roadmap?

Agus: 21:32 Yes, the string here you have, really, the functionality, if you look at it today, a building model that inherently interpretable instead of using black box with post-hoc explainability, so that this is a tool that we put in PI-ML, we only put very few tools today, this is a growing tool. So in fact, the version of 0.4 will have a new additional techniques that will be in there. So we still growing we trying to put more and more tools. So building inherently interpretable model is there, you can test post-hoc explainability, to the truth, to the ground truth, because you have inherently interpret model. So you can experience yourself, whether you trust LIME and SHAP. So this is the first tool that enable that. Secondly, tool that provides you with a robust and reliable model. Model that will work under-changing environment. So very upfront, during model development, we can see how the model will perform, typically people will throw it out there and then monitor it the performance of the model, the model will degrade, here, we can do model that will be robust upfront testing and anticipating how the model will degrade and what variable that we need to monitor? Because that’s probably most sensitive to distribution drift, people do it today through a lot of observability tool model monitoring, we try to bring this one approach to remodel development that we can really very sure when we deploy model, we have high confidence. So that’s what we have more functionality. I talked about a few things that I mentioned before the fairness functionality will be released in 0.4, which is end of the month. So you will have that. And then additional thing is what’s important as well as causality because people today build model really based on correlation, not casual input. So, Pi-ML 0.4 will have causality tests, where you can test whether the input is casual to the output or not, through model selection, not only doing a statistical correlation, but really based on casuality testing. So that will come up in 0.4, the tool is growing. So what’s for the update and contribution if people want to contribute, want to add functionality in Pi-ML, we open to that as well for collaboration for people to contribute to Pi-Ml. So we want to put it out there here too, that is really for real world application in high riskenvironment.

Steve: 24:41 Now, we appreciate that and we support your efforts. For those other financial institutions that are thinking about open sourcing different components or teams that are doing that, any best practices you’d like to share with them you ventured into, take this tool that’s been useful internally, you’re open sourcing it, I’m sure we’ve had a lot of work internally to do that. And you’re trying to build a community around it. And we see Agus your passion for it. So any feedback you have for people that are trying to do the same or starting their journey?

Agus: 25:16 I think it’s really different from organization to organization. And it requires guts by the leader to do this, right. So, financial model in particularly for credit decision, for example, it is very critical. It’s deciding the livelihood of people, if model is discriminatory. Model is not explainable and create damage, create harm, that can really decide people person socio-economics of the future, not only their future, but can be also the future, the future generation, if we do through model system systematically here, it can impact the society too. So we tried to do the right thing here, I think the financial institution need to have a lot of responsibility for this, particularly we are adopting machine learning. So our motivation, really to put this out there. So people develop machine learning, deploy machine learning responsibly. So that’s not a call that I made here. We do training to several of our peers, there are a smaller financial institution that don’t have as resource like us, we help them as well. So we provided training to some of the medium sized banks or smaller banks that want to use the tool as well.

Steve: 26:47 Alright, good. And Agus will be a proponent of that within our network to enable you to for that mission, because you have a solid mission to go after. We have three minutes here. There’s one question by Brandon, do you have any thoughts about ways we can prevent human bias from interfering when training models and trying to automate, example of [Unclear] might make decisions about what data should be excluded when cleaning data before incorporating into the model?

Agus: 27:16 Yeah, I think having human in the loop is very, very important. We start with, “Are we going to use certain variable or not?” That is very, very important, because certain variable will make model more discriminatory, etc. Actually, the 0.4 model it will have this testing, which variable that we should exclude, to make model less bias. So that will be included in the 0.4 is very important to have human in the loop looking at the choice of architecture, it’s very, very important because choice of architecture, I talked about model multiplicity at the beginning. So model that have similar performance in terms of let’s say, AUC, can have very different in terms of fairness metrics. So we need to choose model not only… but just based on single performance metrics, but we also need to think about robustness, reliability, fairness, and so on, and so forth. So that’s what we’re trying to do. So Brandon have more ideas after you look at the Pi-ML, you have more idea, you want to collaborate, or you want to give recommendation to us, we will be very open. How can we make this one better?

Steve: 28:34 Sounds good. And what we’ll do with that one is we’re going to be posting this recording on techshares.com on open teams, posting on YouTube, social media, and then we’ll send out email. And if there’s an ability to work together with Agus questions, we’ll be able to help facilitate that as well. And then Agus, you’ll put us in touch to with your community lead. So that way, we can post more of the training sessions and spread the word on this mission. Wells Fargo has to help people with the technology that they’re doing. So thank you very much for your time, Agus for everybody participating in these tech shares and for presenters as well. I know you put a lot of time into this. So thank you very much. 

Agus: 29:23 Thank you so much. 

Steve: 29:24 And with that, we’ll stop broadcasting.