Robust predictions and increased interpretability through domain-directed modeling

About

In this session, Rob Zinkov, senior software engineer at Quansight, will share their experience with domain-directed methodologies, when to use them, and when to use machine learning models. 

Machine learning has grown from a small but established academic discipline to being embraced across industries while providing significant value. It is often the first tool that data analysis folks reach for to solve their problems. To be sure, machine learning has significant strengths, especially when large amounts of clean, consistent data are available, and it’s providing dramatic benefits across fields as diverse as medical imaging, product recommendation, language translation, self-driving cars, and traffic analysis.

There are numerous areas where classic machine learning approaches are not necessarily the optimal tool, especially where data is lacking or of poor quality or visibility into the model’s internal decision-making processes is critical. In these situations, modeling techniques that use the processes underlying the data, informed by domain-specific expertise, can often outperform machine learning methods. Such ‘domain-directed’ methods are a great addition to your toolbox—they can be effective when machine learning fails or reaches its limits and often provide superior predictive performance and improved model interpretability.

Rob Zinkov
Senior Software Engineer at Quansight

Transcript

Steve: [00:00:00] Hey, we are here for our third session a day with Rob Zinkov. And Rob, you’re going to talk on robust predictions and increased interoperability through domain directed modeling. And as I look at your profile and seeing your great accomplishments at the University of Indiana, you’re a research scientist and you’ve built out Hakaru, the probabilistic programming language. Love to get your feedback. Love to get your story behind how did you get that started. That seems like the foundation, some of the things we’re going to talk about today, but give us a little bit of that background before we get to deep into your discussion.

Rob: [00:00:47] Sure. So I think the main context for that is for a long while, many machine learning algorithms like when you would read machine learning papers, there would be this particular probabilistic model and then they would be this very complicated inference algorithm to get it working. And the inference algorithm always use all these little tricks. So it was always a little bit complicated to get it working. And this was supposed to be very impressive. But it often meant in practice I would read these very interesting papers, but it would take me months to actually replicate them. And it got frustrating after a while because you would have two ideas that are conceptually very similar but seem to like you seem to not ever be able to reuse the work. So you would see these articles for doing things like topic modeling so be different takes on top of modeling, different takes on doing time series prediction and you couldn’t reuse these things. It was so cumbersome. You grab a topic modeling library, you grab a time series modeling library and of course they can’t work together. I mean, even in 2022, often a library for doing document modeling and time series modeling, they’re not going to interoperate or be very surprising to. But I sort of discovered this language called Jags, which showed that, Oh, actually if you’re willing to sacrifice a little bit of computational efficiency, you can actually get these models. You don’t get to use all the tricks that the papers use, but you get to write the model and you get to use it for your task.

[00:02:56] And more importantly, because you have this little language, you get composability. So I was deeply captivated by this because that suddenly meant, Oh, okay, now I can mix and match models. I can really work on that. And I also saw like, oh, this is a real interesting opportunity because maybe many of the reasons that these early systems are slow, that computational inefficiencies, they’re not fundamental barriers. They’re almost all just things that with a little bit of attention and work some software engineering. There should be no gap between whatever the fastest handwritten algorithm is and what comes out of these systems. And so that’s motivated me to join with a former professor, undergrad professor I had and developed the Hakaru project, which we worked on in Indiana for many years to just show that there’s no reason that you couldn’t make a system that lets you write these fast inference algorithms or get these fast and furious algorithms while just being able to write your statistical model and not really focus on it. So that’s the context that I come in from.

Steve: [00:04:30] I appreciate sharing that because love to hear about innovation. And I know you work with Quansight and they do a great job to incorporate open vision, and the passion that you have for the things you found and continue to contribute. So let’s talk a little bit about this domain directed modeling that you have created. And the problem you’re solving is when you have a lot of data and you create your models, that’s not easy, but it’s easier to be able to create, to analyze that data and get the results you want. But when you don’t have a lot of data and you’re trying to still deliver those business results, that’s when things get a little iffy. And I think that’s what you’re going to talk about here. So help us understand about domain direct modeling. What is it? Give us an overview.

Rob: [00:05:16] Right. Yes. So domain directed modeling is like a massage term that I’m using, riffing off of this article from a little over two decades ago. So two decades ago, a statistician, Leo Breiman, wrote this article called The Statistical Modeling: The two cultures. And so in this article, if you have a data modeling task, there’s two ways you can kind of go about it. There’s what he called the traditional statistical modelling where you develop a statistical model of your domain. And then follow it through as we expect sometimes from a lot of the statistical sciences. And then you have the algorithmic modeling approach where you just try to develop an algorithm that can explain the patterns in the data. And of course, the point of this article when it came out a few decades ago is that we’ve over focused on doing these traditional statistical modeling approaches, which I’m calling domain direct approaches. Because when you write the model, it’s very much motivated by your domain expertise. And more attention needs to be faced on the algorithmic modeling approach.

Rob: [00:07:01] I’m actually now arguing that it might actually be time to go in the other direction, which is I think we now actually focus quite a bit on algorithmic modeling and it might actually be worth it to go back to domain directed modeling because it has very unique strengths. Namely it can be much more robust. You can often use it when you have very little data or small amounts of data. It works better when your data is less uniform. So other advantages we’ll talk about shortly. So I actually think it’s really worth it to explore how to do domain modeling. And the reason I think now particularly is a great time to get into is that a lot of these tools and things that I have explored. So like I’ve explored Hakaru, but I wasn’t the only person doing this. Many people have been exploring this, and we now have this community of probabilistic programming people and these probabilistic programming systems. So there’s actually now really nice software to do domain directed modeling very effectively in a way that just didn’t exist before.

Steve: [00:08:33] And just on that one, what are the softwares that you suggest that they use and when should you start using domain direction modeling? When is a good point in your project to start doing.

Rob: [00:08:48] Right. So. It’s such a challenging question because I love so many of these systems. So two systems I think are worth mentioning are there’s, of course, Stan, which has become very much one of the main modern systems. Its own little system with library access from Python and R, and a few other ones. And then there’s PyMC. So these are both deeply established projects. Each of them, I would say has closer decade of active development. Looking further a field, I’m quite excited and contribute quite a bit to Beam machine. So for people who use PyTorch, Beam machine is written on top of the PyTorch ecosystem. So that lets you do this domain directed modeling, using PyTorch and using a lot of the tooling around that, which if you’re already using PyTorch in other parts of your business, it’s a very easy thing to explore. In terms of when you would do it? In many ways it’s a tool that I think you can often reach for a little earlier than you might something like scikit-learn where you can often use a lot of these domain modeling tools for exploring the data so you can ask, are there different common attributes? You can say make a latent variable model and say, well, are there of common things involved here? So much in the way that you might do summary statistics. Once you move beyond basic summary statistics, you can start making very small statistical models and just see what inferences you get.

 

Steve: [00:11:14] Can you give us some examples of where domain directed modelling shines? In your experience, how can you illustrate that a little bit more for us?

Rob: [00:11:27] Sure. So I think some of the places where I think it really shines is where you don’t really have a lot of data or your data are very heterogeneous. So, for example, as an anecdote colleague told me about, they were testing the reliability of machine parts. And when they’re doing the machining, it’s very expensive. So they’re getting like five or six data points. So [there is another example o] Ravikumar who used to work at Space Projects, would use Bayesian modeling for modeling what you should do for like rocket launches and things like that. There’s not that many rocket launches. You don’t get thousands of data points. It’s not like other domains where you have millions of data points. And they’re high , you get like maybe a few dozen, if you’re lucky. You see, the science is you have things like astronomy, where you might use it for say, exoplanet discovery. Again, it’s ones where it’s not a lot of data, but it’s ones where you really use your domain knowledge just to make the most of it, which of course, in all of these domains it’s often things like physics and astronomy, for example.

Steve: [00:13:07] Yeah. So the accuracy and interpretability of ML approaches are a major concern, especially when it comes to the healthcare industry. What are few proven ways to improve the accuracy of the model as you’re talking about probabilistic programming and the use of it?

Rob: [00:13:27] Sure. So there’s always lots of different ways that you can go about it. Of course, all the ways that we use when we do regular machine learning, algorithms still apply. If you get more high quality data, that can always help. There’s always a bit of feature engineering you can always do. Particularly in situations where you’re trying to improve a probabilistic model, what often helps is really it’s looking at the data and looking at where are you not capturing a subtlety in the data. So sometimes you might like  does your data have heavy tails or are you actually capturing that in your model? Because if you’re not, then you run into issues with robustness. There’s a question of does your model capture what all the data actually looks like, or is it doing a bad job on an important subset of the data? Or is it making assumptions that are reasonable? They’re all things of that sort that you can do. And what’s nice about, of course, all these things is you can then go ahead and check and see did it actually seem to do anything.

Steve: [00:15:07] I appreciate the response to that one. Now, as you’re embarking on the journey to use more of these tools, which deterministic probabilistic programming libraries work best with small and large datasets. What do you suggest that they use? How do you suggest they start getting started with this?

Rob: [00:15:29] Just what I said before,  I think you can reach for something like Stan, PyMC or Beam machine. And I think though generally they’re more or less interchangeable. I think which one you gravitate toward will depend on things that are less specific to the library and more your other business needs mostly a python shop, maybe you don’t want to be reaching for Stan, which seems to have better support on the R side. Or if you’re using a lot of PyTorch, you might reach for Beam machine. And in terms of the merits of the library itself, I think it’s more about which ecosystem fits better into because the capabilities in a lot of them, particularly for people who are starting out, I think are more or less interchangeable.

Steve: [00:16:26] And as they go down this path, what are the most common errors you can potentially face during data modeling as you’re working through this probabilistic programming?

Rob: [00:16:39] I think a lot of the errors that people make are kind of mundane in a lot of ways. A lot these probabilistic models, they’re often Bayesian models in particular. So you have to put priors on the latent variables that you care about. And people sometimes, this is one of those situations where a little knowledge can be dangerous. If they’re coming from a more frequent statistical background, they’ll sometimes make their priors overly broad. And this often leads to models that I think have bad geometry to use to describe them. So you should often write priors that actually agree with what you think. So if you’re modeling the age of some of your customers, it’s not clear to me that you should have a model that puts a lot of weight on people who are 300 years old. If you’re modeling pollution in the atmosphere, you probably don’t want to put too much weight on pollution levels that won’t exist on this planet for a billion years. So this is not meant to be glib or anything. Overly broad priors often just make these systems perform badly, and then people just think, Oh, I guess my problem isn’t a fit for the system significantly. We’re talking about like they make this change. Their model takes an hour to run. They put more reasonable prior and it runs in a minute. Like this is a very important thing to do that can really affect the experience of using these systems.

Steve: [00:19:01] Right. No, it makes sense. So another question that came in here. Bias training data has a potential to cause not only unexpected drawbacks but also perverse results by completely contradicting the goal of the business application during data modeling. How can we ensure balanced label representation and training data to avoid data induced bias? So that was a question that came in here.

Rob: [00:19:33] Right? I think a lot of the challenge there is you need a source of truth to keep you honest. So none of these methods, whether it be the methods I prefer or methods other use can really do anything if the data itself is biased. They assume the data is what it is. So I think often those things really require external checks so that really, does the population in the dataset have the same properties that we expect, the population where the model will be deployed to have? Or does this seem to have worse performance on different subpopulations than we want to have for different criteria. That really ends up being something ends up existing almost outside of the modeling process or just getting a sense of what the data you have is.

Steve: [00:21:07] Thanks for answering that question. And one of the other questions that we have here that we’re submitted is creating a model without understanding the business goals and missions can cause lots of problems. How can businesses ensure that they stay on track with their business goals and the data modeler incorporates those into what they’re doing?

Rob: [00:21:34] That’s a great question. Maybe I should have answered that one a little earlier. I mean, this is how most data science projects fail. Right. Like you just not engage the relevant stakeholders early enough and forcefully enough. My experience has been I think this is the experience of most people who work on data science [assets]. I think you just need regular communication with all relevant stakeholders. If you’re making assumptions, people need to know about them early. You need to be managing expectations basically the entire time. And if you don’t understand the business model, I think that’s like the number one priority to correct before going off and making something that ends up not being very helpful.

Steve: [00:22:53] I appreciate that. Let’s see, is there any other audience questions that people may have? So we have. Thanks for sharing. We appreciate your covering this topic with us. Now, you have a blog that goes into more detail about what we said that can be found on quansight.com. And we’re more than happy. We’re going to post that link out to everybody so they can have that. And any other things you’d like to share, Rob, or closing remarks? But we appreciate what you shared with us and this is a very useful way to do data analysis, especially when you don’t have all the data that you’re looking for. And it’s a good tool in many of our research because many times we end up with less data than we wish we had.

Rob: [00:23:46] There’s always less data than you wish, right? Because the thing is the ambition gets higher, right? Because it’s like if we have a lot of this data, now we can reach for things that are a little further out of reach. So it’s like, well, we have all this data for language translation. Now there’s a temptation to reach for the less popular languages, the languages that have less speakers, maybe the endangered languages. You can actually often always subset your data until you suddenly don’t have a lot of data anymore. And that subset still begs very interesting questions. And I think often because we know what to do and we have lots of data, we dispel a lot of the increased interest and complexity in a data set to answer the questions we think we can on the largest possible dataset. So things like local and regional variation suddenly falls to the wayside. Things about more vulnerable populations falls to the wayside because again, these are things that you have less data for. But I think if you have a nice broad range of tools, which I consider domain directed modeling, just one that I think people would access suddenly you can resist that temptation a little bit.

Steve: [00:25:30] Yes. And I posted a link to your blog that you posted on quansight.com. And, Rob, thank you for your time today. We appreciate that and for your insights to the community. And then coming up, we’re going to take a short little break for mic check, but we’re going to come up to Davit Buniatyan. So appreciate that. Thanks once again, Rob.