Modernization of energy data infrastructure with open source tools

About

In this session, John Pantano and Dharhas Pothina will discuss the opportunities and challenges of data infrastructure modernization in the energy sector. Energy companies have always had to adapt to technological evolution, and this has especially held true for data infrastructure, due to the increased diversity of the types of available data, and to the exponential growth in the quantity of data produced. While there are significant advantages to adoption of modern data infrastructure hardware and software, there are numerous challenges as well. A sound decision as to when and how to undertake a data infrastructure modernization campaign must carefully consider all of these factors.

John Pantano
Energy and Environmental Solution Architect

Dharhas Pothina
CTO at Quansight

Transcript

Brian: [00:00:00] Welcome to our second Tech Shares Event. Just to remind, Tech Shares are a forum for technology and business leaders to connect to help their organizations create better software solutions that utilize open source software more effectively. As I noted today, we’ve returned to our three session format with our event today titled Data Management and Energy with Open Source. And along with that title, we’re looking at three different aspects of data management today in the energy sector. Again, my name is Brian Skinn. I’m your host for today, and this session’s topic is modernization of energy data infrastructure with open source tools. As I noted in the last session, we are open for audience questions. So if you have any you’d like to raise, please just drop them into the chat. And again, if you are interested in participating in a future Tech Shares, please reach out to us at tech. shares.com/speakers. All right. So for this session today, I have with us Dharhas Pothina CTO at Quansight and also back again for this session John Pantano, Energy and environmental solution architect at OpenTeams. So again for this session, this will be another conversation between our two guests here. And so again, before we get started there, I want to give Dharhas and John a chance to introduce themselves for this session. So Dharhas, if you would, tell us a bit about yourself and about Quansight.

Dharhas: [00:01:19] Yeah. So my name is Dharhas Pothina. I’m the CTO at Quansight. Quansight is a company that is focused on enabling people to use open source effectively. We specialize in a lot of the open source tools around the Python PyData ecosystem and scientific python ecosystems. Personally, my background is in computational physics. I did a lot of work in environmental and water resources, spent about 15 years in State and federal government, a lot of geospatial and ocean map work. Now I do a lot of distributed computing, high-performance computing and other things all around the data engineering and scalable data space.

Brian: [00:02:06] Terrific. We’re really glad to have you here. John, I know you introduced yourself in last session, but please go ahead and just briefly again, what is your background and what brought you here today?

John: [00:02:18] Hi, I’m John Pantano. I’m with OpenTeams and I joined OpenTeams because my four decades of doing computational scientific computing saw that just the whole way we were collecting more data and in these low validation systems, we were going to able to solve some problems that we hadn’t been able to solve previously. So background was similar to Dharhas computational scientific computing. And just really think the work that people at Quansight have been doing and everybody in the open source community is going to allow us to really leapfrog our decision making or decision support with the data.

Brian: [00:03:12] Terrific. I’ll say a thank you to you both for joining us today. I really appreciate it. So without further a do, we’ll move on to the conversation. So this one, whereas the first session of this event was more of a parallel conversation between two experts in different areas. This one is more likely to be we’re planning is more of a Q&A between Dharhas and John, where John has expertise with upstream energy data management and Dharhas has his recent experience and expertise with data processing, data pipelines, data infrastructure. So we’ll start with John. You know, kind of introducing where you’re coming from with data systems that you’ve worked with, legacy data systems that still function. They’re still functional, but maybe there they’re limited in scope or size or capacity, or they require a great deal of human intervention setting the stage for where things are and what you’ve experienced with data and then passing questions to Dharhas to say, what are the pros, cons, benefits, costs, considerations of trying to modernize that data infrastructure. So, John, if we would go ahead and set the stage for us.

John: [00:04:27] Yeah. So in the oil and gas, we have the geophysical data that’s being collected before they drill the well, and that’s terabytes and terabytes of information. And then the geologists, they collect a bunch of information and each group has their own processing techniques. Yet the geophysical data might have terabytes of information, and then the geologists might just get a screenshot of what might be an interpretation of one of the geophysicists. And then you have your drilling engineers that are basically drilling these wells three miles deep and going to go with a three mile lateral. And they have to stay within a 10 foot zone that the geologist has picked. And so there’s a massive amount of data that’s being collected there. Well, each one of these legacy systems is been optimized to work on one problem and one problem only. And you have the reservoir simulators that then try to take all that information. But that’s just passed off hand to hand. So to Dharhas, we’ve got more and more data being collected. So legacy systems are being built. But at what point do you kind of say, I should rethink the problem and maybe a little bite size, try to fix some of these interoperability issues?

Dharhas: [00:06:14] So this one’s actually, it’s a great question because in the last ten years there have been many new tools that really allow you to handle massive data sets easily. Previously in most of the legacy systems, you had a massive data set and then you put it through some batch processing pipeline and created an image, created a plot, created something else that’s just a snapshot of some way of condensing all that data into something. And that’s what you pass along to the next stage in the pipeline. And if you wanted to ask a slightly different question, you were stuck. You just had that image. Now, with the newer tools, we can actually give people interactive access to everything and we can get to the point where instead of having this offline snapshot, you can have a live view of the data and explore it interactively and ask questions that the original software was not designed to ask. So that’s one of the big advantages of moving to the newer tools. And most of these are freely available open source tools. Some of this requires a different way of thinking about things, because often this data is siloed into different parts of the organization, into different platforms and different machines. And enabling some of these interactive tools and stronger analytics will require thinking through where the data is put and how it’s made accessible. So that’s the best answer. Now, the flip side is to take bite sized pieces, instead of relying on proprietary tools to do each piece, if we use some of the open source analytics tools and insert them into different parts of the pipeline, you have more ability to answer customer questions rather than the questions which the vendor has decided you need to ask.

John: [00:08:19] That’s an important point, you know, as somebody who’s developed software inside of the industry, I see a lot of the software vendors that are in business to make money. And so they’re adding features that make it better for themselves, not necessarily better for the customer. But it’s one of those things where with the open source. Being able to deal with these larger data sets, how do we communicate to the users or the decision makers that this new resource is available? I know there’s a lot of people that use Python at the very beginning and got disillusioned because of it would work on one person’s machine, but it wouldn’t work on another person’s machine. How have we advanced? [overlap].

Dharhas: [00:09:24] One of the strengths and curses of the Python ecosystem is the fact that there are so many tools out there and there’s so many very powerful libraries. And these libraries are changing almost on a daily basis. If you look at the some of the vendor managed tools, you have a very stable tool that just works and it works. It’ll work on every computer you install it on and it just works. The downside of that is you’re stuck with functionality, which is like 15, 20 years old, and you don’t have any ability to use any of the newer algorithms or new techniques. The new PyData ecosystem and the open source ecosystem has very, very powerful tools. And you know, when I go to a conference and pretty much every time I go to some of these conferences, I come back with new algorithms and new tools that I can immediately put into production that fundamentally that change the way I do analytics. For many years, the problem with this and this was a big problem, especially I got into the space in 2008 with Python and in 2008 it was a mess, trying to get things installed on different computers and just getting a working environment.

[00:10:52] The situation has improved massively. There are new tools like Conda from Anaconda and other places that make it a lot easier to manage environments. Something Quandight has done is built on the existing ecosystem and we have a new open source tool that we’re calling Conda-Store right now, and that lets you manage data science and analytics environments across organizations with versioning and multiple ways to deploy environments. So this is the thing, to be a modern company that does modern data analytics, you have to be flexible with environments. You have to be able to push out new environments quickly. You have to let your engineers and data science and subject matter people explore new tools. At the same time you want to understand which tools they’re using. You want to restrict access to some things. You’re going to have security issue, things you want.

[00:11:53] So there’s a governance piece that’s required from an administrative IT. perspective, but there’s also a flexibility piece that’s required by your subject matter folks who are actually doing the analysis. And in Conda-Store, we try to marry those so you can have your control, but and you can manage things, but you can also quickly deploy and let people explore new software and new environments. And at the heart of all this is reproducibility in the idea that if someone tries a new tool or if you’re mandating a particular tool or a particular version has to be used that is reproducible across your organization or actually across a team or a project in your organization, because different teams and projects might have different requirements. We also want to be able to go back and say, Hey, this is the exact software environment I used three years ago on this project and recreate it exactly. Because I’m sure all of you in organizations who have tried experimenting with new tools have had this issue where someone’s used some software and they’ve moved on from the project. They have left the company and now no one can run that code and no one knows how to run that code and part of that is this environment management problem that has to be solved.

John: [00:13:12] And the oil and gas industry is usually very sophisticated, but yet sometimes it’s not as agile as some other industries. How do you convince or convey that these problems have been solved for financial or for attacker. You’ve solved some of these problems at Quansight right? How do we communicate that it’s been actually implemented in other industries?

Dharhas: [00:13:54] I think I mean, there’s definitely evidence out there which we can show. These are open source tools that can be tried fairly easily. The best way for a company and an organization is really to do a proof of concept project. Take a area which is a pain point that is a bit small enough where you can experiment and one way to consider this, it’s called a lighthouse project. And a lighthouse project has different aspects. It needs to be big enough that if it succeeds, you’re going to have a significant advantage. But it needs to be small enough that the number of stakeholders and number of departments and number of data systems is tractable within a short period of time. Because a lot of times when you go into modernization efforts, they’re like, Oh, let’s modernize everything in one go and you end up with a five or ten year project that then fails because we didn’t take into account all the pieces. So this is both. The Lighthouse Project needs to be right sized to do something impactful, but also be in a size where you can quickly prototype and evaluate what’s working for your organization. Because there are some legacy systems that will need to stay because that’s so critical and so all encompassing. So this is like the waterfall versus agile approach. Let’s try something small, see how it works, learn from it, and move from there.

John: [00:15:30] Makes a lot of sense.

Brian: [00:15:33] So one question I have that I’d like to put forth. When ,you’ve got the Light House and contemplating legacy systems and some of them can’t be replaced, but for those where you’re considering the replacement, how often is it the case where you interface legacy equipment to modern data systems and that continues to work okay? But how often is it necessary to actually replace something in the legacy? Are they are the sensors or the controllers or something in order to make it possible to upgrade to the modern data stack?

Dharhas: [00:16:13] To a large extent, that really depends on how critical the legacy systems are to your business processes. Because if you have a subsystem that’s using legacy things, but it’s not business critical, you can just say, okay, I’m going to rip it out and replace it. But legacy systems, especially ones which have been around for a long time, usually are very stable. And until you start replacing bits of it, you don’t understand what all the interactions and problems are. So this is why you see a lot of major modernization efforts fail because someone comes in and says, Oh, I understand how your legacy system works. And then they throw it away and in parallel it ends badly. And an example I’ll give is some work we did for a manufacturing organization in Europe, and we were writing some automation for their assembly line. And this was connecting to lots of different machines on their assembly line. So in their existing procedure, data cannot transfer from machine to machine. So they have Excel files and nodes and multiple independent databases and writing things down. And at the end of the day, when they had their manufactured product, they can’t tie the serial number on the product back to which chemicals we use, which batch was used.

Dharhas: [00:17:41] So the idea was let’s, let’s integrate all this. So from the serial number of the end product, we can know exactly what processes and chemicals and temperatures and everything we use. So we built them a system to do that. And this was a successful. It’s in production now. But one of the things I wanted to tell you about this is when we started moving it into production, the managers at the assembly line was surprised because there were three or four extra steps that the assembly line had that no one knew about. Because they had been changed at the assembly line, but it never gone up into the documentation or the official processes. And so then we had to add some pieces to the infrastructure. So if you have a legacy system, there’s lots of undocumented things happening. So that’s part of why you have to be very careful. And I would usually say take an incremental approach and move things out slowly versus like, Oh, we’re just going to add all these new tools and immediately change everything.

John: [00:18:53] I think that’s a real key thing that, in the oil industry, it’s similar to that manufacturing problem that you just talked about that just by collecting all the data, there’s insights that were hidden from and you can optimize if you are aware of all the information that is actually going on. One thing that now just popped into my head was how do we convince people with these new sensors and new things that are coming out to write things in open source as opposed to getting the vendors educated in the advantages of open source? In my mind, open source allows more users, more trial tests, bigger user community, and therefore more features are able to be added. Dharhas, what is the attractiveness for a vendor to switch over to open source?

Dharhas: [00:20:08] It’s leverage. And if you look at a tool like pandas, which is a open source data analytics data frame tool, there is no way any vendor can beat the capabilities of pandas. But they can leverage the capabilities of pandas. Any any software you develop on your own, you are limited by the resources you have. You cannot compete against the rest of the world. And an example of this is in GIS. There’s a very famous proprietary company that has some very big GIS products. I can do more with open source GIS tools than I can do with that company’s product right now. And that’s because it’s very difficult to compete with the rest of the world. Any product you actually build, any software you build, you have to maintain and you are paying that maintenance fee. So my overall strategy, what I tell people is your product, the only part of your product that should necessarily be closed source is maybe your secret sauce. The thing that’s very specific to what you do, but all the foundation and the core parts should really just leverage the open source stack, because then you can leverage all the intelligence of the open source community and the fact that you’re not on the hook for the maintenance completely. Participating in open source gives you, the way I put it’s a force multiplier because you can provide more features to your clients than you could if you had a completely closed stack.

John: [00:22:02] And you can keep stuff proprietary. Like you said, the secret sauce can still be your secret sauce. You don’t have to share that.

Dharhas: [00:22:10] I mean often, usually the secret sauce, no one actually cares about it because they’re in a different domain from you. Like, for example, I was in geospatial and GIS and doing a lot of work with elevation and raster data, and I ended up using a lot of algorithms from computer vision because an image and a elevation raster dataset is both a grid of data. And if I have some secret sauce which is related to elevation data, an organization that’s in computer vision doesn’t care about that. So usually the secret sauce is something very unique to your business model and what you’re trying to do. I mean, that’s one big misconception, this idea that if you’re using open source, everything you do has to be out in the open. I don’t think that’s true. I think having an intelligent mix of open source and your own proprietary stuff is what makes sense.

Brian: [00:23:20] Indeed. So we’re getting close to the end of time. I think it’s been a really great conversation. John, do you have one last question or one thing to to round us out?

John: [00:23:32] Well, just the next talk is going to be about Nebari. Could you just talk about maybe the benefits of Nebari? [overlap]

Dharhas: [00:23:44] Yeah, so one of the big problems, we’ve seen is there are lots and lots of really powerful modern tools. And there’s lots of tools to deal with big data as well, and visualization and analytics, but getting a platform set up either in the cloud or internally is really hard. It’s way harder than it needs to be. And the Nebari is a open source product that Quansight have been developing for the last two years. It is a opinionated, managed integration of a suite of open source tools. And the idea is to quickly enable an organization to get set up with a data science platform that their subject matter folks can use. In the cloud right now, Nebari is available for all four major clouds like GCP, AWS, Digital Ocean, Azure. In the cloud, a default installation should take as little as 30 minutes. Now, if you want to customize things more or if you want to install it On Prem, it might take a little longer. But the idea is within half a day or a day, you can have a platform that your organization can use and it’s an open platform that can be customized and other things can be integrated in. It’s not proprietary lock in. The idea is let’s get the platform piece out the way so you can actually get to work.

[00:25:21] And we’ve done a lot of automation, and the way we’ve designed it is you actually don’t need very much technical skills to install it. It’s just an install script that runs it and then you can manage it through configuring one YAML file that controls the entire installation. So again, the biggest problem in most organizations is getting infrastructure deployed so that people can get to work. With Nebari and with the consulting Quansight, we’re trying to get that infrastructure piece out of the way so you can actually get work done.

Brian: [00:25:57] Yeah. Excellent. Well, I think we’ll have to leave it there. We are just about out of time. Thank you very much, Dharhas and John for your time participating. Dharhas we really appreciate you bringing your expertise in these areas to the questions today. So, again, this event is being recorded. That recording will be available soon and information on it will be distributed to all participants once it’s available. So again, as John referred to the next session here will be a presentation from Amit Kumar on Nebari. Stay tuned. We will be back with you in just a minute. Thanks again to both of you, Dharhas and John.

John: [00:26:40] Have a good day.

Dharhas: [00:26:42] Bye.