Data collection in energy applications

About

In this conversation, John Pantano and Marc-Andre Lemburg will discuss processes, challenges, and best practices for data collection in the energy sector, providing an invaluable perspective for energy companies seeking to enhance their data handling capabilities. As with all other industries in the modern world, energy companies must implement robust, reliable, and traceable data collection and cataloging systems. The constellation of sensors, controllers, and other hardware that produces data streams is continuously growing in size, complexity, and granularity, and the systems to handle this data must be engineered with care. In addition to managing the sheer volume of data, data solutions must also account for the different uses of the data they process, ranging from numerical/scientific modeling to ML/AI predictions to corporate sustainability & responsibility reporting.

John Pantano
Energy and Environmental Solution Architect

Marc-Andre Lemburg
CEO at eGenix.com

Transcript

Brian: [00:00:00] Welcome to Today’s Tech Shares Event. Tech Shares to remind, are a forum for technology and business leaders to connect and to help their organizations create better software solutions that utilize open source software more effectively. Today, we are going to return to our three session format that we’ve used before with an event that we’re titling data management and energy with open source. Just the title says we’ll be looking at three different aspects of data management in the energy sector. My name is Brian Skinn. I’ll be your host for today. This session’s topic is data collection and energy applications. We will have a nice discussion on that with our two guests. I do want to note here at the outset, we are open for audience questions. So if you have any you’d like me to raise to our guests, please post them in the event chat. And also I’ll note that if you’re interested in participating as a speaker or a roundtable participant in a future event, please reach out to us at tech-shares.com/speakers. All right, with me today, I have Marc Andre Lemburg, CEO at eGenix.com, and John Pantano, an energy and environmental solution architect right here at OpenTeams. And for this session, I’ll be moderating a conversation between our two guests. Before we get started with that discussion, I’d like to give Marc and John a chance to introduce themselves. Marc, why don’t you go ahead and tell us a bit about yourself and eGenix.

Marc: [00:01:19] Yeah. Thank you very much for having me. I’m Marc Lemburg. I’m based in Dusseldorf in Germany. I have a math background. I started a company about 20 years ago. I’m a long time Python user, a core developer. I added Unicode to Python. I’ve been working as a consultant for the past 20 years and mostly in financial services, but also in the energy sector. And that’s why I’m here today to focus a bit on the experience that I’ve made running projects in that sector.

Brian: [00:01:55] Excellent. Thanks very much. All right, John, what’s your background? What brought you to OpenTeams?

John: [00:02:02] Well, good morning or good afternoon or good evening depending where people are, I’m John Pantano. I did my Master’s and PhD where we basically converted reservoir engineering simulations into geological simulators and have been doing a lot of computational scientific predictions and modeling and through the years have used different languages and have really seen the way that OpenTeams is set up to take advantage of this new stack that will be able to really solve some problems that were unsolvable before in the past. So really big supporter of open source and looking forward to the adoption of it in energy and environmental work.

Brian: [00:02:53] Very good. Welcome to you both. All right. To the discussion. So I think before we get into specific topics, it would be useful to introduce the specific elements of energy sector that the two of you will be the prospective should be bringing to it. So, John, as I understand a lot of what you’re planning to speak from is on the upstream side, is that right?

John: [00:03:15] That’s correct. Yeah.

John: [00:03:17] Basically everything from figuring out where to drill the well to how to optimize the field after it’s been operated for 30 years.

Brian: [00:03:27] Gotcha. And Marc, you are coming more from a holistic perspective, the entire operation. Is that right?

Marc: [00:03:35] Right. I’ve been working on a couple of months or probably even a year in a project that had as a goal to build a platform for ESG or CSR. So the reporting that you need to do coming from a sustainability point of view and applying that to the energy sector. So it was a lot of work around data acquisition, data sourcing, basically a lot of things that we’re going to talk about in this session.

Brian: [00:04:12] Okay, very good. So let’s stick with you, Marc. One of the things that you just you need to know in order to work with data is what kind of data are you going to be getting? How much of is it? What are the timescales you’re going to be using it on? How does that factor into to the experience you’ve had?

Marc: [00:04:29] Well, for the reporting side of things, you do need to get and acquire a lot of data throughout the whole organization. So you have a challenge there because there are so many different kinds of data that you need to acquire. So it’s like measurements, for example, it’s volumes of certain things. It’s pollution. It’s waste. But even for the CSR report or ESG report, I know CSR is something that’s EU specific I think, basically deals with the complete view on an organization from the point of view where it’s based in society and what the effects of running the organization are within the society. So you don’t only look at things that are produced or maybe waste that’s produced or pollution that’s cost, but you also focus on, for example, employees and how the employees are being treated in the company diversity, employee churn, that kind of thing. Health, for example, how many incidents, injuries, maybe even fatalities do you have in a year? Those kinds of things. So that’s what you need to capture. So it’s a very diverse set of data that you have to manage and you have to put all that into context for the reporting purposes.

Brian: [00:05:58] So, John, for you, for the upstream data, what types of data and what types of timescales are you looking at?

John: [00:06:04] Yeah, so very similar to Marc. There’s a lot of the stuff that is segmented into different organizations. And for example, geophysical data will be terabytes of information that are to get an image of what’s in the subsurface where the geologist is then interpreting that type of data. And they’re using data like digital rocks from wellbores and stuff like that. That is completely different time scales and you’ve got the production engineers who have things that are being recorded on 15 second, but then you aggregate that data up to monthly reports and then the reservoir engineer who is studying everybody’s performance and is then trying to optimize stuff. And so they look at data and that data is also aggregated in a different manner altogether. So you’ve got to somehow figure out how to close the loop and to have all of this information be consistent as people are trying to make decisions out of it.

Brian: [00:07:22] So yeah, you touched on a couple of challenges there. Dig a bit deeper on that. Like what? How carefully do you have to track where the data is coming from? How do you have to be concerned about where the data is going to be used? What are some more practical considerations with the data collection there?

John: [00:07:46] You know, one of the big challenges is to somehow get communication between the groups, what data is available. So with the current tools that are available, you can catalog information better and then broadcast that catalog. So people are aware that some information is there as opposed to just personal networks that understand where the different data is. So one of the big challenges is to make it easier for people to know what’s available and then how to get that information in a usable format.

Brian: [00:08:34] Makes sense. Similar challenges from your perspective Marc, the data location and cataloging and such?

Marc: [00:08:41] Yes, definitely. So first of all, there’s a lot of data that you have to collect for these reports. And then what particular challenge is that all this data has to be mapped to the organizational structure that you have in the company. And typically these companies are huge. Like the company I was working for had more than 70,000 employees and of course, spread around different locations, different countries. And you have to make sure that you get the data from all the different sites that the company has, from all the different sources in terms of where you produce energy, where you transport it. The networking side of things, where you send the energy. And then also on the consumer side of things, how much energy do you actually distribute to certain industries, for example. And you have to make sure that all this data actually gets acquired is correct. And in the particular case for the CSR report or the ESG report, you also have to make sure that it’s actually approved because these reports have a legal effect. And so the companies need to make sure that the data that’s shown in those reports is actually correct.

[00:09:56] So you have not only a, let’s say, a data engineering perspective on things, but you also have to take care of all the processes that are needed to make the collection work. And sometimes you have to remind people to send in data. Sometimes you have to make them aware of certain timescales that need to be met. Because at the end of the day, if you want to make the report, you have to have everything in place, right. So you cannot just sit there and wait for maybe five people in your 70,000 people organization to submit the final data points that are needed for this report. So that’s a challenge for sure.

John: [00:10:41] I think one of the things that Marc brought up is you have different countries, you have different organizations, and so you have to have a dictionary, too, that basically maps things tp one to one, that apples are apples and oranges or oranges, even though somebody might be breaking up apples into different species of apples. So it’s amazing to take advantage of artificial intelligence and machine learning, you have to somehow transform the data to be on the same level of labels.

Brian: [00:11:23] Let’s dig into that a bit more like those challenges of data transformation synchronizing units. Certainly you have to have everything on an apples to apples. What are some more data conversion challenges that arise as you’re trying to get all of the data you’ve collected into a form that is usable for downstream analysis.

Marc: [00:11:49] For example, when you do reporting, let’s say you get data from the UK, then you typically have imperial units. And if you get data from Germany, you have metric units. So you have to do the conversion there or you have to make sure that everything is reported in the right unit, which is it sounds easy, but in reality actually it’s not always easy because people, they naturally assume that their metric system or their system of units is used and then just combine data with different units without actually paying attention to the different units and doing the conversion properly. Another issue is that you sometimes have ways of measuring things, for example, pollution. You use a standard that’s called the CO2 equivalent. And so regardless of what you’re measuring, you have to do a conversion from the pollution that you’re causing there with whatever you’re causing that pollution to these CO2 equivalents. And those are things that ideally you would want to have in a system that basically takes care of these conversions for you.

Brian: [00:13:02] So this is CO, methane, nitrogen oxides, that sort of a thing getting converted?

Marc: [00:13:05] Right.

John: [00:13:07] And that’s why open source is so great as we get these packages that everybody’s got the same problem. And instead of it trying to be solved 20 different times or 20,000 different times by making this open source, we take the pain point further away. And so that’s one of the reasons why to get together and not just to use the open source packages, but to build on the units package that makes that more automated.

Brian: [00:13:46] Yeah, I was just actually going to bring up the units aware numeric tooling. That’s under development. It’s something that I’ve been paying attention to for some time in Python. You’ve got pynt and I know NumPy is looking into it but yeah having the ability to directly handle units in the code instead of having to have an index of units is a very attractive way to remove some of that complexity at least I think. So thinking about the the CO2 emissions and environmental considerations, how significant are seasonal effects in the data that each of you have to deal with both the the upstream prospecting and the the holistic reporting. Marc, how does that factor into to what you’re trying to do?

Marc: [00:14:38] Well, it depends a bit on the timescales you’re looking at. Typically for the for the official reports, the official CSR report. Those are annual. So it’s a yearly report. So it is possible to compare from year to year even with seasonal effects in the data because you simply aggregate across the the whole year and then you don’t see them. But if you go to other timescales, for example, monthly timescales or quarterly timescales, then of course you have to take those things into account. So you cannot compare the data that you receive for, let’s say, winter with the data that you that you get for summer, because there are seasonal effects in that data, which you’d have to either factor out or basically you say you cannot compare these two seasons, you have to compare winter to winter, because otherwise you’d be comparing apples to oranges, right?

Brian: [00:15:28] Sure. John for the upstream, is there a lot of seasonal effect there? Or are you looking at time scales that make that less important?

John: [00:15:37] There is some seasonal effects. There’s some diurnal effects. There’s which are just the daily variations. But if you’re looking at how many miles of weld are drilled, it’s going to always slow down during the winter. And so there are some metrics that monthly, the way people’s bonuses are actually applied. There has to be adjustments to those types of things. But there’s definitely seasonal effects that have to be taken into account. But more in the upstream, it’s more like how are you going to aggregate things instead of the calendar day. You have days on production and then you also have maybe hours on production. People look at monthly decline curves. But if the well was only online for 20 days, that messes up that analysis. So again, having an open source package that takes into account how to adjust things to be the proper time scale is very attractive so that not everybody’s struggling with it as they do their individual analysis. Right now, it’s easier just to spend the extra day or two to massage the data, and it would be better if that day or two was thought about thinking about the data. So that’s where the attractiveness of these new software stack will allow us to think more about stuff as opposed to doing the data manipulation.

Brian: [00:17:31] So that’s kind of a somewhat open a gap in the coverage of the open source tooling is doing that automatically. That would be a good focus for development.

John: [00:17:42] Yes, very good focus.

Brian: [00:17:45] Okay. Good to know.

Marc: [00:17:46] It would definitely also help if there were more experts going into this area of open source because it’s typically very hard to find experts for the different types of, let’s say, energy that you’re looking at. And normally you have to really reach out to them. You have to first find them in your organization. And that alone is already a challenge. But I think it would be beneficial for everyone if more of these experts actually go to open source and contribute to open source to make it possible to get this expert data basically out of their teams and then more into the open space, because this would really be beneficial for everyone and we would have less duplication of work there.

Brian: [00:18:37] I mean, it’s an example of a broader trend isn’t the right word, but a broader need in the modern world, modern economy for having multi-skilled experts that have feet in this case, both the open source programming world and then also the scientific and engineering world, to be able to really build good open source that’s informed by that scientific and engineering expertise.

Marc: [00:19:04] Yes, I would very much welcome if companies would really facilitate this. They could make it possible for those expert teams to actually contribute to open source, for example, and they could fund this or make it more popular within the organization to contribute to these open source projects.

Brian: [00:19:25] And I think we’re seeing some signs that there is movement in that direction, but it’s still early. It would definitely be great to have it become a much bigger thing.

John: [00:19:34] There definitely has to be that corporate sponsorship to say you’re a domain expert inside of our company and we’re going to give you a certain amount of time to work in the open source area so that there are domain experts that are doing machine learning and artificial intelligence and data engineering. But again, they’re not sharing what? There’s a organization called Software Underground that is doing a good job of trying to, in the upstream get more corporate contributions.

Brian: [00:20:19] Very cool. So, yeah, we want to make sure we spend a little bit of time on AI and ML. It’s a good transition since you mentioned it there, John. So in both of these areas, I guess would you say how extensively used is artificial intelligence and machine learning in both of these areas, perhaps not in the data collection so much, but on the analysis side? And then where does the where do data collection practices stand relative to where they need to be in order to feed high quality data to AI and ML analysis systems? Thoughts on that, John.

John: [00:21:09] There was an interesting poll on LinkedIn in which it says, How many data scientists do you have available to you? And the majority of the people that responded was 1 to 5. And what’s going on is, in my mind, this is a personal opinion that people are all doing about the same thing, that individually they’re applying machine learning and artificial intelligence to seismic data to well log information, to reservoir simulators, trying to move from physics based systems to things. The physics based systems just take up a lot of compute power. So it’s to improve the prediction capabilities that machine learning and artificial intelligence is attractive to with less compute power. But you have to train that information. And that’s where I see we’re hooked on getting all the data available in the right format, apples to apples to get that training done.

Brian: [00:22:24] So one follow up there. Being seismic, how high of dimensionality of data are these models trying to work with? Is it fully 3D?

John: [00:22:34] Fully 3D. The people do stuff on a 2D section, but it’s fully 3D that is being collected and it’s even 4 dimension in that. It’s time-lapse seismic. And so there’s different things where computationally you can say, Oh, here’s a snapshot at 100 days into the production. And then here’s a snapshot, 400 days into production. What’s the difference between these things and what way can we get more oil or gas out of the system? So the data is being acquired and we’re still struggling with figuring out what is the best way to process the data. And the things that big data engineering and things like desk and other X-rays and other sort of things that can deal with data in a much more massive way looks quite attractive to me.

Brian: [00:23:42] For sure. Marc, how about you? Perspectives on AI and ML?

Marc: [00:23:47] Yeah, well in that particular project, we did not really use ML or AI for anything much because most of the focus was on actually getting the data in the first place. But of course you can use ML for these things quite a bit. For example, you can use ML for outlier detection or you can figure out trends that you have. And this is useful for reporting to, for example, a high level management. And also make that available to data scientists in different parts of the organization to basically self-serve so that they can build their models by tapping into that data that is being collected. And that’s certainly something that’s very useful if you want to have a data driven organization. Another area where ML could provide more efficiency is, for example, in actually writing these reports. You can use text models for these things. It should be possible, I would imagine to basically give the data to one of these models and then have a model at least write a draft of the report and then have the people in charge then fill in the gaps or maybe correct things.

John: [00:25:07] One thing that Marc mentioned is if there was digital twins in that portion of the industry. Machine learning and artificial intelligence could go and find what were the most optimal ways that maybe some facility in the UK has figured out a way to get lower emissions than a place in Germany. And so what’s really attractive is making digital twins that basically calibrated on what is been the historic performance. But then you can do what ifs in terms of changing the operations in Germany to reflect what was in the UK or vice versa.

Brian: [00:26:00] Excellent.

John: [00:26:01] So there’s a lot of opportunity for optimization for people’s operations to get both emissions down and cost lower and just a healthier work environment.

Brian: [00:26:18] Very good. So we’re coming up on the end of the time. Any brief last minute comments, either of you, before I close out?

John: [00:26:32] I would just like to say that the amount of data that’s being collected is just going to get bigger and bigger and bigger. And so we have to get smarter in how we address the use of this data.

Brian: [00:26:50] Absolutely.

Marc: [00:26:51] Yeah. To that point, John, I think it’s very important to come up with more standards, better standards for actually how to categorize data. Because I think one of the major challenges that you have in this whole data science and data engineering space is that even though you can collect lots of data and you can store lots of data, making that data accessible to the right people is hard because they won’t necessarily find the data that you have. And there are only a few people in the organization typically that know where the data is and what the data is about. So this we need to put more emphasis on this data cataloging and maybe coming up with better ways in different industries to map data, to actually categorize and identify data in a standard way so that you can make it easily available to everyone.

Brian: [00:27:49] It’s a very good call to action, definitely something that we need to pay attention to. So I think we’ll wrap it there. Thank you very much, Marc and John, for participating in the session today. And to everyone out there in the audience for attending, we really appreciate our experts taking their time today to weigh in. The recording of the event will be available soon, and information on that recording should be distributed to all the participants once it’s available. We’ll be continuing with the next session soon. For now, we will wrap and we look forward to having you continue participating as the session continues. Thanks very much to all and we’ll see you in a bit.

Marc: [00:28:29] Thank you.

Brian: [00:28:30] Thank you.