Best practices for successful big data implementations

About

Travis Oliphant, CEO at OpenTeams and Quansight, founder of Anaconda, NumFOCUS, and PyData, and creator of NumPy, SciPy, and Numba will share best practices and suggested technologies she and her team use to help companies build successful big data implementations for telecom companies. She will also provide best practices on how long projects take, what type of team and skills are required, and what technology she recommends.

Come learn how to make your big data implementation projects successful and what tools, teams, and processes you need to ensure optimized delivery and maintenance.

Travis Oliphant
Founder of OpenTeams, Quansight, Anaconda & NumFOCUS. Creator of NumPy, SciPy & Numba

Transcript

Brian: [00:00:00] Everyone, thank you to the first session of today’s tech shares event. The tech shares event for today is titled Top Big Data Opportunities for Telecom Companies. And our techshares events there are forum for technology leaders to connect with business leaders to help their organizations create better software solutions that utilize open source software more effectively. So originally on the schedule was supposed to be a conversation with Lalitha Krishnamoorthy this morning. Unfortunately she was not able to make it. And so instead we have Travis Oliphant in to discuss the same topic. Best practices for successful big data implementations.

[00:00:49] Travis is a leader, a well-established leader in the python data science community. He’s a wealth of experience developing applications and improving the software for a wide variety of companies, fortune 500 and beyond. Travis led the creation of cornerstone open source projects, including NumPy, SciPy, Numba, Conda and organizations such as NumFOCUS and Pydata and he founded the company’s Anaconda and Quansight before founding Open Teams most recently. He holds a doctorate in biomedical engineering from the Mayo Clinic and a Master of Science and mathematics and electrical engineering from Brigham Young University. Travis, welcome. Thank you very much for participating in the event today.

Travis: [00:01:28] Thanks Brian. Appreciate that.

Brian: [00:01:31] Of course. So again the first topic that we have is best practices for successful big data implementations. So let’s start there. What are some of the most essential big data best practices when you’re designing things like IT infrastructure for data analytics?

Travis: [00:01:49] Yeah, great question. It’s an important question for a lot of people these days because the technologies are changing all the time and have been changed for the past ten years. This is where we are. You’re getting second fiddle. Lalita was specifically chosen for these questions. So I’ll do my best and definitely have my own experiences, I’m eager to share. But definitely we’re missing Lalita this morning. But I have had my own experiences with a lot of companies trying to handle and get ahead of this big data.

[00:02:20] Anaconda was founded around the big data wave of the 2011, 2012 time frame when big data was all the rage and we were noticing that, okay, great, there’s a lot of emphasis on big data, especially scalable big data. But yet we are still finding a lot of our customers were using Python and really most of the time just needed a bigger machine and a bigger hard drive. They didn’t necessarily need scalable computing. I think that’s one of the key things to keep in mind is that there’s always somebody who’s trying to sell you something. And so don’t let the person just trying to sell you something frame the conversation. You ought to be able to frame the conversation according to your actual requirements. What are your actual requirements? What are actual needs? Do you need a 1000 machines? Do you need a massive store?

[00:03:06] A terabyte of data is not big data these days. 1000 terabytes starts to be big data. An Exabyte, definitely big data. So you have to kind of make sure you’re not falling in. We would find people to go. I have big data and they’re really talking about 10 gigabytes or 100 gigabytes. If you can find a usb drive to stick your data on, it’s not big data. So that’s the case. There are large data sets. So I think one best practice is to don’t be sold, keep the focus on what your actual needs are and recognize there’s a lot of voices out there that are just trying to sell a product.

[00:03:43] One thing I love about the open teams model is we’re actually there to help make people aware of the open source solutions that exist. And now everybody knows that open source is everywhere and it’s dominating the substrate of technology. And a lot of big data solutions are in fact open source. But understanding that open source and understanding where it fits is a big deal. So keep the focus on your business. Don’t get carried away by some marketing voice. I think is a really critical point. I think making sure you have the right leadership around the team, you’re asking to do the project. You’re not going to solve a problem by just hiring a bunch of people, just going in and saying, hey, I have this big data problem. Cool, let me go buy a bunch of machines and put a bunch of people on it and hope that something emerges from that is not a good idea.

[00:04:31] You’re going to need to have kind of somebody who’s accountable and taking charge of the business needs and then probably several layers of that as you go. And then finding the right people for the job. Don’t try to hire all the people in-house. I think that’s a mistake a lot of people make because they think that their best option is to go hire the whole team. Now you do need to hire key people but you also need to recognize there’s a lot of smart people outside your organization that can help you, that aren’t in your organization. And that’s ok as long as they’re organized well and connected to your business goals, your business needs.

[00:05:05] I think making sure you can look into your data so there’s some kind of data visualization, make sure that you’re aware of what’s there, what you’re seeing, you can regularly look at it. So what’s your data visualization strategy, what’s your data visualization story and more generally what’s your dash boarding mechanism or what’s your way you’re going to get the information from your data to people. So yeah, data is there but data is just, I would often show a picture of a big data lake house, sort of in the woods and nobody can see. Like a lot of times you will have big data but it’s just sitting there and you can’t make use of it.

[00:05:38] Making use of it means it’s got to go through the minds of your people. You go through the minds of your people. It’s got visualizations, there’s code, there’s reports. Things have to be part of your processes to make sure it’s real. So I think iteration is critical. Recognize It’s not like you’re going to have a big monolithic project that gets to end state and then you’re done. Actually it’s going to be an iterative process to say, how do I get my company, my organization, the people in the organization to get around that data and to be iterating with it.

[00:06:08] And so this is where things like Jupyter notebooks are really popular because it’s one that gives you a chance to do lots of exploratory visualizations. Like I think for most people I recommend using notebooks as a way to get into your data and start with it. Like if people are used to notebooks, they can kind of, oh, I can do a query. And then having your data available through some kind of SQL or parquet files or there’s many languages with solutions that let you put a data frame in front of people. And so get a data frame and start to look through it and then start to analyze it. And once you figure out what you’re going to do, then you can operationalize that and put it into production. 

[00:06:47] Notebooks or similar kinds of interfaces give you that iterative approach to understanding your data. I think you have to really take a hard look at your, am I going to use cloud? How does cloud and on premises interact for you? Like you’re going to make use of cloud in your big data solution. Not only because there’s a lot of data in the cloud that you want to connect with and join with that you’re going to want to do that with but also because it’s just more efficient to do large scale exploratory projects rather than having to buy a bunch of machines. You don’t know how much of those machines are you going to need.

[00:07:19] Go try to expand a new idea in the cloud. Then when you figure out what you’re going to promote, pull it back. That’s fine and you can pull back the piece you want on premises and you’ll save some money that way. And I think the last point I’ll make is just governance. Governance and compliance issues. I mean, some areas of big data, health care, financial data, there’s compliance issues you have to be aware of and make sure you’ve got somebody paying attention to. But then also just everybody is concerned about what’s the security posture and which of these data that I have is more sensitive than others. So anyway, that’s a rundown. I have a bunch of things to think about.

Brian: [00:08:00] Very important points all. Indefinitely the security elements are all evergreen. And it seems to me governance is definitely something that kind of has been lagging from what I’ve seen in awareness of just how important it is. But it’s catching up and there’s tooling and processes that are coming up to meet that need.

Travis: [00:08:17] Yeah.

Brian: [00:08:19] So that kind of naturally leads to the sort of the flip side of what are the best practices? What are some of the challenges that organizations run into as they attempt to implement big data, data driven solutions that trip them up or end up potentially preventing them from achieving their goals there?

Travis: [00:08:37] Yeah. It’s interesting. I think that is one thing that the past ten years of data driven focus, a lot of marketing focus around big data has helped improve some of the cultures. Anecdotes from yesteryear where there’d be people who own the data. Like if you wanted access to particular data table or data information, there’s that one person in accounting or one person in HR that you had to kind of go and they were the gateway. The process was no real process but I’ve heard stories of and I’ve talked to people who they basically like, yeah, well, on Tuesdays I’ll take a candy bar for this person and then they’ll send me a little information. So it’s like, ok. So your process is bribe and release. 

[00:09:23] The reality is that’s real. There’s like lot of data in organizations has been siloed, has been owned, has been controlled by certain gatekeepers. And one of the things I think, it has been helpful to break down those barriers. Even though I’ve not been a big fan of some of the big data technologies of the past years, I think they’re much better now. I think one thing that has always helped is to break down these silos of data ownership and recognize, hey, we need to put security protocols and access controls around data that not everybody should have. But we do need to be able to share and understand that data with kind of more consistency and more regular practice and not make it this bribe and release approach to data.

[00:10:04] So I think that’s one of the challenges is just the fact that there are different organizational elements. There’s a principle. It’s a law, really. It’s an observation that most software evolves to reflect the organization that created it. And data policy is kind of the same way. Data policies evolve to reflect the organization that installed them. And so you’ll have data siloed. And usually that’s done for security purposes a lot of times, there’s justification for it. So I think that’s one of the challenges. Ok, we have these necessary divide and conquer approaches. We know that we can’t have every decision running through a big committee. We’ve got to have people have independence of management and how do I have the data that they produce come someplace where it can be shared.

[00:10:54] So I think it’s one of the natural challenges. And other one I’ve observed that is not often talked about is the fact that if you are trying to get information, you have a purpose for the information. I’m trying to have data and I have a purpose to analyze that data. So you care about certain attributes. Let’s say you’re renting houses and you want to know who the renter’s information, information about the renter. And there’s also information about the property. And now you’re selling houses and you care a little bit about the renters, you more care about the buyers of the house.

[00:11:28] So you want different data in order to analyze what you’re trying to produce. The dashboard you’re creating needs different data. So you end up with these groups of people that have different needs for the attributes. So the question becomes who’s going to maintain all the joint attributes? Who’s going to maintain them? People care about the attributes they care about. They don’t necessarily care about the other attributes. And so really you end up with a system and sometimes those attributes can be overlapping but with different framings.

[00:11:55] So somebody in the UK and they have English spelling for the UK spellings and US spellings in US. Like those are minor but it illustrates that whoever is in charge of the data cares about what they care about. And somebody else who wants to use that data may care about it differently. Are you going to push your preferences onto their database? So in practice what happens is data has a starting point but then people need to use it differently. So there’s adaptations of data. And once you have an adaptation of data, then who’s going to maintain that adaptation? Are you forcing the original creator of the data to maintain your preferences? 

[00:12:38] Like a lot of people imply that. They think they haven’t realized, Oh, wait, yeah, I’ve got to be in charge of those adaptations and own that. I can’t just assume that some other department’s going to give me information that I care about but they don’t. So I think a lot of organizations actually struggle fundamentally with that reality. It’s that fundamental, Oh, these business issues are a little different. And it translates to the schema. This person wants is different than schema I want. And so I do have different data even though they share common elements. 

[00:13:11] So that’s a fundamental business challenge, a little people challenge. So it does lead to ok, well do we find a common substrate that can be shared? But of course, shared coalitions can often lead to those challenges too. So anyway those are some of the challenges I think. Some are people. They’re natural organizational human challenges. And then a few are technology. The technology ones don’t overcome. Most of the challenge I think are information and thinking and then also people’s framing and the fact that there’s a lot of people out there pushing an agenda. I don’t mean to say that pejoratively, I just mean people have a thing they’re trying to produce. And that thing they’re trying to promote may not be the thing you care about.

Brian: [00:13:55] One of those challenges relating to what you just said is I think it’s kind of captured in the emergence of data engineering as a relatively new discipline, trying to deal with those different needs and making sure that everyone who needs data a certain way can get it in a clean fashion.

Travis: [00:14:11] It’s an excellent point. Data engineering then becomes kind of, it’s about data in motion, data where it’s moving from one place to another and recognizing that’s a thing. And then it needs to be managed and maintained. Very good. 

Brian: [00:14:27] So you also mentioned security. I mean, that’s was touched on in the first question. Do you have any thoughts about specific elements that an implementer of a big data project need to think about in terms of their data security?

Travis: [00:14:41] Yeah. Security is definitely one of those challenges. There’s sort of the natural approach. Let’s see the ultimate approach security is we just don’t share any of the data. The best way to lock it down is to lock it down and nobody sees it. So there’s sort of this correlation between, well, I want to be secure but I also need to use the data. So what’s that tradeoff? And you end up with different philosophies and different organizations about how to handle that natural tension. And my view is, it actually depends on the story in the context like what’s the risk associated with security breach? And the higher the risk, the more stringent you probably become, the more careful you’re going to become.

[00:15:24] But there’s some things where the security risk is low or the consequence of a security breach are also relatively low. So we don’t need to lock that down quite as hard. So understanding the context of a security is a challenge because you often end up with, well, let’s get the security person in. And security people in nature, they’re kind of like lawyers. Lawyers by nature are looking for reasons why things may go wrong. That’s what they’re paid to do, is look for what’s going to go wrong. So I love lawyers. They’re awesome but as a business owner, lawyers work for me. And they provide me information about risk assessment and things that can go wrong. And then I have to make a choice as to how I’m going to deal with it.

[00:16:05] So a lot of times the security is the same way. Somebody has to make a decision and then that or some group has to make a decision. That group is ultimately accountable for the decision, not the security professional they hired to help. So I think a lot of times that’s hard because being in charge means you’re accountable and it’s risky to be accountable. And a lot of times people try to push accountability somewhere else and save their own skin. I think that’s a fundamental human trait. I think a lot of businesses end up essentially with little eddy currents of behavior, organizational challenges that are rooted in that fundamental trait of humans to want to avoid blame and avoid decision making, kind of push off decision making to somebody else.

Brian: [00:16:53] Indeed. Yeah.

Travis: [00:16:55] So security has to be taken care of. It depends on the context and you’ve got to make sure you’re getting the right people and the right advice. And then make sure you’re recognizing advice as advice if you’re a decision maker. And not just sort of handing over accountability to somebody that ultimately doesn’t have the accountability and trying to pretend they do.

Brian: [00:17:16] I guess recognizing that the security situation evolves over time.

Travis: [00:17:21] I think it does. There’s some situations. Again that depends on your situation. There’s certain areas of data security where there’s a legal framework that you have to recognize. It’s more than just your business decision. There’s a legal consequences of having breach. And then there’s business consequences of having a breach in other cases which also does have to have legal elements to it. But there’s sort of criminality or civil lawsuits on one hand. On the other hand it may just be civil or maybe business.

Brian: [00:17:52] Business affect, impact.

Travis: [00:17:56] Or you lose business. That’s the primary one. You [Inaudible] and you lose business.

Brian: [00:18:00] Yeah. It does not look good when that happens.

Travis: [00:18:04] Correct.

Brian: [00:18:05] So changing gears a little bit. More again to the data technology side. When a company or an entity is trying to implement a unified big data project, a lot of the time they can have data sources, a wide variety of data sources distributed over a wide variety of systems that all can be pulled together into a unified big data project. So what are some of the challenges associated there? What are kind of tooling best practices or procedural best practices to start trying to pull all those threads together into a unified whole?

Travis: [00:18:39] It’s a great question. That really gets back to the story I was describing before of different departments have different schema needs. So kind of a universal schema will never work. It’s the same problem of saying, hey, we’re going to have this universal object that all use cases of this are going to love. Sometimes you can. I mean, if you can find it, awesome. If you can find that schema or that description of data that many people share a common goal, that’s fantastic. Double down on it. Grow it, create it because you have multiple people that are going to consume it and you can really make a solid solution.

[00:19:13] But that data integration problem, there’s a part of it that’s fundamental to the domains. So recognizing what part of your integration problem is fundamental and recognize what part of it is simply proximal, simply technology? Because the technology integration problems be overcome. The fundamental integration problems or the abstraction integration problems of different departments, different schemas of need, you can’t overcome that with technology. You can pretend you can paper over it. I think a lot of organizations, they’ve never acknowledged that fundamental abstraction problem of data integration.

[00:19:47] And so they think it can be solved with better technology but it can’t. You can maybe solve it by agreement on a better abstraction, on a subset of abstractions. But that’s the hard work. I’ve seen that, for example, in large commercial banking organizations or investment banks, where they’ve got many derivatives. And then those derivatives are each being produced by different teams of traders. And pulling that all together was a tremendous amount of work to create kind of technology substrate that captures the full schemas for everybody. There’s some interesting technology that went on that I was able to witness but the fundamental issue wasn’t because they didn’t share a common tech.

[00:20:28] The fundamental issue was they had different ways of thinking about the problem. And then the different schema that resulted from that way of thinking about the problem are different. So I don’t know, the audience is full of people, some with technology experience but everybody understands the notion of here’s my table and here’s the columns on the table and here’s the attributes I’m keeping track of. There’s a 1-1 mapping between data schemas and object oriented technology. It’s actually really fascinating and its sort or a map expose this. So you see the same problems in technology development.

[00:21:01] Hey, I’m trying to get a group of people together to build a framework or a object oriented structure to help a bunch of people. And that’s really easy to do when you have a small group of people who are saying, hey, here’s how we’re going to think about the problem. And then you take that and try to make it a widespread library. NumPy helped me realize this pretty substantially. Like NumPy is a tensor library. It’s an array library in the array used widely. And it’s a common concept that’s been around for several decades actually. I didn’t invent the concept of ndarray. That concept existed in Fortran and elsewhere.

[00:21:32] But the attributes needed. There is quite a bit of debate about what are you actually to take hold of with an ndarray. Fairly uniform as we in Quansight  labs produce something called data APIs, go to data-apis.org and I produced an array standard. That standard was started with NumPy and then kind of formalized in data APIs arrays standard. And then at the same time data APIs is also doing a data frame standard. Well, the array standard was a lot easier to get to. The data frame standard is still a work in progress because fundamentally data frames while as simple to say, you actually are evoking different things in the minds of a lot of users.

[00:22:13] So some people at data frame means I’ve got to have missing data look like this, others I’ve got to have indexing look like this. So the attributes and behaviors of that object are a little different. And we don’t share it. That’s an explicit example of this abstraction I’ve been talking about. This concept I’ve been talking about of data of different domains, different departments, different organizations may use the same word to describe a data record but then actually care about different things. And so the effort involved in unifying or integrating with somebody else who has a data source that says the same thing might actually pretty significant because they have different titles.

[00:22:54] It may be simple like, Oh, they just use a different name for that same attribute. Ok, cool. Is it the same attribute? Maybe actually, maybe that same name. It’s not just a synonym. It’s a small, it’s like a little bit of a different framing. And so they have different set of possible values that attribute can take. So anyway there’s a fundamental thing here that really is about abstraction and how people think about problem and breaking down problems. So it’s not solvable by just having technology.

[00:23:20] Now better technology can help expose the issue, help solve the issue, help us understand the differences between schema and help us kind of make decisions as an organization about there’s departments in an organization. The organization can make a decision and say, well, how about if this department does this thing and this department does this thing, can we compromise and use a common thing? Yes. Ok, cool. Let’s do that. No. Well, then is there a substrate that is in common? And so we have adapted things that departments use that they can then control and own. But they’re all using a common ground that can be co-invested in. 

Brian: [00:23:54] A communication problem to start really.

Travis: [00:23:57] Patient and that communication is subtle often. It’s often subtle because it has all the same problems of language communication where I’m seeing words and you’re hearing something that I didn’t quite mean.

Brian: [00:24:09] It’s making a picture in my head that’s different than the picture in your head.

Travis: [00:24:11] Yes, correct. It turns out data has the same abnormalities. So data migration tools, they exist. There’s some really great technology around what I did at catalogs. Right now that’s for the past four years, I’ve been very excited by the data catalog movement and kind of the fact that because what data catalogs are doing is exposing schema. Like rather than worry about data, the data catalogs purpose is to put, you know, here’s a name, here’s the schema associated with this data and then descriptions of it and then kind of where it is and then make it easy to grab to help you solve your problem. I think that’s very helpful.

[00:24:47] Great data catalogs are going to make it easy for you to pull that data into your programming environment of choice, whether it be Julia or Python or R or Excel. You’re going to be able to just reference that data by name as opposed to having to kind of can we read this data in. Let me see, as a parquet file or CSV, it’s almost like the data catalog should take care of that for you and it pulls it in. So you as a programmer or a data scientist, data engineer can just focus on, ok, now what? I need to drop these columns, tweak these ones, clean up these versions and that’s what my code can do. So there’s some really good tools like that. There’s a lot of other tools out there that are in that space. But that’s one of my favorite ones is data catalogs for data integration.

Brian: [00:25:28] Yeah. Lots of new tools in development to try to solve these big problems.

Travis: [00:25:32] Correct.

Brian: [00:25:34] As it happens we are coming up on the end of the time slot. So I think we will wrap the conversation on this topic here. Travis, thank you very much for your time for stepping in for Lalita. We wish her well, recover quickly and really appreciate the time. 

Travis: [00:25:49] Absolutely. It’s been great to talk with you, Brian. Thank you so much.

Brian: [00:25:52] Same here. Thank you. So we will pause for a few minutes to reset for the next session, which will be Travis again speaking about selecting open source software for your big data product stack. So hold tight. We’ll be back in a minute.

Travis: [00:26:07] Thank you.