OpenTeams - The Future of Software


E12: Peter Wang, CEO of Anaconda on data science, OSS and their communities

Note: Open Source For Business is produced for the ear and designed to be heard, not read. We strongly encourage you to listen to the audio, which includes emotion and emphasis that’s not on the page. Transcripts are generated using a combination of speech recognition software and human transcribers and may contain errors. Please check the corresponding audio before quoting in print.


Henry Badery: Hello, and welcome to Open Source For Business brought to you by OpenTeams. I’m Henry Badery and I’m the growth marketer here at OpenTeams.

In this episode of Open source of Business, I was fortunate enough to speak with Peter Wang, who is the CEO of Anaconda. Anaconda is a distribution for the Python and R programming languages for scientific computing. And it’s known as the heart of the data science open source community in the world.

Peter was an incredibly engaging guest who gave really, really unique and insightful answers. And we managed to cover a lot in this episode, including the challenges Anaconda has faced using open source software as part of their business model, the importance of giving back to communities that you rely on, two common misconceptions that the business world has about software, the maturity of adoption of data science versus open source software, and finally, whether data science is a fad or a phase. This podcast is sponsored by OpenTeams; the first market network where users of open source software can find that end contracts with service providers. Now that the introductions are out of the way, let’s cue the music.

I thought we could start by going through a bit about your background and how Anaconda got started.

Peter’s background 

Peter Wang: Yeah, sure. So my background is… I guess, my educational background is formerly in the field of physics. I studied physics and sort of quantum physics at Cornell. And then, shortly… well, just after graduating from college, I decided to go into the software industry, and I went through a couple of startups and small companies. And as I started doing consulting in using Python and the scientific and numerical Python tools, I started realizing that there was a potential for much greater impact for those tools. And so around 2012, I started the company, well, it was called Continuum Analytics when we founded it, but I started that company with Travis Oliphant, who is the creator of NumPy and SciPy. Travis and I really shared a vision about the potential impact of Python for data analysis in general. And so, that’s really the journey in a nutshell, we’ve done many things since then, but really it was, I think a period of time in the mid to late 2000s when I really saw that there was much greater potential for Python and that’s why we started the company to really put Python on the map.

Henry Badery: Okay. And for people that don’t know, Travis Oliphant is the CEO and co-founder of OpenTeams, the title sponsor of this podcast. So you go a long way back, and I’ve seen you two interact at conferences. I know the PI conference in Austin, it’s quite funny. Travis was giving a talk and then Peter was at the back kind of asking these fantastic questions, and I could just tell they were good friends from a long time. I also listened to a few of the videos on YouTube and found that it was kind of a bit of a thing that you do. You both see each other’s talks and sparks and debates.

Peter Wang: Yeah, Travis and I, it’s really quite a blessing. He and I are different people in many ways, but we share many, many of the same perspectives. To have two people who are quite different go through such different life journeys, but then have so many points of commonality in terms of the technical perspective, perspectives on the world and business and things like that; it’s really truly a… I consider myself greatly privileged and very humble to have this friendship all these years.

Henry Badery: Yes. And can you talk about the evolution from Continuum Analytics to what is now Anaconda?

Evolution of Continuum Analytics to Anaconda 

Peter Wang: Yeah, there was the evolution of the company itself. I see it as one company of course, we renamed ourselves to Anaconda. I can tell you a bit about why we did that. And the reason was because the product that we created very early on, we created the Anaconda distribution three months into the company, into Continuum Analytics. It was March of 2012. And so, we created this product what created this… I mean, it was a tool, it’s a distribution, whatever, what have you, but the idea was to bundle all the different dependencies you needed in order to really get started with Python for data analysis and really Python for big data as well. So it wasn’t just Python for data, it was five of the big data, so that’s what we called it Anaconda.

And we really hit a nerve with that. A lot of people really were just like, “Oh, well, this is fantastic because now I don’t have to figure it out all of these different dependencies and packages and how to install on these different platforms. These guys have done the hard work and we can just use it.” So, that took off like gangbusters and by about 2017 timeframe, we’re like, you know what, we got tired of going to conferences and saying, “Hi, we’re Continuum Analytics,” and people are like, “Hey, okay, that’s great.” And then we say, “Oh, we’re the ones that make Anaconda.” And everyone’s like, “Oh, we use Anaconda, yeah, we love it.” Once it happens to you, like a few thousand times, you’re like, maybe the world is telling me something.

Henry Badery: Maybe we should change it.

Peter Wang: Maybe we should change it, yeah. Many companies have done that, so we’re just following… we’re one of a long list of good companies have done that.

Henry Badery: And how many users does Anaconda have today?

Peter Wang: It’s really hard to put a precise number on it, but to give you some sense of it. Every month, we have about a million unique new downloaders of Anaconda and Miniconda based on looking at IP addresses. Every month we also when we look at how many people are using our package repositories and how many people are sort of repeat downloaders; that’s about four and a half million monthly active users. When you go and you look back 12 months; on a trailing 12 month basis, we have over 22 million unique IP addresses that hit us for package downloads, for installers and whatnot. And it’s been particularly interesting because IP addresses are fairly crude measure. Some IP address is high 10,000 users. Some people bounce from coffee shop to home, to office. It’s a very crude measure, but as we do some data analysis on the average number of packages people download, what the usage patterns look like, what they’ve looked like over COVID when people are less mobile and more working from home; the numbers give us some confidence that we have in the tens of millions of users across the world.

Henry Badery: Wow. That is incredible. And what an exciting journey to be upon, a part of seeing that grow and to what it is today. But what are some of the challenges that you’ve faced using open source as part of Anacondas business model?

Challenges Anaconda faces using open source as part of it’s business model 

Peter Wang: Yeah, it’s really interesting. Travis and I both are big believers in the open source community. However, we’re not… there’s some people who are what I would call open source zealots who have a very, very strict, like, everything must be free and open kind of approach; no proprietary, you can’t charge for it, there’s people who are in that camp. And Travis and I are not in that camp. I think it’s important to make that distinction about how much we love and support the community and the growth of the innovation and maintain our community around the use of open source. But the open source software itself really, it’s an artifact, it’s a means to an end; it’s something that that community produces. So, the business model that we have at Anaconda is to, you know, we have a couple of different things.

We sell commercial products to businesses who are users of our open source. We create a lot of innovation libraries, we’ve incubated technologies, like number, the Compiler, the Visualization Library and DAS distribute computing and there’s many others besides that we’ve funded over the years, but we don’t charge for those libraries. We do the innovation work, we work with the community and we continue to try to shepherd the open and free development of those things. What we do charge for is we charge for the commercial servers. And just recently this year, well we, I say we, but me really, I looked around and I realized that we needed to fundamentally change the economics around this open source community. And so, we made the decision to change the terms of service for our package repository service to where now, people who use it for commercial purposes who are at companies of a certain size; more than 200 people, we asked them to pay a modest fee, like $15 a month. And then the price goes down if you buy in a volume, if you buy it for your company.

But the idea there was that I realized that the open source community, although it was really good at creating early innovation, and 10, 15 years ago at this point in time, the world has changed a little bit. And so, the open source community has gotten very… it’s a whole swirl of different things. So there are big, big companies who use open source to try to capture users and capture devs into proprietary APIs; those other people who use open source. Smaller companies, maybe not big companies, but they use open source as a loss leader so that you get hooked on using this open source thing. And then, the only place you could possibly then go for the premium features is this one company, which they’re the only ones that maintained that open source. I call that sole vendor open source.

And I realized that all of these kinds of ways of using open source, by all measures, they are truly, they are bonafide open source, but there’s something missing in them. And the missing thing is that they’re not actually generatively reinvesting into a community of innovation. There’s a commons of innovation that’s not being invested in, and the only way that I could defend the PyData and the SciPy, the scientific Python community that I so cherish and love is if we started this process of getting businesses that use these tools to just pay a small amount, but pay in. If we get everyone who uses it to pay in, we’d have more than enough to fund all the maintenance to fund tremendous amounts of wonderful innovation. And so, it’s really at this point, I think that’s one of the ways that we make money now and is that we did change the terms of service. And we coupled that with a dividend program, which we can talk about in more detail, but that’s our commitment to giving back to the community.

Anaconda’s Dividend Program

Henry Badery: Okay. Yeah, just since we’re on that topic now, I thought it was a great initiative when I recently saw at the end of October you released, or Anaconda had released that they started this Anaconda dividend program. Can you talk a little bit about it; what is that?

Peter Wang: Yeah, so it’s very simple. We are simply taking a portion of our revenues and we’re working with NumFOCUS to administer those funds, but we’re going to give those to open source foundations. So obviously, the NumFOCUS foundation is one of those we make contributions to the PSF or to other foundations. As we change the terms of service, we want people to understand that this was done in conjunction with this commitment and this covenant to the community. So we made a commitment to through the end of the year, 10% of our individual user commercial subscriptions, 10% of that revenue, not profit, but 10% of revenue would go directly to this dividend program. And then next year, we’re going to increase the scope of that; we’re going to give 1% of all company revenue. 

And I know some people may think 1% isn’t that much, but when you look at companies and donations, actually, it’s a non-trivial amount. And my goal is to actually increase that to be even larger over time, but right now I can make the commitment to 1%. Yeah, it’s a start and I think one of the things we see in this world, well, one of these I’ve learned in my 40 years here is that oftentimes the best way to criticize is to create. If we can show people an example of what good behavior looks like, then we can ask other people to kind of get in line, so that’s one of the motivations there. And I think the response has been overwhelmingly positive. I’m very, very encouraged to see that.

Henry Badery: Definitely. And I think that’s… are you one of the first or the first in the industry to do something of that scale, 1% of revenue?

Peter Wang: I think there’s in the software industry, I don’t know. I haven’t actually done a deep dive on that. There are some companies, like I said, who are like sole vendor open source kind of things. Like if you have Mongo, then there’s Mongo DB, if Red Hat and the Red Hat Company, or Red Hat Labs. But in the case of a company like us where we supported a community of software developers, to actually go and put a portion of revenue directly and to give it to the nonprofit to administer, I don’t know others like that off the top of my head, obviously there’s companies like Red Hat which do a lot of open source, and the big cloud vendors, Google, Microsoft, Amazon, they all do a certain amount of open source, of course. Of course, it’s not 1% of the revenue. But yeah, we may be one of the first, I don’t know if we are not. If we are great, if we’re not, that’s fine, but I do think it’s something more people should do.

Importance of giving back to open source communities

Henry Badery: Definitely. And I think this is going to just really drive innovation. Open source has seem to get to the point today without that kind of contribution and participation from companies. But I can just imagine how big it is… I can’t actually imagine how big it’s going to grow if we can get money behind this and help people… one want Travis’s missions is to help people be able to turn their hobby into a career. And so, if people would do that, if people could work on open source full-time, then I think it’s a very exciting future ahead. So, one thing I was going to ask is why do you think it’s important to give back to the open source communities that companies rely on?

Peter Wang: Well, if you don’t, those communities languish and the companies, they have to pay more ultimately for innovation. I think that open source dollars and funding open source innovation is the singularly most effective use of capital. I can’t think of a more effective return on investment. But the problem is that it’s a commons, and so even though the value is there, it’s hard for people to attribute, okay, which dollars paid for which things that produced then which outcomes. It’s very hard to do that kind of spreadsheet tracking, and therefore, corporate mentalities have a hard time understanding why they should put dollars into it because they can’t close the loop on how those dollars got spent. But when you look at it in bulk and in aggregate, and we have 20 years of history to look at it now; it’s very clear that if you give bright people the space to form communities, to work with each other and to try new things and quickly and rapidly iterate, then you get this incredible pace of innovation that everyone benefits from.

So, yeah, I mean, I use the term innovation commons a lot. I don’t know that many other people do, but it’s certainly the way I think about it. It’s more than just paying a developer. It’s more than just paying for software artifacts. It really is about sustaining a community… well, the community, some of the compute infrastructure, and some of the things like conferences and other ways to that help support the communities’ vitality. Investing in that infrastructure and investing in those innovation commons are just really, really important. I mean, I’m very pleased to see, like, Chan Zuckerberg do so much innovation and lots of other foundations, Sloan and others have done really great work there. So, we’re just hoping more corporate people would show up. But my experience has been with large companies, even the biggest companies, the wealthiest companies on the planet, the way that companies grow and the way they are constructed, every dollar in every budget is already spoken for.

So for someone to come along and say, “You know what, hey, we need to put up 1% of everyone’s budget. We need to take that away, and we need to put that into this untrackable, untraceable; unaccountable. We just need to sprinkle that like top soil.” People are going to be very mad, right? Everyone’s going to be like, “Well, you know, yada, yada. Like, my KPIs and this and that.”

Henry Badery: What the ROI? 

Peter Wang: “What’s the ROI?” You’re like, “Yeah, I don’t know.” It’s easy… it’s really interesting the convexity of human cognition; if we’re all hurting ourselves by doing something we can usually come to an understanding and say, “Oh, you know what, we should stop doing this because this hurt everyone.” But we’re all polluting in the water, and now we can see the water is brown and it tastes yucky. That’s easy to see, and even then we have a hard time doing the right thing. But in a case like the innovation commons when it’s like, we should all go and tend this verdant forest that’s yielding unbelievable, incredible the rich and luscious fruit for us. It’s very hard for people to understand that. So, I see myself as the steward, you know, I run a little cooperative farm stand on the edges of like a great Amazon jungle. And I’m just trying to encourage everyone to kind of give back into that ecosystem, that biome.

Henry Badery: And one thing we discussed the other day in a pre-call was this idea that the industry has a misconception around where the value comes in open source. Most people or a lot of companies see that it comes from the source code, but really it’s the community you were saying.

Peter Wang: Yeah. I mean, the source code is the fruit and it’s fairly raw fruit. The community is the tree is the soil, the water that bears that fruit ultimately. And I gave a talk at PyData Berlin about a year and a half… well, I will say a year ago. And I said that software isn’t just code, software is a relationship, and it’s a relationship between the users and the developers of the software. And if you don’t understand that as a user of the software, well, if you use like consumer software, it’s sort of like, oh, I get a code drop, I double click on a thing, it runs and that’s what it is. But especially if you’re a business user or if the software is something integral to how you think, or how like your airplanes fly, or how your trading systems run; you really want to understand where does this come from?

How is this sustained? What happens next with this thing? So, you have to see that the software is like a river; it’s just this ongoing thing. It’s a continuous flow of change responding to either changes in the underlying needs, as well as seeking upstream innovation. It’s a river and an actual piece of software, like a code release, that’s a scoop of water out of that river. 

And of course, it’s important; you have to get a scoop of water if you’re going to drink any of that water, you have to kind of scoop it up, but you have to understand that it’s actually this flow, it’s a river. And you have to ask, well, what is my relationship with this river? So with open source software, it’s this incredibly abundant flow of innovation that comes down, very delicious tasting water. And so I think the people who benefit from it, it would behoove them to do a little bit of thinking about strategically, how do I make sure that this water keeps flowing? So yeah, I do think that unfortunately the business world in general has two great misconceptions around the world of software. Like one, I just talked about; it views software as an artifact, not as merely the result of relationship between the user and the developer. And the second misconception is that developers time is somehow fungible, and that it’s proper to treat developer time as simple labor economics. Just like if they were a blacksmith hammering on some iron, or if there were a lumber out in the woods cutting a cord of firewood. 

And the thing that’s true that I’ve seen about software innovation is that all these different people, all their minds are different. And there is an extremely large gradient of skill of insight, creativity, genius, artistry, craftsmanship among different kinds of developers. And so, the business world seeking to impose labor economics on what is ultimately actually kind of artisinal craft, it yields very, very low dividends. It’s incredibly poor. If you were to get CIO, CTOs together in a room, get them a little drunk and really talk like no BS about how effective are your dollar spent in IT, especially in the area of software development for like internal business applications; I would be shocked if it was more than 15% effective. I would be shocked if anyone would claim it was more than 15% effective. Because ultimately, the industry as it’s evolved the last 20 years, it’s been more and more about managing risk than about seeking excellence.


The open source community, however, and we see this tremendously in the PyData community, for instance, and it’s only one of many, but these open source communities when they’re managed well and when there are positive engagements there, these are craftsmen communities. And almost, you could say their artisinal guilds that yield that really are able to find promising new talent; they’re able to then hone new tools, new software, new methodologies approaches that are way better. And they’re able to do it with not many people, not many people at all, not much dollars invested. They’re able to yield much, much better outcomes because they can harness that high variance; the five Sigma sort of positive outcomes. And in sort of most corporate managed labor economics, you’re managing to avoid the minus two Sigma kind of downsides. 

So anyway, that’s a lot of pontificating about that, but I think those are two great misconceptions; the idea that software is a static artifact as opposed to a relationship and the idea that software developers are somehow just wage laborers and not actually the kinds of craftsmen that they really could be.

Henry Badery: And I really love the analogy from the fruit down to the soil and even software as a flowing river. I thought that was such a great way of describing it, so I’m going to add that to my analogy bank; thank you Peter. Since you have done such a fantastic job of growing such a strong community around Anaconda and the projects, like you said, bouquet number; I thought we could go through… would you be able to give some advice to the listeners for the companies that are trying to grow strong open source communities, or even just whoever’s listening and they want to grow a strong open source community? What have you found to be some key learnings that you’ve taken from the last few years and Anaconda?

Peter Wang: Oh, yeah, that’s a great question. Well, there are many things. I think number one, I’ve always been pounding my fist on the table about community. It really is important to think about community, but it’s also important to recognize that there are different modes. When a project is first starting out, it really requires that singular vision. And so, the most successful projects have a single person or a very small group, usually no more than three or four people that are able to work very fast together. So if you’re just starting out on a project, I would say… let’s say you’re a single developer and you love this thing you’re doing, and you want to get it out there and get more people to use it and get more people to help you work on it.

Your focus shouldn’t be on trying to get a ton of people to come and use it initially. Your focus should be on trying to make your project’s core vision distinctive enough, cogent enough that you can really find the people who deeply resonate with it; find your tribe.

Henry Badery: Your champions.

Peter Wang: Your champions, right? So, really focused on the communication of the project vision and what it isn’t. And oftentimes, it’s just as important to articulate what it isn’t and find the people who love it; love that vision can help you. Once you get the champions together, find the one or two, and they can come from very unlikely backgrounds. You never really know what you’re going to get, and that’s the beauty and the wonder of Open- source in an internet era. You can find incredible developers that could be 12 hours offset from you in time zones, but they’ve got for whatever reason, exactly the same ideas as you’ve got, and they love what you’re doing. Nourish, like, work with those and really water those little sprouts. 

And once you get your first few going, work to build yourselves into a small clan, into a small tribe. And then at some point, once you start getting your first hundred or so serious users, once you start seeing certain amount of traction start; that point, you should start thinking about the onboarding process, how do you get a few new community members then? How do you put in… work on your documentation. Really start thinking about documentation as a product unto itself. And then, the other thing I would say to, to people starting projects and thinking about this problem is that realize that if you’re going to be successful, it is going to be a journey not only for your project, but for yourself, because you’re going to be thrust into a leadership role, whether you like it or not.

And if you’re a natural extrovert like myself, you have some advantages there. If you’re not a natural extrovert, or if, for instance, and this happens a lot; let’s be honest, a lot of the software development community is in the Western world is Englisized. If you’re English… I know many people who English is not their primary or their first language, and they can feel reticent or they can feel shy about speaking and whatnot. So if in any case you feel like you have challenges being a leader for your project, then recognize it and address it tactically. So build a small coterie of lieutenants that you trust, to be your inner circle, to be your voice. But of course, if you want to continue to be the leader of your project, maintain the moral authority, maintain the technical vision. 

And learn how to be vulnerable. I think leading through servant leadership and vulnerability is a really great. Once you start getting to where you have a dozen developers working on your project, you’d have to start thinking about what are the values of this community? What are the technical values of the project, but what are also the human values of this community that I’m building? And these are questions that are really, really important because I can imagine just myself, 20 year old self listening to this and being like, “Oh, whatever, a bunch of fluff,” but really it’s important because these are your scaling problems. All problems ultimately become human problems, and if you want your software project to be successful, you’re going to need to manage the people who helped shepherd it to success.

And if you don’t want to do that, again, you just need to recognize that and say, “Well, what I’m going to do is ultimately always going to be a small project that’s my little side project, and it’s going to glommed onto the side of someone else’s thing, or someone else will maybe five years down the road have my idea and take it to great success, and raise millions of dollars and whatnot.” And I had to understand that I’m okay with that. So, a lot of it is just self-awareness in the leadership of the project. Lead or do not lead, but there’s no wavering in the middle; you’re going to make no one happy. I think what else is there? I think those are some of the key things. The other thing that I would say; one really interesting learning in the PyData ecosystem is that we have just… I want to… I’m going to try to avoid waxing too poetical about this, but I think the fact that it started with a scientific computing community was really quite a blessing. Because the scientists that we managed to pull into this ecosystem, this community, the SciPy community; many of them or by nature, fairly humble people, and that humility allowed them to work on projects to sort of have their own scope, and then to also recognize the importance of working with other people. 

And so, sometimes the software developer as well as people in general, I guess, you’ll find a certain megalomania, like, “Well, my project, you could extend to this. And if I just through that, I could do what your project does; we don’t need your project.” But instead, what I saw in the SciPy community was that both at the technical level, at the API level, as well as at the community level between projects and groups, there was generally a comradery and there was a sense of like, “Oh, these people are doing this cool thing. We should be aware of that. How do we make it so that our tool is easier to work with their thing? How do we build compatibility?”

And of course, there’ll be competing projects with different visions of things and that’s okay. And everyone, I think, that that blessing that we had of having some really ego-less leadership early on in the overall community has been fundamental in making it scale. If it wasn’t for that, then we’d have a few projects that want to take over the world. They would fail as all projects who try to take over the world do. And we wouldn’t be able to have the cellular, scalable, decentralized kind of ecosystem of libraries of tools.

There are hundreds and hundreds of libraries in the PyData SciPy ecosystem that people rely on a daily basis. And there are thousands of thousands more in the greater ecosystem that plays with those things. We could not have been this successful if it wasn’t that big of a tent and that tent did not have that many tentpoles. So, I would say that that’s another learning, I don’t know how applicable that is to any random open source project, but that’s something I would say is relatively unique based on what I’ve seen in the PyData ecosystem.

Henry Badery: That was a very rich answer and thank you for that. I appreciate it. It was a great answer. I obsess seen and just been amazed by the PyData ecosystem. Are there other communities around the world that are like that, that share similar qualities, say for web development or embedded systems; is there any open source community is really strong and close knit, really, really close knit as PyData. I haven’t stumbled upon one as yet. 

Why the PyData community has thrived

Peter Wang: I’m sure there are. I just hang out in the PyData space. But I think there’s a lot of gamer and mod kind of communities that are also very… see, I think the thing to look for is generativity. So I think in generative spaces, you can find these kinds of communities because people have an abundance mentality. You’ll find smart people, you’ll see you have two really smart devs, and they can be either like butt heads or they could collaborate and kind of each work on their own different vision, but in conjunction with each other. If they believe in a finite game, if they believe in a scarcity mindset, they’re going to butt heads because like, “Hey, I want all the pie.” “No, I want all the pie.” But if you both believe the pie is growing very, very fast; if we work together, we can each get more of it. We each get 1.5 of three, versus each of us getting 0.8 of 1.6. 

So, I think you have to look for generative spaces and generative communities. And I think gaming ones and ones where people are modding games and just trying to create beautiful, new, wonderful things. There’s less natural inclination to view people as being competition with each other maybe. So, I don’t know, that’s an intuition I have about some of that. The traditional software development community, there’s many, many of them, but they all get fairly corrupted with money fairly early on. There’s a lot of money, glory, glamour, the VC, lottery; there’s all of these dynamics that impose a scarcity mindset onto people, and that corrupts the tree at the root I think.

Henry Badery: So data science and data analysis or open source software in that space fundamentally different from open source software for developers and infrastructure?

Peter Wang: Well, I think at this point we have made it so because the software community that we built is fundamental to this practice area. If we hadn’t done this, I’m sure it would be a couple of big companies that were funded to go and make a ton of money to create proprietary walled gardens. I mean, it would be sort of like that. I think we’re very lucky that the tools that we loved and the community that we built was… well, I guess it was just lucky. It was quite intentional. I really did quite believe. I sort of said, “Look, we should create this,” so we should push this into the space. It was an intentional thing, but I’m glad it worked out as well as it has. I think it’s really important for the future of the world that these underlying numerical tools were being open.

But I do think that the software… one thing that’s really fundamentally different about the open source software for data analysis versus software for developers is that, hopefully… we’re trying to cater to an audience of people who are not software developers. They are more users than they are developers. The software development community, the maker community, at least in the open source world, they’re trying to serve kind of each other. Their audience is other open source devs, and if they go and produce a really nice, useful piece of software; that’s great, but that can be really hard. And I mean this was without really much criticism, but just my observation is the Linux open source community, for instance, it is the most successful by far of any open source community. But the larynx open source community, I should say the canoe Linux open source community has been far better at creating server-side software tools and developer tools and things like that, than tools for end-end users because the Gulf is simply so large. 

You’re building this thing; you’re a wizard, you’re chanting all this code into a text editor, and then you compile it and you wrap it all up and you give it to somebody and you hope it works for them and it doesn’t zap them. That takes a lot of skill and craft. It’s much more fun to make cool enchantments for other wizards to use because they know how to not hurt themselves too much. But I think the data science community, well, certainly I can just speak for the PyData community. I think our community may be a little different, but it’s related. But the PyData community, what’s important about this to understand is that the people who make the software predominantly don’t come from a background of computer science, most of them don’t have professional software development experience. Certainly the ones who laid the groundwork, the fundamental library, [unclear36:32], NumPy, SciPy, Jupyter/IPython, Pandas… like all of these things are made by people who needed to make something for themselves. 

And Python was just good enough and just powerful enough to where they could customize it. And they kind of got sucked into it, and they kind of created a second career for themselves to some extent. Maybe I’m one of those people, but really, they come from a place of great empathy for the end user. But very specifically what the end user is trying to do is something also of a higher order than merely someone clicking around on a web page to order some food or to order a taxi or something. I mean, someone trying to do some like deep numerical analysis of some complex multidimensional data set; that’s a different kind of software.

In fact, most software developers, unless they are in this niche of numerical simulation, they would have a hard time writing performance code that’s correct to do that kind of thing. I would say 99% of your modern full stack web dev application developers don’t even know how to approach these kinds of things that we’re building in the numerical world. Not that they can’t be taught it, but just saying that in their work history and in their… it’s not just like, oh, just write a four loop here, pull a field database there, slap it to that, text display over there and you’re good to go. It’s not that kind of work at all. It’s much deeper work that we’re doing in the data science kind of infrastructure space. So, no, I think it’s definitely a different world. It’s very different, but it puts the end user much closer to the process of making the software.

Industry maturity of open source versus data science

Henry Badery: Okay, that makes sense. Where are we at in terms of industry maturity for the data science space relative to the adoption of open source software?

Peter Wang: Well, I think in general the broad industry at large, what I see is that they’re fairly early still, which is shocking to me because I’ve been advocating for open source since like 1995. And everyone runs Linux and everyone’s like doing all these things, like they use so much open source, but I think it’s the mentality of most IT managers and certainly most corporate people up the chain who see technology as sort of like this gnarly area to be managed. I think those people think of… I don’t think they think about open source in the correct way. I don’t think they understand what’s really happening. So I think that that is still fairly early, or I could say maybe it’s crossed the chasm, but it’s still people have concerns about it. It’s certainly not broadly like everyone’s doing it the right way. I mean, I think that people are still pretty immature in their adoption of open source, unfortunately.

Henry Badery: Yeah. I think it’s starting to change there and definitely we’re seeing a shift, not only in the adoption but the attitude in with regards to giving back. A lot of companies now, they’ve set up open source program offices, they’re contributing to the projects they use. So I think… what does the next phase look like in your opinion? Is that going to be driven enterprises? 

First-phase of data intensive computing

Peter Wang: I think that if enterprises show up in a good way, they can have an incredible impact on this. I think if they show up in the wrong way, they can really slow things down for quite a while, they provide some real headwinds for, I think, people trying to do the genuine, incredible things. So to make that more concrete, what I would say is that… well, two things. Number one, the phenomenon of software itself being a separate thing from hardware is itself kind of a historical accident. Now for us, the idea of software… like software is a metaphysical object that exists. It’s like, how could someone say software could not exist. But from a business computing perspective, it’s only been the last 35 or so years that we’ve really treated software as a distinct and separate kind of thing from the underlying hardware storage; the overall information system.

I think as we steer into the era of broad machine learning and cybernetics and AI, people are going to start viewing these information systems again in a coherent holistic way. And they’re going to see software as merely one part of the overall system, which is good. I think that’s the appropriate way to think about it. It also then puts challenges on the software industry. Because basically, ever since Larry Ellison and Bill Gates decided that software could be an industry and said, “Hey, you should pay for the software bit,” a lot of people have just, I would say axiomatically adopted that perspective to say, “of course, we pay for software.” And all the investors say, “Well, of course, software is a massive part of the value chain, highly scalable, 80% margins.” That era may be drawing to a close as we build systems, information systems that need to be very tailor made for specific problems they’re trying to solve that depending on the dataset and the algorithm, you may have very different set of software and hardware combination.

So, I think people are going to be very surprised all of a sudden that it becomes harder and harder to sell enterprise software the way it used to be. In fact, all the innovation being an open source, compresses that space even more. So, we’re going to see companies then adopting open source to make their own solutions. They’re going to build great solutions, and if they just understand that they need to participate in the open source human ecology via the open source program offices and things like that; I think those companies could do really well. They could really help put more water and fertilizer back into the soil. But if it continued to view it as an asset, if they view this thing as some competitive advantage, you know I see that as well; I’ve seen that mentality come from people. 

Like, “why should we fund this piece of open source innovation if that means my competitor gets it too?” It’s like, well, it means you get it as well though. If you’re truly a better company, then you’re simply compounding your advantages. It just stupid things like that where it’s like, they don’t seem to understand… I shouldn’t dismiss it quite so much as stupid. There’s a mentality of like the open source as being a very scarce good around software. It’s like software scarce, and open sources is therefore a way to get scarcity, but really, open source is a way to tap into a far more abundant faster pace of innovation. And if your company is ready to adopt that innovation, then you can go faster. You can really harness that wind, so that’s the way I think about it. I think the companies can… I’m very hopeful about it. I really am.

Henry Badery: And I think a lot of these companies, they’re accountable to investors or to shareholders and they’re like, and we briefly touched on it at the beginning of the episode, if they can’t show what the ROI is, then I can see why they historically had this mentality, but I think it definitely needs to change.

Peter Wang: Yeah, I mean, look, it’s an enterprise messaging to stakeholders and investors is all dressed in 10 layers of corporate BS. So you just say, “Well, in order to accelerate our digital transformation to the AI era, we decided to make these investments into open source, sorry, made these investments into technology innovation. And we’re now part of these incredibly innovative collaborations with these universities and these other things and research centers. And of course, they don’t mention the word open source anywhere in there. But they can talk about the dollars they’re spending as them being part of this touching innovation, touching what comes next. That’s easy to dress that up. I mean, that’s corporate marketing 101.

Henry Badery: Yeah. And as we get to the end, I want to ask one final question, and it’s a debate that’s happening at the moment and it’s been happening for a while now. That debate is where the data science is a fad or a phase, so which ended that debate to you lie on?

Is data science a fad or a phase? Peter’s 3 predictions for the next 10 years of data intensive computing

Peter Wang: I think, yeah, so I’ve definitely heard that. Some people think that the current need for data scientists programmers will be obviated as we have easier point and click tools or maybe software developers will just learning the appropriate stats and then we’re good to go. I think that we are currently in a certain phase of what I call data intensive computing. And so, data intensive computing is what it’s going to be for the next, at least 10 or 15 years. It’s going to be what is, I think, moving into the future until we get to maybe the singularity, if we ever get there. We’ve just had a weird exception for the last 20 years where we could do data non-intensive computing. But now that we’re back in data intensive computing it’s more and more important to bring the people who understand the business problem, who understand the algorithms and the mathematics, and the knowledge of the compute and information systems, bringing that closer, closer together.

Data scientists are people who happen to be able to hold all three of those in their brains at the same time, so it’s very, very frictionless interaction. As the field matures, I think we’ll still have those data scientists becoming more and more masterful. There’ll be able to harness these tools. They’ll be able to have better understanding of techniques, so there always be kind of an elite class of data scientists. But at the same time as you create this world into, or as we become more and more of… everyone doing more and more data intensive computing; we’ll see specializations and we will see everyone imbued with a certain amount of data literacy, so that’s what I see it as. 

I think that data science in its current form, the obsession over single unicorns that know all these three things; that pressure may fade a little bit because a lot of people are learning this and they’re upskilling and they’re going to be able to meet that market demand, but also, the way we practice this kind of thing we call data science will start becoming bigger and bigger and will start maturing and will start specializing into sub-areas. We already see it now. You have ML engineers that are different than data engineers who are then skilling up into becoming data scientists. You have business analysts and data analysts who are trying to use auto ML tools to kind of learn a bit more about the actual statistical techniques. So, all of these things are happening in this space. 

I don’t think that it’s going to suddenly just resolve and then we settle back into the world of 2010 when you have a database admin, and a Java developer, and then a business analyst sitting in front of Tableau or some graphical tool. I don’t think we’re going to go to that. I don’t think we’re going back to that world ever again. I think that we’re going to see a greater and greater access acceleration of businesses adopting these new data intensive techniques. And we’re going to see a rapid stratification between the businesses who are making the transition and the business who are not. So, it’s going to be really fascinating. The next 10 years is going to be pretty intense, but that’s my fundamental thesis that we’re in the era when the first phase of the era of data intensive computing. Anyone who makes it, by anyone, I mean, any business that makes it, they’re going to see that they need to infuse all of their knowledge workers with data literacy. 

And so a lot of what we call data science today, maybe data science 101 stuff today will be the minimum data literacy you need moving forward. And that’s just what it is. I’m calling that shot. I mean, I believe that’s fundamentally what is going to happen.

Henry Badery: We’ll come back to this in 10 years and watch the episode and we’ll see you see what happens.

Peter Wang: We’ll grab a beer and we’ll see what happens, but I’m pretty sure that my prediction will be proven true. And two other predictions while we’re calling it while we’re drinking a beer; I will also say that we are now entering a period of rapid hardware heterogeneity. So we’re going to see more and more kinds of chips storage systems, all sorts of weird architectures, and all of your software developers who thought they could just basically write a bunch of Java code deployed to a JVM or some vanilla X86 box running in a data center; they’re going to have to relearn a lot of things that they’re going to have their skills be relevant in this New World Order. The second thing that we’re going to see… but ultimately a lot of those new heterogeneous compute architectures are really going to fundamentally be without moving as much computational capability as possible to data. And then, having the data storage and the compute mechanism be as close to the sensor as possible. 

So we’re going to see massive amounts of sensor networking, edge compute all sorts of things happening that we’re going to basically at the end of 10 years, be in a world of totally pervasive computing. Or look back at this era of cloud and everyone put all their stuff into the data center, and we’re just going to crank on the data center. We’re going to see that as an incredibly primitive and wasteful approach. But instead, we’re going to see a much more pervasive data storage compute and sensor fabric that is going to be the infrastructure for computing. So, those are the things that I would call. Yes, I’ll call those shots.

Henry Badery: We’ll come back. Maybe we’ll need two or three beats to go through those. 

Peter Wang: It’ll take two or three. 


Henry Badery: Thank you so much for joining us tonight. Peter, I’ve really enjoyed it chatting with you.

Peter Wang: Thank you very much for having me. Thank you for having me for the excellent questions.

Henry Badery: Thank you so much, Peter. And for everyone who is listening or watching this video, it really would help if you can leave a review on Apple podcasts or leave a comment on YouTube letting us know what you think; that really does help and support the podcast. So, thank you very much everyone, and until next time, goodbye.