Accelerate ML Cycle with deep lake, the data lake for deep learning

About

One of three ML projects fails due to the lack of a solid data foundation. Projects suffer from low-quality data, under-utilized compute, & significant labor overhead for data management. Traditional data lakes address this for analytical workloads, but not for deep learning. In this session, we introduce Deep Lake – the data lake for deep learning. Deep Lake stores complex data in a deep learning-native format. It rapidly streams the data to query engine, in-browser visualization, & ML models in real-time. Learn how to iterate 2x faster with Deep Lake without spending time building complex data infrastructure.

Davit Buniatyan
CEO at activeloopai

Transcript

Steve: [00:00:01] Here with Davit Buniatyan, the CEO of activeloop Davit is someone that has impressed me a lot for his accomplishments and for what he has done with his career and his launch of activeloop. I mean, he started his career at 18 or earlier doing tech innovation with creating an app where you could swipe like Tinder. The news items were way ahead of him. And then he did his PhD at Princeton University, where he’s worked on several projects that led up to activeloop. So, Davit welcome on stage here at Techshare. We’re delighted to have you. Tell us a little bit about the story of how activeloop came to be, and then I’m going to let you dive into what you have to talk about and to show us here.

Davit: [00:01:01] Thanks, Steve. Really appreciate the invite and super excited to be here. Maybe just a brief background about myself before starting the company I didn’t finish my Ph.D., unfortunately. So I was doing a PhD at Princeton University and I was actually working in this lab in biomedical image processing, working on this field called Connectomics. And basically the research that we were doing at connectomics is trying to reconstruct the connectivity of neurons inside the brain to build much more biologically inspired learning algorithms. And the way we were doing is that we were taking one millimetre volume of a mouse brain cutting into very thin slices. Imaging each slice each slice was 100,000 by 100,000 pixels. Then we had 20,000 slices and our problem was to apply deep learning techniques to be able to separate the neurons, find the connections, build the graph. And during this process, the dataset size was petabyte scale, and it was very hard for us to process this data and we had to rethink how the data should be stored, how it should be stripped from the storage to the computing machine. Should we use CPU or GPUs and what kind of models to use?

[00:02:06] And actually what we found out is that if we store that in a chunked NumPy array format and I think one of the guys in our team, William and [], they pioneered this cloud volume python package that helps you to do that. It actually can make it very efficient data representation there and apparently later I learned is that NumPy itself has been built upon biomedical image processing by Travis. So it’s also great to see that there’s a biomedical image processing route in NumPy as well. And super excited that inspired us to start the company and provide much more sophisticated version of it that we call it a deep lake, which we announced like two weeks ago and happy to go through that today with you and we’d love to see if there are any questions. Feel free to stop me in any time and ask me those questions or leaving the comments and happy to answer them.

Steve: [00:03:02] Perfect. Well, we look forward to hearing more about Deep Lake, so I’m going to let you get started.

Davit: [00:03:07] So as I mentioned, we got into Y Combinator, we raised a couple of our own, started working with early customers, learn about their problems. One company had 80 million text documents training a large language model. Another company had petabyte scale aerial imagery trying to build computer vision models for it to provide insights to the farmers. And what we learned is that you have all these awesome databases, data warehouses, data lakes specialized for analytical workloads. And those analytical work was usually like, Hey, can you give me past three months of sales? Or what are the future forecasts of my inventory, which has had a crucial role for past, I think, few decades being the core of the enterprise ML. But over the past ten years, deep learning enabled to generate business valuable insights from unstructured data like images, video, audio. However, the data infrastructure was lacking there.

[00:04:06] And furthermore, when you have seen this as well, like one in third ML projects they fail because of poor data development practices. And those are all great technologies. They have like a very good use case for analytical workloads, but for complex data kind of there’s a green field that they’re not in that many technologies or tools to be able to store the data and then efficiently manage or compute and things like PyTorch, TensorFlow, NumPy and all the derivative works. They did an awesome work. Optimizing the computation on accelerated hardware. However, you have distilled the problem and move the bottleneck into the data part, and that’s what we have in focus so far, solving it.

[00:04:56] So the question is like, why do we need Data lakes? There’s so many benefits in using data lakes. First of all, it’s like trying to centralize the whole the data so that you can break data down the data silos. It helps your organization to have a unified view across all your data sources. So when you make a decision, you can be aware of all the information that also improves in return the operational efficiency and of course reduce costs. However, especially if you start to deal with images, let’s say an elementary use case or biomedical imaging or select bunch of text data, which is like sort of considered to be semi-structured but still like let’s say you have a dump of the already old commons and you’re trying to build a language model there. It’s still very limited to use those tools storage like the traditional data lakes to process. And by the way, there are two types of generations of data. The first one you can think of is that you just store it on a global repository like S3 or HDFS, and then there’s a second generation which is now blossoming up as well. Things like Delta lake and Iceberg, which are awesome for operating a large scale file format or other tabular data. However, those tools, they don’t have native deep learning integrations and like pie data stack kind of integrations, there’s a missing gap between ML Ops and this modern data stack. And the modern data stack is focused on analytical workloads. You have seen like hundreds of ML ops, tools popping up.

[00:06:31] However, there’s a big disconnect between these two worlds and also being like data lakes have been so far focused on running queries for analytical workloads. But if you’re doing machine learning, especially deep learning, use cases, that’s kind of there’s a style difference. Why do you need those queries to work on? And we’re super excited. We introduced Deep Lake, which is a data lake for deep learning applications. Essentially what we focus on is that we help to store the data very efficiently on top of object storages like S3. We take the data, put them into the tensors, or you can think of them as chunk numPy arrays and then make it very efficient to run, to visualize the data, to run queries, materialization, materialize the data views, and then streaming into training or inference pipelines.

[00:07:23] Before getting into details how it works, let me tell you, where does it stand? So if you take the whole envelope blueprint, which we work with, elements of infrastructure, AI. To build is that you have basically three main foundations there. The first one is the human foundation. The second one is the Model Foundation and Data Foundation. In human foundation you have tools like labeling tools, notebooks, dashboards. In the Model Foundation, which is the process where you go from data engineering to the deployment of those machine learning models. And then there’s a data foundation where you have to store the data and then interact with the storage layer. So as you can see, Deep Lake sits in the data warehouse or data lake category, and it enables the rest of the ecosystem to be able to store and maintain their datasets, including both storing structured and unstructured data, having a versioning lineage and being able to run queries, then to use that in the rest of the ecosystem.

[00:08:23] So more specifically, what Deep Lake itself is responsible for is from the raw ingestion of the data, from version control, visualization, running queries and utilization, and then streaming part and the rest of the tools, for example, like we integrate with whites and [] or the orientation tools or maybe training which is like both the open source. And also there are some providers as well. Doing this they are taking care of by other ML ops ecosystem. They start from the basics. The version control data. So the way it works is that it’s actually a columnar storage. Think of it as each column being a tensor or a like n dimensional numpy array and you can have it. So you can start from an empty data set which is on the main branch. Create an image tensor, append 100 image lengths. Those could be pointers to actual png or they could be the pngs themselves or the numPy arrays themselves inside stored the tensor.

[00:09:20] And then you want to decide to annotate this data. So maybe you check out to a new branch, create labels, add 100 annotations that correspond to the images before and then. Okay, Now you’re like, It looks good. Maybe you want to merge back the annotation branch to the main branch, maybe modify a few of the images or annotations and then commit again. And now you can run on this version to run a query which creates this dataset view, which is a subset of the main branch. You can directly connect this to train a PyTorch model. However, it might not be that efficient because you have missing roles that are not necessarily need to be fetched and you can optionally materialize it, which takes this all data, put them into very efficient representation for them to be able to run PyTorch very efficiently. So in this case, usually before what was happening is that data scientists or machine learning engineers have a dataset on S3 trying to copy around copy into a local machine and then copy to another machine. And then that step itself, like you lose the whole data lineage and like to be able to train models with reproducibility, you kind of lose this missing link. How did I get from my raw data to a dataset that I train my model on? And then that reason say, Hey, actually this sample was useless, or maybe I didn’t use the sample. I don’t know what to do with it.

[00:10:45] So this gives you a full version control. And one of the big differences between traditional data lakes is that they have usually just a single branch history, which means that you have this time travel feature and you can go back step. In this case, this looks like more Git for data where you can branch out to new branches, build a dataset or tensor, and then kind of evolve the dataset as you go forward. However, behind the scenes is not using Git at all. Once the dataset is in our format, then we built a visualization engine that you can stream the data from S3 to the browser to visualize the data. Why this is important? First of all, you can get a qualitative and quantitative view of your datasets, especially when you work with, let’s say, 100 million images. It becomes tricky to get a visual understanding of how the overall dataset distribution looks like. Then one things that’s important is that there’s not actually a middleman between the dataset and the visualization engine, meaning that you don’t need to have a backend or a like a rendering engine behind the scenes to stream the data. So that makes it very secure where your client directly connects to the source data to be able to visualize the dataset. Once you had kind of a qualitative look you want to interpret, revisit and the easy way is to run queries I know a bunch of data scientists and myself, including don’t like too much SQL like syntax, but what we thought and that this will simplify a lot is like, Can we add the dialect of SQL and let’s call it a tensor query language, which basically takes all the benefits that SQL provides, which is the standard syntax being able to interact with the data wtth numPy operations where in this example, what I want to do is that I have a dataset of images and I want to get all the bicycles in front of a car during the raining time. And the way I run this is like I select the images, the boxes, the bounding boxes. As you can see, I say, Hey, give me all the categories that contain a bicycle and when the weather is raining, then I want to say, Hey, can you actually order this by my model prediction error. So the area of intersection error with boxes and predictions and then ordered by the importance there. The way AI here is actually user defined python function that takes the boxes, tensor, prediction tensor and gives you back an estimate how wrong was your prediction of the model itself?

[00:13:21] And then what I want further is like, hey, can I actually crop my images 400 pixel by 400 pixel and then adjust the bounding boxes accordingly so that I can get a new training set of 1000 size to fit into my shooting model. And then this helps you both on the browser and also on your Python notebook to be able to quickly iterate or experiment. What kind of a dataset you want to train the model on as a next to find these edge cases to get to kind of very high accurate models. And once the query runs now is the time to take this and materialize it. Materialization Stop is where we have already decided that this is data that we want to be. Maybe it has pointers, maybe it has like a very sparse representation. Now we want to kind of put it into very efficient layout where we can. Store the data and then stream it to deep learning frameworks. So what we do is that, as I mentioned, our format is like basically columnar storage each column here think of it as like a huge numPy array, which is chunked behind the scenes and then started on S3 or any that could be your filesystem, any Google cloud storage, etc.. So the key thing difference here between other chunked array formats is that what we learned and we had this iterations done many times, is that at some point like if you want to store dynamically shape tensors, let’s say your images have varying shape or you have like varying shapes videos that you need to represent in a single tensor, then having a static chunk shape becomes very tricky.

[00:15:04] So, kind of one switch sources here that we have is like the way we lay out the data on top of the [file] systems. So it becomes very efficient both to represent any tensor, including this dynamically shape tensor and also make it efficient to fit it into PyTorch or TensorFlow to train machine learning models. And one of the key advantages here with the columnar storage as overall is let’s say if you want to train a model on images and labels, you don’t actually need to fetch or load the rest of the tensors for your operations or vice versa. If you want to run a query on labels, you don’t need to connect with images, labels, etc. and this all gets stored onto the best string and can easily scale to petabyte scale.

[00:15:49] Further, we build the streaming engine, which takes the data from the S3 storage and very efficiently fit this into the numPy over the network so you can get and here’s what we show you. Let’s say you want to train a model on AWS Sage maker. Image now is an iconic data set has about 150 GB data size. And on AWS you have different ways to train this model. The first one is file mode, which basically says, Hey, when my virtual machine is up, I’m going to file by file copying the machine from S3 to my machine, so the images to the machine and then start doing the training process. In this case, you waste 3 hours of very expensive GPU compute time while the data being fetched. They also recently introduced this AWS fast file mode which basically acts as a virtual file system on top of S3 with a cache with optimistic cache that can help you spend no time on getting the data to the machine. However, now your training itself becomes slow because the file system is not aware of the storage, like how the computation is going to use the storage. So what deep lake does and comes with just the python package sits down with PyTorch and says, hey, I know that you are going to ask for the next five samples to do this process, let me go and fetch the correct data from the data lab we have very efficiently bring it to the GPU.

[00:17:15] So, with that we can achieve as near as local training, like training almost the same time as if the data was local to a machine. But now the data is being streamed from the remote storage. And this enables kind of this notion of, hey, it’s kind of Netflix for data sets where you can spin up a machine anywhere, point to like a S3 storage where your data set is stored and we actually feed this into the GPUs. And furthermore, we did a bunch of comparisons with other types of data loaders, including PyTorch and others, to show demonstrate like this case. And if you want to learn more about all the different landscapes of it s at all, they build an extensive overview of the data landscapes and the challenges and promises that coming next.

[00:18:06] As I mentioned, I think thanks to numPy, I think tools like PyTorch, TensorFlow, Ajax and others did an awesome work on optimizing the computation that’s happening on accelerated hardware like GPUs, GPUs, IPOs, etc. And now the bottleneck is the switch to the data side. And what we work at activeloop there is like really enable this and solve this bottleneck. One more thing to mention as well. We also observe that there are these two different stacks that have been evolving. Obviously, the modern data stack for a long time, I think over the few decades. And then on the right hand, you see all these new tools that are specialized for machine learning practices that are popping up and becoming mainstream. And there’s the bridge that we are trying to connect the traditional or analytical or enterprise machine learning with the more like novel deep learning use cases. Just pray for activeloop. So what we focus on is providing a solid data foundation. We help with a very simple API for creating, storing and collaborating on AI datasets of any size. Rapidly transforming stream data while training models at scale, and also help you to query version control and explore and visualize data sets. 

[00:19:23] For AI and all this for freeing machine learning teams to develop a product much faster. What we have seen as well working with companies is that we help with saving GPU infrastructure costs, help them to drive revenue growth by shipping products faster. And one of the key things that we have seen from the business is usually they expect their data scientists to build all the data infrastructure. And we think that data scientists can actually focus on core business problems and we help to eliminate product failure. While we have seen how other companies, especially in self driving, use cases where they have a team of 8 to 10 solid data engineers, just building all this infrastructure which cost them the order of $10 million over a few years. I didn’t get into too much details, but Deep Lake is actually open source. We have been trending on GitHub number two and number one Python languages a year ago. It has been a small boutique firm considered top ten Python ML at packages in 2021. But more importantly, we have seen already this in production at public companies and we also work with few public companies and private companies as well. Obviously more to make it perfect and put it into good deals. And feel free to join the community as well. We just reached 1000 community members as well on Slack. So yeah, this is us. Thank you very much for your time and happy to answer any questions you might have.

Steve: [00:20:54] No, this is just fantastic. I’m glad for this success, and we hope to have more on a deep lake and what you’re doing on tech shares. So if they want to get involved, they can go to GitHub. Is that the best way or to contact you via your website? What is the best way for people to get to know?

Davit: [00:21:13] Yeah, we created a separate page called deeplake.ai which has all the sources for the GitHub, the white papers. We also just recently published in the academic paper as well, getting into details and obviously the open source is always there. And we’re also super, super thankful to numPy community and PyData overall for enabling us to be able to build such a tool.

Steve: [00:21:42] Very good. We had one question come in from Brian here. Say someone has started a deep learning project using a conventional data lake. How challenging would it be to migrate this existing data lake to Deep Lake?

Davit: [00:21:55] Brian. That’s a great question. I think one of the things that we did recently is did an integration with airbyte, which means that now let’s say your data sits on a Delta lake or iceberg or hoodie. Those are the new tools. If I’m assuming that you’re using any of those tools, then we can just synchronize the tables from those tools to deep lake or if you have already the data, let’s say on Amazon S3, then we have a very easy API where you can just append the pointers to the data there or ingest large data frames into them. Just to keep up current expectations, deep lake specialized or focused on primarily, let’s say, computer vision, use cases, video processing and LP audio. But if your data is still in structured data space, we would highly recommend still use the traditional techniques.

Steve: [00:22:49] Thanks for the answer and Brian for the question. And if there’s any other questions, feel free to post them on there. Where are you going with Deep Lake? What can we expect in the near future and the future? What’s your vision Davit?

Davit: [00:23:08] Yeah, so I think first of all, we are now focused on integrating with the ecosystem, getting into AWS marketplace, adding we are discussing a potential collaboration with the most prominent data warehouse that’s Lake House Systems. So essentially being part of the ecosystem is number one. Second in terms of the feature roadmap and if you’re interested to contribute more than we’ll come to see how we can collaborate there. One of the important features that, we’re optimizing now the data to make it perfect, but at the same time we are also spending resources on adding the asset transactions complete. Currently you have this you can have concurrent rights to different ranges, but on the roadmap by the end of this year we should also enable offset transactions to the same numPy arrays and hopefully our vision where we want to see this is that we believe that deep learning is going to take over the traditional computation and we want to be the data infrastructure standard for deep learning workloads. And that’s where we are working on towards. And then I would love to have the support and also help from the community here.

Steve: [00:24:27] Well, we’re here to help you out, and we appreciate your time today. And with this, we were going to end our tech shares for today, for this Friday. And we have upcoming tech shares coming up. Everybody’s welcome to attend. If they go to tech-shares.com you’ll be able to see a list of our upcoming sessions. Our next one will be on November 4th. Top big data opportunities for telecom companies and Davit I think your technology applies there as well. And then on November 9th, we have a roundtable, a technology roundtable, where we’re going to be taking executives and discussing technologies that they care about and topics as well. So, Davit we appreciate that and we thank everyone for participating. And with that, will end today’s tech share.

Davit: [00:25:24] Thanks, Steve, and thanks to everyone.