Introducing Nebari: An Open Source JupyterHub Distribution Platform

December 14, 2022
9:00 am

About

In this session, Amit Kumar, a software engineer at Quansight, will discuss how Nebari, an all-in-one development and experimentation platform, will help teams work more efficiently and collaboratively. He will demonstrate how to use this open source project to set up a full-fledged Data Science environment on major cloud providers using a simple configuration file and without the need to understand complex terms like “Kubernetes” and “Load balancer.”

Amit Kumar, Software Engineer at Quansight

Go to OpenTeams (www.openteams.com) to find your Open Source Architect to train, support, and deliver your software solution. Assemble the right open source expert team today.

Transcript

foreign

[Music]

foreign to this third session of today’s Tech

shares event uh once again to remind techshairs reform for technology and Business Leaders to connect to help

their organizations create better software solutions that utilize open source software more effectively uh this

is the last of the three sessions in our uh event today titled data management and energy with open source and

um my name is Brian skin I’m your host for today and this session’s topic is a talk by Ahmed Kumar of quansite

introducing navari an open source Jupiter Hub distribution platform as with the other sessions we are open for

audience questions so if you have any please drop them in the chat and again if you’re interested in participating in

a future Tech shares please reach out to us at tech.shares.com speakers all right so today I have with me this

session Amit Kumar software engineer at quonsite uh Amit thanks very much for your time and uh introducing Navarre to

all of us so uh before you launch into your talk um tell us a bit about yourself what’s your background how long you’ve been at

one site uh how did you come to work at the Bari uh things like that uh I’ve been working

with concert for about two and a half years roughly and uh I come from various backgrounds like for example I have been

with like symbolic mathematics initially and then uh web and then some uh in the

Consulting background so in some uh ways I would say I have touched various

domains so now I contribute mostly of my time in open source and contributing to Dubai is I’m one of the code developers

so yeah excellent well we are glad to have you here I will get out of the way and uh

let you have the stage thanks Brian so yeah I’ll be introducing Nevada and

before that as I already introduced myself I I’ll just uh for the record I work at consite and I’m also creator of

CI run Seattle is a service for you to create GitHub actions on your cloud and I mainly work at the intersection of

scientific Computing reproducible infrastructure and open source and nebari lies somewhere around that as

well so I’m also a co-developer of Nevada uh so a while back I gave a talk at

piedel London in like about in 2017 about five years ago and that talk was

about data pipeline data pipeling with Luigi so uh me I was early in my career and I was doing some data science and

data pipeling early and then I was kind of doing similar things but on a different scale so those were fun times

things were quite simple single user single notebook and the local environment creation via like

requirements.txt the environment was not very complicated things kind of worked fine those days and if I have to do some

computation let’s say if I have to train a machine learning model or let’s say runs any competition I would just open

up like a ec2 instance to do that competition or any cloud provider basically so

but now the things have changed a lot in last five years so I have a bunch of problems now so what

if I have to collaborate with it with my colleague on let’s say uh on a job lab on a problem how would I do that and let’s say also I

want to run a computation on 100 cores or not 25 on 25 workers on cloud how do

I get compute for that easily and also like since the environment is

not as simple as it used to be for me so I have a bunch of environments and I want to test various things I want to

quickly create a new environment and then get off it and and then version it properly so how would I manage those

environments and also how do I make them reproducible let’s say things work on my machine but how do I make sure that that

works on my colleagues machine and also on production systems and the last uh running workflows so for

example uh I have something I want to run on a regular basis let’s say a job which I want to trigger like every

morning at 6am how do I do that how do I how do I like write a schema for that to do it very trivially without worrying

about uh kubernetes clusters or infrastructure and also like I have been

doing some work and also for that work to be any use for the business I want to let’s say share some data share some

visualizations how do I do that so all of these uh problems exist for

like any I would say anyone in data engineering or data science space so I would not say that these are never being

solved they are being solved by a bunch of commercial tools as well and or not take their exact names but they sound

like beta breaks uh pagemaker and doodle fertilab and lrmal so there are already

a bunch of tools but the problem is like but all of them come with their own problems like being tied to a cloud or a platform and

sometimes also overkill for a small team so and also uh they are quite expensive as well so not not everyone fits the

bill so I have been uh talking about compute a lot so far like so what’s the first

about compute why what’s the need for why can’t I just run everything on my laptop so let’s uh to give you some

context I was working on a genomics problem a while back and which basically boils down to finding the pairwise

distance between all pairs of row vectors in a 2d array of size 1100 cross

25 million and here is some benchmarks for that so it took about like 10 to 11 minutes on

like four cores I would say four core uh four core CPU of 25 workers 16 gigs each

and on GPU it took like 46 seconds so and good luck doing that on your laptop

so you need compute so uh to for me to be able to do that uh experiment I

needed access to compute compute very quickly so to be able to test and iterate uh very quickly and being able

to test and iterate quickly is what moves the research forward quickly so an abide to the rescue uh let’s talk

about nevari what number is uh what I am why I’m here for so it’s an open source data science platform which is uh good

enough for a small team and also scalable enough for an Enterprise so let’s learn more about it so

first of all it’s open source and uh it’s it’s ever it’s all it’s used to be called qhub initially but it’s been

renamed to nebarina and with over 40 contributors and uh more than 150 Stars

I would love that number to reach a bit more after this call uh over a thousand commits and uh yeah so we are trying to

embrace the open source it is designed as a managed integration of Open Source Technologies so basically

uh we’re not some creating something new like we’re not really Reinventing the wheel so it builds on which is all some

things which are already existing so for example if you’re stuck on something like if you’re stuck on let’s

say a task problem you have a huge and helpful Community to fall back to it’s

not something proprietary that where you are stuck with the vendor and you have to ask the vendor why it’s not working when it’s not working

and so all of this like uh Nevada takes the deployment of this whole

infrastructure as the approach it takes is the infrastructure as code approach so let’s try to learn more about it what

the infrastructure as code approach means so the body has a devops for non-devops

people approach so what it means is basically navari does not expect this user to know what a load answer is or

what a reverse proxy is it in fact it has a as I said it’s a devops for non-in devops people approach so how it works

let’s see so everything starts with a config file we call it navari config file

CLI which creates this config file so you don’t have to like write all these things yourself so anybody has a nice

CLI where you are asked a bunch of questions and based on your deployment

you’ll have different answers to that and then after the end of the end of that CLI command it spits out this file

for you and then the next step is basically you just run by deploy and it will deploy the whole

infrastructure for you so what happens uh underneath is basically when you run

the navi deploy command from this config file it will create the terraform files like uh terraform is basically a way to

easily create and manage Cloud resources and that terraform file is then deployed

to your selected cloud provider so Nevada supports a bunch of cloud providers for example gcp AWS

dissolution and Azure so those terraform files would be specific to your cloud provider but underneath like the there’s

obviously some common elements like for example a kubernetes cluster which is kind of portable so it has got same

components on all all Cloud providers and after this is done you’ll have a

complete data science platform for the user or a group of users and uh let’s see I’ve been talking about

environments a lot as well so what’s the first about environments uh why can’t I just pip install

requirements.txt what’s wrong with that so as you know like for anyone doing

data science or data engineering so data science is built on top of the open source ecosystem and to harness that

ecosystem we need environments so let’s take a look at our usual workflow of a typical data science or

the transgender what that would look like so they would create a base environment with some basic packages

and then let’s say do some work or data science or whatever like whatever they’re doing and then

let’s see if they need another package they would do pip install or conda install and then carry on

so now how would you share that environment which you have just worked on let’s say you’re working on something

and then and then you want to share that and then let’s say if you do big phrase uh so big

freeze doesn’t have like a python version as well so if you’re working on python 3.6 and your colleague is working

on python 2.8 so how would you pip phrase would not uh take for that and

also pip is very uh python specific not language agnostic so if you’re using libraries dependent on compile stuff

like numpy it’s a good example of That So eventually you would have to have some compiled things so if

python is not like uh baby is not a very good uh tool for that like installing

those kind of factors so another example like for example this

was uh this was the requirements file for one of the projects we were working on it was uh efficient debt by torch and

this is the requirements file for that so this requirement file did not change for

a long time and there was some and this was working let’s say about 11 months ago and it installed specific versions

of some libraries and when we were trying to run this after let’s say about 11 months it it didn’t work because we

don’t know what environment the user was in what kind of packages uh it installed and it’s not really hardly pinned uh

like for example Dodge vision is not hardly pin and whatever so it’s not it does not guarantees you that you will

get the exact same environment which you had like let’s say a week ago which you’re working on so

it’s not really a good use it’s it’s not really a good tool for like hardly pinning your environment and have

reproducible environments so all these problems exist but there’s

also there’s also a tool for that for solving them so I’m not saying that these are not solved problems so conda

does solve most of these issues like at individual level but for that to work

you need to have use best practices but unfortunately most of us do not use those plus specs best practices and and

sometimes we do want to use them but we do not have the tooling for that uh a lot of times so

it’s often the case that we end up having messed up environments and non-reportable environments

so like to to demonstrate how what the best practices for uh it looks like so

you should always create an announcement spec so unlike your comments or txt environment.yaml is a bit uh

also Paints the python version and it’s more uh it’s better uh for like language

agnostic stuff so if you’re installing let’s say non-python non-python stuff like compiled stuff it’s it’s the best

tool for that and and also you can create a log file so that you know what exact versions of dependencies you’re

using so you don’t have to rely on let’s say unpended things and you have more guarantee of what you’re gonna have a

reproducible environment and also never really upgrade a package so let’s say if you’re working on an

environment and you want to let’s say upgrade a package or install a new package the best practice would be like to create a new environment instead of

modifying that because that could first your first your environment so uh before trying to present Your

solution for what what the ideal solution would be let’s first talk about what we want from our ideal Solutions so

what are the ideal environment goals so the first thing you would need for an ideal environment is reproducibility so

for example research and production environments are captured and can be used in multiple contexts so like for

example it’s uh it’s not something you just works on your machine it should work everywhere so for that you need a

very hardly pin lock file and also you need the tooling for that to be able to do that easily so

that’s the thing and then the next thing is flexibility so you need to ensure

that data scientists are able to quickly create environment so like it’s not like uh so when you are working on something

you would ideally want to be able to create environments on fly so it’s not

like uh it shouldn’t be like for example you just uh let’s say if you need a new package like like a scale on you have to

go to the it and say that I want this package to be installed in the environment and then it takes like maybe

a couple of days so if that happens it really slows down your I would say work or research so flexibility is very

important so you should be able to create environments on fly and having access to that and being able to Define

that quickly and also being able to create new environments on top of it and being able to share that file so and also the third

thing is governance allowed decision makers to enforce policies and control access to research and production environments so also like for example if

you’re an you’re an Enterprise and you want to have some policies like for example uh your developers could could

only install from let’s say internal mirror or they should all always have like say like say these bunch of

packages installed all the time as a base environment for any environment and things like that like for example you

want to enforce something on the user side so you want governance for that so having some having a tool that have all

these things really makes your an ideal environment and how a very ideal environment for research and data

science so while developing the body we came up with this tool we developed this tool

called as corner store and it solves these problems so it enforces the best practices and controls the full

environment life cycle so to let’s learn about Honda store what

it exactly does so it it does basically the three things on a higher level so it

manages the version and version access to the control environments like for example you can see on the screen on the

screenshot on the right like you can add packages in there so we can have exactly print packages and you can also specify

rules like you want uh this package to be greater than greater than like uh 3.8

or whatever and also you can so this is like uh like a UI UI way to do it you

can also use the emulator so if you click that toggle on the right which is uh basically switch this this toggle if

you switch this one switch to yaml editor it will show you the emulator the one which you which we are used to in

conda and the second one is the builds uh it builds the conda specification in a

scalable manner so being for example a lot of things which you build uh you

want to you don’t want to wait like an hour to build a new environment so you want to be able to use the caching so

you have like a central caching which for example if you have already pulled numpy uh 1.5 you should not

required to pull that again so that your environment builds are fast so it doesn’t takes forever to get you a new

environment so that you can quickly experiment with a new environment and the third one which is basically

serves the quantum environments via file system unlock files and tarballs so and also Docker history let’s say if you

want to push an artifact somewhere and then use that so it also provides a way for that so I can quickly show you

how that file looks like so for example uh this environment I created recently so

this is how it looks like here I can change so this is the old UI so the new uh on this deployment that the new UI is

not there so I’m not I cannot show you the exact same screen but this is how you would like create a new environment

you can edit this and change let’s say you I want to add a package

and submit it will build a new environment on the Fly and also it gives you the log files like

for example for this one so you can see like it has the exact uh

it pins down to the exact dependencies of all the packages and exact versions of all the packages so that you have a

very reproducible environment and so uh what condo store does in an

actual is basically is a UI that guides and enforce the best best practices and

so basically you are moving from an ad hoc creation to a very fully specified environment and as you already saw what

are the problems with the ad hoc creation foreign so this is like a a usual workflow for

conda store and so you specify the environment in IML file or from the UI

the one I showed you before so you specify the name and dependencies how it looks like and also the pinned versions

and then it would create a log file which is very uh hardly pinned and that

creates artifacts like for example tar artifacts or Docker history even so and then you can use the exact same

environment log file to create another environment so that it is very reproducible so for example you can use the artifacts in your production

environment and you would be very guaranteed that you will have the exact same environment you had while testing

on let’s say Egypt lab and it also does the access control and

versioning so let’s say if you create an environment today and you want uh you made like two changes to it you want to

version them let’s say environment one two three so on so because uh the development evolves over time so your

environment would not be same right now from what it was when you started the

project so it would change and you want to see uh the changes that you have made for to the environment let’s say you

there’s a there’s something which broke when you update it in your environment and you want to see uh why what are the

changes I introduced in the environment so you can basically with this access control and versioning you can also see

like what happened like uh what change in your environment and also access control is basically you can have uh

Access Control environments like for example uh you you are in a you’re working in a team of let’s say five

folks and everyone have environment their personal environment and you have some shared environments like for

example this one the uh these are the shared environments so which everyone will have access to and then you have personal environments let’s say the John

those environments like this one so this is your personal environment which you have access to

and third part the governance part which I was talking about let’s let’s say as I said if you want to enforce uh the users

to only use have internal mirrors or restrict like packages from particular channels let’s say you want to enforce

that people only install from like let’s say condo Forge or whatever Rapids or whatever and you want

to always have let’s say a required set of packages to be installed on all environments so you can enforce that

here and also you can if you want to let’s say acquire certain versions of packages uh like for example if you

wanna if you know that there’s a like let’s say a one liberty in one of the newer versions of this package and you

want to make sure that no one installs a version above let’s say 5.3 of a package

so you can enforce that and this is what governance provides so Nevada is designed in a way that

where we have made it so that we can have more lots of plugins so like

different people have different ways of working so we want to support all kinds of ways they can uh they want to work

and develop in Nevada so we have various plugins so one of the one of the famous

ones is like the vs current integration so not everyone enjoys writing code in Jupiter lab so

you are you want to use like vs code you want to write code nvs code so Nevada

has this vs code integration so it’s basically just basically you go to your navari and you click on the vs core

integration it will spin up a vs code with all the files and you can just write code the way you would write on

your local machine another nice plugin is the dashboard

deployment so remember the part I was saying like I did some work and I want to share the share some data and

visualizations to the team uh to the business so this is how you would do it so it supports all the uh all the famous

uh plotting tools like Port leader plotly panel and streamlit and voila and you can create a new uh dashboard and

then you can also access control that dashboard you can share that with a with a colleague and you can also make it private so you can have version Access

Control to that dashboard the way you like and also they’re like uh plenty of uh

other plugins available so for example as I said Das comes with comes built into it if you want to let’s say uh

want to scale your compute to let’s say a cluster of workers and you want to do lots of computation for example for

example the when I sh when I was talking about earlier like the genomics problem I was solving so I use task for that

and also you want to run some workflows so Argo is one of the good tools for that so you can use Argos prefect and

also if you’re doing using doing machine learning you can use Clear ML and and the dashboard is uh the contain DS

dashboard and also since it’s a kubernetes cluster where a bunch of fox would be using that cluster for various

things so you want to also see the uh the usage of the memory users and various things you want to see about the

kubernetes cluster and it’s basically for the administrator let’s say you want to see how your kubernetes cluster is

doing things like that and also if you want you can do a video chat with gypsy

and I will now quickly show the nebari CLI just to give you a taste

of how it looks like so here I installed here I’ve installed

library in this environment and I want to deploy the body to let’s say a cloud infrastructure so I’ll quickly

show how the CLI looks like in case let’s say let’s make this a bit bigger

so the idea of showing this is to show you that how easy it is to generate the

nebari config file and also how easy it is to deploy so let’s quickly take a look at how the

what the what commands we have so the command we are looking for the command you would mostly need is are these three

the init deploy and destroy so what init does is basically it creates a in a body config file which is

the source of for your infrastructure and then after generating that you would run the body deploy which will deploy

everything you have mentioned in the the body config file and that let’s say if you want to destroy a cluster and want

to create a new cluster you just do new body destroy so let’s take a look at the nebari in it

so Nevada init also comes up with one other command which is guided in it so

we’ll see that in a second so if I do navari in it health you can see there’s a command called

guided in it so if I do anybody in it guided so it’s

a it’s a new tool we have added which is basically guides you through the all the steps of uh the requirement you need for

initializing memory so I’ll show you that how it looks like it will ask you a bunch of questions of

what you would how you would you want your device to look like so like for example this one like where would you

like to deploy in my cluster so let’s say I want to buy at AWS and paste your AWS key let’s say I’ll

just put uh whatever and something in here

and what product would you like to use so I would just say I would say nebari Tech

Fair what domain you would like to use I’ll say demo dot mcgorry dot test

and how we like to have the authentication so let’s use the basic

auth which is the path password and you want to use the github’s approach yes I would like to use the

github’s approach it is basically a way for you to deploy make changes to your body so this is the first time which you

generate the config file and let’s say if you want to make a change uh let’s say if you want to change the like

profiles for jupiterhub you can do that with that config file and you’ll simply make that chain in the config file and

you push to the or GitHub repository then GitHub actions will do the rest of the job basically deploying your

updating your infrastructure with the new setting so yes I would like to use that

and organization let’s say is Con site where I want to keep my

GitHub repository and the name of the repository let’s say it’s never share

and would like to clear the remote repository so if you provide the GitHub credentials it would create that

repository as well so you and you can use let’s encrypt for certificates Let’s

ignore that for now and if you want to do Advanced configurations so let’s ignore that for

now and now this will spit up a

configuration file

uh seems like I the token ID for the blues was wrong that’s why it uh spit is

uh things but if you have the correct AWS token it would work

but anyways after this file is generated you would have to just run a battery deploy

and that would basically deploy the whole infrastructure and yep let’s move on to the slides

so yeah so since it’s an open source project it strives with the uh strikes because of the open source contributors

and you can you’re free to contribute to it and you can ask questions on our

slack on our guitar and you can also make more requests on GitHub

and I think now the stage is open for questions terrific thank you very much Amit for

that uh intro to navari um we don’t have any audience questions but I uh one that that I have so

um just looking forward what sorts of features are planned for navari that uh that are in progress that you’re excited

about so uh one of the one of the most crucial I would say is the way you want

to have identity like for example one of the things we are we were discussing recently is uh how we want to provision

access to various resources and in the cloud provider let’s say you have five people who are working on a project and

you want people access to different things like let’s say a different storage bucket or a different uh

bigquery database or things like that so being able to provision that right from Nevada like without someone having to

configure that manually so this is something we are working we are planning to work on so this is something I’m quite excited about because it makes it

very seamless for people to have uh access to various things right away without someone having to manually

configure that yeah it sounds like a great feature set um well I think we’ll wrap it there

again thank you very much I meant for your time uh to to present the project and share it with everyone

um so we really appreciate all the experts for all three sessions today uh taking the time to to weigh in on the

various topics uh as with the others uh this uh this presentation is being recorded and the recording will be

available soon um our next tech shares event will be on November 30th uh again at 11 A.M Eastern

Standard Time uh this will be another Roundtable discussion similar to our previous Tech share but in this case

it’ll be fintech focused we look forward to having you participate there as I’ve mentioned in some of the other sessions

if if any of you are interested in speaking or being a panelist at a future Tech shares event please reach out to us

uh the best place there is Tech dashshares.com speakers uh thanks very much to our our uh

speakers and everyone in attendance when we look forward to seeing you next time

Introducing Nebari: An Open Source JupyterHub Distribution Platform

Committed to your success with open source. OpenTeams is your easy point of access to a range of services from our open source expert network, from commercial open source support to open source training, staffing & recruiting services, and more.

Resources

OpenTeams