Introducing Nebari: An Open Source JupyterHub Distribution Platform

About

In this session, Amit Kumar, a software engineer at Quansight, will discuss how Nebari, an all-in-one development and experimentation platform, will help teams work more efficiently and collaboratively. He will demonstrate how to use this open source project to set up a full-fledged Data Science environment on major cloud providers using a simple configuration file and without the need to understand complex terms like “Kubernetes” and “Load balancer.”

Amit Kumar, Software Engineer at Quansight

Go to OpenTeams (www.openteams.com) to find your Open Source Architect to train, support, and deliver your software solution. Assemble the right open source expert team today.

Transcript

foreign
[Music]
[Music]
foreign to this third session of today’s Tech
shares event uh once again to remind techshairs reform for technology and Business Leaders to connect to help
their organizations create better software solutions that utilize open source software more effectively uh this
is the last of the three sessions in our uh event today titled data management and energy with open source and
um my name is Brian skin I’m your host for today and this session’s topic is a talk by Ahmed Kumar of quansite
introducing navari an open source Jupiter Hub distribution platform as with the other sessions we are open for
audience questions so if you have any please drop them in the chat and again if you’re interested in participating in
a future Tech shares please reach out to us at tech.shares.com speakers all right so today I have with me this
session Amit Kumar software engineer at quonsite uh Amit thanks very much for your time and uh introducing Navarre to
all of us so uh before you launch into your talk um tell us a bit about yourself what’s your background how long you’ve been at
one site uh how did you come to work at the Bari uh things like that uh I’ve been working
with concert for about two and a half years roughly and uh I come from various backgrounds like for example I have been
with like symbolic mathematics initially and then uh web and then some uh in the
Consulting background so in some uh ways I would say I have touched various
domains so now I contribute mostly of my time in open source and contributing to Dubai is I’m one of the code developers
so yeah excellent well we are glad to have you here I will get out of the way and uh
let you have the stage thanks Brian so yeah I’ll be introducing Nevada and
before that as I already introduced myself I I’ll just uh for the record I work at consite and I’m also creator of
CI run Seattle is a service for you to create GitHub actions on your cloud and I mainly work at the intersection of
scientific Computing reproducible infrastructure and open source and nebari lies somewhere around that as
well so I’m also a co-developer of Nevada uh so a while back I gave a talk at
piedel London in like about in 2017 about five years ago and that talk was
about data pipeline data pipeling with Luigi so uh me I was early in my career and I was doing some data science and
data pipeling early and then I was kind of doing similar things but on a different scale so those were fun times
things were quite simple single user single notebook and the local environment creation via like
requirements.txt the environment was not very complicated things kind of worked fine those days and if I have to do some
computation let’s say if I have to train a machine learning model or let’s say runs any competition I would just open
up like a ec2 instance to do that competition or any cloud provider basically so
but now the things have changed a lot in last five years so I have a bunch of problems now so what
if I have to collaborate with it with my colleague on let’s say uh on a job lab on a problem how would I do that and let’s say also I
want to run a computation on 100 cores or not 25 on 25 workers on cloud how do
I get compute for that easily and also like since the environment is
not as simple as it used to be for me so I have a bunch of environments and I want to test various things I want to
quickly create a new environment and then get off it and and then version it properly so how would I manage those
environments and also how do I make them reproducible let’s say things work on my machine but how do I make sure that that
works on my colleagues machine and also on production systems and the last uh running workflows so for
example uh I have something I want to run on a regular basis let’s say a job which I want to trigger like every
morning at 6am how do I do that how do I how do I like write a schema for that to do it very trivially without worrying
about uh kubernetes clusters or infrastructure and also like I have been
doing some work and also for that work to be any use for the business I want to let’s say share some data share some
visualizations how do I do that so all of these uh problems exist for
like any I would say anyone in data engineering or data science space so I would not say that these are never being
solved they are being solved by a bunch of commercial tools as well and or not take their exact names but they sound
like beta breaks uh pagemaker and doodle fertilab and lrmal so there are already
a bunch of tools but the problem is like but all of them come with their own problems like being tied to a cloud or a platform and
sometimes also overkill for a small team so and also uh they are quite expensive as well so not not everyone fits the
bill so I have been uh talking about compute a lot so far like so what’s the first
about compute why what’s the need for why can’t I just run everything on my laptop so let’s uh to give you some
context I was working on a genomics problem a while back and which basically boils down to finding the pairwise
distance between all pairs of row vectors in a 2d array of size 1100 cross
25 million and here is some benchmarks for that so it took about like 10 to 11 minutes on
like four cores I would say four core uh four core CPU of 25 workers 16 gigs each
and on GPU it took like 46 seconds so and good luck doing that on your laptop
so you need compute so uh to for me to be able to do that uh experiment I
needed access to compute compute very quickly so to be able to test and iterate uh very quickly and being able
to test and iterate quickly is what moves the research forward quickly so an abide to the rescue uh let’s talk
about nevari what number is uh what I am why I’m here for so it’s an open source data science platform which is uh good
enough for a small team and also scalable enough for an Enterprise so let’s learn more about it so
first of all it’s open source and uh it’s it’s ever it’s all it’s used to be called qhub initially but it’s been
renamed to nebarina and with over 40 contributors and uh more than 150 Stars
I would love that number to reach a bit more after this call uh over a thousand commits and uh yeah so we are trying to
embrace the open source it is designed as a managed integration of Open Source Technologies so basically
uh we’re not some creating something new like we’re not really Reinventing the wheel so it builds on which is all some
things which are already existing so for example if you’re stuck on something like if you’re stuck on let’s
say a task problem you have a huge and helpful Community to fall back to it’s
not something proprietary that where you are stuck with the vendor and you have to ask the vendor why it’s not working when it’s not working
and so all of this like uh Nevada takes the deployment of this whole
infrastructure as the approach it takes is the infrastructure as code approach so let’s try to learn more about it what
the infrastructure as code approach means so the body has a devops for non-devops
people approach so what it means is basically navari does not expect this user to know what a load answer is or
what a reverse proxy is it in fact it has a as I said it’s a devops for non-in devops people approach so how it works
let’s see so everything starts with a config file we call it navari config file
CLI which creates this config file so you don’t have to like write all these things yourself so anybody has a nice
CLI where you are asked a bunch of questions and based on your deployment
you’ll have different answers to that and then after the end of the end of that CLI command it spits out this file
for you and then the next step is basically you just run by deploy and it will deploy the whole
infrastructure for you so what happens uh underneath is basically when you run
the navi deploy command from this config file it will create the terraform files like uh terraform is basically a way to
easily create and manage Cloud resources and that terraform file is then deployed
to your selected cloud provider so Nevada supports a bunch of cloud providers for example gcp AWS
dissolution and Azure so those terraform files would be specific to your cloud provider but underneath like the there’s
obviously some common elements like for example a kubernetes cluster which is kind of portable so it has got same
components on all all Cloud providers and after this is done you’ll have a
complete data science platform for the user or a group of users and uh let’s see I’ve been talking about
environments a lot as well so what’s the first about environments uh why can’t I just pip install
requirements.txt what’s wrong with that so as you know like for anyone doing
data science or data engineering so data science is built on top of the open source ecosystem and to harness that
ecosystem we need environments so let’s take a look at our usual workflow of a typical data science or
the transgender what that would look like so they would create a base environment with some basic packages
and then let’s say do some work or data science or whatever like whatever they’re doing and then
let’s see if they need another package they would do pip install or conda install and then carry on
so now how would you share that environment which you have just worked on let’s say you’re working on something
and then and then you want to share that and then let’s say if you do big phrase uh so big
freeze doesn’t have like a python version as well so if you’re working on python 3.6 and your colleague is working
on python 2.8 so how would you pip phrase would not uh take for that and
also pip is very uh python specific not language agnostic so if you’re using libraries dependent on compile stuff
like numpy it’s a good example of That So eventually you would have to have some compiled things so if
python is not like uh baby is not a very good uh tool for that like installing
those kind of factors so another example like for example this
was uh this was the requirements file for one of the projects we were working on it was uh efficient debt by torch and
this is the requirements file for that so this requirement file did not change for
a long time and there was some and this was working let’s say about 11 months ago and it installed specific versions
of some libraries and when we were trying to run this after let’s say about 11 months it it didn’t work because we
don’t know what environment the user was in what kind of packages uh it installed and it’s not really hardly pinned uh
like for example Dodge vision is not hardly pin and whatever so it’s not it does not guarantees you that you will
get the exact same environment which you had like let’s say a week ago which you’re working on so
it’s not really a good use it’s it’s not really a good tool for like hardly pinning your environment and have
reproducible environments so all these problems exist but there’s
also there’s also a tool for that for solving them so I’m not saying that these are not solved problems so conda
does solve most of these issues like at individual level but for that to work
you need to have use best practices but unfortunately most of us do not use those plus specs best practices and and
sometimes we do want to use them but we do not have the tooling for that uh a lot of times so
it’s often the case that we end up having messed up environments and non-reportable environments
so like to to demonstrate how what the best practices for uh it looks like so
you should always create an announcement spec so unlike your comments or txt environment.yaml is a bit uh
also Paints the python version and it’s more uh it’s better uh for like language
agnostic stuff so if you’re installing let’s say non-python non-python stuff like compiled stuff it’s it’s the best
tool for that and and also you can create a log file so that you know what exact versions of dependencies you’re
using so you don’t have to rely on let’s say unpended things and you have more guarantee of what you’re gonna have a
reproducible environment and also never really upgrade a package so let’s say if you’re working on an
environment and you want to let’s say upgrade a package or install a new package the best practice would be like to create a new environment instead of
modifying that because that could first your first your environment so uh before trying to present Your
solution for what what the ideal solution would be let’s first talk about what we want from our ideal Solutions so
what are the ideal environment goals so the first thing you would need for an ideal environment is reproducibility so
for example research and production environments are captured and can be used in multiple contexts so like for
example it’s uh it’s not something you just works on your machine it should work everywhere so for that you need a
very hardly pin lock file and also you need the tooling for that to be able to do that easily so
that’s the thing and then the next thing is flexibility so you need to ensure
that data scientists are able to quickly create environment so like it’s not like uh so when you are working on something
you would ideally want to be able to create environments on fly so it’s not
like uh it shouldn’t be like for example you just uh let’s say if you need a new package like like a scale on you have to
go to the it and say that I want this package to be installed in the environment and then it takes like maybe
a couple of days so if that happens it really slows down your I would say work or research so flexibility is very
important so you should be able to create environments on fly and having access to that and being able to Define
that quickly and also being able to create new environments on top of it and being able to share that file so and also the third
thing is governance allowed decision makers to enforce policies and control access to research and production environments so also like for example if
you’re an you’re an Enterprise and you want to have some policies like for example uh your developers could could
only install from let’s say internal mirror or they should all always have like say like say these bunch of
packages installed all the time as a base environment for any environment and things like that like for example you
want to enforce something on the user side so you want governance for that so having some having a tool that have all
these things really makes your an ideal environment and how a very ideal environment for research and data
science so while developing the body we came up with this tool we developed this tool
called as corner store and it solves these problems so it enforces the best practices and controls the full
environment life cycle so to let’s learn about Honda store what
it exactly does so it it does basically the three things on a higher level so it
manages the version and version access to the control environments like for example you can see on the screen on the
screenshot on the right like you can add packages in there so we can have exactly print packages and you can also specify
rules like you want uh this package to be greater than greater than like uh 3.8
or whatever and also you can so this is like uh like a UI UI way to do it you
can also use the emulator so if you click that toggle on the right which is uh basically switch this this toggle if
you switch this one switch to yaml editor it will show you the emulator the one which you which we are used to in
conda and the second one is the builds uh it builds the conda specification in a
scalable manner so being for example a lot of things which you build uh you
want to you don’t want to wait like an hour to build a new environment so you want to be able to use the caching so
you have like a central caching which for example if you have already pulled numpy uh 1.5 you should not
required to pull that again so that your environment builds are fast so it doesn’t takes forever to get you a new
environment so that you can quickly experiment with a new environment and the third one which is basically
serves the quantum environments via file system unlock files and tarballs so and also Docker history let’s say if you
want to push an artifact somewhere and then use that so it also provides a way for that so I can quickly show you
how that file looks like so for example uh this environment I created recently so
this is how it looks like here I can change so this is the old UI so the new uh on this deployment that the new UI is
not there so I’m not I cannot show you the exact same screen but this is how you would like create a new environment
you can edit this and change let’s say you I want to add a package
and submit it will build a new environment on the Fly and also it gives you the log files like
for example for this one so you can see like it has the exact uh
it pins down to the exact dependencies of all the packages and exact versions of all the packages so that you have a
very reproducible environment and so uh what condo store does in an
actual is basically is a UI that guides and enforce the best best practices and
so basically you are moving from an ad hoc creation to a very fully specified environment and as you already saw what
are the problems with the ad hoc creation foreign so this is like a a usual workflow for
conda store and so you specify the environment in IML file or from the UI
the one I showed you before so you specify the name and dependencies how it looks like and also the pinned versions
and then it would create a log file which is very uh hardly pinned and that
creates artifacts like for example tar artifacts or Docker history even so and then you can use the exact same
environment log file to create another environment so that it is very reproducible so for example you can use the artifacts in your production
environment and you would be very guaranteed that you will have the exact same environment you had while testing
on let’s say Egypt lab and it also does the access control and
versioning so let’s say if you create an environment today and you want uh you made like two changes to it you want to
version them let’s say environment one two three so on so because uh the development evolves over time so your
environment would not be same right now from what it was when you started the
project so it would change and you want to see uh the changes that you have made for to the environment let’s say you
there’s a there’s something which broke when you update it in your environment and you want to see uh why what are the
changes I introduced in the environment so you can basically with this access control and versioning you can also see
like what happened like uh what change in your environment and also access control is basically you can have uh
Access Control environments like for example uh you you are in a you’re working in a team of let’s say five
folks and everyone have environment their personal environment and you have some shared environments like for
example this one the uh these are the shared environments so which everyone will have access to and then you have personal environments let’s say the John
those environments like this one so this is your personal environment which you have access to
and third part the governance part which I was talking about let’s let’s say as I said if you want to enforce uh the users
to only use have internal mirrors or restrict like packages from particular channels let’s say you want to enforce
that people only install from like let’s say condo Forge or whatever Rapids or whatever and you want
to always have let’s say a required set of packages to be installed on all environments so you can enforce that
here and also you can if you want to let’s say acquire certain versions of packages uh like for example if you
wanna if you know that there’s a like let’s say a one liberty in one of the newer versions of this package and you
want to make sure that no one installs a version above let’s say 5.3 of a package
so you can enforce that and this is what governance provides so Nevada is designed in a way that
where we have made it so that we can have more lots of plugins so like
different people have different ways of working so we want to support all kinds of ways they can uh they want to work
and develop in Nevada so we have various plugins so one of the one of the famous
ones is like the vs current integration so not everyone enjoys writing code in Jupiter lab so
you are you want to use like vs code you want to write code nvs code so Nevada
has this vs code integration so it’s basically just basically you go to your navari and you click on the vs core
integration it will spin up a vs code with all the files and you can just write code the way you would write on
your local machine another nice plugin is the dashboard
deployment so remember the part I was saying like I did some work and I want to share the share some data and
visualizations to the team uh to the business so this is how you would do it so it supports all the uh all the famous
uh plotting tools like Port leader plotly panel and streamlit and voila and you can create a new uh dashboard and
then you can also access control that dashboard you can share that with a with a colleague and you can also make it private so you can have version Access
Control to that dashboard the way you like and also they’re like uh plenty of uh
other plugins available so for example as I said Das comes with comes built into it if you want to let’s say uh
want to scale your compute to let’s say a cluster of workers and you want to do lots of computation for example for
example the when I sh when I was talking about earlier like the genomics problem I was solving so I use task for that
and also you want to run some workflows so Argo is one of the good tools for that so you can use Argos prefect and
also if you’re doing using doing machine learning you can use Clear ML and and the dashboard is uh the contain DS
dashboard and also since it’s a kubernetes cluster where a bunch of fox would be using that cluster for various
things so you want to also see the uh the usage of the memory users and various things you want to see about the
kubernetes cluster and it’s basically for the administrator let’s say you want to see how your kubernetes cluster is
doing things like that and also if you want you can do a video chat with gypsy
and I will now quickly show the nebari CLI just to give you a taste
of how it looks like so here I installed here I’ve installed
library in this environment and I want to deploy the body to let’s say a cloud infrastructure so I’ll quickly
show how the CLI looks like in case let’s say let’s make this a bit bigger
so the idea of showing this is to show you that how easy it is to generate the
nebari config file and also how easy it is to deploy so let’s quickly take a look at how the
what the what commands we have so the command we are looking for the command you would mostly need is are these three
the init deploy and destroy so what init does is basically it creates a in a body config file which is
the source of for your infrastructure and then after generating that you would run the body deploy which will deploy
everything you have mentioned in the the body config file and that let’s say if you want to destroy a cluster and want
to create a new cluster you just do new body destroy so let’s take a look at the nebari in it
so Nevada init also comes up with one other command which is guided in it so
we’ll see that in a second so if I do navari in it health you can see there’s a command called
guided in it so if I do anybody in it guided so it’s
a it’s a new tool we have added which is basically guides you through the all the steps of uh the requirement you need for
initializing memory so I’ll show you that how it looks like it will ask you a bunch of questions of
what you would how you would you want your device to look like so like for example this one like where would you
like to deploy in my cluster so let’s say I want to buy at AWS and paste your AWS key let’s say I’ll
just put uh whatever and something in here
and what product would you like to use so I would just say I would say nebari Tech
Fair what domain you would like to use I’ll say demo dot mcgorry dot test
and how we like to have the authentication so let’s use the basic
auth which is the path password and you want to use the github’s approach yes I would like to use the
github’s approach it is basically a way for you to deploy make changes to your body so this is the first time which you
generate the config file and let’s say if you want to make a change uh let’s say if you want to change the like
profiles for jupiterhub you can do that with that config file and you’ll simply make that chain in the config file and
you push to the or GitHub repository then GitHub actions will do the rest of the job basically deploying your
updating your infrastructure with the new setting so yes I would like to use that
and organization let’s say is Con site where I want to keep my
GitHub repository and the name of the repository let’s say it’s never share
and would like to clear the remote repository so if you provide the GitHub credentials it would create that
repository as well so you and you can use let’s encrypt for certificates Let’s
ignore that for now and if you want to do Advanced configurations so let’s ignore that for
now and now this will spit up a
configuration file
uh seems like I the token ID for the blues was wrong that’s why it uh spit is
uh things but if you have the correct AWS token it would work
but anyways after this file is generated you would have to just run a battery deploy
and that would basically deploy the whole infrastructure and yep let’s move on to the slides
so yeah so since it’s an open source project it strives with the uh strikes because of the open source contributors
and you can you’re free to contribute to it and you can ask questions on our
slack on our guitar and you can also make more requests on GitHub
and I think now the stage is open for questions terrific thank you very much Amit for
that uh intro to navari um we don’t have any audience questions but I uh one that that I have so
um just looking forward what sorts of features are planned for navari that uh that are in progress that you’re excited
about so uh one of the one of the most crucial I would say is the way you want
to have identity like for example one of the things we are we were discussing recently is uh how we want to provision
access to various resources and in the cloud provider let’s say you have five people who are working on a project and
you want people access to different things like let’s say a different storage bucket or a different uh
bigquery database or things like that so being able to provision that right from Nevada like without someone having to
configure that manually so this is something we are working we are planning to work on so this is something I’m quite excited about because it makes it
very seamless for people to have uh access to various things right away without someone having to manually
configure that yeah it sounds like a great feature set um well I think we’ll wrap it there
again thank you very much I meant for your time uh to to present the project and share it with everyone
um so we really appreciate all the experts for all three sessions today uh taking the time to to weigh in on the
various topics uh as with the others uh this uh this presentation is being recorded and the recording will be
available soon um our next tech shares event will be on November 30th uh again at 11 A.M Eastern
Standard Time uh this will be another Roundtable discussion similar to our previous Tech share but in this case
it’ll be fintech focused we look forward to having you participate there as I’ve mentioned in some of the other sessions
if if any of you are interested in speaking or being a panelist at a future Tech shares event please reach out to us
uh the best place there is Tech dashshares.com speakers uh thanks very much to our our uh
speakers and everyone in attendance when we look forward to seeing you next time