PyMC Open Source Development

About

In this episode of Open Source Directions, we were joined by Thomas Wiecki once again who talked about the work being done with PyMC. PyMC3 is a Python package for Bayesian statistical modeling and Probabilistic Machine Learning focusing on advanced Markov chain Monte Carlo (MCMC) and variational inference (VI) algorithms. Its flexibility and extensibility make it applicable to a large suite of problems.

Features:

a) Intuitive model specification syntax, for example, x ~ N(0,1) translates to x = Normal(‘x’,0,1)

b) Powerful sampling algorithms, such as the No U-Turn Sampler, allow complex models with thousands of parameters with little specialized knowledge of fitting algorithms.

c) Variational inference: ADVI for fast approximate posterior estimation as well as mini-batch ADVI for large data sets.

d) Relies on Theano which provides:

e) Computation optimization and dynamic C compilation

f) Numpy broadcasting and advanced indexing

g) Linear algebra operators

h) Simple extensibility

i) Transparent support for missing value imputation

Transcript

0:00
and all righty well how about we uh we get started and i say hello to the internet uh
0:07
welcome to open source directions uh hosted by open teams they’re the first business to business
0:12
marketplace for open source software services um open source directions is a webinar
0:18
that brings you all the news about the future of your open source projects i’m tony fafs i’m a data scientist at
0:25
quansite and i’m a community organizer in atlanta georgia i’m super excited to be hosting because
0:31
i finally achieved my master plan i was on episode one as co-host and now
0:36
look at here i am i’m running the show david i’ve overtaken you um but i really want to welcome david my
0:42
co-host now how am i how the tables have turned hi i’m david charbonneau i’m cto at open
0:48
teams uh i’m also a senior architect at uh kwonsite and
0:53
i’m based in durham north carolina and it’s our great pleasure to introduce thomas as our guest today
1:00
hi everyone i’m thomas wiki i work on the prime c project which i’ll
1:06
be talking about a lot i’m currently involved in something related to that but i feel i want to build up a suspense to uh
1:13
towards that a little bit more so there will be the reveal for them scheming next uh so stay tuned for that
1:19
but before that i’ve been [Music] studying neuroscience at brown and then
1:26
decided that academia wasn’t the right long-term trajectory for me and
1:31
then worked at quantopian as the vp of data science there for eight years and now doing doing different things
1:39
related to privacy so thanks everyone for tuning in i’m excited to be here
1:44
well thank you for being here with us i’m super excited to talk about all of your experience and work in pymc
1:51
um but before we get into all of those details um we kind of like to start to start
1:57
these episodes off just you know getting a feel for what what what y’all are finding interesting going
2:02
on it looks like this week where uh looking at what’s your favorite most recent pr
2:08
is um which is basically all the only thing you can say in open source directions so uh thomas i was wondering do you have
2:15
a favorite pr this week uh boy do i have a favorite pr yeah should i uh
2:20
like show it to you guys oh if you’d want to that would be great yeah um let me set this up
2:28
um here we go
2:33
so unsurprisingly it’s a pull request to the prime c3 repo and it’s from
2:40
greg mingus who i’ve never heard before and he says well i’m adding this mlda
2:45
stepper and well i have no idea what that is either but apparently it’s this really cool new metropolis
2:52
hastings algorithm that they researched and then uh publishing here um well that they published and then are
2:58
now adding to this library so he’s a first-time contributor and uh so he describes the performance implications
3:06
what he did to do it the references the papers other people are excited about it and then uh there’s a lot of discussion
3:13
actually going into this but he’s just really responsive to all the feedback uh adding
3:18
documentation revising the pull request just looking at the code is also
3:24
pretty massive it’s from the alan turing institute
3:29
he’s adding it to the docs writing docs writing tests as all the nice things that we’re
3:34
looking for in the pull request and what i really also like about this is not only
3:39
the quality of the pull request which is super high the responsiveness about the author but also the fact that we are able to
3:47
attract the researchers that are writing papers on these samples and then
3:53
are basically saying well we could just put up a reference implementation but instead they’re saying well more people use it if we add it to this
4:00
library which already has all the global distributions and inference engines and now it has one more sampler and
4:08
uh so so they’re just directly adding that and that is for us the best thing that could happen right is people we don’t have to implement those
4:15
or some other people who really want to use this have to implement this but just the researcher directly goes
4:21
and implements this for us so i was really excited about that it’s about to get merged and will be
4:26
part of the upcoming release uh it’s a bit at the very briefly
4:33
it’s a sampler where you have a model that has different uh modes of approximation like you have
4:39
a model that is a very rough approximation because you may be evaluated over a very rough grid and you have a very fine-tuned model
4:45
that is more costly to estimate you can use the rough approximation
4:50
the course model and the fine-grained model together and uh basically have the
4:57
use the rough model the course model to evaluate it to find a good proposal
5:03
and then evaluate on the other one so you can just uh the cheap evaluations you’re doing on the course model
5:09
and then you’re only asking the precise model which is costly to estimate for that final acceptor reject step so
5:16
really cool and a great example of open source coming together well this seems like
5:22
what pre-print should be um and pre-publication should be i love i
5:27
love seeing this and it’s super cool that you all provide a safe space for this stuff to happen too
5:32
so thank you for sharing that um david how about you have you seen have there been any pr’s that have really
5:38
uh caught your eye this week actually yeah but primarily because i’m i’m neck deep in the project so for open
5:46
teams we’ve been building a new um kind of infrastructure so we can build up rest end points and clients for
5:54
those rest endpoints really really quickly i can share my screen as well real quick and kind of show off the project
6:00
so we call it um we call it spread and we call it scrud for semantic create
6:07
update and retrieve update and delete i’m trying to find the right window here it is
6:12
so if you looked on github and you went to open teams inc we’ve got a couple of different
6:17
projects that i can share and we’re pushing on all four of these are really really fast
6:23
so we’ve got this overall spread strategy that we’ve defined uh where what we’re trying to basically
6:28
do is make it so we can just specify json schema in some json ld context and get endpoints for
6:34
free but then go further and building on the work folks have done around json
6:39
schema form generation to make it so we can really rapidly throw together
6:45
applications but applications that are informed by the meaning of the data not just the type of the data so a great
6:51
example is if it’s a location field then maybe we want to provide you like a map selection
6:57
selector or something and so this provides a way to kind of flexibly extend
7:02
how things get rendered according to their meaning the goal being we should be able to build up things really really fast by
7:09
defining these mappings and do less coding so we’ve been building up the front end of this which includes a kind of special
7:16
caching client and an extension or wrapper around the view schema form generator
7:21
and then we’ve had another group of folks working on the back end making it so that if we start with these
7:27
schemas and these context documents we get endpoints for free um so we put
7:32
we’re going to be demoing this internally uh tomorrow and i’m really proud of the team and
7:37
really really excited super cool well hopefully you can get pro anto in there the ontology for
7:43
probability distributions probabilistic ui right
7:50
integers just randomly based on distribution but uh yeah let’s get to all of the
7:55
statistics and probability distributions that we can handle now um i’m gonna move along on my uh
8:02
on my pr part and uh i really wanted to dig into the pi mc project here
8:08
um yo yeah there’s a lot of interesting stuff there’s a lot of interesting open source
8:13
drama organizational things happening on the jupiter team compass right now i
8:19
it’s my afternoon soap opera so that’s why i like those pr’s they’re uh they’re a little juicy to me
8:26
um but nevertheless uh pi mc here um ymc
8:33
uh is a project it’s on github of course um and there’s a fifth 55 over 5000
8:40
stars on the project so it’s pretty active quite a few forks on there also and it looks like on condo forge for
8:46
example there’s over 200 000 downloads so um it’s pi mmc seems like a really
8:52
popular project and i’m excited to uh talk about it here um so david maybe you could uh lead us off
8:59
yeah absolutely uh thomas can you tell us a bit about uh who started ymc and why was it started and what need
9:07
does it fill so chris wants oh yeah
9:14
i am it was just delayed oh okay good stuff um yeah so
9:22
prime c one uh was launched by chris vonnesbeck while he was still in grad school many many
9:28
moons ago and so he uh basically just wrote this together in
9:34
pure python as just to fill his own needs he needed a sampler to run his bio biological models
9:40
and then however just because it was pure python was pretty slow so he teamed up with
9:48
david huert and they then created pime c 2.0 and that was a major improvement
9:55
because it now the likelihoods were implemented in fortran code which made it run vastly faster than the
10:01
previous version and that gained quite some popularity and that’s the level where i got involved in it and
10:07
i built some park packages on top of that and and that’s how i got
10:13
to meet chris and some other people who were involved in the project at that time like john salvation and john was really
10:21
thinking ahead back then about uh how can we push these frameworks to
10:27
really the next level and he became really interested in these gradient based samples and so that requires you not only to
10:34
evaluate the log likelihood of the model which is what time c one and two and every other framework can do you also need to
10:41
compute the derivative the gradient of that and that turned out to be really difficult so he tried doing that but it
10:47
didn’t quite work and then uh but he knew that those samples were just vastly
10:53
superior to everything that came before and he was very really the first person to realize that
10:59
these deep learning frameworks um back then they only accepted theano uh now we have tensorflow and um pyro
11:07
and all pytorch all these other ones but he was really the first person to be like well i wonder if
11:12
this tool that was built for building neural networks and can compute gradients via auditive can
11:17
be retrofitted to build computational
11:22
graphs of statistical models and turned out that was not quite possible but almost possible
11:28
so i did like the legwork to do that and then really wrote the the core of prime c3
11:34
uh based on theano and just like now this is obvious right like we have
11:40
tensorflow probability we have all these other packages that are building on these graph computational graph libraries but back
11:45
then no one had even thought about it so for me there was really just like incredible foresight on on his part
11:51
to do that and then also to like really write the core of time c3 a lot of that still exists today and i remember
11:58
talking to him a lot about this and being extremely excited because i was like this is the coolest thing ever
12:03
and i contributed a few small pieces and um and then
12:11
so he built this prototype but then sort of moved on to to queen of pastures
12:17
but um i was just uh so excited about it that with correspondence back basically we pushed that over the finish line and
12:24
turned this into like a proper 3.0 release that people could use and back then it was still pretty small
12:31
but so slowly we started getting more contributors and more users and just very organically
12:38
grew over the last couple of years to now we have or over 20 core contributors and
12:43
we named some stats of download so we have a lot of users in academia there’s over 700 citations
12:49
to our original publication and it’s being used at um all the big and small companies so we’re
12:56
every day we’re learning about new companies that are using pine23 so that is really motivating and inspiring to us
13:02
and really the way the reason why we develop this is well we all just believe that basis
13:09
statistics is uh the the best way of doing statistics that’s what i would say
13:14
to me it’s the first time that statistics uh which i have many classes on but for me that was the first time it
13:21
really actually made sense we’re like oh i don’t have to like derive all these different estimators for the student t-test and whatnot i just
13:28
can write down my model and then i like hit the inference button and it works and i can change the model
13:33
like and it’s all laid out there in code like this is incredible and so for me and all other data
13:39
scientists it’s just really empowering to be able to build my own statistical models and
13:47
get uncertainty quantification and all these very nice things be able to include my expert information or get that from the
13:53
stakeholders who are interested in the model and have their own opinions on what sort of should go on so i can incorporate
13:59
that into the model too and and solve very targeted problems on very
14:07
richly structured data set problems so the contrast that with machine learning for example where
14:13
well the problems that you’re solving there is you have a table of rows of the data points and
14:20
then you have a bunch of features and then you have labels or it’s a regression type of problem and
14:25
you just if that is your problem right and you only care about prediction amazing just go ahead train xg boost or
14:33
psychic learn like it’s amazing but if your problem falls outside of that domain becomes really difficult
14:38
right um you have hierarchical nested structure in your data or it’s time serious like all these kind
14:44
of things are just really common problems right and uh they’re not often not very well suited by machine
14:49
learning which is taking up so much attention um but i think more and more these methods are being
14:55
realized it’s just very elegant powerful methods to solve these problems and
15:01
yeah they with these new samples we can finally actually do that so that was great i love all the history
15:08
um and it’s deep but let’s let’s ease into it a little bit i’m just curious about the name clearly
15:15
all of that history and the name means something can you tell us a little bit of the history of the name and kind of
15:20
the logo uh just kind of give us some you know branding context around uh the
15:25
pi mc project um so prime c just literally stands for python
15:31
and mc for markov chain monte carlo uh i think at some point
15:36
chris debate called it pi mcmc but there was one bit long in the tooth so time c it was and is
15:44
and i don’t know who first like brought in the rocket ship of the logo
15:49
um but when we uh did our logo for prime c3 on 99 designs we just uploaded the
15:56
previous one and every designer just really latched onto the rocket and i mean i think the rock is kind of cool
16:02
right that’s the spaceship uh it has been used at spacex actually to build rocket ships so that’s fitting um it’s uh yeah it takes takes you to
16:09
the moon it looks like a distribution i never made that connection but that’s
16:16
excellent yeah like a normal solution almost perfect well um are there any alternative
16:21
projects out there that you’re aware of no no i’m just kidding um of course
16:26
uh so the other really big popular uh an amazing project is stan um they are uh it’s written by andrew
16:35
gelman and a lot of other really smart people has come up about the same time as prime c3 actually
16:40
and they also use these gradient-based samples which is the major revolution and we really adapted to
16:46
these guys they have been really helpful and a lot of features in pi mc3 are directly
16:51
inspired by uh what the stand guys have been doing um the reason why i really like prime c is
16:58
just you can write your models directly in python and for me that’s just more natural way i don’t have to like
17:04
enter the data here and get results out there and there’s this weird wrapper thing going on of course the downside is you can only
17:09
use it from within python so if you’re in our user you can’t really use prime c3 but you can’t use that
17:16
and then since then there have been a whole bunch of other really cool interesting packages uh that i i would say are more
17:23
experimental or targeted at researchers just because they’re a more lower level and don’t really have
17:30
all the nice ux uh that i would say pimc has and those include pyro
17:36
um or tensorflow probability um some really cool uh work is being done there and they’re really pushing the boundaries and
17:43
pushing it also in the respect to how large scales these models can be
17:51
okay cool it’s people hear about all those different things oh tony well i’m muted accidentally okay
17:58
so thanks for thanks for that description uh is there something uh can you tell us a little bit about the technology that by mc’s
18:06
is built on and how it kind of differentiates itself from those other tools
18:11
right so prime c is built uh times a3 is built on theano which was the very first deep learning
18:17
library that allowed you to write down any neural network and then it would
18:25
take that pour that into a computational graph from the graph structure you can do all
18:30
kinds of simplifications you can compute gradients and then you can also compile that down to c
18:36
or to the gpu and then make it run really fast for pioneer c3 we’re really benefiting
18:42
from that because the whole code base is just python right imc2 had these fortran wrappers
18:47
that made installation really really difficult now it’s a pure python library
18:53
and it’s fast because well we have this compiler in the theano library that is doing that
19:00
stan has wrote that mostly themselves it’s written the course written in c plus plus so
19:07
they’re already basically been taking that and compiling that themselves so we don’t have that burden and
19:14
and we’re getting like gpu support for free which is kind of nice uh since then like all these other
19:20
packages have really followed in those footsteps of using computational back-ends now thiano
19:26
is pretty old by now i still think it’s one of the best actually but nonetheless uh people are not as excited
19:32
about it as they were about tensorflow or high torch or now jax um yeah um so that’s
19:40
what i would say is that’s the the technology that we using
19:45
who’s who are the current maintainers of the project the day to day keeping it going well really it’s
19:52
it’s a lot of people um so uh i’m uh i i don’t want to list them all just
19:58
because i would forget some um and so as i said there’s over 20 people who are like active on it
20:05
so if you just browse github or you look at the commit history that would give it much more justice but
20:10
yeah there’s a lot of people that have been involved for a really long time and they’re all extremely talented and are really
20:16
pushing the project forward so um one thing that i’d like to correct is a lot of people think that i’m
20:22
either the author or like the person who like has done the most work neither of these are true um like it’s
20:27
it’s an effort by a lot of people and and the main author of it is john salvatier who like
20:33
wrote the initial prototype so credit should go to everyone there it sounds like it’s a real vibrant and
20:39
active community and we love to see that that’s exciting when we get to talk about that well so
20:46
clearly statistics apply to a lot of people or at least i hope they realize that uh um what communities do your users and
20:54
your contributors come from or is that another thing where you can’t list it because you did you’d skip some people
21:00
too that’s exactly right it’s just too many so one thing that i love to do
21:06
is just browse the papers that are citing pioneer c3 so because like all of those are papers
21:13
that use it for them um for their science so a lot of it is
21:18
astrophysics uh biology seismology zoology um
21:24
like just insane like chemistry it’s really from like domains i haven’t even heard about
21:29
um so there’s a toolbox uh so it’s being used in a lot of papers
21:35
directly but there’s also two boxes built on pine c3 uh one of those is for example actual planet which is a toolbox used to
21:41
discover stars like planets outside of our solar system just like that’s totally wild i mean it’s
21:47
incredible to have that amount of impact right and then like and use all these different domains um there’s beat which is like for
21:53
earthquake detection uh and and just many more so yeah it’s really exciting and it’s used
22:00
in yeah i haven’t found many domains where it’s not being used um so that would be easier to list and
22:06
uh and also just in companies so a lot of companies are using it surprisingly a lot of them are using it
22:11
for simple things like ad tests um so that seems to be very popular reason but also there um in finance so i
22:18
used it a lot at quantopian to do portfolio allocation so that is really just a testament to the
22:24
personality of well statistics in general is you can just use it for whatever data science problem
22:31
you have i think for a lot of those if it’s more structured as i mentioned before
22:37
it could be a great fit
22:42
thanks so much yeah um let’s see uh is the project participating in any
22:48
diversity and inclusion efforts today um so yeah we definitely are very
22:53
cognizant of that and are um talking internally a lot about that and a lot of our members are active in
23:00
those communities so um ravine for example is um
23:05
is being very active and like presenting that at the pi ladies la meetups and those type of
23:11
things so yeah we we love those initiatives and uh where we can we support them
23:16
that’s fantastic oh you have mute again tony i know i
23:23
keep playing i got a fan on here man yeah it’s a little i apologize uh
23:28
all right that’s right um yeah so thank you so much for that introduction i think now it’s time to get to the goods
23:34
though um maybe we could shift over to a uh project demo so thomas if you kind of
23:41
get your screen set up and uh walk us through some things um but in
23:46
the meantime i really want to thank our sponsor quansite the good folks
23:51
good folks that keep me working in science and day to day on aquansight we’re creating value from
23:57
data um and uh this is brought to you by kwanzai uh we’ve got a lot of open source stuff
24:02
going on oh yeah one thing uh that i also in
24:09
terms of like which domains are using it uh there’s this website called rt.live
24:14
which is a very timely website for estimating the reproduction factor of covet 19.
24:21
so here for every state in the s you get an estimate of the of the reproduction factor and of
24:28
course for basins we have error bars or uncertainty intervals around this and you also get a timely estimate for
24:34
this um so the model was built by
24:39
kevin systrom and thomas vladick so kevin systrom is the co-founder and former ceo of instagram
24:46
so that was really interesting uh sort of thing that happened where he was just like oh yeah let me just build
24:51
this model here i’m not really an expert but i’m going to use prime c3 to uh to to do that and we helped with
24:58
some technical support and so the model is also open source so you can check it out
25:04
and really look at the time c3 code that does that and uh but yeah so
25:11
the model itself is actually really sophisticated and interesting there’s also a tutorial
25:17
here and so that is i think
25:25
i would argue one of the best models out there currently for estimating the rep production factor if you want to show
25:30
this network you might have to jump to envy fewer because github yeah you know github’s
25:36
just like sometimes right what is up with that um okay so
25:42
um prime c3 let’s just like take a brief look um and tell me if i’m going too
25:49
slow and you want me to speed up i’m gonna just try and do a high-level demo to
25:57
that’s good everybody the pace of life is far too fast for how little we leave the house i could i couldn’t agree yeah um
26:04
so uh yeah here is the website and um colin carroll did an amazing job with
26:11
that so he really uh just took our previous website which was sort of a little bit bold and really revamped
26:17
that and um so yeah you can see barely this is
26:23
probably the simplest model well not the simplest but a simple model that is just a linear regression and
26:29
let’s say we have some data x um some of these are the
26:35
coefficients and then i have some y’s that i want to model that i want to explain with my linear model i set up
26:42
a model context so this is the spoiler plate code and will be for bookkeeping
26:48
every model will look like that and then we define our parameters so here we have the
26:54
weights uh of my linear regression and i have a noise parameter that’s the variance
27:00
or the error term and here um because we are basins
27:07
we don’t just define the parameters we also define priors place on them and
27:14
here for the weights i’m now choosing a normal prior so pi mc3 comes pre-equipped
27:19
with a lot of the probability divisions that you want to use so this prior basically says without
27:25
having seen any data what do i believe a likely
27:30
coefficient for this linear regression to b right so i haven’t seen any data but i
27:36
still might have beliefs about that so here my belief is that the coefficient is going to be centered
27:42
around zero with a standard deviation of one and then for the noise well i have some
27:49
prior information right i know that the standard deviation of my normal likelihood can only be positive
27:55
so that’s why i’m choosing a sort of gamma distribution i mean a name just before i did here it has two
28:02
parameters and then we’ll just specify the shape of that gamma distribution and now i’m building this model together
28:10
and say okay i assume that my data is normally distributed again i give it a name and then my mu
28:18
the uh the mean of that data is now going to be a vector
28:24
which is the output of my data x and the dot product with the weights so
28:29
basically this is yeah just follow the math of the linear regression and then my sigma is the
28:36
noise parameter here with the observed keyword argument i give it the output data that’s supposed to model
28:42
and then this concludes the model specification and what’s happening behind the scenes
28:49
here is that we’re building up this theano compute graph of log probability so nothing’s actually happening but now once we have that
28:56
model specified we can do all kinds of things with it and inquire into it one thing is we might just want to
29:02
sample from the prior like what does the model think is um
29:08
what the prior what kind of patterns does the model think are
29:13
plausible so this is the prior predictive then i can also do that is mostly what
29:19
we do is we sample from the posterior which is basically what are my what is my belief
29:24
in my parameters after having seen the data so this is usually if you run maximum likelihood
29:30
estimate right you’re just fitting the parameters to the model now here of course we’re bayesians so we will have
29:36
not just single numbers as a result but we have probability distributions and i will show what that looks like
29:43
and then we can also sample from the stereo predictive set so this is just basically the very core
29:48
intro of how you can do this model but really the power of the approach comes by being
29:55
able to also build much more complex models so if you click on examples
30:00
you see that there’s all kinds of different ones so there’s various ones on linear regression models
30:08
and hierarchical regression which is a very powerful tool um and also more advanced ones
30:15
um there’s different case studies that you like for example this is a very classic hypothesis test that you might
30:20
want to run and also more advanced one like goes in processes which are very
30:26
flexible regression type models non-parametric ones there’s ordinary differential equations
30:33
examples mixture models so everything every type of model you might want to build we have an example of how to do that so
30:39
this is really a treasure trove of learning about this and what i also wanted to show briefly was
30:49
an example model um about this uh yeah i mean i looked at the docs last night
30:55
they were gorgeous um and i was like oh i learned some things i was like oh what’s this distribution
31:00
and it was fun to like you know kind of poke around there um how does this help you
31:05
teaching statistics it seems like you can teach statistics in a much more like rich and visual way
31:11
than perhaps like the old way uh it’s funny to say that um so one
31:17
thing that i’m doing is
31:22
building a course that does exactly what you just said um so i think there is a much
31:28
better way to teach statistics um by using intuition and
31:34
little graphics first right so i think a lot of statistics books and resources are just extremely math
31:40
heavy right so for someone like me coming from software engineering it was um it took quite a while really to um
31:48
like go through all the math and really understand things at an intuitive level once i did i noticed that like oh well
31:54
this is actually not a complicated concept at all like a lot of statistics actually really simple once you sort of have an
32:00
intuitive understanding for that and with this course which i’m currently creating
32:06
i’m basically want to teach statistics in this way so if you’re interested in
32:11
that go to tiwiki.org and and enter your email address and
32:17
then you’ll be updated on my progress on the course so yeah i think there’s a better way to teach that
32:23
i’m currently working on on providing that super cool thank you
32:32
um so one thing one idea that i just wanted to plant which for me was really
32:38
when i first grocked the power of basin statistics was this model where we are looking at
32:44
the returns of the s p 500 over time so these are just daily returns and one thing that you can clearly see
32:51
is that volatility right how wide these swings
32:56
are in both directions really clusters over time so here in 2009 right during the financial crash
33:02
there was a lot of volatility in the market then it started decayed and well we don’t plot it until today but during
33:07
covert 19 definitely a lot of volatility as well so how would we model this well
33:15
returns we might assume are normally stupid but that is actually a pretty terrible
33:20
model just because we’re having these wide swings so you can use a student t distribution
33:26
which has more massive details and accounts for these uh sort of black swan events or just like
33:32
much more uh these tail events but also what becomes clear is that if
33:38
we just fit a student t distribution right or we can just think of our normal distribution
33:44
that the standard deviation of that distribution will if we just fix that over time it’ll be a
33:51
terrible model so instead what we can do is we can allow that to vary using what is called a gaussian random
33:57
walk so that is a specific type of prior that we can place and that will allow the standard deviation to change ever so
34:04
slightly over time and this is what that mode looks like so we have this custom random
34:09
walk distribution in here and then we just feed that into the student t distribution and with the power of base and
34:17
statistics we can just fit that you just call the sample function we don’t have to do anything else we look at the posterior and then we get
34:24
the volatility over time and how it’s changing and what i just want to highlight about this
34:30
is this model is actually crazy complex so instead of just having one standard deviation parameter and one mean that
34:37
we’re estimating over this data set now we have a standard deviation parameter for every single data point
34:43
right so we have more parameters than data points here but it still works because of the
34:49
constraints we’re placing in here so because we’re assuming that this has a gaussian random walk and that it can’t just change
34:54
really nearly at every data every time point but will be constrained to a previous point uh we can still fit that
35:00
model so this is really now yeah as i said quite a complicated model but with the nut sampler this works
35:07
really well and we can just estimate it easily
35:14
so that idea we can just extend to other ones like let’s say we have a linear regression that we would do but over time so these is the this is
35:21
the price of two stocks over time um and you can see that uh they are correlated right but they’re
35:27
sort of changing over time we can just apply that same trick um so
35:32
this is if you just fit a linear regression to it but we can apply that same trick of the gaussian random walk
35:38
and say okay my intercepts and my slopes are allowed to change over time according to this random walk so they’re
35:44
going to be tied and distributed according to the intercept at the previous time point
35:50
and again we’re using that crossing random distribution i’m well aware that like this is not enough to rock but i just want to show
35:57
the high level idea of this and then again we just call the sample function and we see that the intercept is
36:03
changing over time the slope is changing over time and now with just a very simple change to our original regression model
36:10
now we get a regression that is also allowed to change over time so this is basically now the
36:15
new model and so that idea can just easily extend to all kinds of topics um like here i have a blog post
36:24
also on my blog for doing this with a linear sorry logistic regression
36:32
so if you have data that is also changing over time you can build a logistic regression model that is also
36:37
using the gaussian random walk and then fits a classifier where
36:42
intercept is changing over time so that’s pretty cool and then you can
36:48
also extend that to a neural network model your data looks like this so now we can’t linearly separate that
36:55
right so uh this is like the half moons data set and changing over time you can build a neural network model where the weights
37:02
are allowed to change over time and then also they will build a classifier a
37:07
non-linear classifier that is able to separate this non-stationary part of us
37:13
so those are just some high level points that’s gorgeous
37:19
david wasn’t i right when i said that they have great documentation i know i i know they already knew that but
37:24
aren’t there aren’t these docs beautiful with the videos and the pictures and equations and stuff like you’re learning
37:30
from them and you don’t understand the equations you get a feel for what’s going on in the math just because of the
37:35
visualizations which i think that that’s critical yeah yeah so uh maybe we shift gears and start moving
37:42
over into the roadmap discussion and use this as a jumping off point um i really want to talk about y’all’s
37:47
documentation uh it seems like documentation’s like a good measure of the health of a community and
37:53
i would say your community looks very healthy by your documentation can you tell us a little bit about that
37:59
and how you’re able to interact with the community and how you use this to kind of build
38:04
community perhaps certainly so documentation has always
38:09
been very important to us and well i think everyone understands how important it is but nonetheless it’s often not the most
38:16
exciting work one thing that i think we benefited from
38:22
is that people love to write blog posts about the cool models that they built so a lot of these example notebooks that
38:30
we have come from uh just um notebooks and blog posts that people
38:35
have done and then when i find a cool one i often would reach out to the author and be like hey this looks really great
38:41
uh do you want to add this to the documentation
38:46
so this is how a lot of this was built um by now we when people submit pull requests and
38:52
there’s a lot of people submitting these pull requests from all kinds of different places
38:59
we have a requirement that for example if you say okay well open a new pull request um
39:07
oh well whatever there is a checklist of what you need to do and it includes documentation so every
39:15
new feature that gets included has to have an example and has to have documentation so um just by placing status on that
39:23
it’s very important um but we’re also lucky that because we have quite a few
39:28
contributors from different people so here i just noticed this oriole uh abril who’s a core contributor he
39:35
does a lot of work on this and is doing fixes on this so we have quite a few people that enjoy
39:42
doing documentation and is very important and it’s a nice way to get visibility if you um do something so it’s very easy for
39:49
people to get started on and then transition to more deeper parts of the code base
39:56
oh and the last thing that i want to mention then is there’s also all kinds of other
40:02
materials so you go to books and videos [Music] we have a discourse forum where people
40:07
can ask questions uh we have a conference so we have
40:12
time c con which will happen later this year which i hope everyone will submit to
40:18
talk to and also and also attend
40:26
so i’m really excited about that i’m just posting the link uh so yeah there uh well it’s really
40:33
about community that’s uh for us really important thing and i think part of why we have such a healthy community and
40:41
here chris our you know benevolent dictator for life is going to give the keynote and we have two other ones which are really
40:46
excited about which we’re going to announce soon and um but yeah and then uh what i
40:52
actually wanted to show was uh that that there’s also a lot of work going on
40:57
into porting either porting books of people who have
41:03
um um already written books so there’s bayesian methods for hackers which was
41:08
written for prime c2 and reported to prime c3 statistically thinking which is an excellent book uh which was written for
41:14
stan but reported it to prime c3 uh oswaldo wrote book for prime c3
41:20
directly targeted on time c3 there’s all these resources and then there is uh all kinds of talks
41:27
on youtube that you can watch so yeah we’re just uh
41:32
we’re very active in this domain and i think it is a big um reason for the health of the
41:39
community and the project
41:44
this is fantastic it’s kind of inspirational actually just in terms of uh looking at how you’ve structured this
41:52
and the the material you produced it it gives you an idea of how to
41:57
if you wanted to build your own community and foster it this is a great resource um you know in addition to just
42:02
being a wonderful resource for the pi mc3 community absolutely um well
42:09
super cool uh in there maybe maybe we uh we move along and we saw cool stuff
42:15
y’all have where are you all going uh now so maybe you could tell us a little bit some
42:20
future directions some places where you might are looking for some more contributors because pi mc seems like a healthy place to be
42:28
contributing yeah definitely so if you want to get started um we are excited to
42:34
to have you um one good place to start is to just look at the issues we have different labels um so for
42:41
example with a beginner friendly label and you can just look at what is there this is like low hanging fruit that’s
42:46
really helpful for the project but uh is easy to do and but also always very eager to to
42:53
help people one trick that i started doing which is more uh more successful than i thought it
43:00
might be a lot of people are writing issues where they’re saying oh this doesn’t work and i’m like oh okay um
43:06
like the problem might be here and there do you want to do a pull request to fix that and then like oh yeah sure
43:12
let me give that a try so a lot of people were getting that way but just basically being lazy ourselves
43:17
and asking people to to chip in and um i would say like 80 to 90 of the people
43:24
percent of the cases people are really excited to do so so that’s a great way um
43:29
the um the direction that i’m most excited about is one of them is well there’s prime c4
43:37
which um i want to talk to next um but before that uh there’s all kinds
43:43
of different things going into prime c3 as well so one thing uh before i notice that there
43:49
is that we building on thiano right thiano as most people were probably aware of
43:55
uh so the the original authors stopped developing it uh which is a bummer uh because it
44:02
really was a great package and sort of was the back end that we built prime c3 on now
44:08
um this was one of the reasons why we started prime c4 but we also just realized that the existing com user base
44:15
we have is just huge right and and empire c3 is really already such a
44:22
mature package and does really almost everything we would want it to do so what we’ve done is we forked theano
44:29
so we are now maintaining ghana and pushing it forward ourselves so it’s not dead and one of the directions that we’re
44:35
pushing it forward on which i’m incredibly excited about is uh to make
44:41
it use the jack’s back end so jax is a new framework similar um to say tensorflow but much
44:48
more lightweight and better in many aspects i would argue and we can make thiano use jax as the back
44:55
end and that way we get access to all the modern technology there and
45:02
to things like gpu support or tpu support and we don’t have to worry about the um
45:08
the c code that is in the theano library instead we can just use jaxx which does all these amazing compile
45:14
time optimizations and everything but still really on piano so that way um we’re actually solving all the problems
45:20
we still have theano which is even today i think uh on par with all the other
45:25
projects and now we also are on modern infrastructure basically and don’t have to rely on the theano way of doing
45:33
things that was amazing like 10 years ago but now is a little bit
45:38
old a couple of other packages that i’m
45:44
really excited about one is bambi uh oswaldo martinez wrote the book is maintaining that
45:49
um and it’s basically allows you to specify all kinds of uh linear regression models
45:56
as you would in r so just with a single string so it’s very fast and you can use hierarchical models
46:03
another direction that is really promising is adrian zabalt wrote sun ode which is a
46:10
wrapper for a ordinary differential equation library and these ordinary differential
46:16
equation models happen a lot in epidemiology so they’re active today about
46:23
so these srr models all oes and so this library just makes it extremely fast and it’s
46:29
uh integrated with pi z3 so that is another great one um that’s cool hey of course
46:36
i got to give you uh i got to give you a choice here we have a lot of questions from the
46:41
folks in the yeah do you want to keep talking about the future or do you want to answer the questions
46:47
uh let’s go to questions then um the other thing that i want to make sure i notice and i mentioned is
46:55
um the the thing that we’re doing now and that is uh so we
47:02
are getting a lot of interest in pioneer c3 from all kinds of different companies so over the last couple of weeks i just
47:08
had many discussions with people and who basically need help with that and
47:15
also with the other pimc community developers we always thought like oh how
47:20
amazing would it be to work together so now finally it seems like the stars have aligned and
47:25
i’m really excited to announce pimc labs which is a new consultancy that we are
47:31
launching and where we are helping companies to implement or improve their prime c
47:38
models and um basically add this new uh way of doing statistics um
47:45
to to solve analytic problems that are really difficult to solve any other way so if you are uh
47:52
if you want to talk to us send me an email it’s thomas.wiki gmail.com i’d be more than excited to talk to you
47:58
about that and work together with you so that’s my plug you keep telling us all these little
48:03
things i can’t figure out what
48:14
i’m excited for that everyone involved that’s exciting thank you yeah i’m really excited about
48:20
it um i can’t wait for their training content to be honest with you i think that’s going to be some good stuff um
48:26
well how about we get to some questions from uh the audience um l lindsay
48:32
uh was our first question um if you’re familiar with pomegranate uh pop it’s probabilistic modeling in
48:38
python uh could you provide some insight in how pi mc’s three is pymc is different
48:44
uh different use cases different capabilities uh yeah absolutely so pumbra app i think
48:50
is um a package for uh mostly building discrete
48:57
models so um it allows you to build like um i hope i’m not completely mixing this up
49:04
yeah it’s for mixer models uh and hidden markov models so we have discrete states um congruent allows you to
49:12
build these very flexible head markov models and prime c3 is has a much bigger focus
49:19
on continuous models so if you want to build markov models i would say this is a great way
49:25
pi mc3 is focused on continuous models uh however we actually this is another
49:31
future direction is uh brandon willard who is who has done work that i mentioned in all kinds of
49:37
different ways he already um he is working on contributing a new module for pimc3 that focuses on hidden markov
49:44
models and is really flexible and allows you to do that as well so currently with macromoles you need pong grenade or
49:50
something different or something similar but um hopefully soon you’ll be able to do that in time c3 as well
49:59
that’s that’s great um let’s see there was another question from emil dimas uh can you give an
50:06
interpretation of the standard deviation of the standard deviation of the posterior thank you
50:13
yeah so this is always a confusing concept where well we have a parameter right
50:20
that tracks the the standard deviation let’s say we just have normally
50:26
distributed data and we want to estimate the mean and the standard deviation well you can do that in numpy right you
50:32
just compute mean and standard deviation uh and that will give you two numbers so that’s fine
50:37
um in a bayesian framework we are always working with posterior
50:42
distributions and these posterior distributions describe my uncertainty in the graphs themselves so
50:49
i have a posterior over the over the standard deviation of my data and
50:55
of that positive distribution i can also compute the standard deviation so what that would reflect is the
51:01
standard deviation of the standard deviation is my uncertainty in the standard deviation of that parameter
51:07
if i have very few data points my uncertainty in the standard deviation of that on the standard deviation of my
51:13
data will be very wide give a lot of data that will be very narrow so that is what the standard
51:19
deviation of that represents well that was a really good explanation of a hard question i
51:26
think so we have another question from ll lindsay um do you have any
51:32
recommendations for uh tutorials uh references outside the built-in
51:38
tutorials or bayesian methods for hackers um to show real world applications for example
51:44
the rt live um uh example uh that’s a good
51:50
question yeah um so in terms of real world applications hmm well i would say that definitely
51:58
um the the books on the website are pretty good um like
52:05
oswaldo’s book um for actual industry applications
52:10
i don’t have a good answer so that’s definitely one thing that i’m trying to do with my course is like really have those tough um
52:18
application models uh there are definitely some in here that have a real world use
52:25
case like the stochastic volatility model that i showed um and they got some process one for
52:31
example the monolua uh one for co2 levels uh that is like a really good one um
52:38
and um yeah so i i would look at this page but actually it’s a
52:44
good um i don’t have a great answer to that
52:50
fernando erazebal asks do you know when you’ll be releasing your course and approximately how much
52:56
it’ll cost um so i hope
53:02
that we’ll be able to get it together by the end of the year and uh on cost i don’t
53:09
know yet so um that is currently what we’re determining is really the length of the course the thing that i’m
53:17
now thinking is that it will be a shorter course um the first one will
53:23
be very intro level focus just really building an intuitive understanding of the core
53:29
concepts so it’ll be shorter and then there will be another course uh
53:34
for more applied and more advanced topics so rather than one big thing i’m now
53:40
breaking it down into multiple smaller ones and that way i can release it sooner uh at a lower price point per course
53:47
and um and yeah i think it’ll be more targeted
53:52
but thanks for interest great man you seem like a good teacher i want to take that course now
53:58
yeah um so uh we got a question from uh david brochart um which is how has research advanced in
54:05
quantifying uncertainties in uh deep learning inference are there applications embedding pi mc3 inside say
54:12
tensorflow or the other way around building a neural network inside of pi mc3
54:19
yeah that’s a that’s a great question um definitely that’s a very active field of base and deep learning
54:24
and there’s a lot of really interesting things happening there
54:30
it’s not really a focus of pioneer c3 i have some blog posts that show how to do that but there’s
54:37
other packages that are more focused on doing that um and i would say for example tensorflow
54:44
probability has some really strong examples in that domain and so does pyro um i would start looking there
54:52
um and um and actually pi mc4 as well will
54:58
support that so um because prime c4 is built on tensorflow and tensorflow probability uh it can use
55:05
framework like keras and whatnot so the support there will be much much stronger
55:17
amit asks can you give an example or examples of
55:22
how you’ve used this additional information about uncertainty to influence a
55:27
business decision what are your thoughts around conveying uncertainty to management
55:33
that is a really important question and i fortunately just
55:40
uh i don’t know like two years ago maybe found it like the perfect answer and
55:45
it’s based on decision making so the the question is right on right so um
55:52
and this is the story uh that i start with here as well you are a data scientist and your manager
55:58
tells you to like build a model and you come back and you’re all excited about the posterior distribution that you have and he’s like what’s the posterior
56:03
solution right so uh it’s very difficult to communicate these ideas but instead what i’m arguing for
56:11
in this blog post together with robin kumar who’s also doing uh really cool stuff
56:16
and you should follow on twitter is uh that you can use bayesian decision
56:22
making to instead of just providing a posterior distribution that will be the output of
56:27
your model is you use that procedure distribution in the optimization to solve the business problem directly
56:34
so you’re not just saying oh like this is the output now i’m done you’re saying well actually what’s the problem that i’m trying to solve with
56:40
this and here’s the example of um supply chain optimization where we’re
56:47
trying to say how many items do we want to um order from which
56:52
supplier and the idea and the joke is that this is around the operating spaceport so it’s a kind of futuristic
56:58
vibe um and and with that loss function basically the output is not a just like the
57:06
posterior distribution over which supply is how reliable right but actually we’re taking this all the way and solving the problem to tell
57:13
us um directly how many we should order from every supplier so at that point
57:20
right the output is not a posteriority because the output is a decision that we’re making and that decision is
57:27
not just based on the most likely case right but it’s on all possible cases so
57:32
because we’re using the posterior distributions um that uncertainty will be taken into account when we come up with
57:39
that decision so it’s not the decision uh yeah that that solves the most likely case but is the decision
57:46
that’s going to be robust across all possible cases
57:51
well super cool how about we uh thank you for going over just a little bit i don’t know if you’ve lost track of time
57:56
but we are over so we’re gonna do one more question and then we’ll uh we’ll try and wrap it up
58:02
um this question is from abdul rahman uh uh what are
58:09
the key guidelines you will use to decide which toolset bayesian versus statistical models
58:14
uh when you solve a real world problem um well i would um i would say that
58:22
i i view those as similar basin and statistics problems in general so
58:27
i personally never use frequent statistics uh just because i don’t see a reason for that
58:33
but um yeah um i mean i would um i would start with the one that most uh
58:41
where you find the syntax most approachable so yeah i would i would look uh take a look
58:46
at the stand syntax take a look at the prime seats syntax take a look at tensorflow probability
58:51
and and from there basically see what suits your best if you’re a python user i hope that um
58:57
time c3 agrees with you but if not there’s pi stand as well so these are all great packages and um
59:03
and this is really hard to pick among so many great options well that’s great thank you so much uh
59:10
now it’s time for our as we wind up where it’s time for our world famous rant and rave section
59:16
uh where we just have 15 seconds to rant about whatever you want um and uh thomas it’s your soapbox
59:24
all right um i’ve been trying to use tensorflow and find it
59:30
really difficult to use uh whenever i want to do something it’s like there’s five different ways of
59:35
doing that and i never know what the most recent one is and um and if i use the wrong one it’s slow
59:42
it breaks so i just found that to be a challenge and we are facing that challenge too
59:47
with um uh with pime c4 and that’s why i’m so excited
59:53
about jax which is for me much more lightweight and better alternative so that is what i think uh the future is
1:00:02
awesome david what are you ranting about today today today i’ve got a rave usually i’m
1:00:07
good for a rant but i just wanted to reinforce uh and say shout out to the scrub team
1:00:12
on open teams and that includes abraham maxfield daniel alves igor dirk ivano gesara ogasawara jacob
1:00:20
halcyon and tony fast and trent so thank you guys i really appreciate the work well thank you for including me
1:00:27
in the list but uh i guess my rave is uh i love this open source directions
1:00:34
so much because uh we use the word model and real world problems so many times
1:00:41
together if this was deep learning machine learning the real world problem thing doesn’t matter how do i model right but i love i love
1:00:48
hearing the real world applications it makes me wanna uh explore this project a little more um
1:00:53
so that’s all we have time for today uh thank you so much for watching and listening
1:00:59
uh you can follow us on twitter uh at quansiteai um and thomas where can people find you
1:01:04
at nearly ever yeah so i’m on i’m on twitter um kiwiki i’m on
1:01:12
linkedin connect with me there and but yeah twitter is probably the best place to to get in touch uh but yeah thanks
1:01:19
everyone for tuning in thanks for having me on the show it was a blast here well awesome so our open source
1:01:26
army out there uh sign up for open teams today to build your online profile and
1:01:31
uh start keeping track of your contributions and get one of those good remote jobs going well uh take care everybody have a great
1:01:38
day have a great weekend um and thank you all thank you bye

Â