So, we'll start with a quick recap of probability
theory. The assignment is also designed to kind
of make you just go back and read about
these things which you would have done at
some point in your life and I'll just quickly
go over the first module or the zero th module,
which is a recap of probability theory.
That's, why I said embarrassingly back rate,
so axioms of probability, so for any event we know
that probably of the event should be greater than
equal to zero and, if you have the Universal
set which contains all the events in your,
all the events, then the probability of
the universal set is going to be one.
These are the basic axioms of probability.
Now, random variables so, so here's the intuition
behind random variables, right? Suppose a student
can get one of three possible grades which is A,
B, C. one way of looking at it is that of all the
possible events there are these three events, that
the student gets a grade A, all the student gets
a grade B, and the student gets a grade C and,
there would be students in each of these events
and you're trying to find the probability of this
event, right? The other way of looking at it
is that you have, this set of students and you
have a random variable, which unfortunately
is not a variable; it's a function actually,
which maps each of these students from your
set to a particular value weight right. so,
that's what a random variable is. The random
variable is actually a function, which Maps
your outcomes to your values, right? So, from
for each of these students we have a function
which connects them to one of these three possible
grades so, that's another way of looking at it so,
one way was to think of these grades themselves
as even, the other way is to think that you have
a set which has a lot of outcomes and for each of
these elements of the set you can map them to some
value which is a green. Okay. So, we will see
why this is the more better way of doing it so,
irrespective of the first view or the second
view, everything remains the same the answers
that you are going to get if I ask you what is
the probability of the grade being a certain
value at grade being A or, B or, C. whether
you take the first view or the second view, the
answer is going to remain the same that doesn't
matter but, why do we focus on random variables
other than the first view is that
You might be interested in several things about
a student, right? You might be interested in what
are the heights, of different students how many of
them are short, how many of them are tall and so
on. How many are adults young and, so on. it have
various things about a student that you could add
each of these random variables actually operates
on the same set and maps them to different
values, right? So, this view is more modular or,
more reusable in that sense, right? You have this
set of possible outcomes and for each of them you
are trying to map them to certain values and these
values could be different it could be grades,
height, age, what not, right? Everything could
be possible, right? So, you could have a random
variable for each of these quantities that you are
interested in and then you could ask questions,
right? Give me all the outcomes for which the
grade is a certain value, the height is a certain
value and the age is a certain value, right?
So, the more formal definition is a random
variable is a function which maps each outcome
in your Universal set, to a value, right? And
the previous example the F grade, which is in
shorthand represented as the random variable
capital G, is the random variable or the function,
which maps each student to one of these three
possible grades A, B and C, right? So, remember
random variable is a function it's not a variable
I don't know, why it is called a variable but it
is called a variable, okay? And then you could
have a random variable, which maps at two ages
and, a random variable which maps it to Heights
and so on, right? And the event grade is equal to
A is actually a shorthand for the following even,
give me all those outcomes from my Universal
set for which when I apply the function to this
outcomes the answer should be grade A, right?
So, when I say I want the probability of grade
equal to A, this is what I actually mean,
or if I ask for the set grade equal to A,
this is the set that I am looking at. Everyone
is fine with this, okay? So, all of you should
be comfortable with this definition of random
variable this is not my definition just the
generic definition I guess that, okay?
Now, random variable can either be continuous
or discrete, right? So, discrete is the
example of grades, where you have grades A,
B, C, D and so on, while it's a continuous
random variable, height, weight and so on.
Which can take on any real value it's not
discreet. Okay? For this discussion and for
the rest of the discussion on this remaining 30%,
of the course we'll be focusing only on discrete
random variables unless, otherwise mentioned I
don't think I'll ever look at continuous random
videos you'll only focus on discrete random
variables, right? Okay? So, now that's what
a random variable is now that we understand
random variables, we can talk about different
things related to random variables.
The first thing that we can talk about is,
marginal distribution so what do we mean by a
marginal distribution of a random variable? So,
if I ask you, give me a distribution for the
grade, the random variable grade, what will
you actually give me? What are the marginal
distributions in the discrete case actually
mean? If I ask you the marginal distribution of a
random variable what do you need to actually give
me? Probability of each setting of the random
variable, right? So, for if the random variable
can take values A, B, C suppose, the grade can
take values A,B,C then you need to give me the
table that you see on the whatever side it is the
table, right? The only table which is there. Okay?
And we denote this marginal distribution compactly
as P of G, so when I say P of G, I actually mean
this entire vector or this entire table which is
P of G, is equal to A, P of G, is equal to B and,
P of G is equal to C and, so on. That's what
a marginal distribution, is specifying all
the values that the random variable can take
probability for all the random values that a
random variable I know, this is very elementary
but, it's very important for understanding how,
many number of parameters do you need to
learn in the particular joint distribution or,
modular distribution and so on, right?
Now what's a Joint Distribution suppose in
addition to grade which can take on values A, B,
C you also have this random variable intelligence
which unfortunately can take only two values in
our world which is high or low? Okay? what is a
Joint Distribution of our grade and intelligence
it's specifying every, is specifying a probability
for, every combination of the grade and, so you
have this cross product there are three possible
values for grades and two possible values for
intelligence, for each of these six values,
you are going to specify a probability value,
right? So, this table that you see is the Joint
Distribution, right? So, remember that when we
always used to saying that Joint Distribution is
P of G comma I, right? But, that means that you
have P of G comma I, for every value of G and
every value of I that's what you need to specify.
Now again I am repeating this because when I asked
you to give me a joint distribution or, learn
joint distribution from a data, from a given set
of training data, this table is what I expect,
I expect you to give me values for all possible
combinations of the input variables or the input
random variable that's why this is important,
okay? Now what's a conditional distribution so,
if I ask you this is what we typically write it,
I want P of G given I, what does that mean?
How, many values do I need to give you? And
again assume that G can take three values and,
I can take two values, right? So, if I ask you
that give me this conditional distribution how
many values do I need to give you? Six values,
it's the same as the Joint Distribution
what will I have to give you?
So, I'll have to give you these tables,
I'll assume that I is equal to High,
given that I is equal to high, what are the
different properties for P of G, equal to A,
B and C and, the other table is given I equal to
low, what are the priorities for A G equal to,
A, B and C, right? Okay? And there's some
other simple stuff that this is how you
write the conditional distribution, is the joint
distribution, over the marginal distribution,
right? So, this equation actually connects
all the things that we have seen so far.
The joint distribution, is the conditional
distribution, into the marginal distribution,
is that fine? Okay? Fine. So, you should
be comfortable with if I ask you give me
a joint distribution, if I tell you how many
values my random variables, can tell me you,
can take you should be able to tell me
how many parameters I need to specify that
distribution that's what this, a basic material
is meant to stimulate you to do, okay?
Fine. And what's, the joint distribution of n
random variables the table on the next extra
table never on the first in all cases then tables
should never be on the first what's the joint
distribution for n random variables, how many
values do I need to give you? If each of these
random variables can take K values, how many
values will join distribution have? K power n,
right? So, far and that's you're used to
this because, you have done a lot of logic,
right? where you assume Boolean, variables and for
all combinations you try to, write down some truth
table and solve it so, it's very similar to that
so in other words that assigns P of X 1, equal to,
X 1, X 2 equal to X 2, for all possible
values that the variable X I can take,
okay? And if each random variable can take
two values you'll have two raise to n by
entries in the joint distribution, okay?
And the other thing is, just as for two random
variables, you could write the joint distribution
as a product of a conditional and a marginal, how
do you write the joint distribution of n random
variables? So, I am going to start using some
terminology the Joint Distribution of two random
variables factorizes as a conditional distribution
and, a marginal distribution, what about the Joint
Distribution of n random variables? What's the one
rule which has stayed with us so, far and once
continue to go into chain rule, right? So, again
we'll have the chain rule here so we have you can
assume, that all of these variables are clubbed
together so given X 1 and, then probability of X
1, that's the same as this form, right? And then
just keep doing this recursively, till you get the
following right? The ith variable, depends on all
the I minus 1 variables, before that and you'd
have a product of these all, right? Fine this
is known as the chain rule and, you can clearly
see that this is just a special case of this form
right? so, just be very comfortable with the chain
rule, this is going to be very important, when
you are talking about various things it directed
graphical models, or undirected graphical models,
or whatnot, right? So, it's very essential that
you completely understand the chain rule and maybe
I'll, get back to later, okay?
So, now from joint distributions to, marginal
distributions, suppose I'm given the joint
distribution, over two random variables A and, B,
okay? So, the first table that you see here, what
kind of a distribution is it? Joint, conditional,
marginal? Joint distribution, now from here, I
want to find the conditional distribution for
A and, B what does that actually mean what am I
given? And what am I asking for? P of A, P of P,
so how do I get the marginal distribution, from
the joint distribution sum over what, okay? Fine
so, now first of all if I have to give you the
marginal distribution of A, how many values do
I need to give you two values that I'm assuming
that all my random variables are binary so two
values so, from the joint distribution how will
I get these two values I'll sum up with two rows,
I'll keep the value of a same and sum over the
B values and, same for the other great this
is again straightforward all of you know? That
but just be comfortable with this that you can
obtain the marginal distribution, from the joint
distribution by, summing over the variables which
are not of interest, right? So, when you want P
of A, you will sum over the B's when you want P
of B will sum over these, okay? So, this is and in
general now if I give you a joint distribution of,
okay? This is more compactly, right? So,
this is like for all possible values that B,
can take you were going to sum this
but compactly this is how we write,
right? We always ignore the value assignment and
we just talk about P of a comma B, okay?
Now, from here, if you are given n random
variables how, are you going to find the marginal
distribution from this joint distribution
sum over all other variables, right? So,
do you see a problem with the summation
you do see a problem with this summation,
right? There s a problem with the basic joint
distribution itself, we'll come back to it but
we'll focus on these things but if you just
kind of vaguely appreciate at this point it's,
fine we'll come back to it in a few more slides,
okay? So, even if you are given n random variables
and a joint distribution, you can get the marginal
distribution, for each of these n random variables
by summing over all those other variables that
you don't care about, okay? Fine and again this
is more compactly written as this
what is conditional independence when do I say,
that a variable X is independent, of the variable
Y in terms of probability what's the equation
that you write P of x given by, is equal to P of
X, knowing the value of Y does not change your
belief about X, that's the English way of saying
it, right? and we denote this as X independent
of Y so, just this is a standard notation again
and we would expect the grade, to be dependent on
intelligence but perhaps not dependent on weight
or height or something this is probably not any
connection between them, okay? And recall that by
the chain rule for two variables, we have P of X
comma Y, is equal to P of X, into P of Y given
so, what will this simplify to so, combination
of the chain rule and the independence definition
gives you this form for the Joint Distribution of
two variables if the variables are independent,
okay? Fine. So, that's all the basic stuff from
probability that we need, I would encourage you to
go back and just be comfortable with all of this
and with this. And with this we can now start
discussing about Directed Graphical Modules.