©2009 2010 2011 2012 Jennifer Harlow, Dominic Lee and Raazesh Sainudiin.
Creative Commons AttributionNoncommercialShare Alike 3.0
The Markov Chains we discussed last week fit into our Big Picture, which is about inference and estimation and especially inference and estimation problems where computational techniques are helpful.
Point estimation  Set estimation  
Parametric

MLE of finitely many parameters 
Confidence intervals, 
Nonparametric 
coming up ... 
coming up ... 
One/Manydimensional Integrals 
coming up ... 
coming up ... 
But before we move on we have to discuss what makes it all work: the idea of limits  where do you get to if you just keep going?
Last week we described a Markov Chain, informally, as a system which "jumps" among several states, with the next state depending (probabilistically) only on the current state. Since the system changes randomly, it is generally impossible to predict the exact state of the system in the future. However, the statistical and probailistic properties of the system's future can be predicted. In many applications it is these statistical properties that are important. We saw how we could find a steady state vector:
$$\mathbf{s} = \lim_{n \to \infty} \mathbf{p}^{(n)}$$
(And we noted that $\mathbf{p}^{(n)}$ only converges to a strictly positive vector if $\mathbf{P}$ is a regular transition matrix.)
The week before, we talked about the likelihood function and maximum likelihood estimators for making point estimates of model parameters. For example for the $Bernoulli(\theta^*)$ RV (a $Bernoulli$ RV with true but possibly unknown parameter $\theta^*$, we found that the likelihood function was $L_n(\theta) = \theta^{t_n}(1\theta)^{(nt_n)}$ where $t_n = \displaystyle\sum_{i=1}^n x_i$. We also found the maxmimum likelihood estimator (MLE) for the $Bernoulli$ model, $\widehat{\theta}_n = \frac{1}{n}\displaystyle\sum_{i=1}^n x_i$.
We demonstrated these ideas using samples simulated from a $Bernoulli$ process with a secret $\theta^*$. We had an interactive plot of the likelihood function where we could increase $n$, the number of simulated samples or the amount of data we had to base our estimate on, and see the effect on the shape of the likelihood function. The animation belows shows the changing likelihood function for the Bernoulli process with unknown $\theta^*$ as $n$ (the amount of data) increases.
For large $n$, you can probably make your own guess about the true value of $\theta^*$ even without knowing $t_n$. As the animation progresses, we can see the likelihood function 'homing in' on $\theta = 0.3$.
We can see this in another way, by just looking at the sample mean as $n$ increases. An easy way to do this is with running means: generate a very large sample and then calculate the mean first over just the first observation in the sample, then the first two, first three, etc etc (running means were discussed in an earlier worksheet if you want to go back and review them in detail in your own time). Here we just define a function so that we can easily generate sequences of running means for our $Bernoulli$ process with the unknown $\theta^*$.

Now we can use this function to look at say 5 different sequences of running means (they will be different, because for each iteration, we will simulate a different sample of $Bernoulli$ observations).

What we notice is how the different lines converge on a sample mean of close to 0.3.
Is life always this easy? Unfortunately no. In the plot below we show the wellbehaved running means for the $Bernoulli$ and beside them the running means for simulated standard $Cauchy$ random variables. They are all over the place, and each time you reevaluate the cell you'll get different allovertheplace behaviour.


We talked about the Cauchy in more detail in an earlier worksheet. If you cannot recall the detail and are interested, go back to that in your own time. The message here is that although with the Bernoulli process, the sample means converge as the number of observations increases, with the Cauchy they do not.
We talked about $\mathbf{p}^{(n)}$, for $\mathbf{p}$ our Markov Chain transition matrix, converging. We talked about sample means converging (or not). What do we actually mean by 'converge'? These ideas of convergence and limits are fundamental to using Monte Carlo techniques: we need to be able to justify that the way we are attacking a problem will give us the 'right' answer. (At its very simplest, how do we justify that, by generating lots of simulations, we can get to some good approximation for a probability or an integral or a sum?) The advantages of an MLE as a point estimate in parametric estimation all come back to limits and convergence (remember how the likelihood function 'homed in'). And, as we will see when we do nonparametric estimation, limits and convergence are also fundamental there.
A sequence of real numbers $x_1, x_2, x_3, \ldots $ (which we can also write as $\{ x_i\}_{i=1}^\infty$) is said to converge to a limit $a \in \mathbb{R}$,
\[\underset{i \rightarrow \infty}{\lim} x_i = a\]
if for every natural number $m \in \mathbb{N}$, a natural number $N_m \in \mathbb{N}$ exists such that for every $j \geq N_m$, $\leftx_j  a\right \leq \frac{1}{m}$
What is this saying? $\leftx_j  a\right$ is measuring the closeness of the $j$th value in the sequence to $a$. If we pick bigger and bigger $m$, $\frac{1}{m}$ will get smaller and smaller. The definition of the limit is saying that if $a$ is the limit of the sequence then we can get the sequence to become as close as we want ('arbitrarily close') to $a$, and to stay that close, by going far enough into the sequence ('for every $j \geq N_m$, $\leftx_j  a\right \leq \frac{1}{m}$')
($\mathbb{N}$, the natural numbers, are just the 'counting numbers' $\{1, 2, 3, \ldots\}$.)
Take a trivial example, the sequence $\{x_i\}_{i=1}^\infty = 17, 17, 17, \ldots$
Clearly, $\underset{i \rightarrow \infty}{\lim} x_i = 17$, but let's do this formally:
For every $m \in \mathbb{N}$, take $N_m =1$, then
$\forall$ $j \geq N_m=1, \leftx_j 17\right = \left17  17\right = 0 \leq \frac{1}{m}$, as required.
($\forall$ is mathspeak for 'for all' or 'for every')
What about $\{x_i\}_{i=1}^\infty = \displaystyle\frac{1}{1}, \frac{1}{2}, \frac{1}{3}, \ldots$, i.e., $x_i = \frac{1}{i}$?
$\underset{i \rightarrow \infty}{\lim} x_i = \underset{i \rightarrow \infty}{\lim}\frac{1}{i} = 0$
For every $m \in \mathbb{N}$, take $N_m = m$, then $\forall$ $j \geq m$, $\leftx_j  0\right \leq \left \frac{1}{m}  0\right = \frac{1}{m}$
You try
Think about $\{x_i\}_{i=1}^\infty = \frac{1}{1^p}, \frac{1}{2^p}, \frac{1}{3^p}, \ldots$ with $p > 0$. The limit$\underset{i \rightarrow \infty}{\lim} \displaystyle\frac{1}{i^p} = 0$, provided $p > 0$.
You can draw the plot of this very easily using the Sage symbolic expressions we have already met ('f.subs(...)' allows us to substitute a particular value for one of the symbolic variables in the symbolic function f, in this case a value to use for $p$).

What about $\{x_i\}_{i=1}^\infty = 1^{\frac{1}{1}}, 2^{\frac{1}{2}}, 3^{\frac{1}{3}}, \ldots$. The limit$\underset{i \rightarrow \infty}{\lim} i^{\frac{1}{i}} = 1$.
This one is not as easy to see intuitively, but again we can plot it with Sage.

Finally, $\{x_i\}_{i=1}^\infty = p^{\frac{1}{1}}, p^{\frac{1}{2}}, p^{\frac{1}{3}}, \ldots$, with $p > 0$. The limit$\underset{i \rightarrow \infty}{\lim} p^{\frac{1}{i}} = 1$ provided $p > 0$.
You can cut and paste (with suitable adaptations) to try to plot this one as well ...

back to the real stuff ...
We say that a function $f(x): \mathbb{R} \rightarrow \mathbb{R}$ has a limit $L \in \mathbb{R}$ as $x$ approaches $a$:
\[\underset{x \rightarrow a}{\lim} f(x) = L\]
provided $f(x)$ is arbitrarily close to $L$ for all ($\forall$) values of $x$ that are sufficiently close to but not equal to $a$.
Consider the function $f(x) = (1+x)^{\frac{1}{x}}$
$\underset{x \rightarrow 0}{\lim} f(x) = \underset{x \rightarrow 0}{\lim} (1+x)^{\frac{1}{x}} = e \approx 2.71828\cdots$
even though $f(0) = (1+0)^{\frac{1}{0}}$ is undefined!

You can get some idea of what is going on with two plots on different scales

all this has been laying the groundwork for the topic of real interest to us ...
We want to be able to say things like $\underset{i \rightarrow \infty}{\lim} X_i = X$ in some sensible way. $X_i$ are some random variables, $X$ is some 'limiting random variable', but what do we mean by 'limiting random variable'?
To help us, lets introduce a very very simple random variable, one that puts all its mass in one place.

This is known as the $Point\,Mass(\theta)$ random variable, $\theta \in \mathbb(R)$: the density $f(x)$ is 1 if $x=\theta$ and 0 everywhere else
\[
f(x;\theta) =
\begin{cases}
0 & \text{ if } x \neq \theta \\
1 & \text{ if } x = \theta
\end{cases}
\]
\[
F(x;\theta) =
\begin{cases}
0 & \text{ if } x < \theta \\
1 & \text{ if } x \geq \theta
\end{cases}\]
So, if we had some sequence $\{\theta_i\}_{i=1}^\infty$ and $\underset{i \rightarrow \infty}{\lim} \theta_i = \theta$
and we had a sequence of random variables $X_i \sim Point\,Mass(\theta_i)$, $i = 1, 2, 3, \ldots$
then we could talk about a limiting random variable as $X \sim Point\,Mass(\theta)$:
i.e., we could talk about $\underset{i \rightarrow \infty}{\lim} X_i = X$

Now, we want to generalise this notion of a limit to other random variables (that are not necessarily $Point\,Mass(\theta_i)$ RVs)
What about one many of you will be familiar with  the 'bellshaped curve'  the $Gaussian(\mu, \sigma^2)$ or $Normal(\mu, \sigma^2)$ RV?
The probability density function (PDF) $f(x)$ is given by
\[f(x ;\mu, \sigma) = \displaystyle\frac{1}{\sigma\sqrt{2\pi}}\exp\left(\frac{1}{2\sigma^2}(x\mu)^2\right)\]
The two parameters, $\mu$ and $\sigma$, are sometimes referred to as the location and scale parameters.
To see why this is, use the interactive plot below to have a look at what happens to the shape of the density function $f(x)$ when you change $\mu$ or increase or decrease $\sigma$:
Click to the left again to hide and once more to show the dynamic interactive window 

Consider the sequence of random variables $X_1, X_2, X_3, \ldots$, where
$X_1 \sim Normal(0, 1)$
$X_2 \sim Normal(0, \frac{1}{2})$
$X_3 \sim Normal(0, \frac{1}{3})$
$X_4 \sim Normal(0, \frac{1}{4})$
$\vdots$
$X_i \sim Normal(0, \frac{1}{i})$
$\vdots$
We can use the animation below to see how the PDF $f_{i}(x)$ looks as we move through the sequence of $X_i$ (the animation only goes to $i = 25$, $\sigma = 0.04$ but you get the picture ...)
We can see that the probability mass of $X_i \sim Normal(0, \frac{1}{i})$ increasingly concentrates about 0 as $i \rightarrow \infty$ and $\frac{1}{i} \rightarrow 0$
Does this mean that $\underset{i \rightarrow \infty}{\lim} X_i = Point\,Mass(0)$?
No, because for any $i$, however large, $P(X_i = 0) = 0$ because $X_i$ is a continuous RV (for any continous RV $X$, for any $x \in \mathbb{R}$, $P(X=x) = 0$).
So, we need to refine our notions of convergence when we are dealing with random variables
Let $X_i, X_2, \ldots$ be a sequence of random variables and let $X$ be another random variable. Let $F_i$ denote the distribution function (DF) of $X_i$ and let $F$ denote the distribution function of $X$.
Now, if for any real number $t$ at which $F$ is continuous,
$$\underset{i \rightarrow \infty}{\lim} F_i(t) = F(t)$$
(in the sense of the convergence or limits of functions we talked about earlier)
Then we can say that the sequence or RVs $X_i$, $i = 1, 2, \ldots$ converges to $X$ in distribution and write $X_i \overset{d}{\rightarrow} X$.
An equivalent way of defining convergence in distribution is to go right back to the meaning of the probabilty space 'under the hood' of a random variable, a random variable $X$ as a mapping from the sample space $\Omega$ to the real line ($X: \Omega \rightarrow \mathbb{R}$), and the sample points or outcomes in the sample space, the $\omega \in \Omega$. For $\omega \in \Omega$, $X(\omega)$ is the mapping of $\omega$ to the real line $\mathbb{R}$. We could look at the set of $\omega$ such that $X(\omega) \leq t$, i.e. the set of $\omega$ that map to some value on the real line less than or equal to $t$, $\{\omega: X(\omega) \leq t \}$.
Saying that for any $t \in \mathbb{R}$, $\underset{i \rightarrow \infty}{\lim} F_i(t) = F(t)$ is the equivalent of saying that for any $t \in \mathbb{R}$,
$$\underset{i \rightarrow \infty}{\lim} P\left(\{\omega:X_i(\omega) \leq t \}\right) = P\left(\{\omega: X(\omega) \leq t\right)$$
Armed with this, we can go back to our sequence of $Normal$ random variables $X_1, X_2, X_3, \ldots$, where
$X_1 \sim Normal(0, 1)$
$X_2 \sim Normal(0, \frac{1}{2})$
$X_3 \sim Normal(0, \frac{1}{3})$
$X_4 \sim Normal(0, \frac{1}{4})$
$\vdots$
$X_i \sim Normal(0, \frac{1}{i})$
$\vdots$
and let $X \sim Point\,Mass(0)$,
and say that the $X_i$ converge in distribution to the $x \sim Point\,Mass$ RV $X$,
$$X_i \overset{d}{\rightarrow} X$$
What we are saying with convergence in distribution, informally, is that as $i$ increases, we increasingly expect to see the next outcome in a sequence of random experiments becoming better and better modeled by the limiting random variable. In this case, as $i$ increases, the $Point\,Mass(0)$ is becoming a better and better model for the next outcome of a random experiment with outcomes $\sim Normal(0,\frac{1}{i})$.

There is an interesting point to note about this convergence:
We have said that the $X_i \sim Normal(0,\frac{1}{i})$ with distribution functions $F_i$ converge in distribution to $X \sim Point\,Mass(0)$ with distribution function $F$, which means that we must be able to show that for any real number $t$ at which $F$ is continuous,
$$\underset{i \rightarrow \infty}{\lim} F_i(t) = F(t)$$
Note that for any of the $X_i \sim Normal(0, \frac{1}{i})$, $F_i(0) = \frac{1}{2}$, and also note that for $X \sim Point,Mass(0)$, $F(0) = 1$, so clearly $F_i(0) \neq F(0)$. What has gone wrong? Nothing: we said that we had to be able to show that $\underset{i \rightarrow \infty}{\lim} F_i(t) = F(t)$ for any $t \in \mathbb{R}$ at which $F$ is continuous, but the $Point\,Mass(0)$ distribution function $F$ is not continous at 0!



Let $X_i, X_2, \ldots$ be a sequence of random variables and let $X$ be another random variable. Let $F_i$ denote the distribution function (DF) of$X_i$ and let $F$ denote the distribution function of $X$.
Now, if for any real number $\varepsilon > 0$,
$$\underset{i \rightarrow \infty}{\lim} P\left(X_i  X > \varepsilon\right) = 0$$
Then we can say that the sequence $X_i$, $i = 1, 2, \ldots$ converges to $X$ in probability and write $X_i \overset{P}{\rightarrow} X$.
Or, going back again to the probability space 'under the hood' of a random variable, we could look the way the $X_i$ maps each outcome $\omega \in \Omega$, $X_i(\omega)$, which is some point on the real line, and compare this to mapping $X(\omega)$.
Saying that for any $\varepsilon \in \mathbb{R}$, $\underset{i \rightarrow \infty}{\lim} P\left(X_i  X > \varepsilon\right) = 0$ is the equivalent of saying that for any $\varepsilon \in \mathbb{R}$,
$$\underset{i \rightarrow \infty}{\lim} P\left(\{\omega:X_i(\omega)  X(\omega) > \varepsilon \}\right) = 0$$
Informally, we are saying $X$ is a limit in probabilty if, by going far enough into the sequence $X_i$, we can ensure that the mappings $X_i(\omega)$ and $X(\omega)$ will be arbitrarily close to each other on the real line for all $\omega \in \Omega$.
Note that convergence in distribution is implied by convergence in probability: convergence in distribution is the weakest form of convergence; any sequence of RV's that converges in probability to some RV $X$ also converges in distribution to $X$ (but not necessarily vice versa).


For our sequence of $Normal$ random variables $X_1, X_2, X_3, \ldots$, where
$X_1 \sim Normal(0, 1)$
$X_2 \sim Normal(0, \frac{1}{2})$
$X_3 \sim Normal(0, \frac{1}{3})$
$X_4 \sim Normal(0, \frac{1}{4})$
$\vdots$
$X_i \sim Normal(0, \frac{1}{i})$
$\vdots$
and $X \sim Point\,Mass(0)$,
It can be shown that the $X_i$ converge in probability to $X \sim Point\,Mass(0)$ RV $X$,
$$X_i \overset{P}{\rightarrow} X$$
(the formal proof of this involves Markov's Inequality, which is beyond the scope of this course).
Take a look at the Khan academy videos on the Law of Large Numbers and the Central Limit Theorem. This will give you a working idea of these theorems. In the sequel, we will strive for a deeper understanding of these theorems on the basis of the two notions of convergence of sequences of random variables we just saw.
Remember that a statistic is a random variable, so a sample mean is a random variable. If we are given a sequence of independent and identically distributed RVs, $X_1,X_2,\ldots \overset{IID}{\sim} X_1$, then we can also think of a sequence of random variables $\overline{X}_1, \overline{X}_2, \ldots, \overline{X}_n, \ldots$ ($n$ being the sample size).
Since $X_1, X_2, \ldots$ are $IID$, they all have the same expection, say $E(X_1)$ by convention.
If $E(X_1)$ exists, then the sample mean $\overline{X}_n$ converges in probability to $E(X_1)$ (i.e., to the expectatation of any one of the individual RVs):
\[ \text{If} \quad X_1,X_2,\ldots \overset{IID}{\sim} X_1 \ \text{and if } \ E(X_1) \ \text{exists, then } \ \overline{X}_n \overset{P}{\rightarrow} E(X_1) \ . \]
Going back to our definition of convergence in probability, we see that this means that for any real number $\varepsilon > 0$, $\underset{n \rightarrow \infty}{\lim} P\left(\overline{X}_n  E(X_1) > \varepsilon\right) = 0$
Informally, this means that means that, by taking larger and larger samples we can make the probability that the average of the observations is more than $\varepsilon$ away from the expected value get smaller and smaller.
Proof of this is beyond the scope of this course, but we have already seen it in action when we looked at the $Bernoulli$ running means. Have another look, this time with only one sequence of running means. You can increase $n$, the sample size, and change $\theta$. Note that the seed for the random number generator is also under your control. This means that you can get replicable samples: in particular, in this interact, when you increase the sample size it looks as though you are just adding more to an existing sample rather than starting from scratch with a new one.
Click to the left again to hide and once more to show the dynamic interactive window 
You have probably all heard of the Central Limit Theorem before, but now we can relate it to our definition of convergence in distribution.
Let $X_1,X_2,\ldots \overset{IID}{\sim} X_1$ and suppose $E(X_1)$ and $V(X_1)$ both exist,
then
\[\overline{X}_n = \frac{1}{n} \sum_{i=1}^n X_i \overset{d}{\rightarrow} X \sim Normal \left(E(X_1),\frac{V(X_1)}{n} \right) \]
And remember $Z \sim Normal(0,1)$?
Consider $Z_n := \displaystyle\frac{\overline{X}_nE(\overline{X}_n)}{\sqrt{V(\overline{X}_n)}} = \displaystyle\frac{\sqrt{n} \left( \overline{X}_n E(X_1) \right)}{\sqrt{V(X_1)}}$
If $\overline{X}_n = \displaystyle\frac{1}{n} \displaystyle\sum_{i=1}^n X_i \overset{d}{\rightarrow} X \sim Normal \left(E(X_1),\frac{V(X_1)}{n} \right)$, then $\overline{X}_n E(X_1) \overset{d}{\rightarrow} XE(X_1) \sim Normal \left( 0,\frac{V(X_1)}{n} \right)$
and $\sqrt{n} \left( \overline{X}_n E(X_1) \right) \overset{d}{\rightarrow} \sqrt{n} \left( XE(X_1) \right) \sim Normal \left( 0,V(X_1) \right)$
so $Z_n := \displaystyle \frac{\overline{X}_nE(\overline{X}_n)}{\sqrt{V(\overline{X}_n)}} = \displaystyle\frac{\sqrt{n} \left( \overline{X}_n E(X_1) \right)}{\sqrt{V(X_1)}} \overset{d}{\rightarrow} Z \sim Normal \left( 0,1 \right)$
Thus, for sufficiently large $n$ (say $n>30$), probability statements about $\overline{X}_n$ can be approximated using the $Normal$ distribution.
The beauty of the CLT, as you have probably seen from other courses, is that $\overline{X}_n \overset{d}{\rightarrow} Normal \left( E(X_1), \frac{V(X_1)}{n} \right)$ does not require the $X_i$ to be normally distributed.
We can try this with our $Bernoulli$ RV generator. First, a small number of samples:

You can use the interactive plot to increase the number of samples and make a histogram of the sample means. According to the CLT, for lots of reasonablysized samples we should get a nice symmetric bellcurveish histogram centred on $\theta$. You can adjust the number of bins in the histogram as well as the number of samples, sample size, and $\theta$.
Click to the left again to hide and once more to show the dynamic interactive window 
But although the $X_i$ do not have to be $\sim Normal$ for $\overline{X}_n = \overset{d}{\rightarrow} X \sim Normal\left(E(X_1),\frac{V(X_1)}{n} \right)$, remember that we said "Let $X_1,X_2,\ldots \overset{IID}{\sim} X_1$ and suppose $E(X_1)$ and $V(X_1)$ both exist", then,
\[\overline{X}_n = \frac{1}{n} \sum_{i=1}^n X_i \overset{d}{\rightarrow} X \sim Normal \left(E(X_1),\frac{V(X_1)}{n} \right) \]
This is where is all goes horribly wrong for the standard $Cauchy$ distribution (any $Cauchy$ distribution in fact): neither the expectation nor the variance exist for this distribution. The Central Limit Theorem cannot be applied here. In fact, if $X_1,X_2,\ldots \overset{IID}{\sim}$ standard $Cauchy$, then $\overline{X}_n = \displaystyle \frac{1}{n} \sum_{i=1}^n X_i \sim$ standard $Cauchy$.
You try
Try looking at samples from two other RVs where the expectation and variance do exist, the $Uniform$ and the $Exponential$:
Click to the left again to hide and once more to show the dynamic interactive window 
Click to the left again to hide and once more to show the dynamic interactive window 
More You try
The python random module, available in Sage, provides a useful way of taking samples if you have already generated a 'population' to sample from, or otherwise playing around with the elements in a sequence. See http://docs.python.org/library/random.html for more details. Here we will try a few of them.
The aptlynamed sample function allows us to take a sample of a specified size from a sequence. We will use a list as our sequence:

Each call to sample will select unique elements in the list (note that 'unique' here means that it will not select the element at any particular position in the list more than once, but if there are duplicate elements in the list, such as with a list [1,2,4,2,5,3,1,3], then you may well get any of the repeated elements in your sample more than once). sample samples with replacement, which means that repeated calls to sample may give you samples with the same elements in.


Try experimenting with choice, which allows you to select one element at random from a sequence, and shuffle, which shuffles the sequence in place (i.e, the ordering of the sequence itself is changed rather than you being given a reordered copy of the list). It is probably easiest to use lists for your sequences. See how shuffle is creating permutations of the list. You could use sample and shuffle to emulate permuations of k objects out of n ...
You may need to check the documentation to see how use these functions.







Strictly optional
These cells show how the animations were made. If you want to use them, you should uncomment the final two lines in each animationmaking cell. The cells with the gif function command will also need uncommenting if you want to do that part of it. Note that there will a delay of the order of 3060 seconds while the plots are made and a similar delay for the gif. The final two lines are commented out in the published worksheet so that you have to read this warning about delays before you evaluate the cells!
Making animations is definitely not part of this course, but if you are interested you can find out more on the Sage reference manual page on animations and in the Sage wiki animate section.






















