Introduction

Experimental design is a huge field that certainly includes more than I’ll briefly cover here. However, this note will cover some of it’s simplest and most foundational forms. In general, experimental design is about evaluating hypotheses about how systems change when conditions influencing them change.

In this note, we’ll discuss some key principles for thinking clearly about experiments, parametric and non-parametric hypothesis testing, analysis of variance, factorial designs, and fractional factorial designs, as well as key ideas related to these. This note does not currently cover how A/B testing is practiced in industry, multi-armed bandits, response surface methods, regression, sequential experimentation, quasi-experiments, causal inference, sampling, and more advanced topics. But I hope future notes I post might.

Much of this draws from Montgomery (2019).

Key Principles

Key principles in experimental design of all flavors are experimental runs, randomization, replication, blocking, and factorial designs.

Before diving into these, let’s set the stage slightly by stating that in general, we’re interested in how one or more components of a system (call these “treatments”) change some outputs of the system (call these “outcomes”). In order to nail down such relationships, the principles we mentioned above will be key. Let’s define each of these principles.

An experimental run is a single test of a set of treatment settings on the system under study. We might think of runs as observations of a set of treatment settings, where the observation could be identified by an individual or single unit under study. We generally assume that each run or observation is independent and that the distribution of random variables for each run is identically distributed (i.i.d.).1 Observations differ in the specific values of the random variables as each is a draw from the underlying distribution governing the system.

Treatment assignment (as well as the order of runs, where this is applicable) should be randomized. This allows alternative explanations (confounders or non-causal paths) for observed effects to have equal effects on the treated and untreated units and so be averaged out in effect measures. That is, with randomization of treatment, other factors that might explain an effect will be eliminated on average.

Replication refers to observing multiple units with or running multiple runs with the same treatment settings (think sample size). This is key for quantifying statistical uncertainty about effect estimates. It also allows for more precise estimation of effects, as the noise effecting each observation or run becomes less influential.

A nuisance factor is a variable that has an effect on the response but we are not interested in this effect. When a nuisance variable is known and controllable we can block on it to eliminate its effect. That is, we can group units into blocks within which units are similar to each other based on the value these units take for the nuisance factor. Each block therefore creates a homogeneous set of observations on which to compare treatments.

Factorial designs vary more than one treatment at a time and in which all possible combinations of treatment levels are given to some units in the experiment. This allows researchers to explore the effects of each treatment as well as interactions between multiple treatments in an efficient way. Fractional factorial designs relax the requirement that all treatment level combinations are included.

Before turning to hypothesis testing, we first revisit some basic statistical idea that underlie the parametric tools we will discuss.

Basic Statistics and Distributions

In general, statistics is about learning about a population (system) based on a sample of units (runs) from that population (system). We’ll typically be considering random samples here, where each unit is equally likely to be sampled.2 Key statistics (functions of the sample) we’ll be considering are the sample mean \(\bar{y} = \frac{1}{n}\sum_{i=1}^n y_i\) and sample variance \(S^2 = \frac{1}{n-1}\sum_{i=1}^n (y_i - \bar{y})^2\). It is easy to show that these are unbiased estimators for the population mean and variance, respectively. Also, \(SS = \sum_{i=1}^n (y_i - \bar{y})^2\) is called the sum of squares and \(S^2 = \frac{SS}{n-1}\).

Central Limit Theorem, Normal Distribution, and Sample Mean

The probability distribution of statistics are called sampling distributions. Let’s consider the sampling distributions of the sample sum and mean as the sample size gets large.

Suppose \(y_1, \dots, y_n\) are i.i.d. random variables with mean \(\mu\) and variance \(\sigma^2\). Then, as \(n\to\infty\),

  • \(\frac{\left(\sum_{i=1}^n y_i\right) - n\mu}{\sqrt{n\sigma^2}} \overset{d}{\to} \text{Normal}(0,1)\)
  • \(\frac{\left(\frac{1}{n}\sum_{i=1}^n y_i\right)-\mu}{\sqrt{\frac{\sigma^2}{n}}} \overset{d}{\to} \text{Normal}(0,1)\)

and informally3

  • \(\sum_{i=1}^n y_i \overset{d}{\to} \text{Normal}(n\mu,n\sigma^2)\)
  • \(\frac{1}{n}\sum_{i=1}^n y_i \overset{d}{\to} \text{Normal}(\mu,\frac{\sigma^2}{n})\)

This tells us that the sample sum and mean will be approximately normally distributed for sufficiently large sample size. We often consider normality as a plausible distribution for noise terms in models, since such noise might be thought of as the sum of many independent sources of noise.

Let’s see a quick demonstration of this using the uniform distribution.

# Let's take 500 draws of a uniform[0,1] and repeat this 1000 times and 
# sum the 500 draws for each of the 1000 repetitions.
n = 500
m = 1000
sums = as.data.frame(matrix(runif(n*m,0,1), n, m)) %>%
    summarise(across(where(is.numeric), ~ sum(.x, na.rm = TRUE)))
sums = as.numeric(sums[1,])
m = round(mean(sums),2)
v = round(var(sums),2)
hist(sums, breaks = 20,prob=TRUE,
     main = paste0("Histogram of 1000 Sums of 500 Uniform[0,1] Draws: \n Mean = ",m,", Variance = ",v),
     ylab = "Number of Sums of 100 Draws",
     xlab = "Sum Values")
curve(dnorm(x, mean=n*0.5, sd=sqrt(n*(1/12))), 
      col="darkblue", lwd=2, add=TRUE, yaxt="n")

In the above, the blue curve represents the normal distribution that is the limiting distribution for this example, where the mean is \(n\times\mu = 100\times 0.5\approx\) 250 and the variance is \(n\times\sigma^2 = 100\times \frac{1}{12}\approx\) 41.6666667.

The probability density function for the normal distribution is \(f(y) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(y-\mu)^2}{2\sigma^2}\right)\) for \(-\infty <y,\mu<\infty\) and \(\sigma^2>0\). We see that normal random variables are uniquely characterized by their mean, \(\mu\), and variance, \(\sigma^2\). A normal distribution with mean 0 and variance 1 is the standard normal distribution.

Chi-Square Distribution and Sample Variance

Let’s consider another distribution, the chi-square.

If \(z_1, \dots, z_k\) are i.i.d. standard normal random variables, then \(\sum_{j=1}^k z_j^2 \sim \chi^2_{k}\). That is, a sum of squared standard normals is a chi-square. The chi-square distribution is a skewed distribution with mean \(\mu = k\) and variance \(\sigma^2 = 2k\), where \(k\) is referred to as the degrees of freedom.

We can again demonstrate this quickly. The blue curve in the below represents the chi-square distribution with 10 degrees of freedom.

# Plot 100 sums of 10 squared standard normal draws.  
n = 10
m = 1000
chis = as.data.frame(matrix(rnorm(n*m,0,1), n, m)^2) %>%
    summarise(across(where(is.numeric), ~ sum(.x, na.rm = TRUE)))
chis = as.numeric(chis[1,])
hist(chis, breaks = 20,prob=TRUE,
     main = "Histogram of 1000 Sums of 10 Squared Normal[0,1] Draws",
     ylab = "Number of Sums of 10 Squared Draws",
     xlab = "Sum of Squared Values")
curve(dchisq(x, df=n), 
      col="darkblue", lwd=2, add=TRUE, yaxt="n")

Now, the sum of squares, \(SS\), sounds a bit like a chi-square. In fact, if \(y_1, \dots, y_n\) are i.i.d. normal random variables with mean \(\mu\) and variance \(\sigma^2\), then

\[\frac{SS}{\sigma^2} = \sum_{i=1}^n \left(\frac{y_i-\bar{y}}{\sigma}\right)^2 = \sum_{i=1}^n z_i^2 \sim \chi^2_{n-1} \implies SS \sim \sigma^2\chi^2_{n-1}.\] Note that there are \(n-1\) degrees of freedom since we use one degree of freedom for the sample mean. So this means that sample variance \(S^2 = \frac{SS}{n-1} \sim \frac{\sigma^2}{n-1} \chi^2_{n-1}\), again when \(y_1, \dots, y_n\) are iid normal random variables.

t Distribution

Let’s consider a third distribution, the t distribution.

If \(z\) is standard normal and \(\chi^2_{k}\) is chi-square and they are independent, then \(\frac{z}{\sqrt{\frac{\chi^2_{k}}{k}}} \sim t_{k}\). That is, the ratio of independent standard normal to the square root of a chi-square divided by it’s degrees of freedom has a t distribution. When \(k=\infty\), the t distribution becomes the standard normal.

We can again demonstrate this quickly. The blue curve in the below represents the t distribution with 10 degrees of freedom. The red curve is a standard normal. The t distribution has fatter tails.

# Plot 10000 ratios of standard normals to square root of 
# chi-squares divided by degrees of freedom.  
m = 10000
df = 10
sn = rnorm(m,0,1)
chi = rchisq(m,df=df)
t = sn / sqrt(chi/df)
hist(t, breaks = 50,prob=TRUE,
     main = "Histogram of 10,000 Ratios of Standard Normals to 
     Square Root of Chi-square Divided by DOF",
     ylab = "Freequency",
     xlab = "Ratio Values")
curve(dt(x, df=df), 
      col="darkblue", lwd=2, add=TRUE, yaxt="n")
curve(dnorm(x, 0,1), 
      col="darkred", lwd=2, add=TRUE, yaxt="n")

If \(y_1, \dots, y_n\) are i.i.d. normal random variables with mean \(\mu\) and variance \(\sigma^2\), then from above we have (again, some of this is a bit informal)

  • \(\bar{y} \sim N(\mu,\frac{\sigma^2}{n}) \implies \frac{\bar{y}-\mu}{\sqrt{\frac{\sigma^2}{n}}} \sim N(0,1)\)
  • \(S^2 \sim \frac{\sigma^2}{(n-1)}\chi^2_{n-1} \implies \frac{(n-1)S^2}{\sigma^2} \sim \chi^2_{n-1}\)

So we see that \[\begin{aligned} t = \frac{ \left(\frac{\bar{y}-\mu}{\sqrt{\frac{\sigma^2}{n}}}\right) }{ \sqrt{ \frac{ \left( \frac{(n-1)S^2}{\sigma^2} \right) }{(n-1)} } } = \frac{ \left(\frac{\bar{y}-\mu}{\frac{\sigma}{\sqrt{n}}}\right) }{ \left(\frac{S}{\sigma} \right) } = \frac{\bar{y}-\mu}{S/\sqrt{n}} \sim t_{n-1} \end{aligned}\]

We will see that this statistic can be used for hypothesis tests.

F Distribution

Let’s consider a final distribuiton, the F distribution.

If \(\chi^2_{a}\) and \(\chi^2_{b}\) are independent chi-square random variables, then the ratio \(\frac{\chi^2_{a}/a}{\chi^2_{b}/b} \sim F_{a,b}\). That is, the ratio of two independent chi squares divided by their degrees of freedom is an F distributed random variable with \(a\) numerator degrees of freedom and \(b\) denominator degrees of freedom. The mean of the F distribution is \(\frac{b}{b-2}\) (for \(b>2\)) and the variance is \(\frac{2b^2(a+b-2)}{a(b-2)^2(b-4)}\) (for \(b>4\)).

We can again demonstrate this quickly. The blue curve in the below represents the F distribution with 10 and 100 degrees of freedom.

# Plot 10000 ratios of chi-squares divided by their degrees of freedom.  
m = 100000
a = 10
b = 100
A = rchisq(m,df=a)
B = rchisq(m,df=b)
Fs = (A/a) / (B/b)
hist(Fs, breaks = 50,prob=TRUE,
     main = "Histogram of 10,000 Ratios of Chi-square Divided by their DOFs",
     ylab = "Freequency",
     xlab = "Ratio Values")
curve(df(x, df1=a,df2=b), 
      col="darkblue", lwd=2, add=TRUE, yaxt="n")

If \(x_1, \dots, x_{n_1}\) are i.i.d. normal random variables with mean \(\mu_1\) and variance \(\sigma^2\) and \(y_1, \dots, y_{n_2}\) are i.i.d. normal random variables with mean \(\mu_2\) and variance \(\sigma^2\), then the ratio of sample variances has an F distribution.

\[\frac{S_1^2}{S_2^2} = \frac{\frac{\sigma^2}{(n_1-1)}\chi^2_{n_1-1}}{\frac{\sigma^2}{(n_2-1)}\chi^2_{n_2-1}} = \frac{\chi^2_{n_1-1}/(n_1-1)}{\chi^2_{n_2-1}/(n_2-1)} \sim F_{n_1-1,n_2-1}\]

Parametric Hypothesis Testing

With the above foundation, let’s dig into some parametric approaches to hypothesis testing, one of the simplest statistical tools for analyzing experiments.

One Sample Z-Test4

Suppose we have a sample of observations \(y_1,\dots,y_n\) and we want to test whether the mean of the sample is statistically different from some value, \(\mu_0\). We assume first that these are iid samples from some distribution. We will further assume that the underlying distribution is normal (for now). We first consider a simple model of the data. Here, our model will be

\[y_{i} = \mu +\epsilon_i\] for \(i = 1,\dots,n\), where \(\epsilon_i \overset{iid}\sim N(0,\sigma^2)\) is random noise. Our model therefore assumes that each observation is a draw from the underlying distribution \(N(\mu,\sigma^2)\). We call a hypothesis that the true mean of the underlying distribution is \(\mu_0\) the null hypothesis. The alternative is that the true mean is not equal to \(\mu_0\). Note that this is a two sided alternative hypothesis, one sided versions are also possible.

\[\begin{aligned} H_0: \mu &= \mu_0 \\ H_1: \mu &\ne \mu_0 \end{aligned}\]

If the variance \(\sigma^2\) is known, then we know that

\[Z_0 = \frac{\bar{y} - \mu_0}{\sigma/\sqrt{n}}\] follows a standard normal distribution, under the assumption that the null hypothesis is true. Note that if the underlying distribution is not normal, but the variance is known and the sample is large enough for the CLT to apply, then we can also use this. This is because the sampling distribution of the sample mean is \(N(\mu,\sigma^2/n)\) (again, informally).

Our decision rule will be to reject the null hypothesis when \(Z_0\) takes values that would be very extreme (and hence unlikely) if the null hypothesis were true.

Such a procedure obviously is not perfect. Two important types of error can occur.

  • “False Positives” or “Type I Errors” occur when we reject the null when it is true.
    • The “positive” refers here to finding a statistical difference.
    • Let \(\alpha = P(\text{Type I Error}) = P(\text{reject }H_0 | H_0\text{ is true})\).
    • \(\alpha\) is often called the significance level.
  • “False Negatives” or “Type II Errors” occur when we fail to reject the null when it is false.
    • Let \(\beta = P(\text{Type II Error}) = P(\text{fail to reject }H_0 | H_0\text{ is false})\).
    • \(1-\beta = P(\text{reject }H_0 | H_0\text{ is false})\) is often called the power of the test.

We can specify a particular significance level \(\alpha\) that defines a critical or rejection region based on the null distribution and then we reject the null when \(|Z_0| > Z_{\alpha/2}\), where \(Z_{\alpha/2}\) is the upper \(\alpha/2\) percentile of the standard normal.

In general, we often want to provide interval estimates that provide some idea of the likely region that the true parameter falls into, given what we see in the data and our assumptions. We’d like to find an interval such that the statement \(P(L\le \mu \le U) = 1-\alpha\) is true. The interval \((U,L)\) is called a \(100(1-\alpha)\%\) confidence interval. The idea here is that we’ll calculate \(L,U\) from our sample. So we interpret such an interval as follows: with repeated sampling, under the null hypothesis, the true mean will be included in \(100(1-\alpha)\%\) of the calculated intervals. So the sample under study might not lead to an interval that includes true mean, but it should most of the time for small \(\alpha\)’s.

We can create a \(100(1-\alpha)\%\) confidence interval for the true population mean as \(\bar{y} \pm Z_{\alpha/2} \left(\frac{\sigma}{\sqrt{n}}\right)\). If \(\mu_0\) falls outside this interval, we should reject the null. This is equivalent to the previously described decision rule.

This is because \[\begin{aligned} 1-\alpha &= P(-Z_{\alpha/2}\le Z_0 \le Z_{\alpha/2}) \\ &= P(-Z_{\alpha/2}\le \frac{\bar{y} - \mu}{\sigma/\sqrt{n}} \le Z_{\alpha/2}) \\ &= P\left(-Z_{\alpha/2}\left(\frac{\sigma}{\sqrt{n}}\right) \le \bar{y} - \mu \le Z_{\alpha/2}\left(\frac{\sigma}{\sqrt{n}}\right)\right) \\ &= P\left(\bar{y}-Z_{\alpha/2}\left(\frac{\sigma}{\sqrt{n}}\right) \le \mu \le \bar{y}+Z_{\alpha/2}\left(\frac{\sigma}{\sqrt{n}}\right)\right) \end{aligned}\]

We can also report a p-value that captures the probability of observing a statistic equally or more extreme than the one observed, assuming the null is true. This allows us to avoid purely binary reject / don’t reject thinking. P-values can be calculated easily using statistical software and the null distribution. The p-value can be thought of as the smallest level of significance that would lead us to reject the null.

One Sample t-Test5

If the variance is not known, then we must assume normality (for this type of test) and use \(S^2\) to estimate \(\sigma^2\). The hypotheses remain the same. We then use the test statistic

\[t_0 = \frac{\bar{y} - \mu_0}{S/\sqrt{n}},\]

which we know follows a t distribution with \(n-1\) degrees of freedom from above, since \(S^2\) is chi-square distributed and the ratio of a standard normal to a chi-square is a t distribution. Our decision rule would be to reject the null if \(|t_0|>t_{\alpha/2,n-1}\), where \(t_{\alpha/2,n-1}\) is the upper \(\alpha/2\) percentile of the t distribution with \(n-1\) degrees of freedom. Here, the \(100(1-\alpha)\%\) confidence interval for the true population mean is \(\bar{y} \pm t_{\alpha/2,n-1} \left(\frac{S}{\sqrt{n}}\right)\). Note that the denominator of the test statistic, \(S/\sqrt{n}\), is often called the standard error, which is the estimate of the standard deviation of \(\bar{y}\).

Two Sample Z-Test

Suppose we have two samples of observations \(y_{11},\dots,y_{1n_1}\) and \(y_{21},\dots,y_{2n_1}\). We want to test whether the mean of the two samples are statistically different. We assume that these are iid samples from two underlying normal distributions. We write our assumptions in a simple model of the data.

\[y_{ij} = \mu_i +\epsilon_{ij}\] for \(i = 1,2\) and \(j = 1,\dots,n_i\), where \(\epsilon_{1j} \overset{iid}\sim N(0,\sigma_1^2)\) and \(\epsilon_{2j} \overset{iid}\sim N(0,\sigma_2^2)\) are random noise. That is, we allow the mean and variance of each sample to differ.

The hypotheses here are (again, there are one sided versions of this) \[\begin{aligned} H_0: \mu_1 &= \mu_2 \\ H_1: \mu_1 &\ne \mu_2 \end{aligned}\]

We could also think of these as \[\begin{aligned} H_0: \mu_1 - \mu_2 &= 0 \\ H_1: \mu_1 - \mu_2 &\ne 0 \end{aligned}\]

Again, we first consider when the variances are known. The test statistic here is

\[Z_0 = \frac{\bar{y}_1 - \bar{y}_2}{\sqrt{ \frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2} }}\]

The difference between normal random variables is also a normal random variable, where the mean is the difference in means and the variance is the sum of the variances. So if both populations are normal or if the sample sizes are large enough to allow for us to use the CLT, then \(\bar{y}_1 - \bar{y}_2 \sim N(\mu_1-\mu_2, \frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2})\) and so \(Z_0 \sim N(0,1)\), assuming the null is true.

Our decision rule would be to reject the null if \(|Z_0|>Z_{\alpha/2}\), where \(Z_{\alpha/2}\) is the upper \(\alpha/2\) percentile of the standard normal distribution. Here, the \(100(1-\alpha)\%\) confidence interval for the true difference in means is \(\bar{y}_1 - \bar{y}_2 \pm Z_{\alpha/2} \left(\sqrt{ \frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2} }\right)\).

Two Sample t-Test

If the variances are not known, then we must assume normality (for this type of test) and use \(S_1^2,S_2^2\) to estimate \(\sigma_1^2,\sigma_2^2\). The hypotheses remain the same. We then use the test statistic

\[t_0 = \frac{\bar{y}_1 - \bar{y}_2}{\sqrt{ \frac{S_1^2}{n_1} + \frac{S_2^2}{n_2} }}\]

In this case, the distribution of \(t_0\) is not exactly a t distribution. But it is well approximated by a t distribution6 with degrees of freedom

\[\nu = \frac{\left( \frac{S_1^2}{n_1} + \frac{S_2^2}{n_2} \right)^2}{ \frac{(S_1^2/n_1)^2}{n_1-1} + \frac{(S_2^2/n_2)^2}{n_2-1} }.\]

Our decision rule would be to reject the null if \(|t_0|>t_{\alpha/2,\nu}\), where \(t_{\alpha/2,\nu}\) is the upper \(\alpha/2\) percentile of the t distribution with \(\nu\) degrees of freedom. Here, a \(100(1-\alpha)\%\) confidence interval for the true difference in means is \(\bar{y}_1 - \bar{y}_2 \pm t_{\alpha/2,\nu} \left(\sqrt{ \frac{S_1^2}{n_1} + \frac{S_2^2}{n_2} }\right)\).

Let’s look at an example where the data are normally distributed. We see that our results match those from the standard t-test output.

n1 = 100
n2 = 200
mu1 = 0
mu2 = 1
sigma1 = 1
sigma2 = 2
y1 = rnorm(n1,mu1,sqrt(sigma1))
y2 = rnorm(n2,mu2,sqrt(sigma2))

ybar1 = mean(y1)
ybar2 = mean(y2)
s2_1 = var(y1)
s2_2 = var(y2)

alpha = 0.05

t0 = (ybar1 - ybar2) / sqrt((s2_1/n1) + (s2_2/n2))
nu = ((s2_1/n1) + (s2_2/n2))^2 / 
  (((s2_1/n1)^2/(n1-1)) + ((s2_2/n2)^2/(n2-1)))
p_val = 2*pt(t0, nu, lower.tail = TRUE)
t_crit = qt(alpha/2, nu, lower.tail = FALSE)
CI_L = (ybar1 - ybar2) - t_crit*sqrt((s2_1/n1) + (s2_2/n2))
CI_H = (ybar1 - ybar2) + t_crit*sqrt((s2_1/n1) + (s2_2/n2))

results = matrix(0,nrow = 1,ncol = 9)
results[1,] = c(ybar1,ybar2,t0,nu,p_val,alpha,t_crit,CI_L,CI_H)
colnames(results) = c("Mean 1","Mean 2","t-Stat", "DF", "p-Value", 
                      "Alpha","Critical Value", "CI Low", "CI High")
knitr::kable(round(results,3))
Mean 1 Mean 2 t-Stat DF p-Value Alpha Critical Value CI Low CI High
0.096 1.027 -6.561 268.217 0 0.05 1.969 -1.209 -0.651
t.test(y1,y2)
## 
##  Welch Two Sample t-test
## 
## data:  y1 and y2
## t = -6.5613, df = 268.22, p-value = 2.739e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.2093545 -0.6510918
## sample estimates:
##  mean of x  mean of y 
## 0.09636482 1.02658795

Two Sample t-Test, Equal Variances

If the variances are equal (i.e., \(\sigma_1^2 = \sigma_2^2\)), then we can use the test statistic

\[t_0 = \frac{\bar{y}_1 - \bar{y}_2}{\sqrt{ \frac{S_p^2}{n_1} + \frac{S_p^2}{n_2} }} = \frac{\bar{y}_1 - \bar{y}_2}{S_p \sqrt{ \frac{1}{n_1} + \frac{1}{n_2} }}\]

where \(S_p^2 = \frac{(n_1-1)S_1^2 + (n_2-1)S_2^2}{n_1+n_2-2}\) is the common sample variance for the two samples. In this case, \(t_0\) is actually t-distributed, given our assumptions, that is, \(t_0 \sim t_{n_1+n_2-2}\). If the underlying distributions are independent normals then \(\bar{y}_1 - \bar{y}_2 \sim N\left(\mu_1-\mu_2, \sigma^2\left(\frac{\sigma^2}{n_1} + \frac{\sigma^2}{n_2}\right)\right)\). But we don’t know \(\sigma^2\) and so we estimate it with \(S_p^2\), which as above follows a chi-square and so \(t_0\) follows a t distribution.

Our decision rule would be to reject the null if \(|t_0|>t_{\alpha/2,n_1+n_2-2}\), where \(t_{\alpha/2,n_1+n_2-2}\) is the upper \(\alpha/2\) percentile of the t distribution with \(n_1+n_2-2\) degrees of freedom. Here, a \(100(1-\alpha)\%\) confidence interval for the true difference in means is \(\bar{y}_1 - \bar{y}_2 \pm t_{\alpha/2,n_1+n_2-2} \left(S_p \sqrt{ \frac{1}{n_1} + \frac{1}{n_2} }\right)\).

Additional Justification for t-Test

The t-tests above rely on the samples being drawn from normal distributions. However, moderate departures from this assumption do not dramatically alter the results. It can be shown that t-tests are good approximations to randomization tests (see below), which make no assumptions about about the form of the distribution from which the samples are drawn. This is drawn entirely from Montgomery (2019). But, really, we might just run a non-parametric randomization test, if we can.

Also, Wikipedia suggests that “For non-normal data, the distribution of the sample variance may deviate substantially from a \(\chi^2\) distribution. However, if the sample size is large, Slutsky’s theorem implies that the distribution of the sample variance has little effect on the distribution of the test statistic.” See https://en.wikipedia.org/wiki/Student%27s_t-test#Assumptions and https://en.wikipedia.org/wiki/Slutsky%27s_theorem.

Checking Assumptions of t-Test

Let’s briefly review the assumptions we make for the t-tests.

  • The \(y_i\)’s are iid.
  • \(\bar{y} \sim N(\mu,\frac{\sigma^2}{n}) \implies \frac{\bar{y} - \mu}{\sqrt{\frac{\sigma^2}{n}}} \sim N(0,1)\). That is, sample means are normally distributed. This will be satisfied when the \(y_i\)’s are normally distributed or when the sample size allows for us to invoke the central limit theorem.
  • \(S^2 \sim \frac{\sigma^2}{(n-1)}\chi^2_{n-1} \implies S^2(n-1)/\sigma^2 \sim \chi^2_{n-1}\). This will be satisfied when the \(y_i\)’s are normally distributed.
  • \(\bar{y}\) is independent of \(S^2\).

The assumptions about equal variances and about normality can be checked empirically. In the below plots, we show normal Q-Q plots for normal data and t distributed data. We see the normal data lies on the 45 degree line but the t data does not. In particular, we see that the tails of the t samples are fatter than a normal distribution’s tails.

We also do an F-test to check the equality of variances between the normal data and the t data. However, since one of these is not normal, the F-test isn’t actually valid. So we also run a Levene test, which is more tolerant to non-normality.7 We also calculate the sample variances and plot a box plot to sanity check.

# Normal Q-Q Plot for Normally Distributed Samples
x1 = rnorm(10000,0,0.1)
qqnorm(x1,main="Normal Q-Q Plot for Normally Distributed Samples",
       ylim=c(-6,6),xlim=c(-6,6)); qqline(x1)

# Normal QQ Plot for t (df=5) Distributed Samples
x2 = rt(10000,df=5)
qqnorm(x2,main="Normal Q-Q Plot for t (df=5) Distributed Samples",
       ylim=c(-6,6),xlim=c(-6,6)); qqline(x2)

#create data frame
data = data.frame(c(x1,x2),
                  as.factor(c(rep(1,length(x1)),rep(2,length(x2)))))
names(data) = c("x","id")
#calcualte variance and create box plot
data %>%
  group_by(id) %>%
  summarize(var=var(x))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
##   id        var
##   <fct>   <dbl>
## 1 1     0.00966
## 2 2     1.61
boxplot(x ~ id,data = data)

#run F text and Levene test
var.test(x1, x2, alternative = "two.sided")
## 
##  F test to compare two variances
## 
## data:  x1 and x2
## F = 0.0060128, num df = 9999, denom df = 9999, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.005781590 0.006253152
## sample estimates:
## ratio of variances 
##         0.00601275
car::leveneTest(x ~ id,data = data)
## Levene's Test for Homogeneity of Variance (center = median)
##          Df F value    Pr(>F)    
## group     1  9962.1 < 2.2e-16 ***
##       19998                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Choosing Sample Size

In situations where we have the ability to choose our sample size, we’ll consider samples of equal size. For the two sample t test with equal variances, our confidence interval is of the form \(\bar{y}_1 - \bar{y}_2 \pm t_{\alpha/2,2n-2} \left(S_p\sqrt{ \frac{2}{n} }\right)\). The size of the interval depends on \(t_{\alpha/2,2n-2} \left(S_p\sqrt{ \frac{2}{n} }\right)\). We don’t have control over \(S_p\) but we do have control over the sample size, which determines \(t_{\alpha/2,2n-2} \sqrt{ \frac{2}{n} }\) for a given \(\alpha\).

We might choose the sample size based on how tight of a confidence interval we would get. We can see how this decision might go in the plot below.

alpha = 0.05
n = seq(2,100)
funk = function(m){
  return(qt(alpha/2, 2*m-2, lower.tail = FALSE)*sqrt(2/m))
}
plot(n,sapply(n,funk),
     main = "Sample Size Confidence Interval Scaling Factor",
     xlab = "Sample Size",ylab = "Scaling Factor")

Power Analysis

Above we defined statistical significance and power. We noted that we will be selecting the significance level to be some low value. We generally also would like to have the power be some relatively high value. Essentially we want to reduce the probability that we reject the null when it it true (small significance) and also increase the probability that we reject the null when it is false (large power).

How might be achieve both of these goals? The key is being able to select the sample sizes and having a good idea of the magnitude of the effect size that would be meaningful for the application. The later comes from subject matter knowledge.

In power analysis, we set the values for desired significance, power, and effect size and then determine the sample size that would allow us to achieve this. There are many packages that automate this for us.

Let’s look at t-test.8 Note that the effect size here is Choen’s d: \(d = \frac{|\mu_1 - \mu_2|}{\sigma}\), where \(\mu_1\) and \(\mu_2\) are the means and \(\sigma^2\) is the common variance. Here a value of 0.01 indicates a very small effect size (which is harder to detect in small samples).9

# parameters for power analysis
effect = 0.2
alpha = 0.01
power = 0.9
# perform power analysis
pwr::pwr.t.test(n = , d = effect, sig.level = alpha, power = power, type = "two.sample")
## 
##      Two-sample t test power calculation 
## 
##               n = 745.63
##               d = 0.2
##       sig.level = 0.01
##           power = 0.9
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

It’s also important to note that: “The design of an experiment or observational study often influences the power. For example, in a two-sample testing situation with a given total sample size n, it is optimal to have equal numbers of observations from the two populations being compared (as long as the variances in the two populations are the same).” (Wikipedia - Power of a Test) That is, power is maximized when the samples are of equal sizes.

Paired t-Test

For now, I skip the paired t test. But you can refer to Wikipedia - Paired Difference Test.

Non-Parametric Hypothesis Testing: Randomization Tests

If our data might not be normal or we want to check the results of a parametric test, we might consider a two sample test that doesn’t require any assumption about the distribution from which the data have been drawn. Randomization tests fit the bill. These only require that the samples are representative (random) and i.i.d.10

The hypothesis here is again \[\begin{aligned} H_0: \mu_1 &= \mu_2 \\ H_1: \mu_1 &\ne \mu_2 \end{aligned}\]

The idea of a randomization test is that, under the null, the means are the same. So if we shuffle the data and then recalculate the difference in means many many times, we will be able to compare the original difference in means to the distribution of these. If there is a true difference in means, then the original ordering should produce an unlikely difference relative to the suffled difference distribution.

A two sample randomization test for equality of means proceeds as follows:

Let’s see this in action. We test whether the means for a standard normal and a bernoulli random variable (\(p=0.5\)) are equal. We see that the randomization test gives somewhat similar results to the t test but here the t test is not well founded.

n1 = 10
n2 = 20
y1 = rnorm(n1,0,1)
y2 = rbinom(size=1,n=n2,p=0.5)
y_dif = mean(y1) - mean(y2)

y = c(y1,y2)
iter = 1000
z_difs = rep(NA,iter)
for(i in 1:iter){
  shuffle = sample(y)
  z1 = shuffle[1:n1]
  z2 = shuffle[(n1+1):(n1+n2)]
  z_dif = mean(z1) - mean(z2)
  z_difs[i] = z_dif
}

p_val = sum(abs(z_difs)>abs(y_dif))/iter
results = matrix(0,nrow = 1,ncol = 4)
results[1,] = c(mean(y1),mean(y2),y_dif,p_val)
colnames(results) = c("Mean 1","Mean 2","Difference", "p-Value")
knitr::kable(round(results,3))
Mean 1 Mean 2 Difference p-Value
-0.114 0.45 -0.564 0.047
hist(z_difs,breaks = 20)
abline(v=y_dif,col="red",lty=2,lwd=2)

t.test(y1,y2)
## 
##  Welch Two Sample t-test
## 
## data:  y1 and y2
## t = -1.7189, df = 11.546, p-value = 0.1123
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.2816310  0.1539684
## sample estimates:
##  mean of x  mean of y 
## -0.1138313  0.4500000

Bootstrap Hypothesis Tests

There are also non-parametric tests based on bootstrapping. I hope to add a section about these in the future.

Analysis of Variance (ANOVA)

Suppose we want to compare more than just two treatment levels for a single factor. We can use ANOVA to do this.

Suppose we have \(a\) treatment levels for a single treatment factor that we want to compare. We have \(n\) observations for each treatment level.

We start with a model of the data. We use a linear model for the data here. \[y_{ij} = \mu + \tau_i + \epsilon_{ij}\] for \(i = 1,\dots,a\)and \(j=1,\dots,n\). \(\mu\) is the overall mean, \(\tau_i\) is the \(i\)th treatment effect, and \(\epsilon_{ij} \sim N(0,\sigma^2)\) is noise. So \(y_{ij} \sim N(\mu+\tau_i,\sigma^2)\).

Note that we usually consider \(\mu_i = \mu + \tau_i\) and \(\frac{1}{a}\sum_{i=1}^a \mu_i = \mu \implies \sum_{i=1}^a \tau_i = 0\). This means that the treatment effects are deviations from the overall mean.

If the treatment levels are specifically chosen by the experimenter, with no desire to extrapolate beyond those chosen, we call this the fixed effects model. In this case, the findings cannot be extrapolated to treatment levels not included in the analysis.

If we consider the treatment levels as a random sample of possible treatment levels, and we’d like to extend the results to treatment levels not specifically in the analysis, then we call this the random effects model. This approach reduces what we can say and we test hypotheses about the variability in \(\tau_i\) and try to estimate this variability.

Fixed Effects Model

The specific test that ANOVA allows us to run is \[\begin{aligned} H_0: \tau_1 &= \dots = \tau_a = 0 \\ H_1: \tau_i &\ne 0 \text{ for at least one }i \end{aligned}\]

Let \[\begin{aligned} &y_{i.} = \sum_{j=1}^n y_{ij} &&\bar{y}_{i.} = \frac{y_{i.}}{n} \\ &y_{..} = \sum_{i=1}^a\sum_{j=1}^n y_{ij} &&\bar{y}_{..} = \frac{y_{..}}{N} \\ \end{aligned}\] where \(N=an\).

ANOVA breaks the total variation down into component parts. Total sum of squares is \(SS_T = \sum_{i=1}^a\sum_{j=1}^n (y_{ij} - \bar{y}_{..})^2\) and measures overall variability in the data, since scaling it can give sample variance. It is easy to show that

\[\begin{aligned} \sum_{i=1}^a\sum_{j=1}^n (y_{ij} - \bar{y}_{..})^2 &= n \sum_{i=1}^a (\bar{y}_{i.} - \bar{y}_{..})^2 + \sum_{i=1}^a\sum_{j=1}^n (y_{ij} - \bar{y}_{i.})^2 \\ \implies SS_T &= SS_{\text{Treatments}} + SS_E \end{aligned}\]

\(SS_{\text{Treatments}}\) is the sum of squares due to treatments and \(SS_E\) is the sum of squares due to error.

There are a total of \(N= an\) observations and so \(SS_T\) has \(N-1\) DOF. There are \(a\) treatment levels and so \(SS_{\text{Treatments}}\) has \(a-1\) DOF. Within each treatment level, there are \(n\) observations with which to estimate statistical error so \(SS_E\) has \(a(n-1) = an -a = N-a\) DOF.

Note that we can calculate these sums of squares as

\[\begin{aligned} SS_T &= \sum_{i=1}^a\sum_{j=1}^n y_{ij}^2 - \frac{\bar{y}_{..}^2}{N} \\ SS_{\text{Treatments}} &= \frac{1}{n} \sum_{i=1}^a \bar{y}_{i.}^2 - \frac{\bar{y}_{..}^2}{N} \\ SS_E &= SS_T - SS_{\text{Treatments}} \\ \end{aligned}\]

Sum of Squares and Mean Square Due to Error

Note that \[SS_E = \sum_{i=1}^a\left[\sum_{j=1}^n (y_{ij} - \bar{y}_{i.})^2\right]\] The term in the brackets would be the sample variance in the \(i\)th treatment, if divided by \(n-1\). Now we can see that a pooled estimate of the common variance within each treatment can be written:

\[\frac{(n-1)S_1^2 + \dots + (n-1)S_a^2}{(n-1) + \dots + (n-1)} = \frac{\sum_{i=1}^a\left[\sum_{j=1}^n (y_{ij} - \bar{y}_{i.})^2\right]}{N-a} = \frac{SS_E}{N-a}\]

Now, \(MS_E := \frac{SS_E}{N-a}\). It can be shown easily that

\[\mathbb{E}[MS_E] = \sigma^2\]

Sum of Squares and Mean Square Due to Treatments

If there were no difference between the treatment means, we could use \(SS_{\text{Treatments}}/(a-1)\) to estimate \(\sigma^2\). This is because \(\frac{\sum_{i=1}^a (\bar{y}_{i.} - \bar{y}_{..})^2}{a-1}\) estimates \(\frac{\sigma^2}{n}\), the variance of the treatment averages. So \(MS_{\text{Treatments}} := \frac{SS_{\text{Treatments}}}{a-1} = \frac{n}{a-1}\sum_{i=1}^a (\bar{y}_{i.} - \bar{y}_{..})^2\) estimates \(\sigma^2\).

It can be shown easily that

\[\mathbb{E}[MS_{\text{Treatments}}] = \sigma^2 + \frac{n}{a-1}\sum_{i-1}^a \tau_i^2\]

ANOVA Hypothesis Test for Equality of Treatment Effects

Again, our hypotheses are \[\begin{aligned} H_0: \tau_1 &= \dots = \tau_a = 0 \\ H_1: \tau_i &\ne 0 \text{ for at least one }i \end{aligned}\]

We’ve seen that \(MS_E\) estimates \(\sigma^2\) and that, under the null that all treatment effects are zero, \(MS_{\text{Treatments}}\) also estimates \(\sigma^2\). However, if the treatment effects do differ, then \(MS_{\text{Treatments}}\) will be larger than \(\sigma^2\). So we can test our hypothesis by comparing \(MS_E\) and \(MS_{\text{Treatments}}\).

Since we’ve assumed our linear model and that \(\epsilon_{ij} \sim N(0,\sigma^2)\), \(y_{ij} \sim N(\mu+\tau_i,\sigma^2)\). This means that \(SS_T\) is a sum of squared normally distributed RVs. Dividing by \(\sigma^2\) would complete the standardization and make \(SS_T /\sigma^2 \sim \chi^2_{N-1}\). Similarly, \(SS_E /\sigma^2 \sim \chi^2_{N-a}\) and \(SS_{\text{Treatments}} /\sigma^2 \sim \chi^2_{a-1}\), under the null. Cochran’s Theorem establishes the independence of \(SS_E/\sigma^2\) and \(SS_{\text{Treatments}}/\sigma^2\).

So under the null,

\[\begin{aligned} F_0 &= \frac{(SS_{\text{Treatments}}/\sigma^2)/(a-1)}{(SS_E/\sigma^2)/(N-a)} = \frac{SS_{\text{Treatments}}/(a-1)}{SS_E/(N-a)} = \frac{MS_{\text{Treatments}}}{MS_E} \\ &\sim F_{a-1,N-a} \end{aligned}\]

Under the alternative hypothesis (that at least one treatment effect is not zero), \(MS_{\text{Treatments}} > MS_E\) and so if we see values for \(F_0\) that are improbably large assuming the null is true, we should reject the null. This is a one sided test. That is we should reject the null if \(F_0 > F_{\alpha,a-1,N-a}\). We could also use a p-value approach.

ANOVA results are generally presented in a table:

Source of Variation Sum of Squares Degrees of Freedom Mean Square \(F_0\)
Between Treatments \(SS_{\text{Treatments}} = \frac{1}{n} \sum_{i=1}^a \bar{y}_{i.}^2 - \frac{\bar{y}_{..}^2}{N}\) \(a-1\) \(MS_{\text{Treatments}} = \frac{SS_{\text{Treatments}}}{a-1}\) \(F_0 = \frac{MS_{\text{Treatments}}}{MS_E}\)
Error (w/in Treatments) \(SS_E = SS_T - SS_{\text{Treatments}}\) \(N-a\) \(MS_E = \frac{SS_E}{N-a}\)
Total \(SS_T = \sum_{i=1}^a\sum_{j=1}^n y_{ij}^2 - \frac{\bar{y}_{..}^2}{N}\) \(N-1\)

It is also possible to do this with different sample sizes for each treatment.

Check Assumptions

It’s important to recall that this all relies on an assumption of normality. We can again check this as above wiht normal probability plots. It is best to do an analysis of the residuals: \(e_{ij} = y_{ij} - \hat{y}_{ij}\), where \(\hat{y}_{ij} = \hat\mu +\hat\tau_i = \bar{y}_{..} + (\bar{y}_{i.} - \bar{y}_{..}) = \bar{y}_{i.}\). We should also check that there is no structure to the residuals, in particular that the variance of the residuals is homogenous. This can be checked with a plot of residuals vs fitted values. We might also plot the residuals against other variables that may have been collected, patterns here might mean that the variable should be controlled for more carefully.

When we have non constant variance or non-normality, we can try to transform the observations with square root, log, power, or other simple transformation to see if these yield constant variance. These might bring the error distribution closer to normal. We can then run the analysis on the transformed data. Here the conclusions would apply to the transformed populations, not the original population.

Comparing Pairs of Treatment Means: Tukey’s Test

Suppose we want to do pairwise comparisons of all treatment means. \[\begin{aligned} H_0: \mu_i &= \mu_j \\ H_1: \mu_i &\ne \mu_j \end{aligned}\] for \(i\ne j\). And we want a significance level of at most \(\alpha\) for all the tests jointly. The Tukey test allows us to calculate confidence intervals for all of these pairwise tests. This will allow us to see which treatment might lead to the highest increase in the response.

The test uses the studentized range statistic

\[q = \frac{\bar{y}_{\text{max}} - \bar{y}_{\text{min}}}{\sqrt{MS_E/n}}\] where \(\bar{y}_{\text{max}},\bar{y}_{\text{min}}\) are the largest and smallet sample means out of the \(a\) sample means.

When sample sizes are not equal, two means for treatments \(i,j\) are significantly different if

\[|\bar{y}_{i.} - \bar{y}_{j.}| > \frac{q_\alpha(a,f)}{\sqrt{2}}\sqrt{MS_E\left(\frac{1}{n_i} + \frac{1}{n_j}\right)} := T_\alpha\] We can construct confidence intervals for \(\mu_i - \mu_j\) as

\[\bar{y}_{i.} - \bar{y}_{j.} \pm \frac{q_\alpha(a,f)}{\sqrt{2}}\sqrt{MS_E\left(\frac{1}{n_i} + \frac{1}{n_j}\right)}\]

where \(f\) is the DOF of \(MS_E\).

Comparing Treatment Means with a Control: Dunnet Procedure

If one of the treatment levels is a control, then we might be interested just in comparing each of the treatments to the control. So there would be \(a-1\) comparisons, where each has a hypothesis

\[\begin{aligned} H_0: \mu_i &= \mu_a \\ H_1: \mu_i &\ne \mu_a \end{aligned}\] for \(i = 1,\dots,a-1\).

Dunnet’s procedure modifies the usual t-test. For each hypothesis, we compute the observed difference in means: \(|\bar{y}_{i.} - \bar{y}_{a.}|\) for \(i = 1,\dots,a-1\).

The null hypothesis \(H_0: \mu_i = \mu_a\) is rejected for significance \(\alpha\) (the joint significance on all \(a-1\) tests) if

\[|\bar{y}_{i.} - \bar{y}_{a.}| > d_\alpha(a-1,f)\sqrt{MS_E\left(\frac{1}{n_i} + \frac{1}{n_a}\right)}\]

Regression and ANOVA

In the future, I hope have a section linking ANOVA and regression.

Non-Parametric Approach to ANOVA: Kruskal-Wallis Test

In the future, I hope have a section on non-parametric ANOVA.

Random Effects Model

In the future, I hope have a section on random effects models.

Blocking

A nuisance factor is a variable that has an effect on the response but we are not interested in this effect. When a nuisance variable is known and controllable we can block on it to eliminate its effect. In a randomized complete block design (RCBD), each block contains all treatments. Each block therefore creates a homogeneous set of observations on which to compare treatments.

Suppose we have \(a\) treatments and \(b\) blocks. Randomization of treatment assignment only occurs within each block. Suppose we have only one observation per block-treatment here.

The model for the data is

\[y_{ij} = \mu + \tau_i + \beta_j + \epsilon_{ij}\] where \(\mu\) is the overall mean, \(\tau_i\) are the treatment effects, \(\beta_j\) are the block effects, and \(\epsilon_{ij} \overset{iid}\sim N(0,\sigma^2)\) are noise.

We again think of \(\tau_i\) and \(\beta_j\) as deviations from the overall mean and so \(\sum \tau_i = 0\) and \(\sum \beta_j = 0\)

Again, our hypotheses are \[\begin{aligned} H_0: \tau_1 &= \dots = \tau_a = 0 \\ H_1: \tau_i &\ne 0 \text{ for at least one }i \end{aligned}\]

We can now show that

\[SS_T = SS_{\text{Treatments}} + SS_{\text{Blocks}} + SS_E\]

We can also show that \(\frac{SS_{\text{Treatments}}}{\sigma^2} \sim \chi^2_{a-1}\), \(\frac{SS_{\text{Blocks}}}{\sigma^2} \sim \chi^2_{b-1}\), and \(\frac{SS_E}{\sigma^2} \sim \chi^2_{(a-1)(b-1)}\), since the sum of squares portions are sums of squared normals minus their means, which makes these ratios chi-square.

Define \(MS_{\text{Treatments}} = \frac{SS_{\text{Treatments}}}{a-1}\), \(MS_{\text{Blocks}} = \frac{SS_{\text{Blocks}}}{b-1}\), \(MS_E = \frac{SS_E}{(a-1)(b-1)}\).

We can show that \[\begin{aligned} \mathbb{E}[MS_{\text{Treatments}}] &= \sigma^2 + \frac{b}{a-1}\sum_{i=1}^a \tau_i^2 \\ \mathbb{E}[MS_{\text{Blocks}}] &= \sigma^2 + \frac{a}{b-1}\sum_{j=1}^b \beta_j^2 \\ \mathbb{E}[MS_E] &= \sigma^2 \end{aligned}\]

To test the equality of treatment means, we can use the test statistic

\[F_0 = \frac{\frac{SS_{\text{Treatments}}}{\sigma^2}/(a-1)}{\frac{SS_E}{\sigma^2} / (a-1)(b-1)} = \frac{MS_{\text{Treatments}}}{MS_E} \sim F_{a -1, (a-1)(b-1)}\] when the null is true. We can reject the null when \(F_0 > F_{\alpha,a -1, (a-1)(b-1)}\). We could also use a p-value approach.

The sums of squares can be calculated as \[\begin{aligned} SS_T &= \sum_{i=1}^a \sum_{j=1}^b y_{ij}^2 - \frac{y_{..}^2}{N} \\ SS_{\text{Treatments}} &= \frac{1}{b} \sum_{i=1}^a y_{i.}^2 - \frac{y_{..}^2}{N} \\ SS_{\text{Blocks}} &= \frac{1}{a} \sum_{i=1}^b y_{.j}^2 - \frac{y_{..}^2}{N} \\ SS_E &= SS_T - SS_{\text{Treatments}} - SS_{\text{Blocks}} \end{aligned}\]

Source of Variation Sum of Squares Degrees of Freedom Mean Square \(F_0\)
Treatments \(SS_{\text{Treatments}}\) \(a-1\) \(MS_{\text{Treatments}} = \frac{SS_{\text{Treatments}}}{a-1}\) \(F_0 = \frac{MS_{\text{Treatments}}}{MS_E}\)
Blocks \(SS_{\text{Blocks}}\) \(b-1\) \(MS_{\text{Blocks}} = \frac{SS_{\text{Blocks}}}{b-1}\)
Error \(SS_E\) \((a-1)(b-1)\) \(MS_E = \frac{SS_E}{(a-1)(b-1)}\)
Total \(SS_T\) \(N-1\)

Factorial Designs

Two Factor Factorial Designs

Suppose there are two treatment factors that we’d like to test with multiple factors level each. Let there be \(a\) levels of factor \(A\) and \(b\) levels of factor \(B\). And let there be n observations for each treatment combination. So there are a total of \(nab\) observations.

The model for the data is

\[y_{ijk} = \mu + \tau_i + \beta_k + (\tau\beta)_{ij} + \epsilon_{ijk}\] for \(i = 1,\dots, a\), \(j = 1,\dots, b\), and \(k = 1,\dots, n\). \(\mu\) is the overall mean, \(\tau_i\) is the ith treatment effect for \(A\), \(\beta_j\) is the jth treatment effect for \(B\), \((\tau\beta)_{ij}\) is the interaction effect between \(\tau_i\) and \(\beta_j\), and \(\epsilon_{ijk}\sim N(0,\sigma^2)\) is noise. We again interpert the treatment effects as deviations from the overall mean and so \(\sum\tau_i =0, \sum\beta_j = 0, \sum_i(\tau\beta)_{ij} = \sum_j (\tau\beta)_{ij} =0\).

The hypotheses are for \(A,B,\) and their interaction: \[\begin{aligned} H_0: \tau_1 &= \dots = \tau_a = 0 \\ H_1: \tau_i &\ne 0 \text{ for at least one }i \end{aligned}\] \[\begin{aligned} H_0: \beta_1 &= \dots = \beta_a = 0 \\ H_1: \beta_j &\ne 0 \text{ for at least one }j \end{aligned}\] \[\begin{aligned} H_0: (\tau\beta)_{ij} &= 0 \text{ for all }i,j\\ H_1: (\tau\beta)_{ij} &\ne 0 \text{ for at least one }i,j \end{aligned}\]

We can show that

\[SS_T = SS_A + SS_B + SS_{AB} + SS_E\] and that these have \(abn-1\), \(a-1\), \(b-1\), \((a-1)(b-1)\), and \(ab(n-1)\) DOF, respectively.

We can also show that \(\frac{SS_A}{\sigma^2} \sim \chi^2_{a-1}\), \(\frac{SS_B}{\sigma^2} \sim \chi^2_{b-1}\), \(\frac{SS_{AB}}{\sigma^2} \sim \chi^2_{(a-1)(b-1)}\), and \(\frac{SS_E}{\sigma^2} \sim \chi^2_{ab(n-1)}\), since the sum of squares portions are sums of squared normals minus their means, which makes these ratios chi-square.

Define \(MS_A = \frac{SS_A}{a-1}\), \(MS_B = \frac{SS_B}{b-1}\), \(MS_{AB} = \frac{SS_{AB}}{(a-1)(b-1)}\), \(MS_E = \frac{SS_E}{ab(n-1)}\).

We can show that \[\begin{aligned} \mathbb{E}[MS_A] &= \sigma^2 + \frac{bn}{a-1}\sum_{i=1}^a \tau_i^2 \\ \mathbb{E}[MS_B] &= \sigma^2 + \frac{an}{b-1}\sum_{j=1}^b \beta_j^2 \\ \mathbb{E}[MS_{AB}] &= \sigma^2 + \frac{n}{(a-1)(b-1)}\sum_{i=1}^a\sum_{j=1}^b (\tau\beta)_{ij}^2 \\ \mathbb{E}[MS_E] &= \sigma^2 \end{aligned}\]

We can test the equality of \(\tau_i\)’s with \[F_0 = \frac{\frac{SS_A}{\sigma^2}/(a-1)}{\frac{SS_E}{\sigma^2} / ab(n-1)} = \frac{MS_A}{MS_E} \sim F_{a -1, ab(n-1)}\] We can test the equality of \(\beta_j\)’s with \[F_0 = \frac{\frac{SS_B}{\sigma^2}/(b-1)}{\frac{SS_E}{\sigma^2} / ab(n-1)} = \frac{MS_B}{MS_E} \sim F_{b -1, ab(n-1)}\] We can test the equality of \((\tau\beta)_{ij}\)’s with \[F_0 = \frac{\frac{SS_{AB}}{\sigma^2}/(a-1)(b-1)}{\frac{SS_E}{\sigma^2} / ab(n-1)} = \frac{MS_{AB}}{MS_E} \sim F_{(a-1)(b-1), ab(n-1)}\]

Source of Variation Sum of Squares Degrees of Freedom Mean Square \(F_0\)
A \(SS_A\) \(a-1\) \(MS_A = \frac{SS_A}{a-1}\) \(F_0 = \frac{MS_A}{MS_E}\)
B \(SS_B\) \(b-1\) \(MS_B = \frac{SS_B}{b-1}\) \(F_0 = \frac{MS_B}{MS_E}\)
AB \(SS_{AB}\) \((a-1)(b-1)\) \(MS_{AB} = \frac{SS_{AB}}{(a-1)(b-1)}\) \(F_0 = \frac{MS_{AB}}{MS_E}\)
Error \(SS_E\) \(ab(n-1)\) \(MS_E = \frac{SS_E}{ab(n-1)}\)
Total \(SS_T\) \(abn-1\)

You can again use Tukey’s test to test all pairwise treatment level hypotheses simultaneously and maintain your desired significance level. You can deal with significant interactions by fixing the level of one factor and then applying the Tukey test to the other factor. You could also run all \(ab\) comparisons to see which differ significantly.

Blocking is also possible in addition to the treatment factors. It works similarly to blocking described above.

Three Factor Factorial Designs

You could extend this to have any number of factors with any number of factor levels. With three factor levels, we’d have

\[y_{ijkl} = \mu + \tau_i + \beta_k + \gamma_k + (\tau\beta)_{ij} + (\tau\gamma)_{ik} + (\beta\gamma)_{jk} + (\tau\beta\gamma)_{ijk}+ \epsilon_{ijkl}\] and

\[SS_T = SS_A + SS_B + SS_C + SS_{AB} + SS_{AC} + SS_{BC} + SS_{ABC} + SS_E\] and

Source of Variation Sum of Squares Degrees of Freedom Mean Square \(\mathbb{E}[MS]\) \(F_0\)
A \(SS_A\) \(a-1\) \(MS_A = \frac{SS_A}{a-1}\) \(\sigma^2 + \frac{bcn}{a-1}\sum \tau_i^2\) \(F_0 = \frac{MS_A}{MS_E}\)
B \(SS_B\) \(b-1\) \(MS_B = \frac{SS_B}{b-1}\) \(\sigma^2 + \frac{acn}{b-1}\sum \beta_j^2\) \(F_0 = \frac{MS_B}{MS_E}\)
C \(SS_C\) \(c-1\) \(MS_C = \frac{SS_C}{c-1}\) \(\sigma^2 + \frac{abn}{c-1}\sum \gamma_k^2\) \(F_0 = \frac{MS_C}{MS_E}\)
AB \(SS_{AB}\) \((a-1)(b-1)\) \(MS_{AB} = \frac{SS_{AB}}{(a-1)(b-1)}\) \(\sigma^2 + \frac{cn}{(a-1)(b-1)}\sum\sum (\tau\beta)_{ij}^2\) \(F_0 = \frac{MS_{AB}}{MS_E}\)
AC \(SS_{AC}\) \((a-1)(c-1)\) \(MS_{AC} = \frac{SS_{AC}}{(a-1)(c-1)}\) \(\sigma^2 + \frac{bn}{(a-1)(c-1)}\sum\sum (\tau\gamma)_{ik}^2\) \(F_0 = \frac{MS_{AC}}{MS_E}\)
BC \(SS_{BC}\) \((b-1)(c-1)\) \(MS_{BC} = \frac{SS_{BC}}{(b-1)(c-1)}\) \(\sigma^2 + \frac{an}{(b-1)(c-1)}\sum\sum (\beta\gamma)_{jk}^2\) \(F_0 = \frac{MS_{BC}}{MS_E}\)
ABC \(SS_{ABC}\) \((a-1)(b-1)(c-1)\) \(MS_{ABC} = \frac{SS_{ABC}}{(a-1)(b-1)(c-1)}\) \(\sigma^2 + \frac{n}{(a-1)(b-1)(c-1)}\sum\sum (\tau\beta\gamma)_{ijk}^2\) \(F_0 = \frac{MS_{ABC}}{MS_E}\)
Error \(SS_E\) \(abc(n-1)\) \(MS_E = \frac{SS_E}{abc(n-1)}\) \(\sigma^2\)
Total \(SS_T\) \(abcn-1\)

The \(2^k\) Factorial Designs

Let’s consider the case where there are \(k\) factors but each has only two treatment levels. Because there are only two levels for each factor, we assume that the response is approximately linear over the range of factor levels. This is often reasonable in factor screening experiments, when we are just starting to study the system and want to get a quick picture of what is important. The factors can be quantitative or qualitative.

\(2^2\) Design

Suppose we have two factors, both with two levels.

The data for this experiment might be structured as follows. We have \(n\) replicates of each treatment combination, where one treatment setting for each factor is considered “high” and labeled \(+\) and the other considered “low” and labeled “-.” The first two columns of this are called the design matrix.

Factor A Factor B Replicate 1 Replicate n
- - # #
+ - # #
- + # #
+ + # #

We use “A” to refer to the main effect of factor A, “B” to refer to the main effect of factor B, “AB” to refer to the interaction, and “I” to refer to the overall mean.

The four treatment combinations are also represented by lower case letters, where the presence of the letter indicates the high level. “\((1)\)” represents both factors at the low level.

\((1)\), \(a\), \(b\), and \(ab\) represent the sum of the \(n\) replicates / observations at the specific treatment level combination.

  • The effect of A at the low level of B is \(\frac{[a-(1)]}{n}\) and the effect of A at the high level of B is \(\frac{[ab-b]}{n}\). Taking the average of these gives the main effect of A.
  • The effect of B at the low level of A is \(\frac{[b-(1)]}{n}\) and the effect of A at the high level of B is \(\frac{[ab-a]}{n}\). Taking the average of these gives the main effect of B.
  • The interaction effect AB is the average difference between the effect of A at the high level of B and the effect of A at the low level of B.

So \[\begin{aligned} A &= \frac{1}{2}\left(\frac{[a-(1)]}{n} + \frac{[ab-b]}{n}\right) = \frac{1}{2n}\left[ab+a-b-(1)\right] \\ B &= \frac{1}{2}\left(\frac{[b-(1)]}{n} + \frac{[ab-a]}{n}\right) = \frac{1}{2n}\left[ab-a+b-(1)\right] \\ AB &= \frac{1}{2}\left(\frac{[ab-b]}{n} - \frac{[a-(1)]}{n}\right) = \frac{1}{2n}\left[ab-a-b+(1)\right] \\ \end{aligned}\]

It can also be shown that \[\begin{aligned} SS_A &= \frac{1}{4n}\left[ab+a-b-(1)\right]^2 \\ SS_B &= \frac{1}{4n}\left[ab-a+b-(1)\right]^2 \\ SS_{AB} &= \frac{1}{4n}\left[ab-a-b+(1)\right]^2 \\ \end{aligned}\]

We can use ANOVA to test hypotheses about whether there is a difference between the levels for each factor and for interaction.

Let’s also see how we can get the contrasts in a simple table.

Treatment Combination I A B AB
(1) + - - +
a + + - -
b + - + -
ab + + + +

We can also tie this to OLS regression estimates. We can fit a model to the data for this experiment of the form \[y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_{12}x_1x_2 + \epsilon\] where \(x_1,x_2\) are the main effects of the two factors on the \(\pm\) scale and \(x_1x_2\) is the two factor interaction. That is, the relationship between the coded variables on the \(\pm\) scale (\(x_1,x_2\)) and the natural variables (A,B) is \[\begin{aligned} x_1 &= \frac{A-(A_\text{low}+A_\text{high})/2}{(A_\text{high} - A_\text{low})/2} \in \{-1,+1\} \\ x_2 &= \frac{B-(B_\text{low}+B_\text{high})/2}{(B_\text{high} - B_\text{low})/2} \in \{-1,+1\} \\ \end{aligned}\] The regression coefficients are equal to one half the main and interaction effect estimates. This is because a regression coefficient measures the effect of a one-unit change in \(x\) on the mean of \(y\), and the effect estimate is based on a two unit change (from -1 to +1). \[\begin{aligned} \beta_0 &= \frac{1}{4n}\left[ab+a+b+(1)\right] \\ \beta_1 &= \frac{1}{4n}\left[ab+a-b-(1)\right] \\ \beta_2 &= \frac{1}{4n}\left[ab-a+b-(1)\right] \\ \beta_{12} &= \frac{1}{4n}\left[ab-a-b+(1)\right] \\ \end{aligned}\] Note that the standard errors of all the coefficients in the regression output will all be equal since this is an orthogonal design.

\(2^3\) Design

Suppose we have factors A,B,C that each have two levels.

The data for this experiment might be structured as follows. We have \(n\) replicates of each treatment combination, where one treatment setting for each factor is considered “high” and labeled \(+\) and the other considered “low” and labeled “-.” The first three columns of this are called the design matrix.

Factor A Factor B Factor C Replicate 1 Replicate n
- - - # #
+ - - # #
- + - # #
+ + - # #
- - + # #
+ - + # #
- + + # #
+ + + # #

We can again get the contrasts for the main effects and interactions using a table.

Treatment Combination I A B AB C AC BC ABC
(1) + - - + - + + -
a + + - - - - + +
b + - + - - + - +
ab + + + + - - - -
c + - - + + - - +
ac + + - - + + - -
bc + - + - + - + -
abc + + + + + + + +

When the lower case letters represent the total response for that treatment combination, \[\begin{aligned} A &= \frac{1}{4n}\left[-(1)+a-b+ab-c+ac-bc+abc\right] \\ B &= \frac{1}{4n}\left[-(1)-a+b+ab-c-ac+bc+abc\right] \\ AB &= \frac{1}{4n}\left[+(1)-a-b+ab+c-ac-bc+abc\right] \\ C &= \frac{1}{4n}\left[-(1)-a-b-ab+c+ac+bc+abc\right] \\ \dots \end{aligned}\] Note that the 4 in the denominator comes from their being 4 effects that are averaged to get each of these (e.g., for the main effect of A, the effect of A for each of the combinations of levels for B and C).

Also note that in this table the column for \(AB = A\times B\) and the same is true for all other interactions.

We can again easily estimate all the effects using a regression.

\[y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_3 + \beta_{12}x_1x_2 + \beta_{13}x_1x_3 + \beta_{23}x_2x_3 + \beta_{123}x_1x_2x_3 + \epsilon\] This regression is run on the coded versions of the variables, as before. And the regression coefficients are equal to one half the main and interaction effect estimates.

Some Notes

Note the typical analysis procedure for a \(2^k\) design is

  • Fit an initial model
    • If the design is replicated, fit the full model with all main effects and interactions.
    • Otherwise, use a half normal plot to see which effects are important.
    • Determine which effects are significant. Refine model to drop the insignificant effects. Note that if we keep an interaction, all lower order effects that contribute to this interaction should be kept.
  • Fit refined model.
  • Analyze residuals to check assumptions.
  • Interpret the results using main effect and interaction plots or contour plots or response surface plots.

Half normal plots are plots of the absolute value of the effect estimates against their cumulative normal probabilities. Departures from the line on a half normal plot indicate important effects. Since the effects are differences in sample means for the low and high settings, the CLT says these should tend to follow a normal distribution, when these are not equal to zero or close to it, we should think that the effect might be important. (https://www.itl.nist.gov/div898/handbook/pri/section5/pri598.htm)

Note that our use of coded variables allows us to compare the relative size of very different factors easily.

Two-Level Fractional Factorial Designs

With a large number of factors in a \(2^k\) factorial design, we might end up needing more runs or observations than are possible to collect. For example, with 10 factors, there are 1024 effects and so we’d need at least as many observations / runs but only 10 of these are main effects and 45 are two factor interactions. We might think that some of the higher order interactions are not as important, at least for an initial screening experiment where the goal is to find which factors have large effects.

One Half Fraction Designs

Let’s consider a one half fraction of a \(2^3\) factorial design. This will have \(2^{3-1}= 4\) observations. Since we are not gathering a large enough sample to assign all possible treatment combinations, we will also not be able to identify all factor effects. This is the trade off we make to have fewer runs.

We choose which treatment combinations to assign based on a “defining relation.” Here it will be \(I = +ABC\). \(ABC\) is called the generator. The defining relation tells us which factor effect can not be distinguished in the experiment, this is called aliasing. In general, we will want to alias as few low order terms as possible. This can be achieved by identifying \(I\) with the highest order interaction effects. Here we use \(I=+ABC\). Let’s see what the implications of this are.

The alias structure can be determined using the defining relation:

  • Multiply the defining relation by A: \(A\cdot I = A\cdot ABC \implies A = A^2BC = BC\), since \(\pm 1^2 = 1\).
  • \(B\cdot I = B\cdot ABC \implies B = AB^2C = AC\)
  • \(C\cdot I = C\cdot ABC \implies C = ABC^2 = AB\)

We indicate the alias structure by writing \[\begin{aligned} I &= +ABC \\ [A] &\to A + BC \\ [B] &\to B + AC \\ [C] &\to C + AB \\ \end{aligned}\] We can see all of this very quickly by just looking at the rows of the following table for which \(ABC\) has a \(+\).

Treatment Combination I A B AB C AC BC ABC
(1) + - - + - + + -
a + + - - - - + +
b + - + - - + - +
ab + + + + - - - -
c + - - + + - - +
ac + + - - + + - -
bc + - + - + - + -
abc + + + + + + + +

That is,

Treatment Combination I A B AB C AC BC ABC
a + + - - - - + +
b + - + - - + - +
c + - - + + - - +
abc + + + + + + + +

We can then compare columns to see which effects are aliased. We could alternatively use a defining relation of \(I=-ABC\), which would use the rows where \(ABC\) has a \(-\).

This is a resolution III design, because the smallest effect in the defining relation contains three letters. Higher resolution designs are preferable. The higher the resolution, the less restrictive the assumptions that are required with respect to which interactions are aliased with eachother and hence which interactions must be assumed to be negligible to obtain interpretable results.

We can still use a regression model on coded variables to estimate effects. Again, the coefficients will be 1/2 the effects.

General \(2^{k-p}\) Fractional Factorial Designs

For general \(2^{k-p}\) fractional factorial designs, \(p\) generators are required and again we want to achieve the highest resolution possible. When resolution alone is insufficient to provide a single defining relation, we should chose the minimum aberation design, which has the fewest effects of the resolution length in the defining relation. This will guide the choice of a defining relation which then tells us the alias structure, which treatment combination to assign, and which effects must be assumed to be negligible.

Once we have all this, we can run the experiment and proceed to follow a similar course of analysis as above:

  • Use box plots to and half normal plot to get a sense of which effects are important.
  • Fit a main effects model to get a sense of which effects are important.
  • Use Step wise regression with main effects and as many low order interactions as possible to fit an initial model.
  • Analyze residuals to check assumptions.
  • Perform follow-up experiment, if necessary, and repeat steps.
  • Interpret the results using main effect and interaction plots or contour plots or response surface plots.

References

Montgomery, Douglas C. 2019. Design and Analysis of Experiments, 10th Edition. New York: Wiley.

  1. It is important to acknowledge that this is a assumption and will not hold in many interesting and important settings.↩︎

  2. Again, it is important to note that this is might not be the case in all settings. Non-random sampling can have dramatic implications for the bias of estimates.↩︎

  3. It doesn’t make sense to have \(n\) in the limit.↩︎

  4. See https://en.wikipedia.org/wiki/Z-test for additional information.↩︎

  5. See https://en.wikipedia.org/wiki/Student%27s_t-test for additional information.↩︎

  6. This is also known as Welch’s t-test.↩︎

  7. See https://en.wikipedia.org/wiki/Levene%27s_test.↩︎

  8. See https://machinelearningmastery.com/statistical-power-and-power-analysis-in-python/ and https://www.statmethods.net/stats/power.html.↩︎

  9. See https://en.wikipedia.org/wiki/Effect_size.↩︎

  10. https://measuringu.com/randomization-test/, https://www.uvm.edu/~statdhtx/StatPages/Randomization%20Tests/RandomizationTestsOverview.html, https://cs.stanford.edu/people/wmorgan/sigtest.pdf↩︎