import numpy as np
= 10000
n = np.random.uniform(size=n)
x1 = np.random.normal(size=n)
x2 = np.random.normal(scale=2, size=n)
eps = 3 * x1 ** 2 + x2 / 3 + eps y
Lab 2 Presentation - Data Visualization for Linear Modeling
We have data, and suppose we want to fit ordinary least squared (OLS): \[ \hat{\boldsymbol{\beta}}= (\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{Y}, \] in order to minimize the average squared error loss \(\mathscr{L}(y, \hat{y}) = (y- \hat{y})^2\) on a test set.
1 First attempt to plot
Let us explore the relation between x1
and y
using a scatter plot.
from matplotlib import pyplot as plt
plt.scatter(x1, y)
There seems to be some kind of upward trend, but the data are very noisy. What can we do?
2 Introducing quantile_plot
This functions bins the x
values into bins
bins, and plot the mean y
value for each bin.
import numpy as np
from matplotlib import pyplot as plt
def quantile_plot(x, y, by=None, bins=10, by_bins=3, y_fn=np.mean):
assert len(x) == len(y)
def qp_data(x, y):
= np.searchsorted(np.quantile(x, q=[i / bins for i in range(1, bins)]), x)
fac = np.unique(fac)
ufac = np.array([np.mean(x[fac == f]) for f in ufac])
qx = np.array([y_fn(y[fac == f]) for f in ufac])
qy return qx, qy
= qp_data(x, y)
qx, qy if by is None:
"-o")
plt.plot(qx, qy, else:
assert len(x) == len(by)
"-o", label="ALL", color="lightgrey")
plt.plot(qx, qy, = np.searchsorted(np.quantile(by, q=[i / by_bins for i in range(1, by_bins)]), by)
by_fac = np.unique(by_fac)
by_ufac for i, f in enumerate(np.unique(by_ufac)):
= by_fac == f
mask = f"{i}) {min(by[mask]):.2f} / {max(by[mask]):.2f}"
nm = qp_data(x[mask], y[mask])
qx, qy "-o", label=nm)
plt.plot(qx, qy, plt.legend()
Let us run quantile_plot
on x1
and y
.
=10) quantile_plot(x1, y, bins
The plot now is much less noisy, and the upward trend is very clear.
3 Is this a good fit for a linear model?
Is it linear? Let us add a line that represents the OLS prediction.
quantile_plot(x1, y)= sum(x1 * y) / sum(x1 ** 2)
b1 * b1, color='black', linewidth=1) plt.plot(x1, x1
The black (OLS) is not a good fit. Should we care?
Recall from class, we showed that the minimizer of squared error is the conditional mean: \[
f^* = \arg\min\left\{\mathbb{E}_{\,}\left[(y-f(\boldsymbol{x}))^2\right]: f \text{ function}\right\} \; \; \; \; \text{if and only if} \; \; \; \; f^*(\boldsymbol{x})=\mathbb{E}_{\,}\left[y|\boldsymbol{x}\right].
\] The blue line is relatively smooth, to the extent that we may believe that the true mean of y
as function of x_1
follows closely to it. Therefore, the best possible prediction if we want to minimize squared error would be roughly the blue line.
So, yes, we should care. If we run OLS on x1
we will predict the black line, which is quite far from optimal.
How can we improve the prediction?
4 Feature engineering
Let us try a quadratic relation, \(y=x_1^2\). We create a new variable x1_sq
which is the square of x1
, and repeat the process over x1_sq
and y
instead.
= x1 ** 2
x1_sq =10)
quantile_plot(x1_sq, y, bins= sum(x1_sq * y) / sum(x1_sq ** 2)
b1_sq * b1_sq, color='black', linewidth=1) plt.plot(x1_sq, x1_sq
The black (OLS) line looks like a good fit now. The deviations of the blue line are small and seemingly random. Or are they?
In any case, OLS in x1_sq
looks like a better fit that OLS in x1
. We seem to be predicting close to the true conditional mean: \[
\mathbb{E}_{\,}\left[y|x_1\right] \approx \beta x_1^2.
\]
Will this translate to reduced squared error? Let us check.
def squared_error(y, yhat):
return np.mean((y - yhat) ** 2)
* b1), squared_error(y, x1_sq * b1_sq) squared_error(y, np.mean(y)), squared_error(y, x1
(4.903522114285841, 4.218220045517391, 4.087226795661047)
On the left is the variance of \(y\), which is the squared error if we predict the empirical average of \(y\). In the middle is the squared error for the linear model in \(x_1\). On the right is the squared error for the linear model in \(x_1^2\).
We see that the model with \(x_1^2\) is indeed better than the model in \(x_1\), but not by that much. This is probably as much as we can get from this feature alone.
5 Interaction
Let us try one more thing. We add to the mix another feature \(x_2\), and plot the joint contribution of \(x_1^2\) and \(x_2\) to \(y\). quantile_plot
allows that by binning the by
variable (in this case x2
) into by_bins
bins, and plots multiple quantile_plot
, one for each level of the by
variable.
In the case below, the blue line corresponds to quantile_plot
between x1_sq
and y
for data with low x2
. The orange line corresponds to medium-range x2
, and the green line to high values of x2
.
=10, by=x2, by_bins=3) quantile_plot(x1_sq, y, bins
We see that the three lines are relatively parallel, which means there is probably no interaction between x1_sq
and x2
. The fact that the green line lies above the orange line, which lies above the blue line, indicates a main effect for x2
. The fact that each of the lines is increasing indicates a main effect for x1_sq
.
Taken together, we should probably include both x1_sq
and x2
in our model, as two separate features.