协方差与协方差矩阵

发布于 2023-09-04  175 次阅读


Please refresh the page if equations are not rendered correctly.
---------------------------------------------------------------

1. 协方差

定义: 若实数随机变量 XY 期望值分别为 E(X)=\muE(Y)=\nu ,则两者间的协方差定义为:

\operatorname{cov}(X, Y)=\mathrm{E}[(X-\mu)(Y-\nu)]

2.协方差矩阵

设有一组随机向量(多元随机变量或随机向量, multivariate random variable or random vector),可以表示为\mathbf{X} = \left[ x_1, x_2, x_3, ..., x_n \right]^\topn=1,2,3, ..., n,代表这一组随机向量的个数。每个随机向量包含m个元素,则可以定义该组随机向量的协方差矩阵为:

\operatorname{Covariance \ Matrix \ \mathbf{C}}=\frac{1}{m-1}\left[\begin{array}{cccc}
\operatorname{cov}\left(x_1, x_1\right)&\operatorname{cov}\left(x_1, x_2\right)&\ldots&\operatorname{cov}\left(x_1, x_n\right) \\
\operatorname{cov}\left(x_2, x_1\right)&\operatorname{cov}\left(x_2, x_2\right)&\ldots&\operatorname{cov}\left(x_2, x_n\right) \\
\vdots&\vdots&\ddots&\vdots \\
\operatorname{cov}\left(x_n, x_1\right)&\operatorname{cov}\left(x_n, x_2\right)&\ldots&\operatorname{cov}\left(x_n, x_n\right)
\end{array}\right]

协方差矩阵的第 (i, j) 项定义为 如下形式 :

c_{i j}=\operatorname{cov}\left(x_i, x_j\right)=\mathrm{E}\left[\left(x_i-\mu_i\right)\left(x_j-\mu_j\right)\right]

其中, \mu_ix_i 的期望值,即, \mu_i=\mathrm{E}\left(x_i\right) 。而协方差矩阵为:

\mathbf{C} =\mathrm{E}\left[(\mathbf{X}-\mathrm{E}[\mathbf{X}])(\mathbf{X}-\mathrm{E}[\mathbf{X}])^{\mathrm{T}}\right]

Nomenclatures differ. Some statisticians, following the probabilist William Feller in his two-volume book A n Introduction to Probability Theory and Its Applications, { }^{[2]} call the matrix \mathrm{K}_{\mathbf{X X}} the variance of the random vector \mathbf{X}, because it is the natural generalization to higher dimensions of the 1-dimensional variance. Others call it the covariance matrix, because it is the matrix of covariances between the scalar components of the vector \mathbf{X}.

\operatorname{var}(\mathbf{X})=\operatorname{cov}(\mathbf{X}, \mathbf{X})=\mathrm{E}\left[(\mathbf{X}-\mathrm{E}[\mathbf{X}])(\mathbf{X}-\mathrm{E}[\mathbf{X}])^{\mathrm{T}}\right] .

Both forms are quite standard, and there is no ambiguity between them. The matrix \mathrm{K}_{\mathbf{X X}} is also often called the variance-covariance matrix, since the diagonal terms are in fact variances.
By comparison, the notation for the cross-covariance matrix between two vectors is

\operatorname{cov}(\mathbf{X}, \mathbf{Y})=\mathrm{K}_{\mathbf{X Y}}=\mathrm{E}\left[(\mathbf{X}-\mathrm{E}[\mathbf{X}])(\mathbf{Y}-\mathrm{E}[\mathbf{Y}])^{\mathrm{T}}\right]

举例:设有随机向量x_1x_2, 分别为:

x_1 = [-2.1, -1, 4.3] \\
x_2 = [3.0, 1.1, 0.12]

可以组成X:

X = np.stack((x1, x2), axis=0)

既:

\left[\begin{array}{ccc}
-2.1&-1&4.3 \\
3.0&1.1&0.12
\end{array}\right]

使用Numpy中的协方差矩阵函数numpy.cov()可以计算其协方差矩阵:

x1 = [-2.1, -1,  4.3]
x2 = [3,  1.1,  0.12]
X = np.stack((x1, x2), axis=0)

>>> np.cov(X)
array([[11.71      , -4.286     ], # may vary
       [-4.286     ,  2.144133]])

>>> np.cov(x1, x2)
array([[11.71      , -4.286     ], # may vary
       [-4.286     ,  2.144133]])

>>> np.cov(x1, bias=False)
array(11.71)

>>> np.cov(x1,bias=True)
array(7.80666667)

>>> np.cov(x,ddof=0)
array(7.80666667)

numpy.cov(m, y=None, rowvar=True, bias=False, ddof=None, fweights=None, aweights=None, *, dtype=None)[source]
注意参数的默认值:
- 当bias参数取默认值时,计算各随机变量的均值时采用(m-1),其中m为number of observations given in each radom vector (unbiased estimate)。反之,如果设置为True, 则采用m求均值。
- If ddof not None the default value implied by bias is overridden. Note that ddof=1 will return the unbiased estimate, even if both fweights and aweights are specified, and ddof=0 will return the simple average (用随机向量的实际元素个数m求均值). See the notes for the details. The default value is None.

3. Pearson相关性系数

已知协方差矩阵的情况下,Pearson相关性系数可以根据以下公式计算得到:

R_{i j}=\frac{c_{i j}}{\sqrt{c_{i i} c_{j j}}}

The values of R are between -1 and 1 , inclusive.

Numpy中,可以直接使用numpy.corrcoef函数求得。

参考资料:
1. 协方差 - 维基百科,自由的百科全书
2. 协方差矩阵 - 维基百科,自由的百科全书
3. 2023-09-04 numpy.cov — NumPy v1.25 Manual
4. numpy.corrcoef — NumPy v1.25 Manual

届ける言葉を今は育ててる
最后更新于 2023-09-04