With big data becoming more and more prominent in the world, high-dimensional data is a relevant issue due to the challenges that it imposes to meaningful analysis. The exponential growth in space leads to data becoming sparse thus making it difficult to analyze underlying patterns/relationships. This is where dimension-reducing techniques come in, the most popular one being Principal Component Analysis. Hence, it is important to analyze exactly what makes PCA work and to study it and generalize it in order to leave room for more variants for differing fields and applications to grow. This paper examines PCA from a linear algebra perspective, particularly using module theory. We prove that PCA is an module homomorphism, and, when all principal components are kept, an module automorphism, meaning that it preserves structure and is invertible. We then look into what happens algebraically when only a subset of principal components are kept. The module homomorphism is then only an module epimorphism, not an isomorphism, still structure preserving but not invertible. Through these findings, we find that there are three essential, algebraic properties of Principal Component Analysis, namely (1) the transformation must be linear, (2) it must project the data onto a new orthonormal basis, and (3) it must diagonalize the covariance (or correlation) matrix of the centered dataset. With these properties, we get an algebraic definition of PCA: an module automorphism that diagonalizes the covariance structure of the original dataset via an orthogonal change of basis.
Copyrights © 2025