Math for Machine Learning Glossary
Welcome to the Math for Machine Learning Glossary. Here you will find definitions for common mathematical terms used throughout the course.
Linear Algebra
| Term | Full Name | Definition |
|---|---|---|
| Scalar | Scalar | A single number (Rank 0 Tensor) representing magnitude only. |
| Vector | Vector | An ordered list of numbers (Rank 1 Tensor) representing magnitude and direction. |
| Basis Vector | Basis Vector | A set of vectors that are linearly independent and span a vector space (e.g., i, j, k). |
| Matrix | Matrix | A rectangular array of numbers (Rank 2 Tensor) arranged in rows and columns. |
| Tensor | Tensor | A multidimensional array of numbers (Rank N) generalizing scalars, vectors, and matrices. |
| Rank | Tensor Rank | The number of dimensions (axes) of a tensor (not to be confused with Matrix Rank). |
| Dot Product | Dot Product (Scalar Product) | An algebraic operation that takes two equal-length sequences of numbers and returns a single number, measuring similarity. |
| Linear Transformation | Linear Transformation | A mapping between two vector spaces that preserves the operations of vector addition and scalar multiplication. |
| Gaussian Elimination | Gaussian Elimination | An algorithm for solving systems of linear equations by transforming the system’s matrix into row-echelon form. |
| Determinant | Determinant | A scalar value derived from a square matrix that characterizes properties of the linear transformation (e.g., scaling factor). |
| Cosine Similarity | Cosine Similarity | A measure of similarity between two non-zero vectors that measures the cosine of the angle between them. |
| Eigenvalues | Eigenvalues | A special set of scalars associated with a linear system of equations (i.e., a matrix equation) that are sometimes also known as characteristic roots. |
| Eigenvectors | Eigenvectors | A non-zero vector that changes at most by a scalar factor when that linear transformation is applied to it. |
| SVD | Singular Value Decomposition | A factorization of a real or complex matrix that generalizes the eigendecomposition of a square normal matrix to any m × n matrix via an extension of the polar decomposition. |
| PCA | Principal Component Analysis | A dimensionality reduction method that transforms a large set of variables into a smaller one that still contains most of the information in the large set. |
| Characteristic Equation | Characteristic Equation | The equation det(A - λI) = 0 whose roots are the eigenvalues of the matrix A. |
| Trace | Trace | The sum of the elements on the main diagonal of a square matrix. It is also the sum of the eigenvalues. |
| Diagonalization | Diagonalization | The process of finding a diagonal matrix that is similar to a given matrix, typically via eigendecomposition A = QΛQ-1. |
| Rank-1 Approximation | Rank-1 Approximation | Approximating a matrix as the outer product of two vectors (plus a scalar weight), often the first term of an SVD. |
| Covariance Matrix | Covariance Matrix | A square matrix giving the covariance between each pair of elements of a given random vector. |
| Positive Definite | Positive Definite Matrix | A symmetric matrix M where xTMx > 0 for all non-zero vectors x. All eigenvalues are positive. |
| Orthogonal Matrix | Orthogonal Matrix | A square matrix Q whose columns and rows are orthogonal unit vectors (QTQ = Q QT = I). |
| Basis | Basis | A set of linearly independent vectors that span a vector space. |
| Span | Span | The set of all possible linear combinations of a given set of vectors. |
| Matrix Rank | Matrix Rank | The dimension of the vector space generated (or spanned) by the matrix’s columns (or rows). |
Calculus & Optimization
| Term | Full Name | Definition |
|---|---|---|
| Derivative | Derivative | The instantaneous rate of change of a function with respect to one of its variables (slope of the tangent line). |
| Partial Derivative | Partial Derivative | The derivative of a multivariable function with respect to one variable, treating the others as constants. |
| Gradient | Gradient Vector | A vector containing all the partial derivatives of a function, pointing in the direction of the steepest ascent. |
| Jacobian | Jacobian Matrix | A matrix of all first-order partial derivatives of a vector-valued function. Essential for Backpropagation. |
| Hessian | Hessian Matrix | A square matrix of second-order partial derivatives of a scalar-valued function. Describes local curvature. |
| Chain Rule | Chain Rule | A formula for computing the derivative of the composition of two or more functions. |
| Taylor Series | Taylor Series | An infinite sum of terms that are expressed in terms of the function’s derivatives at a single point, used for approximation. |
| Learning Rate | Learning Rate | A hyperparameter that controls how much to change the model in response to the estimated error each time the model weights are updated. |
| Convex Function | Convex Function | A function where a line segment connecting any two points on the graph lies above or on the graph (guarantees Global Minimum). |
| Saddle Point | Saddle Point | A point on the surface of the graph of a function where the slopes (derivatives) in orthogonal directions are all zero (a critical point), but which is not a local extremum of the function. |
| SGD | Stochastic Gradient Descent | An iterative method for optimizing an objective function with suitable smoothness properties (e.g., differentiable or subdifferentiable). |
| Momentum | Momentum | A technique to accelerate gradient descent that accumulates a velocity vector in directions of persistent reduction in the objective across iterations. |
| Adam | Adaptive Moment Estimation | An optimization algorithm that adapts the learning rate for each parameter, combining ideas from Momentum and RMSProp. |
| Softmax | Softmax Function | A function that converts a vector of numbers into a vector of probabilities, where the probabilities of each value are proportional to the relative scale of each value in the vector. |
| AutoDiff | Automatic Differentiation | A family of techniques to evaluate the derivative of a function specified by a computer program, efficient and exact (unlike numerical differentiation). |
| Backpropagation | Backpropagation | The primary algorithm for training neural networks, calculating the gradient of the loss function with respect to weights by applying the Chain Rule backwards. |
| Computational Graph | Computational Graph | A directed graph where nodes represent mathematical operations (add, multiply, ReLU) and edges represent the flow of data (tensors). |
Probability & Statistics
| Term | Full Name | Definition | |
|---|---|---|---|
| Probability Density Function | A function whose integral over any interval gives the probability that the random variable falls within that interval (for continuous variables). | ||
| Probability Mass Function | PMF | A function that gives the probability that a discrete random variable is exactly equal to some value. | |
| Normal Distribution | Normal (Gaussian) Distribution | A continuous probability distribution (bell curve) characterized by its mean and standard deviation. | |
| Central Limit Theorem | CLT | A theorem stating that the sum of many independent random variables tends toward a normal distribution, regardless of the original distribution. | |
| Bayes’ Theorem | Bayes’ Theorem | A mathematical formula for determining conditional probability, updating the probability of a hypothesis as more evidence becomes available. | |
| Expectation | Expectation (Mean) | The weighted average of all possible values that a random variable can take on. | |
| Variance | Variance | A measure of how spread out a set of numbers is from their average value. | |
| Covariance | Covariance | A measure of the joint variability of two random variables. | |
| Correlation | Correlation | A normalized measure of the relationship between two variables, ranging from -1 to 1. | |
| P-Value | P-Value | The probability of obtaining test results at least as extreme as the results actually observed, assuming that the null hypothesis is correct. | |
| Sample Space | Sample Space (Ω) | The set of all possible outcomes of a random experiment. | |
| Event | Event | A subset of the sample space (a set of outcomes) to which a probability is assigned. | |
| Naive Bayes | Naive Bayes Classifier | A probabilistic classifier based on applying Bayes’ theorem with strong (naive) independence assumptions between the features. | |
| Laplace Smoothing | Laplace Smoothing | A technique used to handle zero-probability problems by adding a small positive count (usually 1) to each observation count. | |
| Prior | Prior Probability (P(A)) | The initial probability of an event or hypothesis before new evidence is taken into account. | |
| Posterior | Posterior Probability (P(A | B)) | The updated probability of an event or hypothesis after new evidence has been considered. |
| Likelihood | Likelihood (P(B | A)) | The probability of the evidence given that the hypothesis is true. |
| Type I Error | Type I Error (α) | A “False Positive”: Rejecting the null hypothesis when it is actually true. | |
| Type II Error | Type II Error (β) | A “False Negative”: Failing to reject the null hypothesis when it is actually false. | |
| Cross-Entropy | Cross-Entropy Loss | A measure of the difference between two probability distributions for a given random variable or set of events. | |
| Frequentist | Frequentist Statistics | A framework where probability is interpreted as the long-run frequency of repeatable events. | |
| Bayesian | Bayesian Statistics | A framework where probability is interpreted as a degree of belief, updated as more evidence becomes available. | |
| Outlier | Outlier | A data point that differs significantly from other observations, often skewing statistical measures like Mean and Correlation. | |
| Log-Sum-Exp | Log-Sum-Exp Trick | A numerical technique used to calculate the logarithm of the sum of exponentials of input values, used to prevent underflow/overflow. |
Discrete Math & Information Theory
| Term | Full Name | Definition |
|---|---|---|
| Bit | Bit (Binary Digit) | The basic unit of information in computing and digital communications. |
| Surprisal | Surprisal (Self-Information) | A measure of the information content associated with an event. Rare events have high surprisal. |
| Entropy | Entropy (Shannon) | A measure of the unpredictability of the state, or equivalently, of its average information content. |
| KL Divergence | Kullback-Leibler Divergence | A measure of how one probability distribution is different from a second, reference probability distribution. |
| Graph | Graph | A structure amounting to a set of objects in which some pairs of the objects are in some sense “related”. |
| Adjacency Matrix | Adjacency Matrix | A square matrix used to represent a finite graph. The elements of the matrix indicate whether pairs of vertices are adjacent or not in the graph. |
| Adjacency List | Adjacency List | A collection of unordered lists used to represent a finite graph. Each list describes the set of neighbors of a vertex in the graph. |
Signal Processing & Complex Systems
| Term | Full Name | Definition |
|---|---|---|
| Fourier Transform | Fourier Transform | A mathematical transform that decomposes functions (signals) into frequency components (sine waves). |
| DFT | Discrete Fourier Transform | The discrete version of the Fourier Transform, used for digital signal processing. |
| FFT | Fast Fourier Transform | An algorithm that computes the DFT of a sequence in O(N log N) time. |
| Convolution Theorem | Convolution Theorem | A principle stating that convolution in the time domain corresponds to multiplication in the frequency domain. |
| Complex Number | Complex Number | A number that can be expressed in the form a + bi, where a and b are real numbers and i is the imaginary unit. |
| Euler’s Formula | Euler’s Formula | A fundamental equation in complex analysis: eix = cos x + i sin x. |
| Quaternion | Quaternion | A number system that extends the complex numbers to 4 dimensions (w + xi + yj + zk), used for 3D rotations. |
| Gimbal Lock | Gimbal Lock | A state where one degree of freedom is lost because two axes of rotation become parallel. |
| Self-Attention | Self-Attention | A mechanism in Transformers that relates different positions of a single sequence to compute a representation of the sequence. |
| Positional Encoding | Positional Encoding | A technique to inject information about the position of tokens in a sequence, since Transformers process them in parallel. |
| VAE | Variational Autoencoder | A generative model that learns a probabilistic mapping to a latent space (usually Gaussian). |
| Latent Space | Latent Space | A compressed, abstract representation of data (manifold) where similar data points are closer together. |
| Reparameterization Trick | Reparameterization Trick | A technique used in VAEs to backpropagate through a random sampling node by rewriting the random variable as a deterministic function of parameters and noise. |