Math for Machine Learning Glossary

Welcome to the Math for Machine Learning Glossary. Here you will find definitions for common mathematical terms used throughout the course.

Linear Algebra

Term Full Name Definition
Scalar Scalar A single number (Rank 0 Tensor) representing magnitude only.
Vector Vector An ordered list of numbers (Rank 1 Tensor) representing magnitude and direction.
Basis Vector Basis Vector A set of vectors that are linearly independent and span a vector space (e.g., i, j, k).
Matrix Matrix A rectangular array of numbers (Rank 2 Tensor) arranged in rows and columns.
Tensor Tensor A multidimensional array of numbers (Rank N) generalizing scalars, vectors, and matrices.
Rank Tensor Rank The number of dimensions (axes) of a tensor (not to be confused with Matrix Rank).
Dot Product Dot Product (Scalar Product) An algebraic operation that takes two equal-length sequences of numbers and returns a single number, measuring similarity.
Linear Transformation Linear Transformation A mapping between two vector spaces that preserves the operations of vector addition and scalar multiplication.
Gaussian Elimination Gaussian Elimination An algorithm for solving systems of linear equations by transforming the system’s matrix into row-echelon form.
Determinant Determinant A scalar value derived from a square matrix that characterizes properties of the linear transformation (e.g., scaling factor).
Cosine Similarity Cosine Similarity A measure of similarity between two non-zero vectors that measures the cosine of the angle between them.
Eigenvalues Eigenvalues A special set of scalars associated with a linear system of equations (i.e., a matrix equation) that are sometimes also known as characteristic roots.
Eigenvectors Eigenvectors A non-zero vector that changes at most by a scalar factor when that linear transformation is applied to it.
SVD Singular Value Decomposition A factorization of a real or complex matrix that generalizes the eigendecomposition of a square normal matrix to any m × n matrix via an extension of the polar decomposition.
PCA Principal Component Analysis A dimensionality reduction method that transforms a large set of variables into a smaller one that still contains most of the information in the large set.
Characteristic Equation Characteristic Equation The equation det(A - λI) = 0 whose roots are the eigenvalues of the matrix A.
Trace Trace The sum of the elements on the main diagonal of a square matrix. It is also the sum of the eigenvalues.
Diagonalization Diagonalization The process of finding a diagonal matrix that is similar to a given matrix, typically via eigendecomposition A = QΛQ-1.
Rank-1 Approximation Rank-1 Approximation Approximating a matrix as the outer product of two vectors (plus a scalar weight), often the first term of an SVD.
Covariance Matrix Covariance Matrix A square matrix giving the covariance between each pair of elements of a given random vector.
Positive Definite Positive Definite Matrix A symmetric matrix M where xTMx > 0 for all non-zero vectors x. All eigenvalues are positive.
Orthogonal Matrix Orthogonal Matrix A square matrix Q whose columns and rows are orthogonal unit vectors (QTQ = Q QT = I).
Basis Basis A set of linearly independent vectors that span a vector space.
Span Span The set of all possible linear combinations of a given set of vectors.
Matrix Rank Matrix Rank The dimension of the vector space generated (or spanned) by the matrix’s columns (or rows).

Calculus & Optimization

Term Full Name Definition
Derivative Derivative The instantaneous rate of change of a function with respect to one of its variables (slope of the tangent line).
Partial Derivative Partial Derivative The derivative of a multivariable function with respect to one variable, treating the others as constants.
Gradient Gradient Vector A vector containing all the partial derivatives of a function, pointing in the direction of the steepest ascent.
Jacobian Jacobian Matrix A matrix of all first-order partial derivatives of a vector-valued function. Essential for Backpropagation.
Hessian Hessian Matrix A square matrix of second-order partial derivatives of a scalar-valued function. Describes local curvature.
Chain Rule Chain Rule A formula for computing the derivative of the composition of two or more functions.
Taylor Series Taylor Series An infinite sum of terms that are expressed in terms of the function’s derivatives at a single point, used for approximation.
Learning Rate Learning Rate A hyperparameter that controls how much to change the model in response to the estimated error each time the model weights are updated.
Convex Function Convex Function A function where a line segment connecting any two points on the graph lies above or on the graph (guarantees Global Minimum).
Saddle Point Saddle Point A point on the surface of the graph of a function where the slopes (derivatives) in orthogonal directions are all zero (a critical point), but which is not a local extremum of the function.
SGD Stochastic Gradient Descent An iterative method for optimizing an objective function with suitable smoothness properties (e.g., differentiable or subdifferentiable).
Momentum Momentum A technique to accelerate gradient descent that accumulates a velocity vector in directions of persistent reduction in the objective across iterations.
Adam Adaptive Moment Estimation An optimization algorithm that adapts the learning rate for each parameter, combining ideas from Momentum and RMSProp.
Softmax Softmax Function A function that converts a vector of numbers into a vector of probabilities, where the probabilities of each value are proportional to the relative scale of each value in the vector.
AutoDiff Automatic Differentiation A family of techniques to evaluate the derivative of a function specified by a computer program, efficient and exact (unlike numerical differentiation).
Backpropagation Backpropagation The primary algorithm for training neural networks, calculating the gradient of the loss function with respect to weights by applying the Chain Rule backwards.
Computational Graph Computational Graph A directed graph where nodes represent mathematical operations (add, multiply, ReLU) and edges represent the flow of data (tensors).

Probability & Statistics

Term Full Name Definition  
Probability Density Function PDF A function whose integral over any interval gives the probability that the random variable falls within that interval (for continuous variables).  
Probability Mass Function PMF A function that gives the probability that a discrete random variable is exactly equal to some value.  
Normal Distribution Normal (Gaussian) Distribution A continuous probability distribution (bell curve) characterized by its mean and standard deviation.  
Central Limit Theorem CLT A theorem stating that the sum of many independent random variables tends toward a normal distribution, regardless of the original distribution.  
Bayes’ Theorem Bayes’ Theorem A mathematical formula for determining conditional probability, updating the probability of a hypothesis as more evidence becomes available.  
Expectation Expectation (Mean) The weighted average of all possible values that a random variable can take on.  
Variance Variance A measure of how spread out a set of numbers is from their average value.  
Covariance Covariance A measure of the joint variability of two random variables.  
Correlation Correlation A normalized measure of the relationship between two variables, ranging from -1 to 1.  
P-Value P-Value The probability of obtaining test results at least as extreme as the results actually observed, assuming that the null hypothesis is correct.  
Sample Space Sample Space (Ω) The set of all possible outcomes of a random experiment.  
Event Event A subset of the sample space (a set of outcomes) to which a probability is assigned.  
Naive Bayes Naive Bayes Classifier A probabilistic classifier based on applying Bayes’ theorem with strong (naive) independence assumptions between the features.  
Laplace Smoothing Laplace Smoothing A technique used to handle zero-probability problems by adding a small positive count (usually 1) to each observation count.  
Prior Prior Probability (P(A)) The initial probability of an event or hypothesis before new evidence is taken into account.  
Posterior Posterior Probability (P(A B)) The updated probability of an event or hypothesis after new evidence has been considered.
Likelihood Likelihood (P(B A)) The probability of the evidence given that the hypothesis is true.
Type I Error Type I Error (α) A “False Positive”: Rejecting the null hypothesis when it is actually true.  
Type II Error Type II Error (β) A “False Negative”: Failing to reject the null hypothesis when it is actually false.  
Cross-Entropy Cross-Entropy Loss A measure of the difference between two probability distributions for a given random variable or set of events.  
Frequentist Frequentist Statistics A framework where probability is interpreted as the long-run frequency of repeatable events.  
Bayesian Bayesian Statistics A framework where probability is interpreted as a degree of belief, updated as more evidence becomes available.  
Outlier Outlier A data point that differs significantly from other observations, often skewing statistical measures like Mean and Correlation.  
Log-Sum-Exp Log-Sum-Exp Trick A numerical technique used to calculate the logarithm of the sum of exponentials of input values, used to prevent underflow/overflow.  

Discrete Math & Information Theory

Term Full Name Definition
Bit Bit (Binary Digit) The basic unit of information in computing and digital communications.
Surprisal Surprisal (Self-Information) A measure of the information content associated with an event. Rare events have high surprisal.
Entropy Entropy (Shannon) A measure of the unpredictability of the state, or equivalently, of its average information content.
KL Divergence Kullback-Leibler Divergence A measure of how one probability distribution is different from a second, reference probability distribution.
Graph Graph A structure amounting to a set of objects in which some pairs of the objects are in some sense “related”.
Adjacency Matrix Adjacency Matrix A square matrix used to represent a finite graph. The elements of the matrix indicate whether pairs of vertices are adjacent or not in the graph.
Adjacency List Adjacency List A collection of unordered lists used to represent a finite graph. Each list describes the set of neighbors of a vertex in the graph.

Signal Processing & Complex Systems

Term Full Name Definition
Fourier Transform Fourier Transform A mathematical transform that decomposes functions (signals) into frequency components (sine waves).
DFT Discrete Fourier Transform The discrete version of the Fourier Transform, used for digital signal processing.
FFT Fast Fourier Transform An algorithm that computes the DFT of a sequence in O(N log N) time.
Convolution Theorem Convolution Theorem A principle stating that convolution in the time domain corresponds to multiplication in the frequency domain.
Complex Number Complex Number A number that can be expressed in the form a + bi, where a and b are real numbers and i is the imaginary unit.
Euler’s Formula Euler’s Formula A fundamental equation in complex analysis: eix = cos x + i sin x.
Quaternion Quaternion A number system that extends the complex numbers to 4 dimensions (w + xi + yj + zk), used for 3D rotations.
Gimbal Lock Gimbal Lock A state where one degree of freedom is lost because two axes of rotation become parallel.
Self-Attention Self-Attention A mechanism in Transformers that relates different positions of a single sequence to compute a representation of the sequence.
Positional Encoding Positional Encoding A technique to inject information about the position of tokens in a sequence, since Transformers process them in parallel.
VAE Variational Autoencoder A generative model that learns a probabilistic mapping to a latent space (usually Gaussian).
Latent Space Latent Space A compressed, abstract representation of data (manifold) where similar data points are closer together.
Reparameterization Trick Reparameterization Trick A technique used in VAEs to backpropagate through a random sampling node by rewriting the random variable as a deterministic function of parameters and noise.