Math for Machine Learning Glossary

Welcome to the Math for Machine Learning Glossary. Here you will find definitions for common mathematical terms used throughout the course.

Linear Algebra

Term	Full Name	Definition
Scalar	Scalar	A single number (Rank 0 Tensor) representing magnitude only.
Vector	Vector	An ordered list of numbers (Rank 1 Tensor) representing magnitude and direction.
Basis Vector	Basis Vector	A set of vectors that are linearly independent and span a vector space (e.g., i, j, k).
Matrix	Matrix	A rectangular array of numbers (Rank 2 Tensor) arranged in rows and columns.
Tensor	Tensor	A multidimensional array of numbers (Rank N) generalizing scalars, vectors, and matrices.
Rank	Tensor Rank	The number of dimensions (axes) of a tensor (not to be confused with Matrix Rank).
Dot Product	Dot Product (Scalar Product)	An algebraic operation that takes two equal-length sequences of numbers and returns a single number, measuring similarity.
Linear Transformation	Linear Transformation	A mapping between two vector spaces that preserves the operations of vector addition and scalar multiplication.
Gaussian Elimination	Gaussian Elimination	An algorithm for solving systems of linear equations by transforming the system’s matrix into row-echelon form.
Determinant	Determinant	A scalar value derived from a square matrix that characterizes properties of the linear transformation (e.g., scaling factor).
Cosine Similarity	Cosine Similarity	A measure of similarity between two non-zero vectors that measures the cosine of the angle between them.
Eigenvalues	Eigenvalues	A special set of scalars associated with a linear system of equations (i.e., a matrix equation) that are sometimes also known as characteristic roots.
Eigenvectors	Eigenvectors	A non-zero vector that changes at most by a scalar factor when that linear transformation is applied to it.
SVD	Singular Value Decomposition	A factorization of a real or complex matrix that generalizes the eigendecomposition of a square normal matrix to any m × n matrix via an extension of the polar decomposition.
PCA	Principal Component Analysis	A dimensionality reduction method that transforms a large set of variables into a smaller one that still contains most of the information in the large set.
Characteristic Equation	Characteristic Equation	The equation det(A - λI) = 0 whose roots are the eigenvalues of the matrix A.
Trace	Trace	The sum of the elements on the main diagonal of a square matrix. It is also the sum of the eigenvalues.
Diagonalization	Diagonalization	The process of finding a diagonal matrix that is similar to a given matrix, typically via eigendecomposition A = QΛQ^-1.
Rank-1 Approximation	Rank-1 Approximation	Approximating a matrix as the outer product of two vectors (plus a scalar weight), often the first term of an SVD.
Covariance Matrix	Covariance Matrix	A square matrix giving the covariance between each pair of elements of a given random vector.
Positive Definite	Positive Definite Matrix	A symmetric matrix M where x^TMx > 0 for all non-zero vectors x. All eigenvalues are positive.
Orthogonal Matrix	Orthogonal Matrix	A square matrix Q whose columns and rows are orthogonal unit vectors (Q^TQ = Q Q^T = I).
Basis	Basis	A set of linearly independent vectors that span a vector space.
Span	Span	The set of all possible linear combinations of a given set of vectors.
Matrix Rank	Matrix Rank	The dimension of the vector space generated (or spanned) by the matrix’s columns (or rows).

Calculus & Optimization

Term	Full Name	Definition
Derivative	Derivative	The instantaneous rate of change of a function with respect to one of its variables (slope of the tangent line).
Partial Derivative	Partial Derivative	The derivative of a multivariable function with respect to one variable, treating the others as constants.
Gradient	Gradient Vector	A vector containing all the partial derivatives of a function, pointing in the direction of the steepest ascent.
Jacobian	Jacobian Matrix	A matrix of all first-order partial derivatives of a vector-valued function. Essential for Backpropagation.
Hessian	Hessian Matrix	A square matrix of second-order partial derivatives of a scalar-valued function. Describes local curvature.
Chain Rule	Chain Rule	A formula for computing the derivative of the composition of two or more functions.
Taylor Series	Taylor Series	An infinite sum of terms that are expressed in terms of the function’s derivatives at a single point, used for approximation.
Learning Rate	Learning Rate	A hyperparameter that controls how much to change the model in response to the estimated error each time the model weights are updated.
Convex Function	Convex Function	A function where a line segment connecting any two points on the graph lies above or on the graph (guarantees Global Minimum).
Saddle Point	Saddle Point	A point on the surface of the graph of a function where the slopes (derivatives) in orthogonal directions are all zero (a critical point), but which is not a local extremum of the function.
SGD	Stochastic Gradient Descent	An iterative method for optimizing an objective function with suitable smoothness properties (e.g., differentiable or subdifferentiable).
Momentum	Momentum	A technique to accelerate gradient descent that accumulates a velocity vector in directions of persistent reduction in the objective across iterations.
Adam	Adaptive Moment Estimation	An optimization algorithm that adapts the learning rate for each parameter, combining ideas from Momentum and RMSProp.
Softmax	Softmax Function	A function that converts a vector of numbers into a vector of probabilities, where the probabilities of each value are proportional to the relative scale of each value in the vector.
AutoDiff	Automatic Differentiation	A family of techniques to evaluate the derivative of a function specified by a computer program, efficient and exact (unlike numerical differentiation).
Backpropagation	Backpropagation	The primary algorithm for training neural networks, calculating the gradient of the loss function with respect to weights by applying the Chain Rule backwards.
Computational Graph	Computational Graph	A directed graph where nodes represent mathematical operations (add, multiply, ReLU) and edges represent the flow of data (tensors).

Probability & Statistics

Term	Full Name	Definition
Probability Density Function	PDF	A function whose integral over any interval gives the probability that the random variable falls within that interval (for continuous variables).
Probability Mass Function	PMF	A function that gives the probability that a discrete random variable is exactly equal to some value.
Normal Distribution	Normal (Gaussian) Distribution	A continuous probability distribution (bell curve) characterized by its mean and standard deviation.
Central Limit Theorem	CLT	A theorem stating that the sum of many independent random variables tends toward a normal distribution, regardless of the original distribution.
Bayes’ Theorem	Bayes’ Theorem	A mathematical formula for determining conditional probability, updating the probability of a hypothesis as more evidence becomes available.
Expectation	Expectation (Mean)	The weighted average of all possible values that a random variable can take on.
Variance	Variance	A measure of how spread out a set of numbers is from their average value.
Covariance	Covariance	A measure of the joint variability of two random variables.
Correlation	Correlation	A normalized measure of the relationship between two variables, ranging from -1 to 1.
P-Value	P-Value	The probability of obtaining test results at least as extreme as the results actually observed, assuming that the null hypothesis is correct.
Sample Space	Sample Space (Ω)	The set of all possible outcomes of a random experiment.
Event	Event	A subset of the sample space (a set of outcomes) to which a probability is assigned.
Naive Bayes	Naive Bayes Classifier	A probabilistic classifier based on applying Bayes’ theorem with strong (naive) independence assumptions between the features.
Laplace Smoothing	Laplace Smoothing	A technique used to handle zero-probability problems by adding a small positive count (usually 1) to each observation count.
Prior	Prior Probability (P(A))	The initial probability of an event or hypothesis before new evidence is taken into account.
Posterior	Posterior Probability (P(A	B))	The updated probability of an event or hypothesis after new evidence has been considered.
Likelihood	Likelihood (P(B	A))	The probability of the evidence given that the hypothesis is true.
Type I Error	Type I Error (α)	A “False Positive”: Rejecting the null hypothesis when it is actually true.
Type II Error	Type II Error (β)	A “False Negative”: Failing to reject the null hypothesis when it is actually false.
Cross-Entropy	Cross-Entropy Loss	A measure of the difference between two probability distributions for a given random variable or set of events.
Frequentist	Frequentist Statistics	A framework where probability is interpreted as the long-run frequency of repeatable events.
Bayesian	Bayesian Statistics	A framework where probability is interpreted as a degree of belief, updated as more evidence becomes available.
Outlier	Outlier	A data point that differs significantly from other observations, often skewing statistical measures like Mean and Correlation.
Log-Sum-Exp	Log-Sum-Exp Trick	A numerical technique used to calculate the logarithm of the sum of exponentials of input values, used to prevent underflow/overflow.

Discrete Math & Information Theory

Term	Full Name	Definition
Bit	Bit (Binary Digit)	The basic unit of information in computing and digital communications.
Surprisal	Surprisal (Self-Information)	A measure of the information content associated with an event. Rare events have high surprisal.
Entropy	Entropy (Shannon)	A measure of the unpredictability of the state, or equivalently, of its average information content.
KL Divergence	Kullback-Leibler Divergence	A measure of how one probability distribution is different from a second, reference probability distribution.
Graph	Graph	A structure amounting to a set of objects in which some pairs of the objects are in some sense “related”.
Adjacency Matrix	Adjacency Matrix	A square matrix used to represent a finite graph. The elements of the matrix indicate whether pairs of vertices are adjacent or not in the graph.
Adjacency List	Adjacency List	A collection of unordered lists used to represent a finite graph. Each list describes the set of neighbors of a vertex in the graph.

Signal Processing & Complex Systems

Term	Full Name	Definition
Fourier Transform	Fourier Transform	A mathematical transform that decomposes functions (signals) into frequency components (sine waves).
DFT	Discrete Fourier Transform	The discrete version of the Fourier Transform, used for digital signal processing.
FFT	Fast Fourier Transform	An algorithm that computes the DFT of a sequence in O(N log N) time.
Convolution Theorem	Convolution Theorem	A principle stating that convolution in the time domain corresponds to multiplication in the frequency domain.
Complex Number	Complex Number	A number that can be expressed in the form a + bi, where a and b are real numbers and i is the imaginary unit.
Euler’s Formula	Euler’s Formula	A fundamental equation in complex analysis: e^ix = cos x + i sin x.
Quaternion	Quaternion	A number system that extends the complex numbers to 4 dimensions (w + xi + yj + zk), used for 3D rotations.
Gimbal Lock	Gimbal Lock	A state where one degree of freedom is lost because two axes of rotation become parallel.
Self-Attention	Self-Attention	A mechanism in Transformers that relates different positions of a single sequence to compute a representation of the sequence.
Positional Encoding	Positional Encoding	A technique to inject information about the position of tokens in a sequence, since Transformers process them in parallel.
VAE	Variational Autoencoder	A generative model that learns a probabilistic mapping to a latent space (usually Gaussian).
Latent Space	Latent Space	A compressed, abstract representation of data (manifold) where similar data points are closer together.
Reparameterization Trick	Reparameterization Trick	A technique used in VAEs to backpropagate through a random sampling node by rewriting the random variable as a deterministic function of parameters and noise.