Optimal transport

previous arrow
next arrow

Optimal transport

The Optimal transport (OT) problem has attracted a surge of research interest because it expresses the distance between probability distributions, known as the Wasserstein distance. This strong property enables us to apply this distance to widely diverse machine learning problems such as generative adversarial networks, graph optimal transport, clustering, and domain adaptation.

The Kantorovich relaxation formulation of the optimal transport (OT) problem is explained briefly. Let {\bf a} and {\bf b} be {probabilities} or positive weight vectors as {\bf a}=(a_1, a_2, \ldots, a_m)^T \in \mathbb{R}_+^m and {\bf b}=(b_1, b_2, \ldots, b_n)^T \in \mathbb{R}_+^n, respectively. Given two empirical distributions, i.e., discrete measures, {\bf \nu}=\sum_{i=1}^m a_i{\bf \delta}_{{x}_i}, {\bf \mu}=\sum_{j=1}^n b_j{\bf \delta}_{{y}_j} and the ground cost matrix {\bf C} \in \mathbb{R}^{m \times n} between their supports, the problem can be formulated as

{\displaytyle \min_{\scriptsize {\bf T} \in \mathcal{U}({\bf a},{\bf b})}}\ \langle {\bf C} ,{\bf T}\rangle},

where {\bf T} \in \mathbb{R}^{m \times n} represents the transport matrix, and where the domain \mathcal{U}({\bf a},{\bf b}) is defined as

\mathcal{U}({\bf a},{\bf b}) = \Bigl\{ {\bf T} \in \mathbb{R}^{m \times n}_{+}: {\bf T}{\bf 1}_n = {\bf a},{\bf T}^{T}{\bf 1}_m = {\bf b} \Bigr\},

where {\bf T}{\bf 1}_n = {\bf a} and {\bf T}^{T}{\bf 1}_m = {\bf b} are the marginal constraints. Moreover, we present the sum of the two vectors respectively as \alpha > 0 and \beta > 0, i.e., \sum_{i=1}^m {{\bf a}_i} = \alpha and \sum_{i=1}^n {{\bf b}_i} = \beta. Note that \alpha is equal to \beta in the standard OT formulation. The obtained OT matrix {\bf T}^* brings powerful distances as \mathcal{W}_p({\bf \nu},{\bf \mu}) = \langle {\bf T}^*,{\bf C} \rangle^{\frac{1}{p}}, which is known as the p-th order Wasserstein distance. It is used in various fields according to the value of p. Especially, the distance is applied to computer vision when p=1, and to clustering when p=2. When {\bf C} is the ground cost matrix and p=1, we specifically designate the 1-th order Wasserstein distance as the  OT distance.