1 files changed, 923 insertions, 0 deletions
diff --git a/docs/source/quickstart.rst b/docs/source/quickstart.rst
new file mode 100644
index 0000000..978eaff
--- /dev/null
+++ b/docs/source/quickstart.rst
@@ -0,0 +1,923 @@
+
+Quick start guide
+=================
+
+In the following we provide some pointers about which functions and classes
+to use for different problems related to optimal transport (OT) and machine
+learning. We refer when we can to concrete examples in the documentation that
+are also available as notebooks on the POT Github.
+
+This document is not a tutorial on numerical optimal transport. For this we strongly
+recommend to read the very nice book [15]_ .
+
+
+Optimal transport and Wasserstein distance
+------------------------------------------
+
+.. note::
+    In POT, most functions that solve OT or regularized OT problems have two
+    versions that return the OT matrix or the value of the optimal solution. For
+    instance :any:`ot.emd` return the OT matrix and :any:`ot.emd2` return the
+    Wassertsein distance. This approach has been implemented in practice for all
+    solvers that return an OT matrix (even Gromov-Wasserstsein)
+
+Solving optimal transport
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The optimal transport problem between discrete distributions is often expressed
+as
+
+.. math::
+    \gamma^* = arg\min_\gamma \quad \sum_{i,j}\gamma_{i,j}M_{i,j}
+
+    s.t. \gamma 1 = a; \gamma^T 1= b; \gamma\geq 0
+
+where :
+
+- :math:`M\in\mathbb{R}_+^{m\times n}` is the metric cost matrix defining the cost to move mass from bin :math:`a_i` to bin :math:`b_j`.
+- :math:`a` and :math:`b` are histograms on the simplex (positive, sum to 1) that represent the
+weights of each samples in the source an target distributions.
+
+Solving the linear program above can be done using the function :any:`ot.emd`
+that will return the optimal transport matrix :math:`\gamma^*`:
+
+.. code:: python
+
+    # a,b are 1D histograms (sum to 1 and positive)
+    # M is the ground cost matrix
+    T=ot.emd(a,b,M) # exact linear program
+
+The method implemented for solving the OT problem is the network simplex, it is
+implemented in C from  [1]_. It has a complexity of :math:`O(n^3)` but the
+solver is quite efficient and uses sparsity of the solution.
+
+.. hint::
+    Examples of use for :any:`ot.emd` are available in :
+
+    - :any:`auto_examples/plot_OT_2D_samples`
+    - :any:`auto_examples/plot_OT_1D`
+    - :any:`auto_examples/plot_OT_L1_vs_L2`
+
+
+Computing Wasserstein distance
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The value of the OT solution is often more of interest than the OT matrix :
+
+.. math::
+    OT(a,b)=\min_\gamma \quad \sum_{i,j}\gamma_{i,j}M_{i,j}
+
+    s.t. \gamma 1 = a; \gamma^T 1= b; \gamma\geq 0
+
+
+It can computed from an already estimated OT matrix with
+:code:`np.sum(T*M)` or directly with the function :any:`ot.emd2`.
+
+.. code:: python
+
+    # a,b are 1D histograms (sum to 1 and positive)
+    # M is the ground cost matrix
+    W=ot.emd2(a,b,M) # Wasserstein distance / EMD value
+
+Note that the well known  `Wasserstein distance
+<https://en.wikipedia.org/wiki/Wasserstein_metric>`_ between distributions a and
+b is defined as
+
+
+    .. math::
+
+        W_p(a,b)=(\min_\gamma \sum_{i,j}\gamma_{i,j}\|x_i-y_j\|_p)^\frac{1}{p}
+
+        s.t. \gamma 1 = a; \gamma^T 1= b; \gamma\geq 0
+
+This means that if you want to compute the :math:`W_2` you need to compute the
+square root of :any:`ot.emd2` when providing
+:code:`M=ot.dist(xs,xt)` that use the squared euclidean distance by default. Computing
+the :math:`W_1` wasserstein distance can be done directly with  :any:`ot.emd2`
+when providing :code:`M=ot.dist(xs,xt, metric='euclidean')` to use the euclidean
+distance.
+
+
+.. hint::
+    An example of use for :any:`ot.emd2` is available in :
+
+    - :any:`auto_examples/plot_compute_emd`
+
+
+Special cases
+^^^^^^^^^^^^^
+
+Note that the OT problem and the corresponding Wasserstein distance can in some
+special cases be computed very efficiently.
+
+For instance when the samples are in 1D, then the OT problem can be solved in
+:math:`O(n\log(n))` by using a simple sorting. In this case we provide the
+function :any:`ot.emd_1d` and   :any:`ot.emd2_1d` to return respectively the OT
+matrix and value. Note that since the solution is very sparse the :code:`sparse`
+parameter of :any:`ot.emd_1d` allows for solving and returning the solution for
+very large problems. Note that in order to compute directly the :math:`W_p`
+Wasserstein distance in 1D we provide the function :any:`ot.wasserstein_1d` that
+takes :code:`p` as a parameter.
+
+Another special case for estimating OT and Monge mapping is between Gaussian
+distributions. In this case there exists a close form solution given in Remark
+2.29 in [15]_ and the Monge mapping is an affine function and can be
+also computed from the covariances and means of the source and target
+distributions. In the case when the finite sample dataset is supposed gaussian, we provide
+:any:`ot.da.OT_mapping_linear` that returns the parameters for the Monge
+mapping.
+
+
+Regularized Optimal Transport
+-----------------------------
+
+Recent developments have shown the interest of regularized OT both in terms of
+computational and statistical properties.
+We address in this section the regularized OT problems that can be expressed as
+
+.. math::
+    \gamma^* = arg\min_\gamma \quad \sum_{i,j}\gamma_{i,j}M_{i,j} + \lambda\Omega(\gamma)
+
+        s.t. \gamma 1 = a; \gamma^T 1= b; \gamma\geq 0
+
+
+where :
+
+- :math:`M\in\mathbb{R}_+^{m\times n}` is the metric cost matrix defining the cost to move mass from bin :math:`a_i` to bin :math:`b_j`.
+- :math:`a` and :math:`b` are histograms (positive, sum to 1) that represent the weights of each samples in the source an target distributions.
+- :math:`\Omega` is the regularization term.
+
+We discuss in the following specific algorithms that can be used depending on
+the regularization term.
+
+
+Entropic regularized OT
+^^^^^^^^^^^^^^^^^^^^^^^
+
+This is the most common regularization used for optimal transport. It has been
+proposed in the ML community by Marco Cuturi in his seminal paper [2]_. This
+regularization has the following expression
+
+.. math::
+    \Omega(\gamma)=\sum_{i,j}\gamma_{i,j}\log(\gamma_{i,j})
+
+
+The use of the regularization term above in the optimization problem has a very
+strong impact. First it makes the problem smooth which leads to new optimization
+procedures such as the well known Sinkhorn algorithm [2]_ or L-BFGS (see
+:any:`ot.smooth` ). Next it makes the problem
+strictly convex meaning that there will be a unique solution. Finally the
+solution of the resulting optimization problem can be expressed as:
+
+.. math::
+
+    \gamma_\lambda^*=\text{diag}(u)K\text{diag}(v)
+
+where :math:`u` and :math:`v` are vectors and :math:`K=\exp(-M/\lambda)` where
+the :math:`\exp` is taken component-wise. In order to solve the optimization
+problem, on can use an alternative projection algorithm called Sinkhorn-Knopp that can be very
+efficient for large values if regularization.
+
+The Sinkhorn-Knopp algorithm is implemented in :any:`ot.sinkhorn` and
+:any:`ot.sinkhorn2` that return respectively the OT matrix and the value of the
+linear term. Note that the regularization parameter :math:`\lambda` in the
+equation above is given to those functions with the parameter :code:`reg`.
+
+    >>> import ot
+    >>> a=[.5,.5]
+    >>> b=[.5,.5]
+    >>> M=[[0.,1.],[1.,0.]]
+    >>> ot.sinkhorn(a,b,M,1)
+    array([[ 0.36552929,  0.13447071],
+        [ 0.13447071,  0.36552929]])
+
+More details about the algorithms used are given in the following note.
+
+.. note::
+    The main function to solve entropic regularized OT is :any:`ot.sinkhorn`.
+    This function is a wrapper and the parameter :code:`method` help you select
+    the actual algorithm used to solve the problem:
+
+    + :code:`method='sinkhorn'` calls :any:`ot.bregman.sinkhorn_knopp`  the
+      classic algorithm [2]_.
+    + :code:`method='sinkhorn_stabilized'` calls :any:`ot.bregman.sinkhorn_stabilized`  the
+      log stabilized version of the algorithm [9]_.
+    + :code:`method='sinkhorn_epsilon_scaling'` calls
+      :any:`ot.bregman.sinkhorn_epsilon_scaling`  the epsilon scaling version
+      of the algorithm [9]_.
+    + :code:`method='greenkhorn'` calls :any:`ot.bregman.greenkhorn`  the
+      greedy sinkhorn verison of the algorithm [22]_.
+
+    In addition to all those variants of sinkhorn, we have another
+    implementation solving the problem in the smooth dual or semi-dual in
+    :any:`ot.smooth`. This solver uses the :any:`scipy.optimize.minimize`
+    function to solve the smooth problem with :code:`L-BFGS-B` algorithm. Tu use
+    this solver, use functions :any:`ot.smooth.smooth_ot_dual` or
+    :any:`ot.smooth.smooth_ot_semi_dual` with parameter :code:`reg_type='kl'` to
+    choose entropic/Kullbach Leibler regularization.
+
+
+Recently [23]_ introduced the sinkhorn divergence that build from entropic
+regularization to compute fast and differentiable geometric divergence between
+empirical distributions.  Note that we provide a function that compute directly
+(with no need to pre compute the :code:`M` matrix)
+the sinkhorn divergence for empirical distributions in
+:any:`ot.bregman.empirical_sinkhorn_divergence`. Similarly one can compute the
+OT matrix and loss for empirical distributions with respectively
+:any:`ot.bregman.empirical_sinkhorn` and :any:`ot.bregman.empirical_sinkhorn2`.
+
+
+Finally note that we also provide in :any:`ot.stochastic` several implementation
+of stochastic solvers for entropic regularized OT [18]_ [19]_.  Those pure Python
+implementations are not optimized for speed but provide a roust implementation
+of algorithms in [18]_ [19]_.
+
+.. hint::
+    Examples of use for :any:`ot.sinkhorn` are available in :
+
+    - :any:`auto_examples/plot_OT_2D_samples`
+    - :any:`auto_examples/plot_OT_1D`
+    - :any:`auto_examples/plot_OT_1D_smooth`
+    - :any:`auto_examples/plot_stochastic`
+
+
+Other regularization
+^^^^^^^^^^^^^^^^^^^^
+
+While entropic OT is the most common and favored in practice, there exist other
+kind of regularization. We provide in POT two specific solvers for other
+regularization terms, namely quadratic regularization and group lasso
+regularization. But we also provide in :any:`ot.optim`  two generic solvers that allows solving any
+smooth regularization in practice.
+
+Quadratic regularization
+""""""""""""""""""""""""
+
+The first general regularization term we can solve is the quadratic
+regularization of the form
+
+.. math::
+    \Omega(\gamma)=\sum_{i,j} \gamma_{i,j}^2
+
+this regularization term has a similar effect to entropic regularization in
+densifying the OT matrix but it keeps some sort of sparsity that is lost with
+entropic regularization as soon as :math:`\lambda>0` [17]_. This problem can be
+solved with POT using solvers from :any:`ot.smooth`, more specifically
+functions :any:`ot.smooth.smooth_ot_dual` or
+:any:`ot.smooth.smooth_ot_semi_dual` with parameter :code:`reg_type='l2'` to
+choose the quadratic regularization.
+
+.. hint::
+    Examples of quadratic regularization are available in :
+
+    - :any:`auto_examples/plot_OT_1D_smooth`
+    - :any:`auto_examples/plot_optim_OTreg`
+
+
+
+Group Lasso regularization
+""""""""""""""""""""""""""
+
+Another regularization that has been used in recent years [5]_  is the group lasso
+regularization
+
+.. math::
+    \Omega(\gamma)=\sum_{j,G\in\mathcal{G}} \|\gamma_{G,j}\|_q^p
+
+where :math:`\mathcal{G}` contains non overlapping groups of lines in the OT
+matrix. This regularization proposed in [5]_ will promote sparsity at the group level and for
+instance will force target samples to get mass from a small number of groups.
+Note that the exact OT solution is already sparse so this regularization does
+not make sens if it is not combined with entropic regularization. Depending on
+the choice of :code:`p` and :code:`q`, the problem can be solved with different
+approaches.  When :code:`q=1` and :code:`p<1` the problem is non convex but can
+be solved using an efficient majoration minimization approach with
+:any:`ot.sinkhorn_lpl1_mm`. When :code:`q=2` and :code:`p=1` we recover the
+convex group lasso and we provide a solver using generalized conditional
+gradient algorithm [7]_ in function
+:any:`ot.da.sinkhorn_l1l2_gl`.
+
+.. hint::
+    Examples of group Lasso regularization are available in :
+
+    - :any:`auto_examples/plot_otda_classes`
+    - :any:`auto_examples/plot_otda_d2`
+
+
+Generic solvers
+"""""""""""""""
+
+Finally we propose in POT generic solvers that can be used to solve any
+regularization as long as you can provide a function computing the
+regularization and a function computing its gradient (or sub-gradient).
+
+In order to solve
+
+.. math::
+    \gamma^* = arg\min_\gamma \quad \sum_{i,j}\gamma_{i,j}M_{i,j} + \lambda\Omega(\gamma)
+
+        s.t. \gamma 1 = a; \gamma^T 1= b; \gamma\geq 0
+
+you can use function :any:`ot.optim.cg` that will use a conditional gradient as
+proposed in [6]_ . You need to provide the regularization function as parameter
+``f`` and its gradient as parameter  ``df``. Note that the conditional gradient relies on
+iterative solving of a linearization of the problem using the exact
+:any:`ot.emd` so it can be  slow in practice. But, being an interior point
+algorithm,  it always returns a
+transport matrix that does not violates the marginals.
+
+Another generic solver is proposed to solve the problem
+
+.. math::
+    \gamma^* = arg\min_\gamma \quad \sum_{i,j}\gamma_{i,j}M_{i,j}+ \lambda_e\Omega_e(\gamma) + \lambda\Omega(\gamma)
+
+        s.t. \gamma 1 = a; \gamma^T 1= b; \gamma\geq 0
+
+where :math:`\Omega_e` is the entropic regularization. In this case we use a
+generalized conditional gradient [7]_ implemented in :any:`ot.optim.gcg`  that
+does not linearize the entropic term but
+relies on :any:`ot.sinkhorn` for its iterations.
+
+.. hint::
+    An example of generic solvers are available in :
+
+    - :any:`auto_examples/plot_optim_OTreg`
+
+
+Wasserstein Barycenters
+-----------------------
+
+A Wasserstein barycenter is a distribution that minimize its Wasserstein
+distance with respect to other distributions [16]_. It corresponds to minimizing the
+following problem by searching a distribution :math:`\mu` such that
+
+.. math::
+    \min_\mu \quad \sum_{k} w_kW(\mu,\mu_k)
+
+
+In practice we model a distribution with a finite number of support position:
+
+.. math::
+    \mu=\sum_{i=1}^n a_i\delta_{x_i}
+
+where :math:`a` is an histogram on the simplex and the :math:`\{x_i\}` are the
+position of the support. We can clearly see here that optimizing :math:`\mu` can
+be done by searching for optimal weights :math:`a` or optimal support
+:math:`\{x_i\}` (optimizing both is also an option).
+We provide in POT solvers to estimate a discrete
+Wasserstein barycenter in both cases.
+
+Barycenters with fixed support
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+When optimizing a barycenter with a fixed support, the optimization problem can
+be expressed as
+
+.. math::
+    \min_a \quad \sum_{k} w_k W(a,b_k)
+
+where :math:`b_k` are also weights in the simplex. In the non-regularized case,
+the problem above is a classical linear program. In this case we propose a
+solver :any:`ot.lp.barycenter` that rely on generic LP solvers. By default the
+function uses :any:`scipy.optimize.linprog`, but more efficient LP solvers from
+cvxopt can be also used by changing parameter :code:`solver`. Note that this problem
+requires to solve a very large linear program and can be very slow in
+practice.
+
+Similarly to the OT problem, OT barycenters can be computed in the regularized
+case. When using entropic regularization is used, the problem can be solved with a
+generalization of the sinkhorn algorithm based on bregman projections [3]_. This
+algorithm is provided in function :any:`ot.bregman.barycenter` also available as
+:any:`ot.barycenter`. In this case, the algorithm scales better to large
+distributions and rely only on matrix multiplications that can be performed in
+parallel.
+
+In addition to the speedup brought by regularization, one can also greatly
+accelerate the estimation of Wasserstein barycenter when the support has a
+separable structure [21]_. In the case of 2D images for instance one can replace
+the matrix vector production in the Bregman projections by convolution
+operators. We provide an implementation of this algorithm in function
+:any:`ot.bregman.convolutional_barycenter2d`.
+
+.. hint::
+    Examples of Wasserstein (:any:`ot.lp.barycenter`) and regularized Wasserstein
+    barycenter (:any:`ot.bregman.barycenter`) computation are available in :
+
+    - :any:`auto_examples/plot_barycenter_1D`
+    - :any:`auto_examples/plot_barycenter_lp_vs_entropic`
+
+    An example of convolutional barycenter
+    (:any:`ot.bregman.convolutional_barycenter2d`) computation
+    for 2D images is available
+    in :
+
+    - :any:`auto_examples/plot_convolutional_barycenter`
+
+
+
+Barycenters with free support
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Estimating the Wasserstein barycenter with free support but fixed weights
+corresponds to  solving the following optimization problem:
+
+.. math::
+    \min_{\{x_i\}} \quad \sum_{k} w_kW(\mu,\mu_k)
+
+    s.t. \quad \mu=\sum_{i=1}^n a_i\delta_{x_i}
+
+We provide a solver based on [20]_ in
+:any:`ot.lp.free_support_barycenter`. This function minimize the problem and
+return a locally optimal support :math:`\{x_i\}` for uniform or given weights
+:math:`a`.
+
+ .. hint::
+
+    An example of the free support barycenter estimation is available
+    in :
+
+    - :any:`auto_examples/plot_free_support_barycenter`
+
+
+
+
+Monge mapping and Domain adaptation
+-----------------------------------
+
+The original transport problem investigated by Gaspard Monge  was seeking for a
+mapping function that maps (or transports) between a source and target
+distribution but that minimizes the transport loss. The existence and uniqueness of this
+optimal mapping is still an open problem in the general case but has been proven
+for smooth distributions by Brenier in his eponym `theorem
+<https://who.rocq.inria.fr/Jean-David.Benamou/demiheure.pdf>`__. We provide in
+:any:`ot.da` several solvers for smooth Monge mapping estimation and domain
+adaptation from discrete distributions.
+
+Monge Mapping estimation
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+We now discuss several approaches that are implemented in POT to estimate or
+approximate a Monge mapping from finite distributions.
+
+First note that when the source and target distributions are supposed to be Gaussian
+distributions, there exists a close form solution for the mapping and its an
+affine function [14]_ of the form :math:`T(x)=Ax+b` . In this case we provide the function
+:any:`ot.da.OT_mapping_linear` that return the operator :math:`A` and vector
+:math:`b`. Note that if the number of samples is too small there is a parameter
+:code:`reg` that provide a regularization for the covariance matrix estimation.
+
+For a more general mapping estimation we also provide the barycentric mapping
+proposed in [6]_ . It is implemented in the class :any:`ot.da.EMDTransport` and
+other transport based classes in :any:`ot.da` . Those classes are discussed more
+in the following but follow an interface similar to sklearn classes. Finally a
+method proposed in [8]_ that estimates a continuous mapping approximating the
+barycentric mapping is provided in :any:`ot.da.joint_OT_mapping_linear` for
+linear mapping and :any:`ot.da.joint_OT_mapping_kernel` for non linear mapping.
+
+ .. hint::
+
+    An example of the linear Monge mapping estimation is available
+    in :
+
+    - :any:`auto_examples/plot_otda_linear_mapping`
+
+Domain adaptation classes
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The use of OT for domain adaptation (OTDA) has been first proposed in [5]_ that also
+introduced the group Lasso regularization. The main idea of OTDA is to estimate
+a mapping of the samples between source and target distributions which allows to
+transport labeled source samples onto the target distribution with no labels.
+
+We provide several classes based on :any:`ot.da.BaseTransport` that provide
+several OT and mapping estimations. The interface of those classes is similar to
+classifiers in sklearn toolbox. At initialization, several parameters such as
+ regularization parameter value can be set. Then one needs to estimate the
+mapping with function :any:`ot.da.BaseTransport.fit`. Finally one can map the
+samples from source to target with  :any:`ot.da.BaseTransport.transform` and
+from target to source with :any:`ot.da.BaseTransport.inverse_transform`.
+
+Here is
+an example for class :any:`ot.da.EMDTransport` :
+
+.. code::
+
+    ot_emd = ot.da.EMDTransport()
+    ot_emd.fit(Xs=Xs, Xt=Xt)
+
+    Mapped_Xs= ot_emd.transform(Xs=Xs)
+
+A list of the provided implementation is given in the following note.
+
+.. note::
+
+    Here is a list of the OT mapping classes inheriting from
+    :any:`ot.da.BaseTransport`
+
+    * :any:`ot.da.EMDTransport` : Barycentric mapping with EMD transport
+    * :any:`ot.da.SinkhornTransport` : Barycentric mapping with Sinkhorn transport
+    * :any:`ot.da.SinkhornL1l2Transport` : Barycentric mapping with Sinkhorn +
+      group Lasso regularization [5]_
+    * :any:`ot.da.SinkhornLpl1Transport` : Barycentric mapping with Sinkhorn +
+      non convex group Lasso regularization [5]_
+    * :any:`ot.da.LinearTransport` : Linear mapping estimation  between Gaussians
+      [14]_
+    * :any:`ot.da.MappingTransport` : Nonlinear mapping estimation [8]_
+
+.. hint::
+
+    Example of the use of OTDA classes are available in :
+
+    - :any:`auto_examples/plot_otda_color_images`
+    - :any:`auto_examples/plot_otda_mapping`
+    - :any:`auto_examples/plot_otda_mapping_colors_images`
+    - :any:`auto_examples/plot_otda_semi_supervised`
+
+Other applications
+------------------
+
+We discuss in the following several OT related problems and tools that has been
+proposed in the OT and machine learning community.
+
+Wasserstein Discriminant Analysis
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Wasserstein Discriminant Analysis [11]_ is a generalization of `Fisher Linear Discriminant
+Analysis <https://en.wikipedia.org/wiki/Linear_discriminant_analysis>`__ that
+allows discrimination between classes that are not linearly separable. It
+consist in finding a linear projector optimizing the following criterion
+
+.. math::
+    P = \text{arg}\min_P \frac{\sum_i OT_e(\mu_i\#P,\mu_i\#P)}{\sum_{i,j\neq i}
+    OT_e(\mu_i\#P,\mu_j\#P)}
+
+where :math:`\#` is the push-forward operator, :math:`OT_e` is the entropic OT
+loss  and :math:`\mu_i` is the
+distribution of samples from class :math:`i`.  :math:`P` is also constrained to
+be in the Stiefel manifold. WDA can be solved in POT using function
+:any:`ot.dr.wda`. It requires to have installed :code:`pymanopt` and
+:code:`autograd` for manifold optimization and automatic differentiation
+respectively. Note that we also provide the Fisher discriminant estimator in
+:any:`ot.dr.fda` for easy comparison.
+
+.. warning::
+    Note that due to the hard dependency on  :code:`pymanopt` and
+    :code:`autograd`, :any:`ot.dr` is not imported by default. If you want to
+    use it you have to specifically import it with :code:`import ot.dr` .
+
+.. hint::
+
+    An example of the use of WDA is available in :
+
+    - :any:`auto_examples/plot_WDA`
+
+
+Unbalanced optimal transport
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Unbalanced OT is a relaxation of the entropy regularized OT problem where the violation of
+the constraint on the marginals is added to the objective of the optimization
+problem. The unbalanced OT metric between two unbalanced histograms a and b is defined as [25]_ [10]_:
+
+.. math::
+    W_u(a, b) = \min_\gamma \quad \sum_{i,j}\gamma_{i,j}M_{i,j} + reg\cdot\Omega(\gamma) + reg_m KL(\gamma 1, a) + reg_m KL(\gamma^T 1, b)
+
+    s.t. \quad  \gamma\geq 0
+
+
+where KL is the Kullback-Leibler divergence. This formulation allows for
+computing approximate mapping between distributions that do not have the same
+amount of mass. Interestingly the problem can be solved with a generalization of
+the Bregman projections algorithm [10]_. We provide a solver for unbalanced OT
+in :any:`ot.unbalanced`. Computing the optimal transport
+plan or the transport cost is similar to the balanced case. The Sinkhorn-Knopp
+algorithm is implemented in :any:`ot.sinkhorn_unbalanced` and :any:`ot.sinkhorn_unbalanced2`
+that return respectively the OT matrix and the value of the
+linear term.
+
+.. note::
+    The main function to solve entropic regularized UOT is :any:`ot.sinkhorn_unbalanced`.
+    This function is a wrapper and the parameter :code:`method` helps you select
+    the actual algorithm used to solve the problem:
+
+    + :code:`method='sinkhorn'` calls :any:`ot.unbalanced.sinkhorn_knopp_unbalanced`
+      the generalized Sinkhorn algorithm [25]_ [10]_.
+    + :code:`method='sinkhorn_stabilized'` calls :any:`ot.unbalanced.sinkhorn_stabilized_unbalanced`
+      the log stabilized version of the algorithm [10]_.
+
+
+.. hint::
+
+    Examples of the use of :any:`ot.sinkhorn_unbalanced` are available in :
+
+    - :any:`auto_examples/plot_UOT_1D`
+
+
+Unbalanced Barycenters
+^^^^^^^^^^^^^^^^^^^^^^
+
+As with balanced distributions, we can define a barycenter of a set of
+histograms with different masses as a Fréchet Mean:
+
+    .. math::
+        \min_{\mu} \quad \sum_{k} w_kW_u(\mu,\mu_k)
+
+Where :math:`W_u` is the unbalanced Wasserstein metric defined above. This problem
+can also be solved using generalized version of Sinkhorn's algorithm and it is
+implemented the main function :any:`ot.barycenter_unbalanced`.
+
+
+.. note::
+    The main function to compute UOT barycenters is :any:`ot.barycenter_unbalanced`.
+    This function is a wrapper and the parameter :code:`method` help you select
+    the actual algorithm used to solve the problem:
+
+    + :code:`method='sinkhorn'` calls :any:`ot.unbalanced.barycenter_unbalanced_sinkhorn_unbalanced`
+      the generalized Sinkhorn algorithm [10]_.
+    + :code:`method='sinkhorn_stabilized'` calls :any:`ot.unbalanced.barycenter_unbalanced_stabilized`
+      the log stabilized version of the algorithm [10]_.
+
+
+.. hint::
+
+      Examples of the use of :any:`ot.barycenter_unbalanced` are available in :
+
+      - :any:`auto_examples/plot_UOT_barycenter_1D`
+
+
+Gromov-Wasserstein
+^^^^^^^^^^^^^^^^^^
+
+Gromov Wasserstein (GW) is a generalization of OT to distributions that do not lie in
+the same space [13]_. In this case one cannot compute distance between samples
+from the two distributions. [13]_ proposed instead to realign the metric spaces
+by computing a transport between distance matrices. The Gromow Wasserstein
+alignement between two distributions can be expressed as the one minimizing:
+
+.. math::
+    GW = \min_\gamma \sum_{i,j,k,l} L(C1_{i,k},C2_{j,l})*\gamma_{i,j}*\gamma_{k,l}
+
+    s.t. \gamma 1 = a; \gamma^T 1= b; \gamma\geq 0
+
+where ::math:`C1` is the distance matrix between samples in the source
+distribution and :math:`C2` the one between samples in the target,
+:math:`L(C1_{i,k},C2_{j,l})` is a measure of similarity between
+:math:`C1_{i,k}` and :math:`C2_{j,l}` often chosen as
+:math:`L(C1_{i,k},C2_{j,l})=\|C1_{i,k}-C2_{j,l}\|^2`. The optimization problem
+above is a non-convex quadratic program but we provide a solver that finds a
+local minimum using conditional gradient in :any:`ot.gromov.gromov_wasserstein`.
+There also exists an entropic regularized variant of GW that has been proposed in
+[12]_ and we provide an implementation of their algorithm in
+:any:`ot.gromov.entropic_gromov_wasserstein`.
+
+Note that similarly to Wasserstein distance GW allows for the definition of GW
+barycenters that can be expressed as
+
+.. math::
+    \min_{C\geq 0} \quad \sum_{k} w_k GW(C,Ck)
+
+where :math:`Ck` is the distance matrix between samples in distribution
+:math:`k`. Note that interestingly the barycenter is defined as a symmetric
+positive matrix. We provide a block coordinate optimization procedure in
+:any:`ot.gromov.gromov_barycenters` and
+:any:`ot.gromov.entropic_gromov_barycenters` for non-regularized and regularized
+barycenters respectively.
+
+Finally note that recently a fusion between Wasserstein and GW, coined Fused
+Gromov-Wasserstein (FGW) has been proposed
+in [24]_. It allows to compute a similarity between objects that are only partly in
+the same space. As such it can be used to measure similarity between labeled
+graphs for instance and also provide computable barycenters.
+The implementations of FGW and FGW barycenter is provided in functions
+:any:`ot.gromov.fused_gromov_wasserstein` and :any:`ot.gromov.fgw_barycenters`.
+
+.. hint::
+
+    Examples of computation of GW, regularized G and FGW are available in :
+
+    - :any:`auto_examples/plot_gromov`
+    - :any:`auto_examples/plot_fgw`
+
+    Examples of GW, regularized GW and FGW barycenters are available in :
+
+    - :any:`auto_examples/plot_gromov_barycenter`
+    - :any:`auto_examples/plot_barycenter_fgw`
+
+
+GPU acceleration
+^^^^^^^^^^^^^^^^
+
+We provide several implementation of our OT solvers in :any:`ot.gpu`. Those
+implementations use the :code:`cupy` toolbox that obviously need to be installed.
+
+
+.. note::
+
+    Several implementations of POT functions (mainly those relying on linear
+    algebra) have been implemented in :any:`ot.gpu`. Here is a short list on the
+    main entries:
+
+    -  :any:`ot.gpu.dist` : computation of distance matrix
+    -  :any:`ot.gpu.sinkhorn` : computation of sinkhorn
+    -  :any:`ot.gpu.sinkhorn_lpl1_mm` : computation of sinkhorn + group lasso
+
+Note that while the :any:`ot.gpu` module has been designed to be compatible with
+POT,  calling its function with :any:`numpy`  arrays will incur a large overhead due to
+the memory copy of the array on GPU prior to computation and conversion of the
+array after computation. To avoid this overhead, we provide functions
+:any:`ot.gpu.to_gpu` and :any:`ot.gpu.to_np` that perform the conversion
+explicitly.
+
+
+.. warning::
+    Note that due to the hard dependency on  :code:`cupy`, :any:`ot.gpu` is not
+    imported by default. If you want to
+    use it you have to specifically import it with :code:`import ot.gpu` .
+
+
+FAQ
+---
+
+
+
+1. **How to solve a discrete optimal transport problem ?**
+
+    The solver for discrete OT is the function :py:mod:`ot.emd` that returns
+    the OT transport matrix. If you want to solve a regularized OT you can
+    use :py:mod:`ot.sinkhorn`.
+
+
+    Here is a simple use case:
+
+    .. code:: python
+
+       # a,b are 1D histograms (sum to 1 and positive)
+       # M is the ground cost matrix
+       T=ot.emd(a,b,M) # exact linear program
+       T_reg=ot.sinkhorn(a,b,M,reg) # entropic regularized OT
+
+    More detailed examples can be seen on this example:
+    :doc:`auto_examples/plot_OT_2D_samples`
+
+
+2. **pip install POT fails with error : ImportError: No module named Cython.Build**
+
+    As discussed shortly in the README file. POT requires to have :code:`numpy`
+    and :code:`cython` installed to build. This corner case is not yet handled
+    by :code:`pip` and for now you need to install both library prior to
+    installing POT.
+
+    Note that this problem do not occur when using conda-forge since the packages
+    there are pre-compiled.
+
+    See `Issue #59 <https://github.com/rflamary/POT/issues/59>`__ for more
+    details.
+
+3. **Why is Sinkhorn slower than EMD ?**
+
+    This might come from the choice of the regularization term. The speed of
+    convergence of sinkhorn depends directly on this term [22]_ and when the
+    regularization gets very small the problem try and approximate the exact OT
+    which leads to slow convergence in addition to numerical problems. In other
+    words, for large regularization sinkhorn will be very fast to converge, for
+    small regularization (when you need an OT matrix close to the true OT), it
+    might be quicker to use the EMD solver.
+
+    Also note that the numpy implementation of the sinkhorn can use parallel
+    computation depending on the configuration of your system but very important
+    speedup can be obtained by using a GPU implementation since all operations
+    are matrix/vector products.
+
+4. **Using GPU fails with error:  module 'ot' has no attribute 'gpu'**
+
+    In order to limit import time and hard dependencies in POT. we do not import
+    some sub-modules automatically with :code:`import ot`. In order to use the
+    acceleration in :any:`ot.gpu` you need first to import is with
+    :code:`import ot.gpu`.
+
+    See `Issue #85 <https://github.com/rflamary/POT/issues/85>`__ and :any:`ot.gpu`
+    for more details.
+
+
+References
+----------
+
+.. [1] Bonneel, N., Van De Panne, M., Paris, S., & Heidrich, W. (2011,
+    December). `Displacement  nterpolation using Lagrangian mass transport
+    <https://people.csail.mit.edu/sparis/publi/2011/sigasia/Bonneel_11_Displacement_Interpolation.pdf>`__.
+    In ACM Transactions on Graphics (TOG) (Vol. 30, No. 6, p. 158). ACM.
+
+.. [2] Cuturi, M. (2013). `Sinkhorn distances: Lightspeed computation of
+    optimal transport <https://arxiv.org/pdf/1306.0895.pdf>`__. In Advances
+    in Neural Information Processing Systems (pp. 2292-2300).
+
+.. [3] Benamou, J. D., Carlier, G., Cuturi, M., Nenna, L., & Peyré, G.
+    (2015). `Iterative Bregman projections for regularized transportation
+    problems <https://arxiv.org/pdf/1412.5154.pdf>`__. SIAM Journal on
+    Scientific Computing, 37(2), A1111-A1138.
+
+.. [4] S. Nakhostin, N. Courty, R. Flamary, D. Tuia, T. Corpetti,
+    `Supervised planetary unmixing with optimal
+    transport <https://hal.archives-ouvertes.fr/hal-01377236/document>`__,
+    Whorkshop on Hyperspectral Image and Signal Processing : Evolution in
+    Remote Sensing (WHISPERS), 2016.
+
+.. [5] N. Courty; R. Flamary; D. Tuia; A. Rakotomamonjy, `Optimal Transport
+    for Domain Adaptation <https://arxiv.org/pdf/1507.00504.pdf>`__, in IEEE
+    Transactions on Pattern Analysis and Machine Intelligence , vol.PP,
+    no.99, pp.1-1
+
+.. [6] Ferradans, S., Papadakis, N., Peyré, G., & Aujol, J. F. (2014).
+    `Regularized discrete optimal
+    transport <https://arxiv.org/pdf/1307.5551.pdf>`__. SIAM Journal on
+    Imaging Sciences, 7(3), 1853-1882.
+
+.. [7] Rakotomamonjy, A., Flamary, R., & Courty, N. (2015). `Generalized
+    conditional gradient: analysis of convergence and
+    applications <https://arxiv.org/pdf/1510.06567.pdf>`__. arXiv preprint
+    arXiv:1510.06567.
+
+.. [8] M. Perrot, N. Courty, R. Flamary, A. Habrard (2016), `Mapping
+    estimation for discrete optimal
+    transport <http://remi.flamary.com/biblio/perrot2016mapping.pdf>`__,
+    Neural Information Processing Systems (NIPS).
+
+.. [9] Schmitzer, B. (2016). `Stabilized Sparse Scaling Algorithms for
+    Entropy Regularized Transport
+    Problems <https://arxiv.org/pdf/1610.06519.pdf>`__. arXiv preprint
+    arXiv:1610.06519.
+
+.. [10] Chizat, L., Peyré, G., Schmitzer, B., & Vialard, F. X. (2016).
+    `Scaling algorithms for unbalanced transport
+    problems <https://arxiv.org/pdf/1607.05816.pdf>`__. arXiv preprint
+    arXiv:1607.05816.
+
+.. [11] Flamary, R., Cuturi, M., Courty, N., & Rakotomamonjy, A. (2016).
+    `Wasserstein Discriminant
+    Analysis <https://arxiv.org/pdf/1608.08063.pdf>`__. arXiv preprint
+    arXiv:1608.08063.
+
+.. [12] Gabriel Peyré, Marco Cuturi, and Justin Solomon (2016),
+    `Gromov-Wasserstein averaging of kernel and distance
+    matrices <http://proceedings.mlr.press/v48/peyre16.html>`__
+    International Conference on Machine Learning (ICML).
+
+.. [13] Mémoli, Facundo (2011). `Gromov–Wasserstein distances and the
+    metric approach to object
+    matching <https://media.adelaide.edu.au/acvt/Publications/2011/2011-Gromov%E2%80%93Wasserstein%20Distances%20and%20the%20Metric%20Approach%20to%20Object%20Matching.pdf>`__.
+    Foundations of computational mathematics 11.4 : 417-487.
+
+.. [14] Knott, M. and Smith, C. S. (1984). `On the optimal mapping of
+    distributions <https://link.springer.com/article/10.1007/BF00934745>`__,
+    Journal of Optimization Theory and Applications Vol 43.
+
+.. [15] Peyré, G., & Cuturi, M. (2018). `Computational Optimal
+    Transport <https://arxiv.org/pdf/1803.00567.pdf>`__ .
+
+.. [16] Agueh, M., & Carlier, G. (2011). `Barycenters in the Wasserstein
+    space <https://hal.archives-ouvertes.fr/hal-00637399/document>`__. SIAM
+    Journal on Mathematical Analysis, 43(2), 904-924.
+
+.. [17] Blondel, M., Seguy, V., & Rolet, A. (2018). `Smooth and Sparse
+    Optimal Transport <https://arxiv.org/abs/1710.06276>`__. Proceedings of
+    the Twenty-First International Conference on Artificial Intelligence and
+    Statistics (AISTATS).
+
+.. [18] Genevay, A., Cuturi, M., Peyré, G. & Bach, F. (2016) `Stochastic
+    Optimization for Large-scale Optimal
+    Transport <https://arxiv.org/abs/1605.08527>`__. Advances in Neural
+    Information Processing Systems (2016).
+
+.. [19] Seguy, V., Bhushan Damodaran, B., Flamary, R., Courty, N., Rolet,
+    A.& Blondel, M. `Large-scale Optimal Transport and Mapping
+    Estimation <https://arxiv.org/pdf/1711.02283.pdf>`__. International
+    Conference on Learning Representation (2018)
+
+.. [20] Cuturi, M. and Doucet, A. (2014) `Fast Computation of Wasserstein
+    Barycenters <http://proceedings.mlr.press/v32/cuturi14.html>`__.
+    International Conference in Machine Learning
+
+.. [21] Solomon, J., De Goes, F., Peyré, G., Cuturi, M., Butscher, A.,
+    Nguyen, A. & Guibas, L. (2015). `Convolutional wasserstein distances:
+    Efficient optimal transportation on geometric
+    domains <https://dl.acm.org/citation.cfm?id=2766963>`__. ACM
+    Transactions on Graphics (TOG), 34(4), 66.
+
+.. [22] J. Altschuler, J.Weed, P. Rigollet, (2017) `Near-linear time
+    approximation algorithms for optimal transport via Sinkhorn
+    iteration <https://papers.nips.cc/paper/6792-near-linear-time-approximation-algorithms-for-optimal-transport-via-sinkhorn-iteration.pdf>`__,
+    Advances in Neural Information Processing Systems (NIPS) 31
+
+.. [23] Aude, G., Peyré, G., Cuturi, M., `Learning Generative Models with
+    Sinkhorn Divergences <https://arxiv.org/abs/1706.00292>`__, Proceedings
+    of the Twenty-First International Conference on Artficial Intelligence
+    and Statistics, (AISTATS) 21, 2018
+
+.. [24] Vayer, T., Chapel, L., Flamary, R., Tavenard, R. and Courty, N.
+    (2019). `Optimal Transport for structured data with application on
+    graphs <http://proceedings.mlr.press/v97/titouan19a.html>`__ Proceedings
+    of the 36th International Conference on Machine Learning (ICML).
+
+.. [25] Frogner C., Zhang C., Mobahi H., Araya-Polo M., Poggio T. :
+    Learning with a Wasserstein Loss,  Advances in Neural Information
+    Processing Systems (NIPS) 2015