summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorpdlotko <pdlotko@636b058d-ea47-450e-bf9e-a15bfbe3eedb>2016-09-30 16:06:00 +0000
committerpdlotko <pdlotko@636b058d-ea47-450e-bf9e-a15bfbe3eedb>2016-09-30 16:06:00 +0000
commitca68a0e3eaaa35239cf8dc6303ce4a2144f03dc6 (patch)
tree36bf616b9386eeb4afbc4713bda17e23c78d1f27
parent442c3986eff8b0b75e6e59433648133c6c79321e (diff)
a few changes to the tex file
git-svn-id: svn+ssh://scm.gforge.inria.fr/svnroot/gudhi/branches/gudhi_stat@1597 636b058d-ea47-450e-bf9e-a15bfbe3eedb Former-commit-id: dabff6a850cdf451642be3ff833f71dca421564d
-rw-r--r--src/Gudhi_stat/doc/documentation.tex115
1 files changed, 105 insertions, 10 deletions
diff --git a/src/Gudhi_stat/doc/documentation.tex b/src/Gudhi_stat/doc/documentation.tex
index 3e65d88f..1b5523f9 100644
--- a/src/Gudhi_stat/doc/documentation.tex
+++ b/src/Gudhi_stat/doc/documentation.tex
@@ -57,16 +57,16 @@ Ver 1.0.}
\section{Idea}
-When it comes to the statistics and machine learning, one need only very limited number of operations to be performed on a data to proceed. Let us think of some representation of persistence $\mathcal{A}$. To perform most of the statistical and machine learning operations one need to be able to compute average of object of type $\mathcal{A}$ (so that the averaged object is also of a type $\mathcal{A}$). One need to be able to compute distance between objects of a type $\mathcal{A}$, and compute scalar product of objects of a type $\mathcal{A}$.
+When it comes to the statistics and machine learning, one need only very limited number of operations to be performed on a data to proceed. Let us think of some representation of persistence $\mathcal{A}$. To perform most of the statistical and machine learning operations one need to be able to compute average of objects of type $\mathcal{A}$ (so that the averaged object is also of a type $\mathcal{A}$). One need to be able to compute distance between objects of a type $\mathcal{A}$, and compute scalar product of objects of a type $\mathcal{A}$.
To put this statement into a context, let us assume we have two collections $c_1,...,c_n$ and $d_1,...,d_n$ of objects of a type $\mathcal{A}$. We want to verify if the average of those two collections are different by performing bootstrap.
-First of all, we compute averages of those two connections: $C =$ average of $c_1,...,c_n$ and $D =$ average of $d_1,...,d_n$. Note that both $C$ and $D$ are of a type $\mathcal{A}$. Then we compute $d(C,D)$, a distance between $C$ and $D$.
+First of all, we compute averages of those two collections: $C =$ average of $c_1,...,c_n$ and $D =$ average of $d_1,...,d_n$. Note that both $C$ and $D$ are of a type $\mathcal{A}$. Then we compute $d(C,D)$, a distance between $C$ and $D$.
Later we put the two collections into one bin:\\
\[B = { c_1,...,c_n,d_1,...,d_n }\]
-Then we shuffle $B$, and we divide the shuffled version of $B$ into two classes: $B_1$ and $B_2$. Note that again, $B_1$ and $B_2$ are of a type $\mathcal{A}$. We compute their distance $d(B_1,B_2)$. The procedure of shuffling and dividing the set $B$ is repeated $N$ times ($N$ is reasonably large number).
-Then the p-value of a statement that the averages of $c_1,...,c_n$ and $d_1,...,d_n$ is approximated by the number of times $d(B_11,B_2) > d(C,D)$ divided by $N$.
+Then we shuffle $B$, and we divide the shuffled version of $B$ into two classes: $B_1$ and $B_2$ (in this case, of the same cardinality). Note that again, $B_1$ and $B_2$ are of a type $\mathcal{A}$. We compute their distance $d(B_1,B_2)$. The procedure of shuffling and dividing the set $B$ is repeated $N$ times (where $N$ is reasonably large number).
+Then the p-value of a statement that the averages of $c_1,...,c_n$ and $d_1,...,d_n$ is approximated by the number of times $d(B_1,B_2) > d(C,D)$ divided by $N$.
-As one can see, this procedure can be performed fro any type $\mathcal{A}$ which can be averaged, and which allows for computations of distances. The idea of Gudhi stat is to take advantage of C++ (and later python) polymorphism and implement a few interfaces:
+As one can see, this procedure can be performed for any type $\mathcal{A}$ which can be averaged, and which allows for computations of distances. The idea of Gudhi stat is to take advantage of C++ (and later python) polymorphism and implement a few interfaces:
\begin{enumerate}
\item Interface of a representation of persistence that allows averaging (so that the average object is of the same type).
\item Interface of representation of persistence that allows computations of distances.
@@ -83,23 +83,118 @@ And then to implement various currently known representations based on those int
\end{enumerate}
-The main aim of this implementation is to be able to implement various statistical methods, both on the level of C++ and python, that operates on the interfaces that are require (like, in the example above, the ability of averaging and computations of distances). Then, we are able to use any of representation that implement a given interface (including the future ones). Below we are discussing the representations which are currently implemented in Gudhi\_stat:
+The main aim of this implementation is to be able to implement various statistical methods, both on the level of C++ and python, that operates on the interfaces that are required for that particular method (like, in the example above, the ability of averaging and computations of distances). Given those implementations of statistical methods, we are able to use any of representation that implement a given interface (including the future ones).
+Doing so, we define a computational framework that joins topological methods that uses persistent homology with statistics and machine learning. This framework is very easy to being extend by new representations of persistence, and een more general, but any new type of representation.
+
+Below we are discussing the representations which are currently implemented in Gudhi\_stat:
\section{Persistence Landscapes}
-Persistence landscapes were originally proposed by Bubenik in~\cite{landscapes1}]. Efficient algorithms to compute them rigorously were proposed by Bubenik and Dłotko in~\cite{landscapes2}. The idea of persistence landscapes is shortly summarized in below.
+\label{sec:persistence_landscapes}
+Persistence landscapes were originally proposed by Bubenik in~\cite{landscapes1}. Efficient algorithms to compute them rigorously were proposed by Bubenik and D{\l}otko in~\cite{landscapes2}. The idea of persistence landscapes is shortly summarized in below.
+
+To begin with, suppose we are given a point~$(b,d) \in \mathbb{R}^2$ in a
+persistence diagram. With this point, we associate a piecewise
+linear function~$f_{(b,d)} : \mathbb{R} \rightarrow [0,\infty)$, which is
+defined as
+%
+\begin{equation} \label{eq:basicLand}
+ f_{(b,d)}(x) =
+ \left\{ \begin{array}{ccl}
+ 0 & \mbox{ if } & x \not\in (b, d) \; , \\[2ex]
+ x - b & \mbox{ if } & x \in \left( b, \frac{b+d}{2}
+ \right] \; , \\[2ex]
+ d - x & \mbox{ if } & x \in \left(\frac{b+d}{2},
+ d \right) \; .
+ \end{array} \right.
+\end{equation}
+
+A \emph{persistence landscape} of the birth-death
+pairs~$(b_i , d_i)$, where $i = 1,\ldots,m$, which constitute the given
+persistence diagram is the sequence of functions $\lambda_k : \mathbb{R}
+\rightarrow [0,\infty)$ for $k \in \mathbb{N}$, where~$\lambda_k(x)$
+denotes the $k^{\rm th}$ largest value of the numbers~$f_{(b_i,d_i)}(x)$,
+for $i = 1, \ldots, m$, and we define $\lambda_k(x) = 0$ if $k > m$.
+Equivalently, this sequence of functions can be combined into a single
+function $L : \mathbb{N} \times \mathbb{R} \to [0,\infty)$ of two
+variables, if we define $L(k,t) = \lambda_k(t)$.
+
+The detailed description of algorithms used to compute persistence landscapes can be found in~\cite{landscapes2}. Note that this implementation provides exact representation of landscapes. That have many advantages, but also a few drawbacks. For instance, as discussed in~\cite{landscapes2}, the exact representation of landscape may be of quadratic size with respect to the input persistence diagram. It may therefore happen that, for very large diagrams, using this representation may be memory--prohibitive. In such a case, there are two possible ways to proceed:
+\begin{enumerate}
+\item Use non exact representation on a grid described in the Section~\ref{sec:landscapes_on_grid}.
+\item Compute just a number of initial nonzero landscapes. This option is available from C++ level.
+\end{enumerate}
+
+
\section{Persistence Landscapes on a grid}
-Persistence landscapes were originally proposed by Bubenik
+\label{sec:landscapes_on_grid}
+This is an alternative representation of persistence landscapes defined in the Section~\ref{sec:persistence_landscapes}. Unlike in the Section~\ref{sec:persistence_landscapes} we will build a non--exact representation by sampling persistence landscape on a finite, equally distributed grid off points. Since, the persistence landscapes computed based on persistence diagrams have slope $1$ OR $-1$, we have an estimate of an error made by such a sampling.
+
+Due to a lack of rigorous description of the algorithms in the literature, we are giving a short discussion of them in below. Persistence landscapes are here represented by vector of vectors of real numbers. Assume that i-th vector consist of $n_i$ numbers sorted from larger to smaller. They represent the values of the functions $\lambda_1,\ldots,\lambda_{n_i}$ ($\lambda_{n_i+1}$ and further are zero) on the i-th point of a grid.
+
+When averaging two persistence landscapes represented by a grid, we simply compute point-wise averages of the entries of corresponding vectors\footnote{If one vector is shorter than another, we assume that they have the same length and that the renaming components are zero.}
+
+Computations of distances between two persistence diagrams on a grid is not much different than in the rigorous case. In this case, we sum up the distances between the same levels of corresponding landscapes. For fixed level, we approximate the landscapes between the corresponding points, and compute the $l^p$ distance between them.
+
+Similarly as in case of distance, when computing the scalar product of two persistence landscapes on a grid, we sum up the scalar products of corresponding levels of landscapes. For each level, we assume that the persistence landscape on a grid between two grid points is approximated by linear function. Therefore to compute scalar product of two corresponding levels of landscapes, we sum up the integrals of products of line segments for every pair of constitutive grid points.
+
+Note that for this representation we need to specify a few parameters:
+\begin{enumerate}
+\item Begin and end point of a grid (real numbers)
+
+\item Number of points in a grid.
+\end{enumerate}
+
+Note that the same representation is used in TDA R-package~\cite{tda}.
+
\section{Persistence heat maps}
-aaa
+This is a general class of discrete structures which are based on idea of placing a Gaussian kernel in the points of persistence diagrams. This idea was discovered and re-discovered by many authors along last 15 years. As far as we know this idea was firstly described in the work of Bologna group in~\cite{bologna1} and~\cite{bologna2}. Later it has been described by Colorado State University group in~\cite{henry}. The presented paper in the first time provide a discussion of stability of the representation. Also, the same ideas are used in construction of two recent kernels used for machine learning:~\cite{yasu} and~\cite{uli}. Both the kernel's construction uses interesting ideas to ensure stability. In the kernel presented in~\cite{yasu}, a scaling function is used to multiply the Gaussian kernel in the way that the points close to diagonal do not have a big influence on the resulting total distribution. In~\cite{uli} for every point $(b,d)$ two Gaussian kernels are added. First, with a weight one in a point $(b,d)$, and the second, with the weight $-1$ for a point $(b,d)$. In both cases, the representations are stable with respect to 1-Wasserstein distance.
+
+In Gudhi\_stat we currently implement all the structures described above. THe base of this implementation is 2-dimensional arrays of pixels. Each pixel have assigned a real value.
+
+The parameters of the structure are as follows:
+\begin{enumerate}
+\item Size of the image (we always assume that the images are square).
+\item Standard deviation of a Gaussian kernel.
+\item The box $[x_0,x_1]\times [y_0,y_1]$ that contain the persistence image.
+\item Scaling function (each Gaussian kernel at point $(p,q)$ gets multiplied by the value of this function at the point $(p,q)$.
+\item A boolean value determining if the space below diagonal should be erased or not.
+\end{enumerate}
+
\section{Persistence vectors}
-aaa
+This is a representation of persistent homology in a form of a vector which was designed for an application in 3d graphic in~\cite{vectors}. Below we provide a short description of this representation.
+
+Given a persistence diagram $D = \{ (b_i,d_i) \}$, for every pair of birth--death points $(b_1,d_1)$ and $(b_2,d_2)$ we compute the following three distances:
+\begin{enumerate}
+\item $d( (b_1,d_1) , (b_2,d_2) )$.
+\item $d( (b_1,d_1) , (\frac{b_1,d_1}{2},\frac{b_1,d_1}{2}) )$.
+\item $d( (b_2,d_2) , (\frac{b_2,d_2}{2},\frac{b_2,d_2}{2}) )$.
+\end{enumerate}
+and we pick the smallest one. When this computation is done, we obtain a vector of numbers. We sort the obtained vector in decreasing order. The obtained vector is the persistence vector representing the diagram.
+
+Given two persistence vectors, the computation of distances, averages and scalar products is straightforward. In each case, we compute distance (absolute value of differences), average and scalar product (a standard product) between each pair of corresponding numbers in the vectors and sum up the results.
+
\begin{thebibliography}{99}
\bibitem{landscapes1} P. Bubenik, Statistical topological data analysis using persistence landscapes, Journal of Machine Learning Research, 16 (2015), 77–102.
\bibitem{landscapes2} P. Bubenik, P. Dlotko, A persistence landscapes toolbox for topological statistics. Journal of Symbolic Computation.
+\bibitem{tda} B. Fasy, J. Kim, F. Lecci, C. Maria, Introduction to the R package TDA, arXiv:1411.1830.
+
+\bibitem{bologna1} P. Donatini, P. Frosini, A. Lovato, Size functions for signature recognition, Proceedings of SPIE, Vision Geometry VII, vol. 3454
+(1998).
+
+\bibitem{bologna2} M. Ferri, P. Frosini, A. Lovato, C. Zambelli, Point selection: A new comparison scheme for size functions (With an application to monogram recognition), Proceedings Third Asian Conference on Computer Vision, Lecture Notes in Computer Science 1351.
+
+\bibitem{Henry} H. Adams, S. Chepushtanova, T. Emerson, E. Hanson, M. Kirby, F. Motta, R. Neville, C. Peterson, P. Shipman, L. Ziegelmeier, Persistence Images: A Stable Vector Representation of Persistent Homology, arXiv:1507.06217.
+
+\bibitem{yasu} G. Kusano, K. Fukumizu, Y. Hiraoka, Persistence weighted Gaussian kernel for topological data analysis, arXiv:1601.01741.
+
+\bibitem{uli} J. Reininghaus, S. Huber, U. Bauer, R. Kwitt, A Stable Multi-Scale Kernel for Topological Machine Learning, arXiv:1412.6821.
+
+\bibitem{vectors} M. Carrière, S. Oudot, M. Ovsjanikov, Stable Topological Signatures for Points on 3D Shapes., Proc. Sympos. on Geometry Processing, July 2015.
+
\end{thebibliography}
\end{document} \ No newline at end of file