package sklearn

  1. Overview
  2. Docs
Legend:
Library
Module
Module type
Parameter
Class
Class type
val get_py : string -> Py.Object.t

Get an attribute of this module as a Py.Object.t. This is useful to pass a Python function to another function.

module Isomap : sig ... end
module LocallyLinearEmbedding : sig ... end
module MDS : sig ... end
module SpectralEmbedding : sig ... end
module TSNE : sig ... end
val locally_linear_embedding : ?reg:float -> ?eigen_solver:[ `Auto | `Arpack | `Dense ] -> ?tol:float -> ?max_iter:int -> ?method_:[ `Standard | `Hessian | `Modified | `Ltsa ] -> ?hessian_tol:float -> ?modified_tol:float -> ?random_state:int -> ?n_jobs:int -> x:[ `NearestNeighbors of Py.Object.t | `Arr of [> `ArrayLike ] Np.Obj.t ] -> n_neighbors:int -> n_components:int -> unit -> [> `ArrayLike ] Np.Obj.t * float

Perform a Locally Linear Embedding analysis on the data.

Read more in the :ref:`User Guide <locally_linear_embedding>`.

Parameters ---------- X : array-like, NearestNeighbors Sample data, shape = (n_samples, n_features), in the form of a numpy array or a NearestNeighbors object.

n_neighbors : integer number of neighbors to consider for each point.

n_components : integer number of coordinates for the manifold.

reg : float regularization constant, multiplies the trace of the local covariance matrix of the distances.

eigen_solver : string, 'auto', 'arpack', 'dense' auto : algorithm will attempt to choose the best method for input data

arpack : use arnoldi iteration in shift-invert mode. For this method, M may be a dense matrix, sparse matrix, or general linear operator. Warning: ARPACK can be unstable for some problems. It is best to try several random seeds in order to check results.

dense : use standard dense matrix operations for the eigenvalue decomposition. For this method, M must be an array or matrix type. This method should be avoided for large problems.

tol : float, optional Tolerance for 'arpack' method Not used if eigen_solver=='dense'.

max_iter : integer maximum number of iterations for the arpack solver.

method : 'standard', 'hessian', 'modified', 'ltsa' standard : use the standard locally linear embedding algorithm. see reference 1_ hessian : use the Hessian eigenmap method. This method requires n_neighbors > n_components * (1 + (n_components + 1) / 2. see reference 2_ modified : use the modified locally linear embedding algorithm. see reference 3_ ltsa : use local tangent space alignment algorithm see reference 4_

hessian_tol : float, optional Tolerance for Hessian eigenmapping method. Only used if method == 'hessian'

modified_tol : float, optional Tolerance for modified LLE method. Only used if method == 'modified'

random_state : int, RandomState instance, default=None Determines the random number generator when ``solver`` == 'arpack'. Pass an int for reproducible results across multiple function calls. See :term: `Glossary <random_state>`.

n_jobs : int or None, optional (default=None) The number of parallel jobs to run for neighbors search. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary <n_jobs>` for more details.

Returns ------- Y : array-like, shape n_samples, n_components Embedding vectors.

squared_error : float Reconstruction error for the embedding vectors. Equivalent to ``norm(Y - W Y, 'fro')**2``, where W are the reconstruction weights.

References ----------

.. 1 Roweis, S. & Saul, L. Nonlinear dimensionality reduction by locally linear embedding. Science 290:2323 (2000). .. 2 Donoho, D. & Grimes, C. Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data. Proc Natl Acad Sci U S A. 100:5591 (2003). .. 3 Zhang, Z. & Wang, J. MLLE: Modified Locally Linear Embedding Using Multiple Weights. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.70.382 .. 4 Zhang, Z. & Zha, H. Principal manifolds and nonlinear dimensionality reduction via tangent space alignment. Journal of Shanghai Univ. 8:406 (2004)

val smacof : ?metric:bool -> ?n_components:int -> ?init:[> `ArrayLike ] Np.Obj.t -> ?n_init:int -> ?n_jobs:int -> ?max_iter:int -> ?verbose:int -> ?eps:float -> ?random_state:int -> ?return_n_iter:bool -> dissimilarities:[> `ArrayLike ] Np.Obj.t -> unit -> [> `ArrayLike ] Np.Obj.t * float * int

Computes multidimensional scaling using the SMACOF algorithm.

The SMACOF (Scaling by MAjorizing a COmplicated Function) algorithm is a multidimensional scaling algorithm which minimizes an objective function (the *stress* ) using a majorization technique. Stress majorization, also known as the Guttman Transform, guarantees a monotone convergence of stress, and is more powerful than traditional techniques such as gradient descent.

The SMACOF algorithm for metric MDS can summarized by the following steps:

1. Set an initial start configuration, randomly or not. 2. Compute the stress 3. Compute the Guttman Transform 4. Iterate 2 and 3 until convergence.

The nonmetric algorithm adds a monotonic regression step before computing the stress.

Parameters ---------- dissimilarities : ndarray, shape (n_samples, n_samples) Pairwise dissimilarities between the points. Must be symmetric.

metric : boolean, optional, default: True Compute metric or nonmetric SMACOF algorithm.

n_components : int, optional, default: 2 Number of dimensions in which to immerse the dissimilarities. If an ``init`` array is provided, this option is overridden and the shape of ``init`` is used to determine the dimensionality of the embedding space.

init : ndarray, shape (n_samples, n_components), optional, default: None Starting configuration of the embedding to initialize the algorithm. By default, the algorithm is initialized with a randomly chosen array.

n_init : int, optional, default: 8 Number of times the SMACOF algorithm will be run with different initializations. The final results will be the best output of the runs, determined by the run with the smallest final stress. If ``init`` is provided, this option is overridden and a single run is performed.

n_jobs : int or None, optional (default=None) The number of jobs to use for the computation. If multiple initializations are used (``n_init``), each run of the algorithm is computed in parallel.

``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary <n_jobs>` for more details.

max_iter : int, optional, default: 300 Maximum number of iterations of the SMACOF algorithm for a single run.

verbose : int, optional, default: 0 Level of verbosity.

eps : float, optional, default: 1e-3 Relative tolerance with respect to stress at which to declare convergence.

random_state : int, RandomState instance, default=None Determines the random number generator used to initialize the centers. Pass an int for reproducible results across multiple function calls. See :term: `Glossary <random_state>`.

return_n_iter : bool, optional, default: False Whether or not to return the number of iterations.

Returns ------- X : ndarray, shape (n_samples, n_components) Coordinates of the points in a ``n_components``-space.

stress : float The final value of the stress (sum of squared distance of the disparities and the distances for all constrained points).

n_iter : int The number of iterations corresponding to the best stress. Returned only if ``return_n_iter`` is set to ``True``.

Notes ----- 'Modern Multidimensional Scaling - Theory and Applications' Borg, I.; Groenen P. Springer Series in Statistics (1997)

'Nonmetric multidimensional scaling: a numerical method' Kruskal, J. Psychometrika, 29 (1964)

'Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis' Kruskal, J. Psychometrika, 29, (1964)

val spectral_embedding : ?n_components:int -> ?eigen_solver:[ `Arpack | `PyObject of Py.Object.t | `Lobpcg ] -> ?random_state:int -> ?eigen_tol:float -> ?norm_laplacian:bool -> ?drop_first:bool -> adjacency:[ `Sparse_graph of Py.Object.t | `Arr of [> `ArrayLike ] Np.Obj.t ] -> unit -> [> `ArrayLike ] Np.Obj.t

Project the sample on the first eigenvectors of the graph Laplacian.

The adjacency matrix is used to compute a normalized graph Laplacian whose spectrum (especially the eigenvectors associated to the smallest eigenvalues) has an interpretation in terms of minimal number of cuts necessary to split the graph into comparably sized components.

This embedding can also 'work' even if the ``adjacency`` variable is not strictly the adjacency matrix of a graph but more generally an affinity or similarity matrix between samples (for instance the heat kernel of a euclidean distance matrix or a k-NN matrix).

However care must taken to always make the affinity matrix symmetric so that the eigenvector decomposition works as expected.

Note : Laplacian Eigenmaps is the actual algorithm implemented here.

Read more in the :ref:`User Guide <spectral_embedding>`.

Parameters ---------- adjacency : array-like or sparse graph, shape: (n_samples, n_samples) The adjacency matrix of the graph to embed.

n_components : integer, optional, default 8 The dimension of the projection subspace.

eigen_solver : None, 'arpack', 'lobpcg', or 'amg', default None The eigenvalue decomposition strategy to use. AMG requires pyamg to be installed. It can be faster on very large, sparse problems, but may also lead to instabilities.

random_state : int, RandomState instance, default=None Determines the random number generator used for the initialization of the lobpcg eigenvectors decomposition when ``solver`` == 'amg'. Pass an int for reproducible results across multiple function calls. See :term: `Glossary <random_state>`.

eigen_tol : float, optional, default=0.0 Stopping criterion for eigendecomposition of the Laplacian matrix when using arpack eigen_solver.

norm_laplacian : bool, optional, default=True If True, then compute normalized Laplacian.

drop_first : bool, optional, default=True Whether to drop the first eigenvector. For spectral embedding, this should be True as the first eigenvector should be constant vector for connected graph, but for spectral clustering, this should be kept as False to retain the first eigenvector.

Returns ------- embedding : array, shape=(n_samples, n_components) The reduced samples.

Notes ----- Spectral Embedding (Laplacian Eigenmaps) is most useful when the graph has one connected component. If there graph has many components, the first few eigenvectors will simply uncover the connected components of the graph.

References ---------- * https://en.wikipedia.org/wiki/LOBPCG

* Toward the Optimal Preconditioned Eigensolver: Locally Optimal Block Preconditioned Conjugate Gradient Method Andrew V. Knyazev https://doi.org/10.1137%2FS1064827500366124

val trustworthiness : ?n_neighbors:int -> ?metric:[ `S of string | `Callable of Py.Object.t ] -> x:[> `ArrayLike ] Np.Obj.t -> x_embedded:[> `ArrayLike ] Np.Obj.t -> unit -> float

Expresses to what extent the local structure is retained.

The trustworthiness is within 0, 1. It is defined as

.. math::

T(k) = 1 - \frac

nk (2n - 3k - 1) \sum^n_=1 \sum_j \in \mathcal{N_^k

}

\max(0, (r(i, j) - k))

where for each sample i, :math:`\mathcalN_^k` are its k nearest neighbors in the output space, and every sample j is its :math:`r(i, j)`-th nearest neighbor in the input space. In other words, any unexpected nearest neighbors in the output space are penalised in proportion to their rank in the input space.

* 'Neighborhood Preservation in Nonlinear Projection Methods: An Experimental Study' J. Venna, S. Kaski * 'Learning a Parametric Embedding by Preserving Local Structure' L.J.P. van der Maaten

Parameters ---------- X : array, shape (n_samples, n_features) or (n_samples, n_samples) If the metric is 'precomputed' X must be a square distance matrix. Otherwise it contains a sample per row.

X_embedded : array, shape (n_samples, n_components) Embedding of the training data in low-dimensional space.

n_neighbors : int, optional (default: 5) Number of neighbors k that will be considered.

metric : string, or callable, optional, default 'euclidean' Which metric to use for computing pairwise distances between samples from the original input space. If metric is 'precomputed', X must be a matrix of pairwise distances or squared distances. Otherwise, see the documentation of argument metric in sklearn.pairwise.pairwise_distances for a list of available metrics.

.. versionadded:: 0.20

Returns ------- trustworthiness : float Trustworthiness of the low-dimensional embedding.