Machine-learning methods - Documentation¶

Supervised methods¶

class supervised.kernelKNN(k)[source]¶

K-nearest neighbor instance, allowing for any kernel

train(data, labels, **kwargs)[source]¶

Trains the classifier based on @data, @labels and a @kernel_fct.

Parameters:

data (N*p numpy array) –
labels (N*1 numpy array) –
kernel_fct – optionnal, a method used to compute a kernel matrix from input data
solver – optionnal, a numerical solver adapted to the task at hand
stringsData (boolean) – indicating if we are dealing with strings
kwargs – additional keyword arguments, for instance that should be provided to the solver or the kernel function

class supervised.kernelLogisticRegression(lbda=0.1)[source]¶

Logistic regression instance, allowing for any kernel

format_labels(labels)[source]¶: Returns labels formatted for performance evaluation by a metric

predict(data, **kwargs)[source]¶: Predict labels for data

train(data, labels, max_iter=10000, cvg_threshold=0.0001, **kwargs)[source]¶: Trains the kernel Logistic Regression on data and labels

class supervised.kernelSVM(lbda=0.1, solver='cvxopt')[source]¶

SVM instance, allowing for any kernel

format_labels(labels)[source]¶: Transform any binary system of labels into the +1/-1 equivalent system

predict(data, **kwargs)[source]¶: Predict labels for data

train(data, labels, **kwargs)[source]¶: Trains the kernel SVM on data and labels

Unsupervised methods¶

class unsupervised.GaussianMixture(K, d, pi, mu, sigma, isotropic=True)[source]¶

Instance for fitting gaussian mixtures to a dataset

K¶: int – number of clusters

d¶: int – dimension

isotropic¶: boolean – true forces clusters to be spherical. Defaults to true

pi¶: np.array, Kx1 – current estimate of the class variable probabilities. Initialized at train()

mu¶: np.array, Kxd – current estimate of clusters first order momentum. Initialized at train()

sigma¶: np.array, KxK – current estimate of covaraince matrix. Initialized at train()

n¶: int – number of data points for linked dataset

draw(data, predictions, size=40, scale=0.8, eps=0.1)[source]¶

Prints data points, centroids and alineates the covariances matrices

Parameters:	data (np.array) – input dataset. This is meant to be the dataset bound to this instance predictions (np.array) – array of cluster assignments size (int) – marker size scale (float) – to be refactorized. Do not use at the moment eps (float) –

Todo

1, high : Refactorize scale parameters

predict(X)[source]¶

Predicts cluster assignments from a dataset

Parameters:	X (np.array) – nxd input dataset
Returns:	predictions – cluster assignment, one per data point
Return type:	np.array

printResults(log_likelihoods)[source]¶: Print the learnt parameters after training and the evolution of the partial log likelihood through time

Todo

This does not fit well into the package philosophy. To be refactorized

print_log_likelihood(X)[source]¶: Print the average value of (partial) log_likelihood

Todo

This does not fit well into the package philosophy. To be refactorized

train(X, eps=0.0001, max_iter=10000, verbose=True)[source]¶

EM algorithm for estimating gaussian mixture parameters

Parameters:	X (np.array) – nxd dataset eps (float) – stop threshold on change in log-likelihood max_iter (int) – maximum number of iterations verbose (boolean) – defaults to True

class unsupervised.Kmeans(data, nClass, ind=0, init='def')[source]¶

Standard K-means method

ind¶: int – instance ID

data¶: np.array – the data bound to this instance.

N¶: int – number of records

d¶: int – dimension

K¶: int – number of clusters

centroids¶: np.array – current centroids. Only initialized at run

assignment¶: np.array – maps every data point to a cluster ID

Todo

1, Low: Generalize this class to a kernel. Have the kernel bound to the instance from the initialization

2, Low: Allow for the data to be reset

draw(size=40, scale=0.8)[source]¶

Draws the current clustering in a new figure

Parameters:	size (int) – marker size. Defaults to 40 scale (float) – needs rework. Do not use right now

Todo

Change scale integration

run()[source]¶: Computes centroids position and cluster assignments until convergence

stats()[source]¶: Computes intra-cluster distortion, inter-cluster distortion, quality estimation for clustering