Machine-learning methods - Documentation

Supervised methods

class supervised.kernelKNN(k)[source]

K-nearest neighbor instance, allowing for any kernel

train(data, labels, **kwargs)[source]

Trains the classifier based on @data, @labels and a @kernel_fct.

Parameters:
  • data (N*p numpy array) –
  • labels (N*1 numpy array) –
  • kernel_fct – optionnal, a method used to compute a kernel matrix from input data
  • solver – optionnal, a numerical solver adapted to the task at hand
  • stringsData (boolean) – indicating if we are dealing with strings
  • kwargs – additional keyword arguments, for instance that should be provided to the solver or the kernel function
class supervised.kernelLogisticRegression(lbda=0.1)[source]

Logistic regression instance, allowing for any kernel

format_labels(labels)[source]

Returns labels formatted for performance evaluation by a metric

predict(data, **kwargs)[source]

Predict labels for data

train(data, labels, max_iter=10000, cvg_threshold=0.0001, **kwargs)[source]

Trains the kernel Logistic Regression on data and labels

class supervised.kernelSVM(lbda=0.1, solver='cvxopt')[source]

SVM instance, allowing for any kernel

format_labels(labels)[source]

Transform any binary system of labels into the +1/-1 equivalent system

predict(data, **kwargs)[source]

Predict labels for data

train(data, labels, **kwargs)[source]

Trains the kernel SVM on data and labels

Unsupervised methods

class unsupervised.GaussianMixture(K, d, pi, mu, sigma, isotropic=True)[source]

Instance for fitting gaussian mixtures to a dataset

K

int – number of clusters

d

int – dimension

isotropic

boolean – true forces clusters to be spherical. Defaults to true

pi

np.array, Kx1 – current estimate of the class variable probabilities. Initialized at train()

mu

np.array, Kxd – current estimate of clusters first order momentum. Initialized at train()

sigma

np.array, KxK – current estimate of covaraince matrix. Initialized at train()

n

int – number of data points for linked dataset

draw(data, predictions, size=40, scale=0.8, eps=0.1)[source]

Prints data points, centroids and alineates the covariances matrices

Parameters:
  • data (np.array) – input dataset. This is meant to be the dataset bound to this instance
  • predictions (np.array) – array of cluster assignments
  • size (int) – marker size
  • scale (float) – to be refactorized. Do not use at the moment
  • eps (float) –

Todo

1, high : Refactorize scale parameters

predict(X)[source]

Predicts cluster assignments from a dataset

Parameters:X (np.array) – nxd input dataset
Returns:predictions – cluster assignment, one per data point
Return type:np.array
printResults(log_likelihoods)[source]

Print the learnt parameters after training and the evolution of the partial log likelihood through time

Todo

This does not fit well into the package philosophy. To be refactorized

print_log_likelihood(X)[source]

Print the average value of (partial) log_likelihood

Todo

This does not fit well into the package philosophy. To be refactorized

train(X, eps=0.0001, max_iter=10000, verbose=True)[source]

EM algorithm for estimating gaussian mixture parameters

Parameters:
  • X (np.array) – nxd dataset
  • eps (float) – stop threshold on change in log-likelihood
  • max_iter (int) – maximum number of iterations
  • verbose (boolean) – defaults to True
class unsupervised.Kmeans(data, nClass, ind=0, init='def')[source]

Standard K-means method

ind

int – instance ID

data

np.array – the data bound to this instance.

N

int – number of records

d

int – dimension

K

int – number of clusters

centroids

np.array – current centroids. Only initialized at run

assignment

np.array – maps every data point to a cluster ID

Todo

1, Low: Generalize this class to a kernel. Have the kernel bound to the instance from the initialization

2, Low: Allow for the data to be reset

draw(size=40, scale=0.8)[source]

Draws the current clustering in a new figure

Parameters:
  • size (int) – marker size. Defaults to 40
  • scale (float) – needs rework. Do not use right now

Todo

Change scale integration

run()[source]

Computes centroids position and cluster assignments until convergence

stats()[source]

Computes intra-cluster distortion, inter-cluster distortion, quality estimation for clustering