Data manipulation tools¶

dataTools.MI_dimRed(data, labels, n_feats, bins)[source]¶

Reduces the dimensionality of a bag-of-words representation based on mutual information

Parameters:	data (numpy array) – labels (numpy array of booleans) – n_feats (int) – number of high-information words to yield bins (list) – Discretization bins for probability computation
Returns:	reduced BoW representation
Return type:	N*n_feats numpy array

dataTools.argmax_MI(data, labels, n_feats, bins)[source]¶

Returns the n_feats words that share the most information with the binary label

Parameters:	data (numpy array) – labels (numpy array of booleans) – n_feats (int) – number of high-information words to yield bins (list) – Discretization bins for probability computation
Returns:
Return type:	mutual information between word and binary label

dataTools.data_normalization(data, offset_column=False)[source]¶

Performs data normalization

Parameters:	data (numpy array) – offset_column (boolean) – true if you want a column of ones appended at the bottom of your data
Returns:
Return type:	pandas dataframe normalized, and optionally offset

dataTools.format_preds(preds)[source]¶: Translates signed predictions (-1/1 or signed with amplitude for confidence) into 0/1 predictions

dataTools.get_MI(data, labels, word_idx, bins)[source]¶

Returns the mutual information between a word and a binary label

Parameters:	data (numpy array) – labels (numpy array of booleans) – word_idx (ind) – the index corresponding to the word you wish to compute MI for. You must have defined a table mapping word_idxs to words before you can use this function bins (list) – discretization bins for probability computation
Returns:
Return type:	mutual information between word and binary label

dataTools.load_data(dsID, set_type='tr', folder_name='data')[source]¶

Loads a dataset from a folder name and a dataset number

Parameters:	dsID (int) – the dataset number. Your input data should be stored in files that look like ‘Xk.csv’, where k=dsID set_type (float) – the imaginary part (default 0.0) folder_name (str) – folder where your data is stored
Returns:
Return type:	pandas dataframe containing data with index starting from 0

Todo

allow for this function to take as input any file name, with a defaut convention name

dataTools.voting(preds, wghts, stochastic=False)[source]¶

Produces a label prediction from many predictors

Parameters:	preds (array of predictors) – ech predictor is an array of predictions, of a given size N wghts (float array) – confidence Weights given to the respective predictors stochastic (boolean) – if you set this to be True, the consensus prediction will be chosen from a binomial distribution from the different prediction votes
Returns:
Return type:	array of N label predictions