Data manipulation tools

dataTools.MI_dimRed(data, labels, n_feats, bins)[source]

Reduces the dimensionality of a bag-of-words representation based on mutual information

Parameters:
  • data (numpy array) –
  • labels (numpy array of booleans) –
  • n_feats (int) – number of high-information words to yield
  • bins (list) – Discretization bins for probability computation
Returns:

reduced BoW representation

Return type:

N*n_feats numpy array

dataTools.argmax_MI(data, labels, n_feats, bins)[source]

Returns the n_feats words that share the most information with the binary label

Parameters:
  • data (numpy array) –
  • labels (numpy array of booleans) –
  • n_feats (int) – number of high-information words to yield
  • bins (list) – Discretization bins for probability computation
Returns:

Return type:

mutual information between word and binary label

dataTools.data_normalization(data, offset_column=False)[source]

Performs data normalization

Parameters:
  • data (numpy array) –
  • offset_column (boolean) – true if you want a column of ones appended at the bottom of your data
Returns:

Return type:

pandas dataframe normalized, and optionally offset

dataTools.format_preds(preds)[source]

Translates signed predictions (-1/1 or signed with amplitude for confidence) into 0/1 predictions

dataTools.get_MI(data, labels, word_idx, bins)[source]

Returns the mutual information between a word and a binary label

Parameters:
  • data (numpy array) –
  • labels (numpy array of booleans) –
  • word_idx (ind) – the index corresponding to the word you wish to compute MI for. You must have defined a table mapping word_idxs to words before you can use this function
  • bins (list) – discretization bins for probability computation
Returns:

Return type:

mutual information between word and binary label

dataTools.load_data(dsID, set_type='tr', folder_name='data')[source]

Loads a dataset from a folder name and a dataset number

Parameters:
  • dsID (int) – the dataset number. Your input data should be stored in files that look like ‘Xk.csv’, where k=dsID
  • set_type (float) – the imaginary part (default 0.0)
  • folder_name (str) – folder where your data is stored
Returns:

Return type:

pandas dataframe containing data with index starting from 0

Todo

allow for this function to take as input any file name, with a defaut convention name

dataTools.voting(preds, wghts, stochastic=False)[source]

Produces a label prediction from many predictors

Parameters:
  • preds (array of predictors) – ech predictor is an array of predictions, of a given size N
  • wghts (float array) – confidence Weights given to the respective predictors
  • stochastic (boolean) – if you set this to be True, the consensus prediction will be chosen from a binomial distribution from the different prediction votes
Returns:

Return type:

array of N label predictions