Data manipulation tools¶
-
dataTools.MI_dimRed(data, labels, n_feats, bins)[source]¶ Reduces the dimensionality of a bag-of-words representation based on mutual information
Parameters: Returns: reduced BoW representation
Return type: N*n_feats numpy array
-
dataTools.argmax_MI(data, labels, n_feats, bins)[source]¶ Returns the n_feats words that share the most information with the binary label
Parameters: Returns: Return type: mutual information between word and binary label
-
dataTools.data_normalization(data, offset_column=False)[source]¶ Performs data normalization
Parameters: - data (numpy array) –
- offset_column (boolean) – true if you want a column of ones appended at the bottom of your data
Returns: Return type: pandas dataframe normalized, and optionally offset
-
dataTools.format_preds(preds)[source]¶ Translates signed predictions (-1/1 or signed with amplitude for confidence) into 0/1 predictions
-
dataTools.get_MI(data, labels, word_idx, bins)[source]¶ Returns the mutual information between a word and a binary label
Parameters: - data (numpy array) –
- labels (numpy array of booleans) –
- word_idx (ind) – the index corresponding to the word you wish to compute MI for. You must have defined a table mapping word_idxs to words before you can use this function
- bins (list) – discretization bins for probability computation
Returns: Return type: mutual information between word and binary label
-
dataTools.load_data(dsID, set_type='tr', folder_name='data')[source]¶ Loads a dataset from a folder name and a dataset number
Parameters: Returns: Return type: pandas dataframe containing data with index starting from 0
Todo
allow for this function to take as input any file name, with a defaut convention name
-
dataTools.voting(preds, wghts, stochastic=False)[source]¶ Produces a label prediction from many predictors
Parameters: - preds (array of predictors) – ech predictor is an array of predictions, of a given size N
- wghts (float array) – confidence Weights given to the respective predictors
- stochastic (boolean) – if you set this to be True, the consensus prediction will be chosen from a binomial distribution from the different prediction votes
Returns: Return type: array of N label predictions