Welcome to impyte¶
Impyte is a Python module to impute missing values by prediction using machine learning algorithms.
Introduction¶
One essential problem for any person dealing with data is missing values. There are several possibilities to deal with missing information, ranging from dropping data points to estimating the value based on other values in that column (i.e. average or median values). A more recent method involves machinelearning algorithms. This module offers a lightweight Python solution to calculate missing information based on the underlying relationship between data points.
The main goal of this module is to support people who are dealing with missing information to gather additional insights about the different patterns and impute them in an easy way.
There are two essential features to this module:
 Visualization of Patterns
 Imputation of missing information
Yet impyte
is only one piece of the equation. In order to
maximize the return in any value imputation process a deep understanding
of the data is needed. As well as thorough preprocessing and cleaning
of the data. Imyte takes on some of the challenges but tends to work best
in concert with additional data science endeavors.
To get started with impyte is as simple as:
from impyte import impyte
imp = impyte.Impyter()
imp.load_data(missing_data)
imp.impute()
Installation¶
Since this module is still in beta, you can install the latest version through its github repository via pip.
pip install git+git://github.com/andirs/impyte.git
There is also a manual way of importing the module in your project. To do so, download the repository to the folder you are performing your data work in. Afterwards you’ll be able to import the impyte
functionality through following command:
from impyte import impyte
Requirements¶
The requirements are listed in requirements.txt
and will usually
be installed when proceeding through pip. When installing manually,
please make sure following modules are already installed:
API Reference¶

class
impyte.impyte.
Impyter
(data=None)¶ Bases:
object
Example usage:
from impyte import impyte df = pd.read_csv("missing_values.csv") imp = impyte.Impyter(df) # show nanpatterns of data in one data frame imp.pattern() # shows nanpatterns # imputation of all singlenans using random forest imp.impute(estimator='rf') # imputation of all nanpatterns imp.impute(estimator='rf', multi_nans=True) # use f1 and r2 thresholds imp.impute(estimator='rf', threshold={"r2": .7, "f1_macro": .7})
Parameters: data (pd.DataFrame, optional) – Data on which to perform imputation.
The data can also be a list of lists but will be converted into a pandas DataFrame once loaded. If none, data can be loaded at a later point through
impyte.Impyter.load_data
.Variables:  data (pd.DataFrame) – The original data, loaded by user through
instantiation or
impyte.Impyter.load_data
method.  result (pd.DataFrame) – Copy of original data on which imputation is being performed.
 clf (dict) – Holds estimator for given imputation. (Deprecated)
 self.pattern_log (Pattern object) – An instantiated
impyte.Pattern
object, that holds information about the NaNpattern.  self.model_log (dict) – Python dictionary, storing all models once
impyte.Impyter.impute
has been run  self.error_string (str) – String representation of error messages that occured during the imputation process.
 self.pattern_predictor_dict (dict) – Python dictionary storing a pattern string and its connected list of predictors.
 self.pattern_dependent_variable_dict (dict) – Python dictionary storing a pattern string and its connected list of dependent variables.

__init__
(data=None)¶ Parameters: data (pd.DataFramelist[list], optional) – When initialized, data can be loaded directly. An alternative way is loading it with impyte.Impyter.load_data

static
compare_features
(list_one, list_two)¶ Compares two lists given its objects based on a comparison of Counter dicts. The order of elements is unimportant.
Parameters:  list_one (list)
 list_two (list)
Returns: True – If list_one and list_two contain the same elements.
Return type: Boolean

drop_imputation
(threshold, verbose=True, drop_pattern=False)¶ Method to drop imputation results based on threshold values. Threshold values are compared against the crossvalidation scores of all imputation models. If the score is lower than the threshold value, the imputation will be dropped.
An example:
imp = impyte.imputer(data) imp.impute(estimator='rf') imp.drop_imputation({"f1_macro": .8, "r2": .7})
Note
In the case of multinan, drop_imputation will average the score of all models. Yet, performing this method for multinan patterns is discouraged.
Further individual treatment of the data set might be more helpful in order to preprocess the information correctly. One potential action could be, to drop multinan columns if they contain no information.
Parameters: threshold (dict{str, float}) – Threshold dictionary including values for r2 and f1 scores.
An example:
{ "r2" : .5, "f1_macro" : .7 }
At this point only f1 and r2 scores are being supported.
verbose (Boolean) – Boolean flag to indicate whether results should be written to stdout.
Note
At this point there is a verbose system that distinguishes multiple layers of verbosity. This flag can also simply set to
True
in order to print out the minimum verbosity. A multi verbosity level might be enforced at a later stage.drop_pattern (Boolean) – Indicator if not only imputation but also pattern should be dropped.

drop_pattern
(pattern_no, inplace=False)¶ Method to drop pattern referenced by pattern number. Drops pattern from data set and returns preliminary result. If inplace flag is set to True, internal storage of impyte object is being manipulated as well. Otherwise, a copy without the dropped pattern will be returned and the stored data set stays intact.
Parameters:  pattern_no (int)
 inplace (Boolean)

get_data
()¶ Returns a copy of the loaded data for quick reference.
Returns: Original Data – A copy of the original data set can be retrieved through this method. Return type: pd.DataFrame

get_model
(pattern_no)¶ Returns model that matches pattern number.
Parameters: pattern_no (int) – Pattern number to receive fitting model. Returns: model Return type: ImpyterModelImpyterMultiModel

get_pattern
(pattern_no, result=False)¶ Returns data points for a specific pattern_no for further investigation.
Parameters:  pattern_no (int) – Index value that indicates pattern
 result (Boolean) – Flag to show if original or result data should be sliced.
Returns: data – Data points that have a certain pattern, if
result
is set toTrue
the data is result data, otherwise a slice of the original data is being returned.Return type: pd.DataFrame

get_result
()¶ Returns a copy of the result data for reference.
Returns: Result Data – A copy of the result data. Return type: pd.DataFrame

get_summary
(importance_filter=True)¶ Shows simple overview of missing values. Returns table with information on missing values per column, its percentage and the count of unique values within that column.
Setting the importance filter flag to True shows only columns that have some missing values. This is helpful for data sets with a large amount of variables and only few nanvalues.
Parameters: importance_filter (Boolean) – Show only features with at least one missing value. Returns: Summary table Return type: pd.DataFrame

impute
(data=None, cv=5, verbose=True, estimator='rf', multi_nans=False, one_hot_encode=True, auto_scale=True, threshold={'r2': None, 'f1_macro': None}, recompute=False)¶ Impute is the core method of impyte. The method works out of the box and uses Random Forest estimators per default to impute missing values. It automatically performs crossvalidation to showcase the potential accuracy of the imputation.
Scoring that is being used is f1_macro score for classifiers (supporting binary and multiclass) and r2 for regression models.
In order to fill in only columns that surpass a certain scoring threshold (i.e. f1 score > .7), the threshold parameter can be set. The threshold values are being transmitted through a dictionary.
Note
Multi Nans
Prediction of values with multinan is a last resort option. This might be suitable for certain edge cases but if the score values are low it should be considered dropping the feature or the data points all together.
Parameters: data (pd.DataFrame) – Data to be imputed.
cv (int) – Amount of crossvalidation runs.
verbose (Boolean) – Indicator, whether prediction results should be printed out.
estimator (str) – Estimators can be chosen through a simple string abbreviation. This table outlines the potential options.
Abbreviation Estimator ‘rf’ Random Forest ‘svm’ Support Vector Machine ‘sgd’ Stochastic Gradient Descent ‘knn’ KNearest Neighbor ‘bayes’ (Naive) Bayes ‘dt’ Decision Tree ‘gb’ Gradient Boosting multi_nans (Boolean) – Indicator if data points with multiple NaN values should be imputed as well
one_hot_encode (Boolean) – If set to True onehotencoding of categorical variables happens
auto_scale (Boolean) – If set to True continuous variables are automatically scaled and transformed back after imputation.
threshold (dict{str, float}) – Classification and regression threshold cutoffs. At this point f1 score and R2.
recompute (Boolean) – Indicator whether the system should recompute the imputation or use stored models if possible.
Note
Impyte will print a warning to the stdout if the data set might contain too few rows in general to properly compute any imputation method.

load_data
(data)¶ Function to load data into Impyter class. Requires a pandas DataFrame to load. Otherwise, the input is being transformed into a DataFrame. While loading the data is being copied into the object, to stay clear of consistency issues with the original data set.
Parameters: data – preferably pandas DataFrame

load_model
(filename, path='models/')¶ Load a stored machine learning model to perform value imputation.
Parameters:  filename (str) – Filename of model
 path (str) – Path to model (default value is ‘models/’)

map_model_to_pattern
(mdl)¶ Checks model for similarity to stored patterns and returns pattern number if a match is found.
Parameters: mdl (ImpyterModel) Returns: pattern_no – If no pattern number can be found, a None value will be returned. Return type: int

map_multimodel_to_pattern
(mmdl)¶ Checks multimodel for similarity to stored patterns and returns pattern number if a match is found.
Parameters: mmdl (ImpyterMultiModel) Returns: pattern_no – If no pattern number can be found, a None value will be returned. Return type: int

one_hot_decode
(data)¶ Decodes onehotencoded features into single column again. Generally speaking, this function inverses the onehotencode function.
Parameters: data (pd.DataFrame) – DataFrame that has onehotencoded columns processed by impyte.Impyter.one_hot_encode
.Returns: Data set – Data set with collapsed information. Return type: pd.DataFrame

one_hot_encode
(data, verbose=False)¶ Uses pandas get_dummies method to return a onehotencoded DataFrame.
Parameters:  data (pd.DataFrame)
 verbose (Boolean)
Returns: DataFrame with onehotencoded categorical values.
Return type: Data set  pd.DataFrame

pattern
(recompute=False)¶ Returns missing value patterns of data set. Leverages
impyte.Pattern._compute_pattern
and impyte.Pattern.get_pattern methods to compute and return an overview of all existant NaN patterns in the data set. The overview shows a NaN in the column where a data point was missing and 1 for all complete slots. On the right hand side is a count variable to indicate how often that pattern was found. The patterns are always sorted by count and it is not given, that pattern 0 is always the pattern with only complete cases.A potential result table could look like this, where
NaN
indicates the column contains missing values in this pattern. TheCount
column shows how many observations of this NaNpattern are in the data set.Pattern left_socks right_socks Count 0 1 1 15 1 NaN 1 6 2 1 NaN 6 3 NaN NaN 4 For additional information (and a rather sad joke) please head over to
impyte.Pattern
.Parameters: recompute (Boolean) – Flag to indicate whether patterns should be recomputed from the original data set. This is an important feature if for example a pattern has been dropped and should be incorporated again. Returns: NaNPattern Table – Table with overview of NaNpatterns. Return type: pd.DataFrame

save_model
(pattern_no=None, filename=None, path='models/')¶ Stores an imputation model for either the whole data set or a particular pattern in a pickle file. If pattern_no is not set, the method stores all models. If filename is not set, an automated name is being produced including a timestamp.
Parameters:  pattern_no (int, optional) – Pattern number that points to a certain NaNPattern model which
in turn references a
impyte.ImpyteModel
orimpyte.ImpyteMultiModel
.  filename (str, optional) – If value is not set, an automated name is being created.
 path (str) – (default value is ‘models/’ which will automatically create a model for that)
 pattern_no (int, optional) – Pattern number that points to a certain NaNPattern model which
in turn references a

set_unique
(unique_no)¶ Set unique values for imputation.
Parameters: unique_no (int) – Positive number that indicates a threshold for unique values needed in a column for it to be counted as continuous variable.
 data (pd.DataFrame) – The original data, loaded by user through
instantiation or

class
impyte.impyte.
ImpyterModel
(estimator_name, model=None, pattern_no=None, feature_name=None, scores=None, scoring=None, predictor_variables=None, pattern_string=None, y_scaler=None)¶ Bases:
object
Stores computed Impyter machine learning models and relevant information that is linked to the model and pattern.
Variables:  model (sklearn Machine Learning Model) – Contains a trained machine learning model for given imputation task.
 pattern_no (int) – Indicator for pattern number.
 feature_name (strint) – Name of the dependent variable.
 scores (list) – List of all crossvalidation scores. The average of this list is being used as the threshold score.
 estimator_name (str) – String representation of the Machine Learning model.
 scoring (str) – String representation of the scoring measurement (‘r2’ or ‘f1_macro’ right now)
 predictor_variables (list) – Contains names of all independent variables used for the imputation task.
 pattern_string (tuple) – Tuple representation of pattern string. Can be used for identification of patterns.
 y_scaler (sklearn.preprocessing.StandardScaler object) – StandardScaler object that contains additional information
in case the model was used with
auto_scale = True
.

__init__
(estimator_name, model=None, pattern_no=None, feature_name=None, scores=None, scoring=None, predictor_variables=None, pattern_string=None, y_scaler=None)¶ Parameters:  estimator_name (str) – Name of machine learning model
 model (sklearn Machine Learning Model) – Sklearn machine learning estimator object
 pattern_no (int) – Pattern number associated with nanpattern.
 feature_name (strint) – Name of dependent variable.
 scores (list[float]) – Collection of all crossvalidation scores.
 scoring (str) – String representation of scoring function. (i.e. “r2” or “f1_macro”)
 predictor_variables (list[strint]) – List of names of all independent variables.
 pattern_string (tuple) – Tuple representation of a certain pattern.
 y_scaler (sklearn.preprocessing.StandardScaler object) – StandardScaler object that contains additional information
in case the model was used with
auto_scale = True
.

class
impyte.impyte.
ImpyterMultiModel
(pattern_string)¶ Bases:
object
Stores multinan imputations in the form of a list of
impyte.ImpyterModel
objects.Variables:  _model_list (list) – Collection of all ImpyterModel that are needed to compute the given multinan pattern.
 count (int) – Amount of models that are stored in ImpyterModels.
 pattern_string (tuple) – Tuple representation of multinan pattern.

__init__
(pattern_string)¶ Parameters: pattern_string (tuple) – References a pattern by tuple.

append
(model)¶ Appends an additional ImpyterModel object to the list of models.
Parameters: model (ImpyterModel object) – The model to be appended to the model list

static
check_and_append
(input_list, storage_list)¶ Extension helper method to append items to a preexisting list if not included.
Parameters:  input_list (list) – List with items to append.
 storage_list (list) – List that serves as storage item for all items.
Returns: storage_list – Collection of all unique elements from input_list and storage_list
Return type: list

static
combine_in_list
(input_list, *args)¶ Extension helper method to add multiple and single arguments to a preexisting list.
Parameters:  input_list (list) – Preexisting list.
 args (list) – List or single values to be extended to list.
Returns: extended input_list
Return type: list

get_dependend_and_independent_variables
()¶ For all models stored in the object, collect their dependent and independent variables.
As an example, if we had a multinan model that stored two ImpyterModels to predict
right_socks
andleft_socks
, the variables stored in the response would look like this:{ "independent_variables": ["time_of_year", "pants", "hat"], "dependent_variables": ["right_socks", "left_socks"] }
Returns: Variables – Dictionary including independent and dependent variables. Can be accessed through “independent_variables” and “dependent_variables”. Return type: dict{str, list}

class
impyte.impyte.
NanChecker
¶ Bases:
object
Class that checks data set, lists or single values for NaN occurrence.
Examples
Testing list for NaN values:
nan_array = ["Test", None, '', 23, [None, "42"]] nan_checker = impyte.NanChecker() print(nan_checker.is_nan(nan_array)) >>> [False, True, True, False, [True, False]]

static
is_nan
(data, nan_vals=None, recursive=True)¶ Detect missing values (NaN in numeric arrays, empty strings in string arrays).
Parameters:  data ({numpy.ndarraystrlistintfloat}) – Data to be investigated for NaN values.
 nan_vals (list) – Array of values that count as NaN values  if empty, “” and None are being used
 recursive (boolean) – Flag that determines whether the lists should be handled in recursive manner
Returns: result – Array or bool indicating whether an object is null or if an array is given which of the element is null.
Return type: Boolean

static

class
impyte.impyte.
Pattern
(unique_instances=10)¶ Bases:
object
Class that calculates, stores and visualizes NaN patterns and their indices.
Variables:  column_names (list) – Python list storing names of all columns that are in data set.
 complete_idx (int) – Integer containing pattern number with only complete cases
 continuous_variables (list) – Python list containing column names of all continuous variables. (i.e. columns that contain values in a range from 0.0 to 1.0)
 discrete_variables (list) – Python list containing column names of all discrete variables. (i.e. columns that contain values such as “red”, “blue”, “green”)
 easy_access (dict{tuple, list}) –
Python dictionary holding NaNpattern strings and mapping them to a list of the names of columns that contain NaN values in the given NaNpattern.
As an example:
{ ('NaN', 1): ['left_socks'], (1, 'NaN'): ['right_socks'], ('NaN', 'NaN'): ['left_socks', 'right_socks'] }
 missing_per_column (list) – Python list used to store summarization results, to make the use
of
impyte.Pattern.get_missing_value_percentage
more efficient (the default is None)  nan_checker (NanChecker object) – An instantiated
impyte.NanChecker
object, that can be used to analyze values and rows regarding their NaN values.  pattern_index_store (dict{int, list}) –
Python dictionary holding a list of indices for every pattern number. This dictionary is being used to look up the corresponding data points in a pandas DataFrame.
As an example:
{ 0: [0, 1, 2, 3, 4], # pattern_number: indices 1: [5, 6, 7, 8, 9] }
This pattern log consists out of 2 patterns (0 and 1) each pointing to 5 indices.
 pattern_store (dict{str, pd.DataFrame}) –
Python dictionary storing the pattern table. The table (in pd.DataFrame form) can be accessed by
self.pattern_store['result']
.A potential result table could look like this, where
NaN
indicates the column contains missing values in this pattern. TheCount
column shows how many observations of this NaNpattern are in the data set.Pattern left_socks right_socks Count 0 1 1 15 1 NaN 1 6 2 1 NaN 6 3 NaN NaN 4 Let’s hope these left and right socks are of the same color at least…
 result_pattern (dict{tuple, int}) – Python dictionary version of pattern counts. Makes computation and alterations easier.
 tuple_counter (int) – Value storing the amount of different patterns after performing pattern analysis. (the default is 0)
 tuple_counter_dict (dict) – Python dictionary mapping pattern strings to pattern number.
 tuple_dict (dict{tuple, int}) –
As an example:
{ ('NaN', 1): 1, # points to pattern 1 (1, 'NaN'): 2, ('NaN', 'NaN'): 3 }
 unique_instances (int) –
Value indicating the minimum value for a column of unique values to be considered as continuous variable when having the proper dtype
(the default is 10, which implies that columns with over 10 unique values are being labeled as continuous variables if containing numbers).
 pattern_predictor_dict (dict) – Python dictionary mapping pattern strings to their independent variable names.
 pattern_dependent_dict (dict) – Python dictionary mapping pattern string to their dependent variable names.

__init__
(unique_instances=10)¶ When instantiating a
impyte.Pattern
object, most values are being initialized as being empty or None.Parameters: unique_instances (int) – Value indicating the minimum value for a column of unique values to be considered as continuous variable when having the proper dtype
(the default is 10, which implies that columns with over 10 unique values are being labeled as continuous variables if containing decimal numbers).

get_column_name
(patter_no)¶ Returns the column name(s) that contain missing information of a certain NaNpattern.
Parameters: patter_no (int) – Number or identifier of pattern Returns: Column names – If patter_no has been computed, a list of all column names associated with pattern_no are being returned. Return type: list

get_complete_id
()¶ Returns pattern number of observations that don’t contain any missing information.
Returns: Pattern number Return type: int

get_complete_indices
()¶ Function to determine complete cases based on results table. Leverages precomputed information and is quicker than pandas dropna method.
Returns: Indices – List of indices that point to rows with complete cases Return type: list

get_continuous
()¶ Returns copy of continuous variable names.
Returns: Continuous variable names Return type: list

get_discrete
()¶ Returns copy of discrete variable names.
Returns: Discrete variable names Return type: list

get_missing_value_percentage
(data, importance_filter=False)¶ Combines information regarding the values in the data set and returns them in a concise way.
A potential summary table could look like this.
Column Complete Missing Percentage Unique left_socks 21 6 19.4 % 2 right_socks 21 6 19.4 % 2 Parameters: data (pd.DataFrame) – data refers to the information the user wants to analyze (Usually the result data set stored in
Impyte.impyter
)importance_filter (Boolean) – Flag, to don’t show columns that have no missing values. This might make sense for data sets with a lot of columns that have no missing values.
(default value is False, stating that all columns are important)
Returns: Summary table – Contains information regarding complete, missing and unique values in the data set.
Return type: pd.DataFrame

get_multi_nan_pattern_nos
(multi=True)¶ Returns all pattern numbers of multinans or singlenans
Parameters: multi (Boolean) – Flag indicating whether the user wants to retrieve multi or singlenan pattern numbers. Returns: Pattern Numbers – All single or multinan pattern numbers. Return type: list

get_pattern
(data=None, recompute=False)¶ Returns NaNpatterns based on primary computation or initiates new computation of NaNpatterns.
Parameters:  data (pd.DataFrame)
 recompute (Boolean) – If set True, stored results are being disregarded
Returns: Pattern overview – Table representation of all NaNpatterns and their counts.
Return type: pd.DataFrame

get_pattern_indices
(pattern_no)¶ Returns data points for a specific pattern_no for further investigation.
Parameters: pattern_no (int) – Index value that indicates pattern number. Returns: Indices – Indices that correspond to a pattern number. Return type: list

get_single_nan_pattern_nos
()¶ Returns all pattern numbers that contain only single nans.
Returns: Pattern Numbers – All single pattern numbers containing singlenans. Return type: list

remove_pattern
(pattern_no)¶ Removes a certain pattern. Deletes dictionary entry in the pattern index store as well as drops the entry in the results table.
Parameters: pattern_no (int) – Index value that indicates pattern.
Help¶
FAQs¶
Below are some pointers towards the right direction if something breaks. If you encounter any other error please feel free to reach out.
When imputing my estimator raises ValueError: Unknown label type: ‘continuous’
Hint
This might happen, if there is too little information for impyte to correctly distinguish your data type. This error essentially means, you’re handing a continuous data type [i.e. a float] to a classifier which expects a class or discrete value.
To solve this problem, you can set the unique value threshold to a lower value. (standard value is 10 unique instances).
License¶
Copyright 2017 Andreas RubinSchwarz
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.