Wednesday, January 26, 2022

Blog on Data Leakage

                                                DATA LEAKAGE


Data Leakage refers to a problem where information about the holdout datasets, such as test or validation datasets, is made available to the model in the training dataset. The leakage is often small but can have an adverse effect on the model's performance. 

To simplify this, we can also say that when the information of test data is passed into the train data and our model knows about the test data while training. This shared information is called "DATA LEAKAGE".

Different factors associated with the cause of Data Leakage:

       The most useful approach is to apply one or more transforms to the entire dataset. Then the entire dataset is split into train and test sets or K-fold crossvalidation is used to fit and evaluate a machine learning model.

  1.    Prepare Dataset
  2.    Split Dataset
  3.    Evaluate Models   
Although this is a common and basic approach, it is dangerously incorrect in most cases.

The problem with applying data preparation techniques before splitting data for model evaluation is that it can lead to data leakage and, in turn will likely results in an incorrect estimate of model's performance on the problem.

To Handle the Data Leakage Problem:

          * Tweak the process of preparing the data before applying it to the model.Instead of applying the above steps mentioned we can tweak the process in the following steps:

                 1. Split Data

                 2. Fit Data Preparation on Training Dataset

                 3. Applying Data Preparation to Train and Test Datasets.

                 4. Evaluate Models.

Conclusions 

More generally, the entire modeling pipeline must be prepared only on the training datasets to avoid data leakage problem. This might includes data transforms, but also other techniques such as feature selection, dimensionality reduction, feature engineering and many more. This means so called "model evaluation" should be called "modeling pipeline evaluation".

 

 

 

Blog on Data Leakage

                                                  DATA LEAKAGE Data   Leakage refers to a problem where information about the holdout datase...