DATA LEAKAGE
Data Leakage refers to a problem where information about the holdout datasets, such as test or validation datasets, is made available to the model in the training dataset. The leakage is often small but can have an adverse effect on the model's performance.
To simplify this, we can also say that when the information of test data is passed into the train data and our model knows about the test data while training. This shared information is called "DATA LEAKAGE".
Different factors associated with the cause of Data Leakage:
The most useful approach is to apply one or more transforms to the entire dataset. Then the entire dataset is split into train and test sets or K-fold crossvalidation is used to fit and evaluate a machine learning model.
- Prepare Dataset
- Split Dataset
- Evaluate Models
To Handle the Data Leakage Problem:
* Tweak the process of preparing the data before applying it to the model.Instead of applying the above steps mentioned we can tweak the process in the following steps:
1. Split Data
2. Fit Data Preparation on Training Dataset
3. Applying Data Preparation to Train and Test Datasets.
4. Evaluate Models.
Conclusions
More generally, the entire modeling pipeline must be prepared only on the training datasets to avoid data leakage problem. This might includes data transforms, but also other techniques such as feature selection, dimensionality reduction, feature engineering and many more. This means so called "model evaluation" should be called "modeling pipeline evaluation".
No comments:
Post a Comment