Techie Blogs

Wednesday, January 26, 2022

Blog on Data Leakage

DATA LEAKAGE

Data Leakage refers to a problem where information about the holdout datasets, such as test or validation datasets, is made available to the model in the training dataset. The leakage is often small but can have an adverse effect on the model's performance.

To simplify this, we can also say that when the information of test data is passed into the train data and our model knows about the test data while training. This shared information is called "DATA LEAKAGE".

Different factors associated with the cause of Data Leakage:

The most useful approach is to apply one or more transforms to the entire dataset. Then the entire dataset is split into train and test sets or K-fold crossvalidation is used to fit and evaluate a machine learning model.

Prepare Dataset
Split Dataset
Evaluate Models

Although this is a common and basic approach, it is dangerously incorrect in most cases.

The problem with applying data preparation techniques before splitting data for model evaluation is that it can lead to data leakage and, in turn will likely results in an incorrect estimate of model's performance on the problem.

To Handle the Data Leakage Problem:

* Tweak the process of preparing the data before applying it to the model.Instead of applying the above steps mentioned we can tweak the process in the following steps:

1. Split Data

2. Fit Data Preparation on Training Dataset

3. Applying Data Preparation to Train and Test Datasets.

4. Evaluate Models.

Conclusions

More generally, the entire modeling pipeline must be prepared only on the training datasets to avoid data leakage problem. This might includes data transforms, but also other techniques such as feature selection, dimensionality reduction, feature engineering and many more. This means so called "model evaluation" should be called "modeling pipeline evaluation".

Monday, September 21, 2020

IMPACT OF AI IN B2B MARKET

IMPACT OF AI IN ACCOUNT RECEIVABLES

Challenges Faced

The main challenges of the A/R is the huge amount of transactions that the customers have to deal with, specially in the B2B space. When we talk about hundred's of thousands of customers, we've millions of invoices. We really have to go after every transaction. So that's where a lot of time is being spent by having AI teams manually going after invoices, trying to get paid, trying to apply the payment or trying to figure out why the customer has not paid or something. So that's a main challenge of being very transaction heavy in B2B space. So when we've the challenge of lot of transactions that we've to manually go after that's when we look into the technology where technology comes to save us. We look into robotics where we can automate different tasks and we use Artificial Intelligence which is nothing but machine learning algorithms to look at patterns from customer's stand point, from all the transactions stand point and we use machine learning to predict what are the various payment patterns our customer's have, what are the various deductions they have in the past. Trying to help customers to tailored their solution to the problem they have in each of these spaces.

How to Overcome that Challenge

Since the world is evolving towards AI and currently every sector is leveraging Artificial Intelligence in their business domain. Like every business problem I've also used machine learning algorithms to predict the partial payment amount of a customer. The problem statement that we got from the company was totally based on B2B business domain. B2B is nothing but a business conducted between two companies where one of the business is the buyer and the other one is the seller. Buyer business buys some goods from the seller business and the seller business issues some invoices against those goods which contains the detailed information of all the goods along with the payment amount and the date of the payment. Every business has an Account Receivables Department which keeps the track of all the records like the payment status, customer payment details and their payment terms. In the ideal world, the buyer business should pay back within a stipulated time period(i.e the Payment Term). However, in the real world, the buyer business seldom pays within their established time period sometimes they pay in installments and manually it is quite difficult to keep track of all the records so we build an Account Receivables Department chat bot along with an AI enabled dashboard where we have all the invoice details of the customer and with the help of machine learning we can predict the partial payment amount.

Machine Learning Algorithm

Performed data preprocessing(Cleaned the dataset by removing null values and also dealt with the categorical values by converting them into numerical one using Ordinal Encoding), feature Engineering, feature transformation and Exploratory Data Analysis. Pre-processed the data and unstacked into multiple columns and calculated the partial payment amount against each invoices.

Applied Random Forest Classifier, which identified 4 important features like "name_of_customer", "customer_payment_terms", "age_of_invoice", "invoice_currency" with a Variance Threshold value = 0.01, and predicted the partial payment amount of a customer with a help of binary classification by adding conditions like "+1" for the customer who paid full amount, "0" for the customer who paid partial amount and "-1" for the customer who doesn't paid any amount. Achieved 97% accuracy and finally created two models and compared those two models with respect to" invoice currency" i.e. the one which was having the invoice_currency feature and the other one which was not having the feature "invoice_currency". By comparing those two model's found that the model having the invoice_currency performed better. This gave an idea that in real world, B2B business works on credits where Customers select products, place an order and arrange delivery through an agreed logistics channel. Customers do not pay at the time of the order, but receive an invoice which they settle within agreed payment terms.

Model with invoice_currency

Model without invoice_currency

ALGORITHM ACCURACIES

Algorithm Used Accuracy

Logistic Regression 96.17%

SVM Classifier 96.30%

Random Forest Classifier 97.81%

Project Link: https://colab.research.google.com/drive/1ddM_L_YbAG-0Y4HASxU53EogdbdLFGvQ?usp=sharing

Thursday, August 20, 2020

Life Cycle Of Machine Learning Problem

This is a generic life cycle model that can be applied to any machine learning application development .

First of all we need a business idea which is a key to any machine learning application, as it provides the justification for need for resources during the life cycle development. The business need also is something that we go back to during various stages of the life cycle so that the team implementing the project does not lose sight of the original project.

The next step is creating the machine learning architecture. During this stage, it is the machine learning engineers or big data engineers and data science team that set out to create a hadoop platform (a distributed platform)or any aws platform to store the data in structured format in the form of clusters on which the future machine learning application would be created. It is during this stage that the conceptualization of the business need and translation of the business need into technical architecture takes place. It is also the part of life cycle where the architects of the machine learning application go back to the business in order to understand their need better and to start translating it into an implementable solution.

The next stage is that of data preparation, and this is a very important phase because right from data we need to identify what kind of data is required to identifying the sources from where it will come. Data's can be generated from different sources like either in the form of streaming data or in the form of batch data where streaming data is a live stream data which comes from different IOT devices whereas the batch data is a collection of data points that have been grouped together within a specific time interval.Once we know what data is required to build this machine learning solution, we can see if such data exists within the organization where the application is being built or it needs to come from outside of the organization. The next stage in data preparation is acquisition of the data, where the team that is building the machine learning solution acquires it. If the whole data set is big, then at least a sample data set can be used, based on which it will build the solution. Data acquisition needs to happen both from internal and external sources. The most important task here is to determine what kind of format the data is available in, such as flat files like csv, XML, or JSON or Oracle database or DB2 database, etc. Classification is made by the implementation team regarding what is the source of structured data and what is the source of unstructured data. The treatment of unstructured data and structured data in machine learning are very different, hence it's identification is equally important.In the next step of data preparation, we perform data wrangling, which largely deals with cleaning up data. Here all such data that does not add any value to our business solution need is eliminated and only that data required for solving the business need is kept.

Once the data cleaning happens we do the next step of the machine learning cycle, which is exploratory data analysis. In exploratory data analysis, we look at the basic statistics of data such as its mean, median, and mode and its correlation between the different labels, and we identify whether the data is comprised of numerical or categorical variables, etc. This exploratory data analysis gives direction to the model building. For example, the choice of algorithm would depend on the kind of variables that we have.

In the sample we may have data sets with categorical variables such as gender (male or female) to be predicted based on other labels. In such a case we may have to use non quantitative algorithms for our model building. Model building is the next step and is closely tied with exploratory data analysis. In this process we do the analysis of descriptive statistics, we identify which modeling technique we are going to use, and then we build a benchmark predictive model. We use other methods and algorithms on the data set, and we try to interpret and find the best algorithm for creating the predictive model. Once the identification of the model is done, the next step is to create model validation. We use staging data sets that are closer to production and see how our model behaves; if it gives good results, then the model is deployed and it is implemented. After this the feedback is taken to see if it has met the business need for which it was built. If there is a new business need or the model needs to take care of some things that the business requires, then again we go back to the process of solution architecture data preparation, EDA model and building model, and then we do model validation. In essence, this is a cyclic process that goes on until the machine learning application is terminated.