Thursday, August 20, 2020

Life Cycle Of Machine Learning Problem

                       


This is a generic life cycle model that can be applied to any machine learning application development .

First of all we need a business idea which is a key to any machine learning application, as it provides the justification for need for resources during the life cycle development. The business need also is something that we go back to during various stages of the life cycle so that the team implementing the project does not lose sight of the original project.

The next step is creating the machine learning architecture. During this stage, it is the machine learning engineers or big data engineers  and data science team that set out to create a hadoop platform (a distributed platform)or any aws platform to store the data in structured format in the form of clusters on which the future machine learning application would be created. It is during this stage that the conceptualization of the business need and translation of the business need into technical architecture takes place. It is also the part of life cycle where the architects of the machine learning application go back to the business in order to understand their need better and to start translating it into an implementable solution.

The next stage is that of data preparation, and this is a very important phase because right from data we need to identify what kind of data is required to identifying the sources from where it will come. Data's can be generated from different sources like either in the form of streaming data or in the form of batch data where streaming data is a live stream data which comes from different  IOT devices whereas the batch data is a collection of data points that have been grouped together within a specific time interval.Once we know what data is required to build this machine learning solution, we can see if such data exists within the organization where the application is being built or it needs to come from outside of the organization. The next stage in data preparation is acquisition of the data, where the team that is building the machine learning solution acquires it. If the whole data set is big, then at least a sample data set can be used, based on which it will build the solution. Data acquisition needs to happen both from internal and external sources. The most important task here is to determine what kind of format the data is available in, such as flat files like csv, XML, or JSON or Oracle database or DB2 database, etc. Classification is made by the implementation team regarding what is the source of structured data and what is the source of unstructured data. The treatment of unstructured data and structured data in machine learning are very different, hence it's identification is equally important.In the next step of data preparation, we perform data wrangling, which largely deals with cleaning up data. Here all such data that does not add any value to our business solution need is eliminated and only that data required for solving the business need is kept.

Once the data cleaning happens we do the next step of the machine learning cycle, which is exploratory data analysis. In exploratory data analysis, we look at the basic statistics of data such as its mean, median, and mode and its correlation between the different labels, and we identify whether the data is comprised of numerical or categorical variables, etc. This exploratory data analysis gives direction to the model building. For example, the choice of algorithm would depend on the kind of variables that we have.

In the sample we may have data sets with categorical variables such as gender (male or female) to be predicted based on other labels. In such a case we may have to use non quantitative algorithms for our model building. Model building is the next step and is closely tied with exploratory data analysis. In this process we do the analysis of descriptive statistics, we identify which modeling technique we are going to use, and then we build a benchmark predictive model. We use other methods and algorithms on the data set, and we try to interpret and find the best algorithm for creating the predictive model. Once the identification of the model is done, the next step is to create model validation. We use staging data sets that are closer to production and see how our model behaves; if it gives good results, then the model is deployed and it is implemented. After this the feedback is taken to see if it has met the business need for which it was built. If there is a new business need or the model needs to take care of some things that the business requires, then again we go back to the process of solution architecture data preparation, EDA model and building model, and then we do model validation. In essence, this is a cyclic process that goes on until the machine learning application is terminated.  

No comments:

Post a Comment

Blog on Data Leakage

                                                  DATA LEAKAGE Data   Leakage refers to a problem where information about the holdout datase...