This is a generic life cycle model that can be applied to any machine learning application development .
First of all we need a business idea which is a key to any machine learning application, as it provides the justification for need for resources during the life cycle development. The business need also is something that we go back to during various stages of the life cycle so that the team implementing the project does not lose sight of the original project.
The next step is creating the machine learning architecture. During this stage, it is the machine learning engineers or big data engineers and data science team that set out to create a hadoop platform (a distributed platform)or any aws platform to store the data in structured format in the form of clusters on which the future machine learning application would be created. It is during this stage that the conceptualization of the business need and translation of the business need into technical architecture takes place. It is also the part of life cycle where the architects of the machine learning application go back to the business in order to understand their need better and to start translating it into an implementable solution.
The next stage is that of data preparation, and this is a very important phase because right from data we need to identify what kind of data is required to identifying the sources from where it will come. Data's can be generated from different sources like either in the form of streaming data or in the form of batch data where streaming data is a live stream data which comes from different IOT devices whereas the batch data is a collection of data points that have been grouped together within a specific time interval.Once we know what data is required to build this machine learning solution, we can see if such data exists within the
organization where the application is being built or it needs to come from outside of
the organization. The next stage in data preparation is acquisition of the data, where
the team that is building the machine learning solution acquires it. If the whole data
set is big, then at least a sample data set can be used, based on which it will build the
solution. Data acquisition needs to happen both from internal and external sources.
The most important task here is to determine what kind of format the data is available
in, such as flat files like csv, XML, or JSON or Oracle database or DB2 database, etc.
Classification is made by the implementation team regarding what is the source
of structured data and what is the source of unstructured data. The treatment of
unstructured data and structured data in machine learning are very different, hence
it's identification is equally important.In the next step of data preparation, we perform
data wrangling, which largely deals with cleaning up data. Here all such data that
does not add any value to our business solution need is eliminated and only that data
required for solving the business need is kept.
Once the data cleaning happens we do the next step of the machine learning cycle,
which is exploratory data analysis. In exploratory data analysis, we look at the basic
statistics of data such as its mean, median, and mode and its correlation between the
different labels, and we identify whether the data is comprised of numerical or categorical
variables, etc. This exploratory data analysis gives direction to the model building. For
example, the choice of algorithm would depend on the kind of variables that we have.
In the sample we may have data sets with categorical variables such as gender (male or
female) to be predicted based on other labels. In such a case we may have to use non quantitative algorithms for our model building. Model building is the next step and is
closely tied with exploratory data analysis. In this process we do the analysis of descriptive
statistics, we identify which modeling technique we are going to use, and then we build a
benchmark predictive model. We use other methods and algorithms on the data set, and
we try to interpret and find the best algorithm for creating the predictive model. Once
the identification of the model is done, the next step is to create model validation. We use
staging data sets that are closer to production and see how our model behaves; if it gives
good results, then the model is deployed and it is implemented. After this the feedback is
taken to see if it has met the business need for which it was built. If there is a new business
need or the model needs to take care of some things that the business requires, then
again we go back to the process of solution architecture data preparation, EDA model and
building model, and then we do model validation. In essence, this is a cyclic process that
goes on until the machine learning application is terminated.