Showing posts with label unsupervised learning. Show all posts
Showing posts with label unsupervised learning. Show all posts

Sunday, July 19, 2020

Pitfalls to avoid for effective model building


Watch this video about this article:


It is of utmost importance that the most optimized model is deployed for production and this is usually done via model performance characteristics like accuracy, precision, recall, f1 score, etc. To achieve this, we may employ various methods like feature engineering, hyper-parameter tuning, SVMs, etc.

However, before optimizing any model, we need to choose the right one in the first place. There are several factors that come into play before we decide upon the suitability of any model like:

a.     Has the data been cleaned adequately?

b.     What methods have been used for data preparation?

c.      What feature engineering techniques are we going to apply?

d.     How do we interpret and handle the observations like skewness, outliers, etc.?

Here, we will focus on the last factor mentioned above where most of us are prone to commit mistakes.

It is a standard practice to normalize the distribution by reducing the outliers, dropping certain parameters, etc. before feature selection. But, sometimes one might need to take a step back and observe –

a.     How is our normalization affecting the entire dataset and

b.     Is it gearing us towards the correct solution within the given context?

Let us examine this premise with a practical example as shown below.

Problem statement: Predicting concrete cement compressive strength using artificial neural networks

As usual, the data has been cleaned and prepared for detailed analysis before going for model selection and building. Please note that, we will not be addressing the initial stages in this article. Let us have a look at some of the key steps and observations as described below.

1.     Dropping outliers for normalization

An initial exploratory data analysis and visualization depicts the overall distribution of the target column “strength” -


As seen above, the data distribution is quite sparse with multiple skewness, both positive and negative. Further analysis reveals the following:


The following are the observations:

a.   Cement, slag, ash, courseagg and fineagg display huge differences indicating possibility of outliers

b.     Slag, ash and coarseagg have their median values closer to either 1st quartile or minimum values while both slag and fineagg have maximum values as outliers.

c.      Target column "strength" has many maximum values as outliers.

Replacing outliers for Concrete Cement Compressive Strength with any other value will beat the purpose of the data analysis i.e. develop a best fit model that gives an appropriate mixture with “maximum compressive strength”. Hence, it is good to replace outliers with mean values only for other variables as per the analysis and leave the target column as it is.

2.     Dropping variables to reduce skewness

Before applying feature engineering techniques, we need to look at correlation of the variables as shown below:





Observations based on our analysis:

a.     There is no high correlation between any of the variables

b.     Compressive strength increases with amount of cement

c.      Compressive strength increases with age

d.     As fly-ash increases the compressive strength decreases

e.     Strength increases with addition of Superplasticizer

Observations based on domain knowledge:

a.     Cement with low age requires more water for higher strength i.e. older the cement, more the water it requires

b.     Strength increases when less water is used in preparing it i.e. more of water leads to reduced strength

c.      Less of coarse aggregate along with less of slag increases strength

We can drop the variable slag only while the rest need to be retained.  

If we were to drop certain variables solely based on observed correlation in the given dataset, we would end up with a model having pretty high accuracy but at the same time it would be considered at best a “paper model” i.e. not practicable in the real world. Hence, certain amount of domain knowledge either directly or through consultation with a subject-matter expert goes a long way in avoiding major pitfalls while model building. 

The above example pretty much sums up, what we can call as “bias” (pun intended) that most of us can be prone to whether we are having a technical edge or a domain edge. Hence, it is a good practice to rethink the methods applied vis-à-vis the big picture.

Source:  The data for this project is available in file https://archive.ics.uci.edu/ml/machine-learning-databases/concrete/compressive/

Reference:  I-Cheng Yeh, "Modeling of strength of high performance concrete using artificial neural networks," Cement and Concrete Research, Vol. 28, No. 12, pp. 1797-1808 (1998).

 


Books By Dr. Prashant A U

  🔐 "ManusCrypt: Designed for Mankind" by Prashant A Upadhyaya 🔐 🚀 Revolutionizing Information Security for the Human Era! 🚀 ...