Watch this video about this article:
It is of utmost importance that the most optimized model is deployed for production and this is usually done via model performance characteristics like accuracy, precision, recall, f1 score, etc. To achieve this, we may employ various methods like feature engineering, hyper-parameter tuning, SVMs, etc.
However, before optimizing any
model, we need to choose the right one in the first place. There are several
factors that come into play before we decide upon the suitability of any model
like:
a. Has
the data been cleaned adequately?
b. What
methods have been used for data preparation?
c. What
feature engineering techniques are we going to apply?
d. How
do we interpret and handle the observations like skewness, outliers, etc.?
Here, we will focus on the last factor
mentioned above where most of us are prone to commit mistakes.
It is a standard practice to
normalize the distribution by reducing the outliers, dropping certain
parameters, etc. before feature selection. But, sometimes one might need to
take a step back and observe –
a. How
is our normalization affecting the entire dataset and
b. Is
it gearing us towards the correct solution within the given context?
Let us examine this premise with a
practical example as shown below.
Problem statement: Predicting concrete cement compressive strength
using artificial neural networks
As usual, the data has been cleaned
and prepared for detailed analysis before going for model selection and
building. Please note that, we will not be addressing the initial stages in
this article. Let us have a look at some of the key steps and observations as
described below.
1. Dropping
outliers for normalization
An initial exploratory data
analysis and visualization depicts the overall distribution of the target
column “strength” -
As seen above, the data
distribution is quite sparse with multiple skewness, both positive and negative.
Further analysis reveals the following:
The following are the observations:
a. Cement,
slag, ash, courseagg and fineagg display huge differences indicating
possibility of outliers
b. Slag,
ash and coarseagg have their median values closer to either 1st quartile or
minimum values while both slag and fineagg have maximum values as outliers.
c. Target
column "strength" has many maximum values as outliers.
Replacing outliers for Concrete
Cement Compressive Strength with any other value will beat the purpose of the
data analysis i.e. develop a best fit model that gives an appropriate mixture
with “maximum compressive strength”. Hence,
it is good to replace outliers with mean values only for other variables as per
the analysis and leave the target column as it is.
2. Dropping
variables to reduce skewness
Before applying feature engineering
techniques, we need to look at correlation of the variables as shown below:
Observations based on our analysis:
a. There
is no high correlation between any of the variables
b. Compressive
strength increases with amount of cement
c. Compressive
strength increases with age
d. As
fly-ash increases the compressive strength decreases
e. Strength
increases with addition of Superplasticizer
Observations based on domain
knowledge:
a. Cement
with low age requires more water for higher strength i.e. older the cement,
more the water it requires
b. Strength
increases when less water is used in preparing it i.e. more of water leads to
reduced strength
c. Less
of coarse aggregate along with less of slag increases strength
We can drop the variable slag only
while the rest need to be retained.
If we were to drop certain
variables solely based on observed correlation in the given dataset, we would
end up with a model having pretty high accuracy but at the same time it would
be considered at best a “paper model” i.e. not practicable
in the real world. Hence, certain amount of domain knowledge either directly or
through consultation with a subject-matter expert goes a long way in avoiding
major pitfalls while model building.
The above example pretty much sums
up, what we can call as “bias” (pun intended) that most of us can be prone to
whether we are having a technical edge or a domain edge. Hence, it is a good
practice to rethink the methods applied vis-à-vis the big picture.
Source: The
data for this project is available in file https://archive.ics.uci.edu/ml/machine-learning-databases/concrete/compressive/
Reference: I-Cheng Yeh, "Modeling of strength of
high performance concrete using artificial neural networks," Cement and
Concrete Research, Vol. 28, No. 12, pp. 1797-1808 (1998).