“You see but you do not observe…” says Sherlock Holmes in one of his adventures. This quote describes the foundation of Holmes’ deductions about relying upon having all of the possible information about the case before making conjectures about it.
Data analysis is a refined technique to talk to the dataset, and uncover what it has to offer. Extracting trends and patterns from the data requires a well-defined methodical path. Before trying to predict future trends, one must check if the sample collected is good enough for analysis. A lot of time is invested in exploring, cleaning and preprocessing data, using various methods. Data visualizations using different subsets of data can help us understand the distribution of data, heavy outliers, missing and wrong values. Further investigation with a domain expert can help fixing missing data and outlier correction.
Descriptive statistics can provide us with a lot of insights of the dataset during initial exploration.
You may want to read up on descriptive statistics before interpreting a statistical summary.
Let us take up a Housing_Qualifier dataset to understand the statistical summary. The aim of the task is to build a binary classification model that classifies the attribute QUALIFIED with 0 if not qualified and 1 if it’s qualified. The scope of this article is to explore the data, pre-process it and analyze two features in the dataset based on the descriptive statistics. Amidst all this plethora of libraries, the function ProfileReport() from the pandas_profiling library summarizes the statistics.
The dataset has more than 75000 rows and 38 columns.
Let us focus our analysis on two features of the dataset; PRICE and ROOMS (No. of rooms).
How to interpret the summary?
Interpretation of statistical summary in five points
1. 27% of observations contain zero values, distorting the Range of the feature. Missing values amount to 18% which is very high for a strong feature such as Price. Price of any property cannot be zero or have missing values.
2. Mean = 384126.34 and standard deviation = 577772.06. The high value of SD indicates that there are many data points away from the mean. The histogram also shows us the data distribution is positively skewed.
3. The CV value of 1.504 indicates that the data is widely dispersed.
4. Skewness = 9.22 shows that data has a strong positive skewness indicating lack of Gaussian distribution.
5. High kurtosis value of 251.87 is much beyond the acceptable limits (0–3) indicating a sharp peak and long tail containing a lot of outliers due to missing values and zero values.
Interpretation of statistical summary in five points
1. 50% of observations in the IQR, which indicates the data, not largely skewed. Although the box plot shows a high number of outliers, IQR value is not affected by them.
2. The histogram shows that most of the houses have between 6–8 rooms. There are a few outliers with values ranging between 38 and 101, which affects the range and thereby the distribution.
3. Mean = 7.35 shows that data is not widely dispersed.
4. Skewness = 2.68 shows that data is not normally distributed.
5. High kurtosis value of 41.49 indicates a lot of outliers due to missing values and zero values.
The above five point analysis helped us to peek into the data by observing and analyzing two features. A fair analysis of the features and their distribution can be obtained using descriptive statistics.
How should the data be wrangled and normalized?
Missing values — Missing values play an important role in skewness of data, thereby defining the distribution of data. Missing values can be due to non-availability of data or error caused while data-entry. Missing values can be addressed in various ways. Identify the different kinds of missing values such as:
- There can be non-standard missing values such as “ N/A, n/a, — ”.
- Unexpected missing values such as having a text in a numerical feature and vice-versa
- Empty spaces or no value in the features
Numerical feature missing values can be replaced using a median or mean value of the data values. Most frequent value can be used to replace categorical missing values. These are a few technical ways to replace missing values but DOMAIN knowledge can prove extremely essential and crucial while imputing missing data values.
Imputing values — Machine learning algorithms only understand numerical values of features. Thus, categorical feature values must be converted into numerical, before they are fed into the model. This brings us to an important preprocessing stage also widely known as data wrangling. Categorical variables can be imputed by various methods.
- Binary categorical values can be imputed by mapping them with a 0 or 1. Gender feature will contain ‘Male’ or ‘Female’, they can be mapped with numeric binary values.
- Label Encoding can be used to simply convert each value in the feature to a number. For example a Flooring_Type feature can have values such as ‘HardWood’ or ‘Vitrified Tiles’ or ‘Carpet’ or ‘Ceramic’. These values can be encoded to 1, 2, 3, 4.
- One hot encoding can help overcome the disadvantage of misinterpretation of numeric values posed due to Label Encoding. It converts each category of feature into a separate feature and encodes it to 1 or 0. The ensures that the weight of the feature is not inappropriate in the model. There are advanced approaches for categorical encoding.
Handling Date features — Date features when imputed can prove to be important feature to the model. Date features may be available in various formats as string in the dataset. The date format may also include time. Inevitably, they must be imputed before including them as a feature. One of the methods is to split them and create additional features. Such as, a date 2016–06–21T00:00:00.000Z can be split to multiple features namely; year / month / day / hr / min / sec / milli-sec.
Normalizing data — Skewed distribution can be converted to Gaussian or reduced to a large extent by transforming the data using log or square root transformation.
Conclusion — Data in real world is messy with abundant missing values, skewed distribution and numerous categorical features. Data needs to be broken, crunched, imputed and transformed to exhibit certain pattern of trend, make predictions or classifications. Statistical analysis and observations using different library functions in python / R can help us know the availability and distribution of data in various features.
Data Exploration is thus an unarguably essential process before modelling.
Please click here for the code!