What is descriptive statistics in machine learning?
Updated: Oct 6, 2022
Descriptive statistics are calculated only for numerical variables. It gives us the detailed results of our dataset which includes mean, standard deviation, minimum value, maximum value, etc.
The output of the descriptive statistics function is:-
Count of values for a variable.
Percentiles (25%, 50%, 75%)
You can use pandas describe function to get the descriptive statistics.
Uses of descriptive statistics.
Build a better understanding of data.
Identify and treat missing values.
Identify any outliers and anomalies.
The mean or average value tells how closely the same values are grouped together, but the standard deviation tells us how some values differ from mean values.
If Standard Deviation is low: Most of the values are close to the average value.
If Standard Deviation is high: Most of the values are far from the mean value, hence it will spread out.
The standard deviation formula is:
σ = √Σ (xi – μ)2 / (n-1)
σ (“sigma”) is the symbol for standard deviation
Σ is a fun way of writing “sum of”
xi represents every value in the data set
μ is the mean (average) value in the data set
n is the sample size
Both Numpy and Pandas will give different results for SD because in NumPy the formula is
σ = √Σ (xi – μ)2 / (n)
We will see how to calculate those things in the coming posts.