Part 2 - A Primer on Statistics

My notes from a LinkedIn Course on Statistics
Statistics
Author

Senthil Kumar

Published

February 27, 2022

1. What is the starting point to unravel a data story?

  • Look for the middle (mean, median and mode)

2. How spread out is the data?

Along with “Middle” point, look for variability - Range: (Max - Min)

Telling stories with mean and median is still limited. With `Range` it becomes better

|        | Value |
|--------|-------|
| Mean   | 60    |
| Median | 58    |
| Range  | 70    |
  • Standard Deviation:
    • Approx Definition: Average of all data point’s distances from the mean

    • Proper Definition: Square Root of { the mean of {Square of the difference between each data point and the mean of the data}}

      Std Deviation of the population:
      σ = √( Σ(X - μ)2/ N );

      X = Value in the data distribution
      μ = Mean of the population
      N = Number of data points

      Std Deviatin of Sample:
      s = √ ( Σ(X - x̄ )2 / (n-1) )

    • Why the denominator is n - 1 in Sample Std Deviation?

      • x̄ is the mean of the sample
      • By empirical evidence (observed in many datasets),
        Σi=1N (xi - x̄)2 << Σi=1N (xi - μ)2
      • Hence dividing the sample std deviation by (n-1) makes it “unbiased” and more towards population std deviation
      • Another Explanation: There are only (n-1) degrees of freedom in the calculation of (xi - x̄)
  • Z-Score:
    • A particular datapoint’s distance from the mean measured in standard deviations Z-score = ( X - μ / σ )

      = (231 - 139) / 41 = 2.24 = 231 is 2.24 std deviations from the mean = 112 is -0.66 std deviations from the mean

  • Interesting points:
    • Std deviations of two different datasets cannot be compared (e.g.: Salaries of Data Scientists and Consumption of Fuel in cars)

3. Empirical Rule (or the 68-95-99.7 rule)

  • Most of the datapoints (68%, 95%, 99.7% ) fall within some std deviations (1,2, and 3 respectively) from the mean
  • In other words, 99.7% of the data that is normally distributed will lie 3 standard deviations from the mean.
  • What is normal distribution?
    The dataset distribution mimics a bell curve
  • Application of the Empirical Rule:
    • Understanding if a particular data point being an outlier or not

4. Central Limit Theorem

  • Given a population of unknown distribution with mean μ and finite variance σ2,
    • If we keep sampling n values from the the distribution, and compute sample mean as
      n ~= ( X1 + X1 + Xn / n)
    • As n-> ∞, the distribution of the sample means tend to be normal or gaussian (following the bell curve)
  • In simple words,
    • If you have a population with unknown distribution but with a mean of μ and std deviation of σ and take sufficiently large number of samples n (with replacement), the distribution of means will be approximately normally distributed

  • With the help of CLT, we need not wait for the entire population’s data (and the subsequent identification of the population’s unknown distribution), we can apply normal distribution principles (like the empirical rule and many more statistical techniques) on the sample means and draw a conclusion about the population

More about CLT with an example:
- Central Limit Theorem’s super power - “You don’t need to know the population distribution”

5. Outlier:

  • Outlier is a relative term. There is no absolute definition (like if a datapoint is 2 or 3 σ away from the mean)

  • How to investigate outliers:
    (one should not simply ignore/remove it)

    • Is this really an outlier?
    • How did this happen?
    • What can we learn?
    • What needs to change (to make it fit into the distribution)?

Source: - LinkedIn Courses - Statistics Foundations: Probability and Statistics Foundations: The Basics | refer