Although data engineers mainly deal with setting up the data flow and underlying infrastructure, there are often cases where statistics comes useful:

  • ๐Ÿ•ต๏ธโ€โ™‚๏ธ Detecting data quality issues
  • ๐Ÿ“Š Understanding distributions of data
  • ๐Ÿšจ Monitoring processes and triggering alerts

Mean (average): The most commonly used measure of average. sum(n)/count(n)

Median: Middle value of a sorted list. [0, 5 , (24), 50, 100]. Useful when dealing with outliers.

Mode: Most frequent value in a list. [0, 3, 3, 1000] -> 3

Relationship between mean and median under different skewness

Relationship between mean and median under different skewness

In a perfectly normal distribution, the mean, median, and mode are the same.

Histogram ๐Ÿ“Š Itโ€™s good to be able to use a histogram to check the distribution of the data, detect outliers and see if the data is skewed which can help with sampling/ partitioning decisions.

Variance: Measure of how spread out a list of numbers is. Average of squared differences from mean. Not so easy to interpretโ€”โ€On average, the squared difference from the mean is 2.67.โ€

Standard Deviation: Measure of how spread out a list of numbers is. Square root of variance. Easier to interpretโ€”โ€The values on average are 1.63 units away from the mean.โ€ Useful for monitoring systems where values deviating from the average might signal a problem.

Percentiles: The relative standing of a value within a dataset. For example, if youโ€™re in the 90th percentile in a test, you scored higher than 90% of participants.

Quartiles: Like percentiles but simplified to 4 parts where values may stand:

  • Q1 (25th percentile): 0% - 25%
  • Q2 (50th percentile): 25% - 50%
  • Q3 (75th percentile): 50% - 75%
  • Q4 (100th percentile): 75% - 100%

Useful when specifying SLAs or when we need a simplified distribution summary.

Correlation vs Causation ๐Ÿ”—๐Ÿšซ: Correlation measures the relationship between two variables. Causation takes place when one variable directly influences another. Correlation does not imply causation.

Example: Drowning deaths and ice cream sales are correlated. Every year, in the summer ice cream sales go up and so do drowning deaths. But this does not mean that ice cream sales cause drowning deaths.

Correlation vs Causation

Correlation vs Causation

In the next part I will go through:

  • ๐Ÿงช A/B testing