#4 Statistics for Data Science

Descriptive Statistics

Descriptive statistics involve summarizing and organizing data to make it easily understandable. It includes measures of central tendency and measures of variability.

Measures of Central Tendency:

  • Mean: The average of a data set, calculated by summing all values and dividing by the number of values.
mean = data['column_name'].mean()

  • Median: The middle value in a data set when the values are arranged in ascending order.
median = data['column_name'].median()

  • Mode: The value that appears most frequently in a data set.
mode = data['column_name'].mode()[0]

Measures of Variability:

  • Range: The difference between the highest and lowest values in a data set.
range_value = data['column_name'].max() - data['column_name'].min()

  • Variance: The average of the squared differences from the mean.
variance = data['column_name'].var()

  • Standard Deviation: The square root of the variance, representing the average distance from the mean.
std_dev = data['column_name'].std()

Probability Theory

Probability theory deals with the likelihood of events occurring. It is fundamental to understanding statistical inference.

Basic Concepts:

  • Experiment: An action or process that leads to a set of outcomes.
  • Sample Space: The set of all possible outcomes of an experiment.
  • Event: A subset of the sample space.

Probability: The probability of an event is a measure of the likelihood that the event will occur, ranging from 0 (impossible) to 1 (certain).

from scipy import stats

# Probability of a value in a normal distribution
prob = stats.norm.cdf(x, loc=mean, scale=std_dev)

Conditional Probability: The probability of an event occurring given that another event has already occurred.

# Conditional probability using Bayes' theorem
P_A_given_B = (P_B_given_A * P_A) / P_B

Inferential Statistics: Hypothesis Testing and Confidence Intervals

Inferential statistics allow us to make predictions or inferences about a population based on a sample of data.

Hypothesis Testing: A method of making decisions using data. It involves making an initial assumption (the null hypothesis), and then determining whether the data provide sufficient evidence to reject that assumption.

  • Null Hypothesis (H0): The default assumption that there is no effect or difference.
  • Alternative Hypothesis (H1): The assumption that there is an effect or difference.

Common Tests:

  • t-test: Compares the means of two groups.
from scipy.stats import ttest_ind

t_stat, p_value = ttest_ind(group1, group2)

  • chi-square test: Tests the association between categorical variables.
from scipy.stats import chi2_contingency

chi2_stat, p_value, dof, expected = chi2_contingency(contingency_table)

Confidence Intervals: A range of values used to estimate the true value of a population parameter. The confidence level represents the proportion of times that the interval would contain the parameter if you repeated the experiment multiple times.

import numpy as np

mean = np.mean(data)
std_err = stats.sem(data)
confidence_interval = stats.t.interval(0.95, len(data)-1, loc=mean, scale=std_err)

Correlation and Causation

Understanding the relationship between variables is crucial in data science.

Correlation: Measures the strength and direction of a linear relationship between two variables. It is quantified by the correlation coefficient (Pearson’s r), which ranges from -1 (perfect negative correlation) to +1 (perfect positive correlation).

correlation = data['column1'].corr(data['column2'])

Causation: Indicates that one event is the result of the occurrence of the other event. Establishing causation requires more than just correlation; it often involves controlled experiments or longitudinal studies to rule out confounding factors.

Common Pitfalls:

  • Confounding Variables: External variables that affect both the independent and dependent variables, leading to a spurious association.
  • Reverse Causation: The possibility that Y causes X instead of X causing Y.

Visualization: Scatter plots are useful for visualizing the relationship between two numerical variables.

import matplotlib.pyplot as plt

plt.scatter(data['column1'], data['column2'])
plt.xlabel('Column 1')
plt.ylabel('Column 2')
plt.title('Scatter Plot')
plt.show()

Statistics form the backbone of data science, providing tools and methods to understand and analyze data. Mastering descriptive statistics, probability theory, inferential statistics, and the distinction between correlation and causation will enable you to make informed decisions and uncover meaningful insights from your data.

StatisticsForDataScience #DescriptiveStatistics #ProbabilityTheory #InferentialStatistics #HypothesisTesting #ConfidenceIntervals #Correlation #Causation #DataScience #DataAnalysis #DataExploration #DataPreprocessing #PythonForDataScience #StatisticalAnalysis #DataVisualization #Matplotlib #Seaborn

Leave a Reply