Statistics is a branch of mathematics that deals with the collection, analysis, interpretation, presentation, and organization of data. It provides tools and
methodologies to make informed decisions or predictions based on data, especially when faced with uncertainty.
Key Components of Statistics:
Branches of Statistics:
Types of Data in Statistics:
Importance of Statistics:
Statistics in R programming involves a wide range of functions, libraries, and techniques for statistical analysis. R is widely used in the field of statistics because of its extensive set of built-in tools for handling data, performing statistical tests, and visualizing results.
In summary, Statistics is essential for transforming raw data into meaningful
information, which can be used in research, business, science, and various fields for decision-making and problem-solving.
Now we will be learning about the concepts of Statistics and their types , how we are going to implement those in R Programming.
Descriptive statistics provide simple summaries about the sample and the measures. These include measures of central tendency (mean, median, mode) and measures of variability (variance, standard deviation, range).
a) Basic Descriptive Statistics
You can begin with a simple dataset in R. For this, you can use built-in datasets like mtcars or iris.
# Loading built-in datasets
data(mtcars)
data(iris)
# Viewing the first few rows of mtcars dataset
head(mtcars)
1.1 Measures of Central Tendency
mean(mtcars$mpg) # Mean of miles per gallon (mpg)
# Install modeest package to calculate mode install.packages(“modeest”)
library(modeest)
mfv(mtcars$mpg) # Most frequent value (mode) of mpg
1.2 Measures of Dispersion
1.3 Data Summary
summary(mtcars$mpg)
Visualization helps to understand data distributions
Histogram:
hist(mtcars$mpg, main = “Histogram of MPG”, xlab = “Miles Per Gallon”, col = “lightblue”)
Boxplot:
boxplot(mtcars$mpg, main = “Boxplot of MPG”, ylab = “Miles Per Gallon”, col = “lightgreen”)
Scatter Plot:
plot(mtcars$wt, mtcars$mpg, main = “Scatter Plot”, xlab = “Weight”, ylab = “MPG”, col = “red”)
Inferential statistics involve making predictions or inferences about a population based on a sample of data. The common techniques include hypothesis testing, confidence intervals, correlation, and regression analysis.
2.1 Hypothesis Testing
# One-sample t-test
t.test(mtcars$mpg, mu = 20)
# Test if the mean of mpg is different from 20
# Two-sample t-test
t.test(mpg ~ am, data = mtcars)
# Compare mpg between automatic and manual cars
# Create a contingency table and perform chi-square test chisq_test <- table(mtcars$am, mtcars$cyl)
chisq.test(chisq_test)
The Chi-Square test is useful when working with categorical data to determine if there’s an association between two categorical variables.
# Create a contingency table and perform chi-square test
chisq_test <- table(mtcars$am, mtcars$cyl) # Transmission type vs Cylinders chisq.test(chisq_test)
2.2 Confidence Intervals
Confidence intervals give a range of plausible values for a population parameter (like the mean), based on sample data.
Confidence Interval for Mean:
# The t.test function gives confidence intervals by default t.test(mtcars$mpg)
This output provides the confidence interval for the mean of mpg.
b) Advanced Inferential Statistics
1.1 Correlation
Correlation measures the strength and direction of the linear relationship between two continuous variables.
cor(mtcars$mpg, mtcars$wt)
# Correlation between mpg and weight
2.4 Linear Regression
Regression analysis allows us to model and analyze relationships between variables.
# Fitting a linear model for mpg based on weight model <- lm(mpg ~ wt, data = mtcars) summary(model)
This will output the regression equation and significance of the relationship between mpg and wt.
# Fitting a multiple linear regression model
model_mult <- lm(mpg ~ wt + hp + cyl, data = mtcars) summary(model_mult)
2.5 ANOVA (Analysis of Variance)
ANOVA is used to compare the means of three or more groups.
# ANOVA to check if mpg differs across cylinder types aov_model <- aov(mpg ~ factor(cyl), data = mtcars) summary(aov_model)
# Two-way ANOVA with mpg, cylinder type, and transmission type aov_model2 <- aov(mpg ~ factor(cyl) * factor(am), data = mtcars) summary(aov_model2)
2.7 Non-Parametric Tests
When the assumptions of parametric tests (like normality) are not met, non-parametric tests like the Wilcoxon test are used.
wilcox.test(mtcars$mpg ~ mtcars$am) # Compare mpg between automatic and manual cars
2.7 Bootstrapping
Bootstrapping is a resampling technique used to estimate statistics on a population by sampling a dataset with replacement.
Bootstrap Example (using the boot package):
library(boot)
# Defining a statistic function (mean)
boot_mean <- function(data, indices) {
return(mean(data[indices]))
}
# Applying bootstrapping on mpg data
results <- boot(data = mtcars$mpg, statistic = boot_mean, R = 1000)
results
This generates multiple resamples of the data and computes the statistic (e.g., the mean) for each sample.
Advanced visualizations help in presenting the results of your statistical analysis more effectively.
library(ggplot2)
# Scatter plot with a regression line
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
geom_smooth(method = “lm”, col = “red”)
1. Data Loading:
2. Descriptive Statistics:
3. Inferential Statistics:
By following these steps, you can transition from basic to advanced levels of
both descriptive and inferential statistics in R, making your data analysis more powerful and comprehensive.
