Meritshot Tutorials
- Home
- »
- Probability Plots and Diagnostics in R
R Tutorial
-
R-OverviewR-Overview
-
R Basic SyntaxR Basic Syntax
-
R Data TypesR Data Types
-
R-Data StructuresR-Data Structures
-
R-VariablesR-Variables
-
R-OperatorsR-Operators
-
R-StringsR-Strings
-
R-FunctionR-Function
-
R-ParametersR-Parameters
-
Arguments in R programmingArguments in R programming
-
R String MethodsR String Methods
-
R-Regular ExpressionsR-Regular Expressions
-
Loops in R-programmingLoops in R-programming
-
R-CSV FILESR-CSV FILES
-
Statistics in-RStatistics in-R
-
Probability in RProbability in R
-
Confidence Interval in RConfidence Interval in R
-
Hypothesis Testing in RHypothesis Testing in R
-
Correlation and Covariance in RCorrelation and Covariance in R
-
Probability Plots and Diagnostics in RProbability Plots and Diagnostics in R
-
Error Matrices in RError Matrices in R
-
Curves in R-Programming LanguageCurves in R-Programming Language
Probability Plots and Diagnostics in R
Probability Plots and Diagnostics in R
Probability plots and diagnostics are essential for evaluating the distributional assumptions of data and fitting statistical models. They help in assessing whether a dataset follows a specific distribution, such as normality, and diagnosing potential problems in model fitting.
1. Normal Q-Q Plot (Quantile-Quantile Plot)
A Q-Q plot compares the quantiles of your dataset with the theoretical quantiles of a standard distribution, typically normal. It is useful for assessing the normality assumption.
Example:
# Sample data
data <- rnorm(100)
# Q-Q plot
qqnorm(data)
qqline(data, col = “red”) # Adds a reference line
- qqnorm() creates the Q-Q plot.
- qqline() adds a straight line for comparison. If the data is normally distributed, the points should lie roughly along the red line.
2. Probability Plot using qqplot for Other Distributions
You can also create Q-Q plots for distributions other than the normal one by specifying different distribution types in the qqplot() function.
Example:
# Exponential Q-Q plot
x <- rexp(100, rate = 1)
qqplot(qexp(ppoints(100)), x)
abline(0, 1)
3. Shapiro-Wilk Normality Test
To further validate the normality assumption, you can perform a Shapiro-Wilk test. This test checks the hypothesis that the data is normally distributed.
Example:
# Shapiro-Wilk normality test
shapiro.test(data)
The shapiro.test() function provides a p-value:
- If the p-value > 0.05, the data is normally distributed.
- If the p-value ≤ 0.05, the data significantly deviates from normality.
4. Residual Diagnostics in Regression Models
In regression models, checking residuals helps to diagnose if your model fits well. Key diagnostics include residual plots, normality of residuals, and heteroscedasticity.
Example: Diagnostic Plots in Linear Regression
# Linear regression model
model <- lm(mpg ~ wt, data = mtcars)
# Diagnostic plots
par(mfrow = c(2, 2)) # Split the plotting window
plot(model)
This will generate:
- Residuals Fitted: Checks the linearity assumption.
- Normal Q-Q: Checks the normality of
- Scale-Location: Checks homoscedasticity (constant variance).
- Residuals Leverage: Identifies influential observations.
5. Histogram with a Normal Curve
A histogram with an overlaid normal curve helps to visually inspect how well the data fits a normal distribution.
Example:
# Sample data
data <- rnorm(100)
# Plot histogram
hist(data, probability = TRUE, col = “lightblue”, main = “Histogram with Normal Curve”)
# Add normal curve
curve(dnorm(x, mean = mean(data), sd = sd(data)), add = TRUE, col = “red”, lwd = 2)
6. Kernel Density Plot
Kernel density plots give a smooth estimate of the data distribution and can be compared with a normal distribution.
Example:
# Kernel density plot
plot(density(data), main = “Kernel Density Plot”)
Summary of Diagnostic Tools:
- Q-Q Plot: For checking
- Shapiro-Wilk Test: For hypothesis testing of
- Residual Diagnostics: For regression model
- Histogram with Normal Curve: For visual distribution
- Kernel Density Plot: For smooth distribution
Each of these tools provides insight into the distribution and potential issues in the data, helping to make informed modeling choices.
2. Leverage Plot
A Leverage Plot (also known as an Added Variable Plot or Partial Regression Plot) is used to detect influential data points in a regression model. Leverage points are those observations that have an unusually large effect on the regression coefficients. A leverage plot helps identify observations that might unduly influence the fit of the model.
In R, leverage plots can be created using the car package, which provides additional diagnostic tools for linear models.
Steps to Create a Leverage Plot in R
- Install and Load the car Package
If you don’t already have the car package installed, install it by running: install.packages(“car”)
Then load the package:
library(car)
2. Fit a Linear Model
You need a fitted linear regression model to create a leverage plot. For example, let’s use the mtcars dataset:
# Fit a linear regression model
model <- lm(mpg ~ wt + hp, data = mtcars)
3. Create a Leverage Plot
You can create leverage plots for each predictor in the model:
# Leverage plot for the model leveragePlots(model)
This command will create leverage plots for all predictors in the regression model.
Understanding the Leverage Plot
- X-axis: Represents the values of the predictor variable after accounting for the effect of all other variables.
- Y-axis: Represents the values of the dependent variable (e.g., mpg) after accounting for the effect of all other variables.
- Points far from the center may indicate high leverage, meaning they have a strong influence on the model’s These points should be inspected further, as they might distort the regression results.
Visualizing High-Leverage Points
In addition to leverage plots, you can identify high-leverage points directly from the linear model using the hatvalues() function.
Example:
# Calculate leverage values
leverage <- hatvalues(model)
# Identify observations with high leverage (above 2 times the average leverage)
high_leverage <- which(leverage > 2 * mean(leverage))
# Display high-leverage points mtcars[high_leverage, ]
Example Output
A leverage plot will look similar to a residual plot but will highlight points that have a disproportionately large effect on the slope of the regression line.
Alternative: Plotting Leverage Points Using Base R
You can also plot leverage values manually using base R functions:
# Fit the model
model <- lm(mpg ~ wt + hp, data = mtcars)
# Leverage values
leverage_values <- hatvalues(model)
# Plot leverage values
plot(leverage_values, main = “Leverage Plot”, xlab = “Index”, ylab = “Leverage”, pch = 19, col = “blue”)
abline(h = 2 * mean(leverage_values), col = “red”, lwd = 2)
Summary:
- Leverage plots are useful for detecting influential data
- High leverage points can disproportionately affect model
- You can generate leverage plots in R using the car package or manually plot leverage values with base R functions.