Probability Plots and Diagnostics in R

Meritshot Tutorials

R Tutorial

R-Overview
R-Overview
R Basic Syntax
R Basic Syntax
R Data Types
R Data Types
R-Data Structures
R-Data Structures
R-Variables
R-Variables
R-Operators
R-Operators
R-Strings
R-Strings
R-Function
R-Function
R-Parameters
R-Parameters
Arguments in R programming
Arguments in R programming
R String Methods
R String Methods
R-Regular Expressions
R-Regular Expressions
Loops in R-programming
Loops in R-programming
R-CSV FILES
R-CSV FILES
Statistics in-R
Statistics in-R
Probability in R
Probability in R
Confidence Interval in R
Confidence Interval in R
Hypothesis Testing in R
Hypothesis Testing in R
Correlation and Covariance in R
Correlation and Covariance in R
Probability Plots and Diagnostics in R
Probability Plots and Diagnostics in R
Error Matrices in R
Error Matrices in R
Curves in R-Programming Language
Curves in R-Programming Language

Probability Plots and Diagnostics in R

Probability plots and diagnostics are essential for evaluating the distributional assumptions of data and fitting statistical models. They help in assessing whether a dataset follows a specific distribution, such as normality, and diagnosing potential problems in model fitting.

1. Normal Q-Q Plot (Quantile-Quantile Plot)

A Q-Q plot compares the quantiles of your dataset with the theoretical quantiles of a standard distribution, typically normal. It is useful for assessing the normality assumption.

Example:

# Sample data

data <- rnorm(100)

# Q-Q plot

qqnorm(data)

qqline(data, col = “red”) # Adds a reference line

qqnorm() creates the Q-Q plot.
qqline() adds a straight line for comparison. If the data is normally distributed, the points should lie roughly along the red line.

2. Probability Plot using qqplot for Other Distributions

You can also create Q-Q plots for distributions other than the normal one by specifying different distribution types in the qqplot() function.

Example:

# Exponential Q-Q plot
x <- rexp(100, rate = 1)
qqplot(qexp(ppoints(100)), x)
abline(0, 1)

3. Shapiro-Wilk Normality Test

To further validate the normality assumption, you can perform a Shapiro-Wilk test. This test checks the hypothesis that the data is normally distributed.

Example:

# Shapiro-Wilk normality test

shapiro.test(data)

The shapiro.test() function provides a p-value:

If the p-value > 0.05, the data is normally distributed.
If the p-value ≤ 0.05, the data significantly deviates from normality.

4. Residual Diagnostics in Regression Models

In regression models, checking residuals helps to diagnose if your model fits well. Key diagnostics include residual plots, normality of residuals, and heteroscedasticity.

Example: Diagnostic Plots in Linear Regression

# Linear regression model

model <- lm(mpg ~ wt, data = mtcars)

# Diagnostic plots

par(mfrow = c(2, 2)) # Split the plotting window

plot(model)

This will generate:

Residuals Fitted: Checks the linearity assumption.
Normal Q-Q: Checks the normality of
Scale-Location: Checks homoscedasticity (constant variance).
Residuals Leverage: Identifies influential observations.

5. Histogram with a Normal Curve

A histogram with an overlaid normal curve helps to visually inspect how well the data fits a normal distribution.

Example:

# Sample data
data <- rnorm(100)
# Plot histogram
hist(data, probability = TRUE, col = “lightblue”, main = “Histogram with Normal Curve”)
# Add normal curve
curve(dnorm(x, mean = mean(data), sd = sd(data)), add = TRUE, col = “red”, lwd = 2)

6. Kernel Density Plot

Kernel density plots give a smooth estimate of the data distribution and can be compared with a normal distribution.

Example:

# Kernel density plot

plot(density(data), main = “Kernel Density Plot”)

Summary of Diagnostic Tools:

Q-Q Plot: For checking
Shapiro-Wilk Test: For hypothesis testing of
Residual Diagnostics: For regression model
Histogram with Normal Curve: For visual distribution
Kernel Density Plot: For smooth distribution

Each of these tools provides insight into the distribution and potential issues in the data, helping to make informed modeling choices.

2. Leverage Plot

A Leverage Plot (also known as an Added Variable Plot or Partial Regression Plot) is used to detect influential data points in a regression model. Leverage points are those observations that have an unusually large effect on the regression coefficients. A leverage plot helps identify observations that might unduly influence the fit of the model.

In R, leverage plots can be created using the car package, which provides additional diagnostic tools for linear models.

Steps to Create a Leverage Plot in R

Install and Load the car Package

If you don’t already have the car package installed, install it by running: install.packages(“car”)

Then load the package:

library(car)

2. Fit a Linear Model

You need a fitted linear regression model to create a leverage plot. For example, let’s use the mtcars dataset:

# Fit a linear regression model

model <- lm(mpg ~ wt + hp, data = mtcars)

3. Create a Leverage Plot

You can create leverage plots for each predictor in the model:

# Leverage plot for the model leveragePlots(model)

This command will create leverage plots for all predictors in the regression model.

Understanding the Leverage Plot

X-axis: Represents the values of the predictor variable after accounting for the effect of all other variables.
Y-axis: Represents the values of the dependent variable (e.g., mpg) after accounting for the effect of all other variables.
Points far from the center may indicate high leverage, meaning they have a strong influence on the model’s These points should be inspected further, as they might distort the regression results.

Visualizing High-Leverage Points

In addition to leverage plots, you can identify high-leverage points directly from the linear model using the hatvalues() function.

Example:

# Calculate leverage values
leverage <- hatvalues(model)
# Identify observations with high leverage (above 2 times the average leverage)
high_leverage <- which(leverage > 2 * mean(leverage))
# Display high-leverage points mtcars[high_leverage, ]

Example Output

A leverage plot will look similar to a residual plot but will highlight points that have a disproportionately large effect on the slope of the regression line.

Alternative: Plotting Leverage Points Using Base R

You can also plot leverage values manually using base R functions:

# Fit the model
model <- lm(mpg ~ wt + hp, data = mtcars)
# Leverage values
leverage_values <- hatvalues(model)
# Plot leverage values
plot(leverage_values, main = “Leverage Plot”, xlab = “Index”, ylab = “Leverage”, pch = 19, col = “blue”)
abline(h = 2 * mean(leverage_values), col = “red”, lwd = 2)

Summary:

Leverage plots are useful for detecting influential data
High leverage points can disproportionately affect model
You can generate leverage plots in R using the car package or manually plot leverage values with base R functions.

Browse by Domains

Meritshot Tutorials

Popular Programs

Interview Questions

Case Study

Tutorials

Keep learning with Meritshot

Legal Links

Useful Links

Subscribe Now