Create Your First Project
Start adding your projects to your portfolio. Click on "Manage Projects" to get started
Advanced Regression Modeling and Diagnostics in R: A Grocery Labor Cost Analysis
This project involved applying multiple linear regression modeling and diagnostic techniques in R to predict total labor hours in a grocery retail environment based on operational factors. The analysis was conducted on a custom dataset containing:
X1: Number of cases shipped
X2: Indirect labor cost percentage
X3: Holiday indicator (binary)
The aim was to investigate how operational and contextual variables affect labor demand and to evaluate model assumptions using both graphical and statistical techniques.
Analysis Workflow (in R):
Data Cleaning & Visualization
Imported and explored the dataset using read.csv() and visualization tools.
Created a scatter plot matrix to explore linearity and detect potential outliers or patterns.
Computed the correlation matrix to assess multicollinearity risks and linear associations.
Model Building
Fitted a multiple regression model using lm() with the three predictors (X1, X2, X3).
Interpreted regression coefficients to understand the marginal impact of each variable.
Extracted and analyzed the residuals to assess fit and consistency.
Model Diagnostics
Plotted residuals vs fitted values and predictors to visually assess homoscedasticity.
Conducted the Brown-Forsythe test to statistically test the assumption of constant variance.
Performed a normal Q-Q plot to evaluate residual normality.
Statistical Inference
Conducted the F-test to determine overall model significance.
Applied Bonferroni-adjusted t-tests for joint hypothesis testing of β₁ and β₃.
Constructed confidence intervals for the significant coefficients.
Key Findings:
Significant Predictors:
Both number of cases shipped (X1) and holiday status (X3) had statistically significant positive relationships with labor hours.
Non-significant Predictor:
Indirect cost percentage (X2) was not a significant predictor.
Model Quality:
The model achieved an R² of 0.6883, indicating that 68.83% of the variability in labor hours was explained by the predictors.
Assumption Violations:
The Brown-Forsythe test (p = 0.00437) indicated heteroscedasticity, suggesting non-constant error variance and the potential need for transformation or alternative modeling strategies.
Skills Demonstrated:
Regression diagnostics (visual + statistical)
Inferential statistics (F-test, Bonferroni correction)
R programming for statistical modeling and visualization
Interpretation and communication of model assumptions
Application of regression in real-world business operations

