Welcome to this site, the official site of Economics 488, Senior Capstone Seminar in Economics, at James Madison University. This site will demonstrate the upper blogosphere in economics, with contributions coming from students and the class instructor.
Aggregate Income and Consumption Expenditure on Alcoholic Beverages (for off-premises consumption)
PURPOSE
This assignment estimated a relationship between aggregate consumption and aggregate income by running an OLS regression in SAS. An Engel Curve is used to identify this relationship. The null hypothesis test states that a change in aggregate income will not lead to a change in aggregate consumption of alcoholic beverages.
DESCRIPTIVE STATISTICS
CONSALC Personal consumption expenditures: Nondurable goods: Alcoholic beverages purchased for off-premises consumption, Billions of Dollars, Annual, Not Seasonally Adjusted
YDISP Disposable personal income, Billions of Dollars, Annual, Not Seasonally Adjusted
The annual time series data used in this regression ran from 1933 to 2015.
AUTOCORRELATION IN TIME-SERIES DATA
Time series data is often accused of serial correlation. Studenmund defines first-order serial correlation, as “The current value of the error term is a function of the previous value of the error term.” (Studenmund, A. H., and Henry J. Cassidy. Using Econometrics: A Practical Guide. Sixth ed. Boston: Addison-Wesley, 2011.) Autocorrelation will not lead to bias in coefficient estimates. However, OLS will estimate biased variances (see relevant output). To improve the OLS estimate for time series data, we will test and correct for autocorrelation.
TEST FOR AUTOCORRELATION
Two different methods were used to detect autocorrelation in the Engel Curve regression. The first, informal method of testing for serial correlation is to plot the estimates and note any unusual patterns. Our plotted data’s pattern clearly indicated the presence of positive serial autocorrelation (see relevant output).
A more formal and widely used method to test for first-order autocorrelation is the Durbin-Watson (D-W) d Statistic. This detection method was applied to this regression because it meets the assumptions of the D-W derivation: the regression model includes an intercept, does not include a lagged dependent variable as an independent variable, and the serial correlation is first-order in nature. (Studenmund, A. H., and Henry J. Cassidy. Using Econometrics: A Practical Guide. Sixth ed. Boston: Addison-Wesley, 2011.) The D-W statistic calculated in this regression is 0.142 (see relevant output), which indicates the presence of positive autocorrelation.
CORRECT FOR AUTOCORRELATION
To correct for first order autocorrelation, I ran PROC AUTOREG and produced Yule-Walker estimates for the regression model. This method corrects for first-order autocorrelation and “restores minimum variance to its estimation” (Studenmund, A. H., and Henry J. Cassidy. Using Econometrics: A Practical Guide. Sixth ed. Boston: Addison-Wesley, 2011.)
SAS results in Output B (see relevant output) show the AUTOREG procedure that OLS produced prior to correction.
SAS results in Output C (see relevant output) show the corrected output with Yule-Walker Estimates. Its D-W statistic increased from 0.1422 to 1.5394. Though still present, there is decreased autocorrelation as the D-W stat approaches 2, which strengthens the corrected model. The change in R^{2} gives a more accurate parameter estimate for the coefficient on disposable personal income.
EMPIRICAL RESULTS
The corrected regression suggests that a $1B increase in aggregate disposable personal income is associated with a $0.009465B (or, $9.465 million) increase in aggregate consumption expenditure of alcoholic beverages. These results are highly statistically significant with a p-value of < .0001 and we reject the null hypothesis.
Is greed good?
Engel Curve for Poultry Consumption and Disposable Income
For this assignment, I looked at the relationship between disposable income per capital and the quantity of chicken (in pounds) per capita consumed in the United States from 1997 to 2015. From the data provided by the National Chicken Council and Fred, I used SAS to obtain the Engel curve shown in Figure 1. As Figure 1 shows, poultry seems to be an inferior good, indicated by the backward bend in the curve.
While SAS does show a p-value lower than .05 for poultry, I do have a rather low R^{2} value. This suggests the data does not fit the model very well. I tested for autocorrelation using the Durbin-Watson Test and, as Figure 3 shows, there appears to be a positive autocorrelation in the data, given that Pr < DW. I include Figure 4 to show the correlation between the residual and the lag on the residual (almost a .9 correlation).
Given the evidence of autocorrelation, I implemented autoreg into my code to fix the model. Consequently, SAS provided the output in Figure 6. This regression’s p – value is again statistically significantly different from 0. However, the R^{2} in Figure 6 is much higher than that of the original regression, indicating that this is a better model.
Figure 1 This graph shows the relationship between disposable income per capital and poultry consumption in pounds per capita in the US from 1997 to 2015.
Figure 2 Regression of relationship between poultry consumption and disposable income.
Figure 3 show that there is positive autocorrelation in the regression
Figure 4 shows the correlation between the residual and the lag on the residual.
Figure 5 is the first part of the output from the proc autoreg code. It shows the original regression and that there is autocorrelation.
Figure 6 is the second part of the output form the proc autoreg code. It adjusts the model for autocorrelation.
Poultry Consumption Data – http://www.nationalchickencouncil.org/about-the-industry/statistics/per-capita-consumption-of-poultry-and-livestock-1965-to-estimated-2012-in-pounds/
Disposable personal income Per Capital – https://fred.stlouisfed.org/series/A229RC0A052NBEA
Motor Vehicle Related Expenditures and Disposable Income
Aggregate Engel Curve: Motor Vehicles-Related Expenditure and Disposable Income (1997 – 2015)
Introduction
This paper estimates how overall consumption of motor vehicles and related parts varies with disposable income across the fifty states. I began by collecting seven macroeconomic variables, in addition to disposable income, that have general macroeconomic significance and may be specifically significant to predicting motor vehicles-related expenditure.
Data
The data for this estimate span from 1997 to 2015 at an annual frequency. The personal consumption expenditures data for gasoline/other energy related goods (var: gas), motor vehicles/parts (var: motor), and transportation services (var: transpo) come from the Bureau of Economic Analysis. As does the GDP (var: gdp), population (var: pop), and disposable income (var: dispo) data. The source of the average annual Fahrenheit temperature (var: temp) data is the National Climatic Data Center. The unemployment data (var: u) comes from the Bureau of Labor Statistics. All USD amounts are based in 2015 dollars.
The table below provides the variables’ summary statistics:
The correlation coefficient matrix to the right suggests that multicollinearity is present. Dispinc is highly correlated with four variables. Gdp and dispinc have a correlation of 0.998, which may naturally follow from the variables both being measures of overall income. Dispinc and transpo have a correlation coefficient of 0.959. Dispo and gas have a coefficient of 0.901. Dispo and pop have a 0.997.
In addition, several other variable pairs have correlation coefficients above 0.91. So, the first model will attempt to both avoid excessive multicollinearity and redundancy and include variables whose inclusion makes sense according to microeconomic theory.
Model: motor = transpo dispinc
The model contains the following two independent variables: transportation services expenditures and disposable income. The regression equation is what follows:
I anticipated that an increase in transpo will result in a decrease in motor due to the substitutability of the two variables. I anticipated that dispinc and motor will have a positive relationship. Therefore, an increase in dispinc will result in an increase in motor. This did not turn out to be the case. There are several possible reasons for this. There may be significant multicollinearity, autocorrelation, omitted variables, the true model is non-linear, or other model misspecification issues (e.g. incorrect application of microeconomic theory).
A test for the extent of multicollinearity (“vif”) in a regression analysis indicates that the variance inflation factors (VIF) are high. VIF runs on the assumption that the model’s predictors are not linearly dependent. A score above five indicates that the variables are highly correlated. Both predictors have VIF scores above 12. This suggests that the parameter coefficient estimates are unreliable and poorly estimated. “collinoint” shows that the variables are highly collinear as well. The tables below contain the model’s parameter estimates and other regression results:
Therefore, multicollinearity is a certain problem in this model. However, the predictor variables do explain 0.6781 of the variation in motor, so it is worth investigating further.
Another issue may be autocorrelation. With time series data, it is likely that the variable’s current observed value relates to the variable’s value during the previous time period. When this is the case, the model violates the assumption that the error terms are serially uncorrelated. This creates inefficient estimates.
Therefore, multicollinearity is a certain problem in this model. However, the predictor variables do explain 0.6781 of the variation in motor, so it is worth investigating further.
Another issue may be autocorrelation. With time series data, it is likely that the variable’s current observed value relates to the variable’s value during the previous time period. When this is the case, the model violates the assumption that the error terms are serially uncorrelated. This creates inefficient estimates.
The Durbin-Watson test shows that there is strong first order autocorrelation, 0.611, so the model does not meet the assumption of zero covariance in the error term. The associated p-value for testing autocorrelation is < 0.05, which indicates a positive autocorrelation. The SAS output below shows our parameter estimates, Durbin-Watson test statistic, and degree of first order autocorrelation:
Usually, a plot of the model’s residuals provides a visual hint as to whether or not—and in which direction—autocorrelation exists. In addition, we can use tests for normality. The Shapiro-Wilk W test statistic, a statistical test where the null hypothesis is normal distribution, has a high value, 0.7475. Therefore, we have insufficient evidence to reject the notion that we have normally distributed residuals. However, the d statistic above is sufficient existence to prove positive autocorrelation. Therefore, autocorrelation is an additional problem in the model.
Model revision: model = dispinc temp u
Since dispinc and transpo are so highly collinear, the revised model excludes transpo. I excluded gas because it has a 0.91 correlation with dispinc. The model also includes temp and unemployment. I anticipated that an increase in temp will lead to an increase in motor, an increase in unemployment will lead to a decrease in motor, and an increase in dispinc will lead to an increase in motor. The regression equation and parameter estimates follow:
The revised model has parameter coefficient estimates whose signs match microeconomic theory. This suggests that this model has removed some of the excessive multicollinearity. The variance inflation results confirm this. A VIF of 1 shows no multicollinearity and a VIf above 1 shows moderate correlation between the predictors. The model’s predictors are no longer highly correlation. In addition, the t value of dispinc has slightly increased in magnitude. Temp has a low t value but will remain in the model because of its theoretical significant. U has a statistically significant t value and a probability of 0.0095 of having occurred due to random error.
So, this revision has significantly decreased the model’s multicollinearity and may be approaching the true model. However, autocorrelation still exists. From the original model to the revised model, the 1^{st} order autocorrelation increased from 0.611 to 0.665.
To correct for this, I created 4 new variables to get a sense of the period-to-period correlation: a lag of dispinc (di_lag), a lag of motor (mot_lag), a lag of temp (temp_lag), and a lag of u (u_lag). The matrix below shows that dispinc and its lag have a nearly perfect correlation, motor and its lag have a high correlation, temp and its lag have a slightly negative correlation, and u and its lag have a high correlation.
So, the lagged version of the model still contains high positive autocorrelation. The model also produced nearly identical parameter estimates. Its outputs are what follows:
The next model correction I attempted increased the lag condition to three periods. It showed significantly reduced autocorrelation by the third lag. However, lagging causes loss of degrees of freedom and may not be valid. Given that this data contains 19 time periods, such a loss may not be worth the trade-off. Therefore, perhaps more advanced time series methods are better suited to this data. In addition, it is possible that introducing dummy variables that subset the data according to major economic events would reduce the degree of autocorrelation, e.g. recessionary periods, and before and after 9/11. The results of this regression are what follow.
Conclusion
My original model had severe multicollinearity and autocorrelation. To correct for multicollinearity, I excluded transpo because of its high correlation with dispinc, and included u and temp. Still, autocorrelation remained. To correct for autocorrelation, I lagged the variables by one time period. It was difficult to minimize autocorrelation in this model. The presence of autocorrelation makes the parameter estimates difficult to interpret and ambiguous because we no longer have the minimum variance property. Further ways to improve the model would be to investigate and apply more advanced time series. For example, it would be useful to take a more detailed look at the theoretical validity of lagging by three time periods for this data.
SAS Code
*data import;
proc import out = usdat
datafile = “/home/belareeves0/488/data.xlsx” dbms = xlsx replace; getnames = yes;
sheet = ‘all’;
run;
proc contents data = usdat;
run;
*plot of motor over time;
proc gplot data = usdat;
title ‘Motor Vehicles Related Expenditures (1997-2015)’;
symbol i=spline v=circle h=2;
plot motor*year;
run;
*plot of dispinc over time;
proc gplot data = usdat;
title ‘Disposable Income (1997-2015)’;
symbol i=spline v=circle h=2;
plot dispinc*year;
run;
*plot of both dispinc, motor over time;
proc gplot data = usdat;
title ‘Motor Vehicles Related Expenditures (1997-2015)’;
symbol i=spline v=circle h=2;
plot motor*year = 1
dispinc*year = 2 / overlay;
run;
proc gplot data = usdat;
plot motor*year;
title ‘Residuals of Motor Vehicles Related Expenditures over Time’;
where _type_ = ‘RESIDUAL’;
symbol interpol = join v=circle h=2;
run;
proc corr data = usdat;
run;
proc means data = usdat;
run;
data usdat;
set usdat;
di_lag = lag(dispinc);
mot_lag = lag(motor);
temp_lag = lag(temp);
u_lag = lag(u);
run;
*corr b/w lagged vars;
proc corr data = usdat;
var dispinc di_lag mot_lag motor temp_lag temp u_lag u;
run;
*model 1: dispinc transpo;
proc reg data = usdat;
model motor = dispinc transpo / dwprob vif;
output out = out1 p=p r=r;
plot r.*p.;
title ‘Motor Residuals’;
run;
*multicollinearity fix: dispinc temp u;
proc reg data = usdat;
model motor = dispinc temp u /dwprob vif;
output out = out1 p=p r=r;
plot r.*p.;
title ‘Motor Residuals’;
run;
*autocorr fix attempt: nlag = 1;
proc autoreg;
model motor = dispinc temp u / nlag=1 dwprob;
output out = out1 p=p r=r;
run;
*autocorr fix attempt: nlag = 3;
proc autoreg;
model motor = dispinc temp u / nlag=3 dwprob;
output out = out1 p=p r=r;
run;
*trying proc arima acf;
proc arima data = usdat;
identify var = motor nlag =3;
run;
proc univariate normal plot;
var r;
run;
Real Disposable Income Per Capita on Peanut Butter Consumption
This assignment required the use of the USDA’s Economic Research Service Food Availability System as well as the FRED Economic Database. The former was used to find peanut consumption values from 1980 – 2012, and the latter was used to find real disposable personal income data: per capita from the same year range. The reason that this year range was selected was to have enough of a sample size for the regression analysis, and the date where the research “cut off” was on 2012. The dependent variable, peanut butter consumption, and the explanatory variable, real disposable personal income, will form the Engel Curve, as seen later in this post.
The regression analysis was done to see what income’s effect was on peanut butter consumption, to determine what kind of good peanut butter is. Peanut butter, as evidenced throughout by the regression analysis, was revealed to be a normal good, because as income rises, demand (or consumption in this case) rose as well. Here, we can see this result in the regression analysis. The equation is:
PBC_{i} = β_{0} + β_{1}setRDPIPC + errorterm_{i}
Real Disposable Income (in hundreds) per Capita on Peanut Butter Consumption per Capita
The REG Procedure
Model: MODEL1
Dependent Variable: PBC PBC
Number of Observations Read | 50 |
Number of Observations Used | 33 |
Number of Observations with Missing Values | 17 |
Analysis of Variance | |||||
Source | DF | Sum of Squares |
Mean Square |
F Value | Pr > F |
Model | 1 | 1.25309 | 1.25309 | 10.55 | 0.0028 |
Error | 31 | 3.68056 | 0.11873 | ||
Corrected Total | 32 | 4.93365 |
Root MSE | 0.34457 | R-Square | 0.2540 |
Dependent Mean | 3.15007 | Adj R-Sq | 0.2299 |
Coeff Var | 10.93846 |
Parameter Estimates | ||||||
Variable | Label | DF | Parameter Estimate |
Standard Error |
t Value | Pr > |t| |
Intercept | Intercept | 1 | 2.10576 | 0.32700 | 6.44 | <.0001 |
setRDPIPC | 1 | 0.00363 | 0.00112 | 3.25 | 0.0028 |
With this information, we can fill in the variables of the simple linear regression equation.
PBC_{i} = 2.10576 + .00363setRDPIPC + errorterm_{i}
The real disposable personal income data is significant at 5%. However, there are still problems that may occur with this data set. As seen above, the R^{2} value is rather small, to suggest that the model does not fit well with the data.
Autocorrelation/ Durbin-Watson Test:
Real Disposable Income (in hundreds) per Capita on Peanut Butter Consumption per Capita
The AUTOREG Procedure
Ordinary Least Squares Estimates | |||
SSE | 3.68056122 | DFE | 31 |
MSE | 0.11873 | Root MSE | 0.34457 |
SBC | 28.2593619 | AIC | 25.2663468 |
MAE | 0.27955637 | AICC | 25.6663468 |
MAPE | 8.88405386 | HQC | 26.2734053 |
Durbin-Watson | 0.4287 | Total R-Square | 0.2540 |
Parameter Estimates | |||||
Variable | DF | Estimate | Standard Error |
t Value | Approx Pr > |t| |
Intercept | 1 | 2.1058 | 0.3270 | 6.44 | <.0001 |
setRDPIPC | 1 | 0.0000363 | 0.0000112 | 3.25 | 0.0028 |
Here, we can see that the Durbin-Watson test is far below our expectations (2), suggesting heavy autocorrelation. Autocorrelation signals the possible correlation between time periods, and in this case, of previous time periods on future ones. We correct this autocorrelation by lagging the model by 2 periods. Our resulting Durbin-Watson value is significantly 2. We can also see the parameter estimates for the lagged values below. Additionally, the R^{2} value rises to a respectable .7143, or 71.43% of variation is explained by the model.
Real Disposable Income (in hundreds) per Capita on Peanut Butter Consumption per Capita
The AUTOREG Procedure
Maximum Likelihood Estimates | |||
SSE | 1.40937588 | DFE | 29 |
MSE | 0.04860 | Root MSE | 0.22045 |
SBC | 4.54312068 | AIC | -1.4429096 |
MAE | 0.15374566 | AICC | -0.0143381 |
MAPE | 4.89358882 | HQC | 0.57120748 |
Log Likelihood | 4.72145479 | Transformed Regression R-Square | 0.1039 |
Durbin-Watson | 1.9626 | Total R-Square | 0.7143 |
Observations | 33 |
Parameter Estimates | |||||
Variable | DF | Estimate | Standard Error |
t Value | Approx Pr > |t| |
Intercept | 1 | 1.8506 | 0.7859 | 2.35 | 0.0255 |
setRDPIPC | 1 | 0.0000467 | 0.0000271 | 1.72 | 0.0952 |
AR1 | 1 | -0.6606 | 0.1842 | -3.59 | 0.0012 |
AR2 | 1 | -0.1491 | 0.1933 | -0.77 | 0.4468 |
Possible Omitted Variable:
Through observing the low R^{2} value and the small effect of the income on peanut butter consumption, it is safe to say that there are numerous possible omitted variables that contribute to peanut butter consumption. Foremost among these variables is possible jam/jelly/preserve consumption, as jam is as close to a perfect substitute to peanut butter as possible. Unfortunately, the dataset for Jelly, Jam, Preserves over the years is not available in a complete and accessible form, so it can only be assumed that this would allow the model to be a better fit for the dependent variable. However, it must be noted that any possible variable that can be added to this model will contribute to increasing R, while not necessarily making the model a best fit for
Conclusion and Engel Graph:
The conclusion that we can draw from this data is that a $1 rise in real disposable income per capita will cause a .0004 pound rise in peanut butter consumption per capita. As we have noted before, with the relatively small R^{2} value and the small effect of real disposable income per capita on peanut butter consumption, we can assume that there are some significant variables omitted from this regression that would greatly help its credibility, despite correcting for autocorrelation and multicollinearity.
Relationship Between Disposable Income and Gambling Consumption
For my Engel Curve, I estimated the effect of annual household disposable income (hdi) on annual gambling consumption expenditure (gamblingconsumption). It combines 24 between 1992-2015 (N=24) and was pulled from the FRED website. For this sample, the model predicted that a 1 billion dollar increase in aggregate disposable household income was associated with a 10.9 million dollar increase in aggregate gambling consumption (p-value < 0.0001; t-stat = 26.53). The model as a whole fit very well (R^{2}= 0.9697) and was statistically significant (p-value < 0.0001). However, using the Durbin-Watson test, I discovered positive autocorrelation in the model (d= 0.2686) that was statistically significant (p-value < 0.0001 for positive autocorrelation). Multicollinearity was not an issue because the model only contained one independent variable (hdi) and heteroskedasticity was not present in the error terms.
In correcting for autocorrelation, I used proc autoreg, which by default uses the Yule-Walker estimation method to correct. I lagged the model using the 2^{nd} order autoregressive process (NLAG=2) because lagging it only one period (NLAG = 1) still showed statistically significant positive autocorrelation within the model and a higher sum of squared errors then when using the 2^{nd} order. After correcting, the model again predicted that a 1 billion dollar increase in aggregate disposable household income is associated with a 10.9 million dollar increase in aggregate gambling consumption. The coefficient was again highly statistically significant (p value < 0.0001; t-stat 17.24). The transformed model fit very well as a whole (R^{2}= 0.9369). More importantly, the Durbin Watson t statistic in this transformed model is now 1.6072 and statistically insignificant p-value (0.1335), suggesting that we cannot reject the null hypothesis that correlation between the two variables is 0. Although the Durbin-Watson statistic is not above 2, which is the accepted value for a model showing no autocorrelation, this is a great improvement from the original Durbin-Watson statistic of 0.2686. From this model we can deduce that gambling is a normal good (as income increase, consumption increases). Below I included SAS code.
SAS CODE
proc import datafile=”/home/blumjt0/sasuser.v94/gamblin_consumption.xls”
out=blumjt0
dbms=xls
replace;
run;
proc means;
var hdi gamblingconsumption;
run;
proc reg;
model gamblingconsumption = hdi/ DW white;
plot gamblingconsumption * hdi;
run;
proc autoreg;
model gamblingconsumption = hdi/ NLAG=2 DWPROB;
run;
Relationship between Real Personal Disposable Income vs. Used Cars
For my project, I decided to test the effect that changes in income would have on the consumption of used cars. I used data from the FRED database, using Real Personal Disposable Income per capita (using 2009 as a base year) as my independent variable and the net purchase of used cars (in billions of dollars) as my dependent variable. The data set gives annual reports on both these variables starting on January 1^{st}, 1969 all the way through January 1, 2016, giving us a total of 47 observations. The following is the data collected from an OLS regression using these 2 variables.
We find that there is a very high correlation between changes in income and changes in the consumption of used cars. The high t-statistic of 16 plus the high value of R^{2} and adjusted R^{2}, coupled with a low p-value tells us that these 2 variables are highly correlated. As income goes up, the consumption of used cars also increases. The high F-statistic also provides evidence that both variables are positively correlated.
To account for heteroscedasticity and/or any outliers, I used the natural log of the observations in the data set for both variables; and the resulting regression gave positive results. In the end, the regression yielded a R^{2} of .83 (.05 higher than the previous regression) and a higher adjusted R^{2} as well. The F-statistic and t-statistic are both much higher than in the previous regression (even when both statistics were high to begin with), which gives us confidence that both variables being tested are highly correlated.
The graph above represents the Engle Curve of used cars, with real disposable income per capita on the x-axis and the net purchase of used cars on the y-axis. Plotting the natural log of the observations in the data set for both variables, we get an upward sloping Engle Curve. Using the natural log as a means to capture the Engle curve is a good measure because any changes in the natural log are approximately equal to a percent change. The upward slope of the curve also tells us that used cars are normal goods, since the demand for a normal good goes up as income rises.
The graph above represents the Engle Curve of used cars, with real disposable income per capita on the x-axis and the net purchase of used cars on the y-axis. Plotting the natural log of the observations in the data set for both variables, we get an upward sloping Engle Curve. Using the natural log as a means to capture the Engle curve is a good measure because any changes in the natural log are approximately equal to a percent change. The upward slope of the curve also tells us that used cars are normal goods, since the demand for a normal good goes up as income rises.
Aggregate Food Engel Curve
Engel Curve Assignment
For this assignment I was interested in how aggregate food expenditures fluctuated with aggregate disposable income. I obtained annual aggregate food expenditure information, in 1988 dollars, dating from 1953-2014 from the website of the United States Department of Agriculture (USDA). Consequently, from the USDA I also obtained a dataset containing information concerning the annual per capita available food supply adjusted for loss, in Kilocalories, dating from 1953-2014. From the Federal Reserve of St. Louis (FRED) I obtained a dataset chronicling the U.S. civilian noninstitutional population per year from 1953-2014. I also obtained, from FRED, a dataset containing seasonally adjusted and end of period aggregated real disposable personal income for the period 1953-2014 as well as the annual personal savings rate as a percentage of disposable income, for the same period, also aggregated end of period.
My initial model OLS model was as follows:
CPEXP= INTERCEPT+RDI +POP +SUPPLY +SAVINGR
Where CPEXP is annual aggregate food expenditures in millions of 1988 dollars, RDI is annual seasonally adjusted real disposable income in billions of 2009 chained dollars, POP is annual civilian noninstitutional population of the U.S. in thousands of persons, SUPPLY is the loss-adjusted annual supply of kilocalories available per capita in the U.S., and SAVINGR is the annual personal savings as a percentage of personal disposable income. This model yielded the following SAS output:
As can be seen from the output, my initial R-squared and adjusted R-squared indicated that my model had good fit, with over 99% of the variance in CPEXP explained by the variance in the statistically significant independent variables. Since this assignment concerns an Engel curve I will only talk briefly about the variables POP, SUPPLY, and SAVINGR and focus instead on the effect that RDI has on CPEXP. In my initial model the effect of RDI was statistically significant at the 5% level, as well as the 1% level as indicated by the p-value, and the parameter estimate predicates that an increase of $1 billion in aggregate real disposable income would cause an increase of $21.27 million in aggregate food expenditures holding the supply of food, as well as the civilian noninstitutional population, and personal savings rate as a percentage of personal disposable income constant. Consequently, POP is also significant at the 5% and 1% level but both SUPPLY and SAVINGR are insignificant at the 5% level.
However, looking at the residual plots of the independent variables, there is a hint that autocorrelation is present given the cyclical nature of the residuals. To check for this I conducted a Durbin-Watson test and obtained the following output:
Given this result, I am able to conclude that my model is suffering from positive autocorrelation since the Durbin-Watson statistic is 0.693 but the dL is 1.455 and the dU is 1.729 at the 5% significance level. To correct for this, I run a second regression model, using the same variables as in the above model, but using the SAS autoreg procedure with the maximum likelihood estimator enabled, and lag of 3. This autoregressive model yields the following parameter estimates:
The Durbin-Watson test statistic of 1.9230 indicates that, at the 5% significance level with dL being 1.455 and dU being 1.789, the autocorrelation has been corrected. The parameter estimate for RDI indicates that, at the 5% as well as 1% significance level, an increase of $1 billion in real disposable income leads to an additional $21.5 million in aggregate expenditures on food holding supply of food, civilian noninstitutional population, and personal savings as a percentage of personal disposable income constant. Consequently, POP is statistically significant at the 1% and 5% significance level and SAVINGR is significant at the 5% level. However, SUPPLY still remains insignificant at the 5% level. Additionally, the R-squared of this model indicates that 99% of the variation in CPEXP is explained by the variation in the statistically significant dependent variables.
Finally, in order to create a visual of the Engel Curve for aggregate food expenditures in the U.S., I create a scatter plot of the RDI on the vertical axis and CPEXP on the horizontal. The output is given below and shows a clear positive relationship between the two variables:
Engel Curve Estimation for US Dairy Consumption
Estimating an Engel Curve for US Dairy Consumption
Using OLS, this model estimates an Engel Curve examining the relationship between US dairy consumption and income. The data involved includes real disposable personal income per capita (RDPI) in the United States from 1995 – 2015 provided by Federal Reserve Economic Data and per capita dairy consumption (DCPC) measured in average pounds per person in the United States from 1995 to 2015 provided by the United States Department of Agriculture (USDA). DCPC is modeled as the dependent variable with RDPI and RDPI squared as independent variables.
The descriptive statistics indicate a mean RDPI of $33,627.38 and range from $27,180.00 – $38,431.00. DCPC for this dataset has a mean of 595.8 pounds per person and a range from 561.9 – 627.2 pounds per person.
The OLS results for this sample indicate that as real disposable personal income (RDPI) increases by one thousand dollars, average dairy consumption per person (DCPC) increases by 0.001 pounds per person, holding income squared (RDPI2) constant. The coefficient on RDPI is not statistically significant at a 5% level (p= 0.8749). As income squared increases by one thousand dollars, average dairy consumption per person increases by 6.247242E-8 pounds. The coefficient on RDPI2 is not statistically significant at a 5% level (p= 0.5597). The model fit the data well with an R2 of 0.9400 and an adjusted R2 of 0.9334. The results indicate that there is a weak relationship between income and dairy consumption but the model fit the data well.
Autocorrelation is expected for time series of this nature. The results from the Durbin Watson test indicate a high D value (D = 1.4232) where a value of 2 indicates there is no autocorrelation present in the sample.
Correcting for autocorrelation, the new results indicate that a one-thousand-dollar increase in income decreases dairy consumption by 0.002 pounds per person, holding income squared constant. A one-thousand-dollar increase in income squared increases dairy consumption by 1.0884E-7 pounds per person, holding income constant. The new R2 from the Yule Walker estimates (0.9433) and regress R2 (0.9208) indicate that the corrected model fit the data well.
There are several limitations to this model:
- There is an extremely weak relationship between income and dairy consumption, as shown by small coefficients on income and income squared. Small coefficients could be due to a lack of additional observation years.
- There are most likely other explanatory factors that influence dairy consumption such as individual preferences and allergies.
- There is a small amount of autocorrelation present, and correcting for autocorrelation gives changes the relationship between income and dairy consumption from a very small positive number to a very small negative number and the relationship between income squared and dairy consumption remains infinitesimally positive. Other model misspecifications could be present.
- Including more specific variables such as milk, icecream, and yogurt consumption could further differentiate between the data and offer insight into individual dairy components in relation to income. This could also eliminate multicollinearity if present.
SAS Results:
References
United States Department of Agriculture, Dairy Products: per capita consumption, United States (Annual). Retrieved from https://www.ers.usda.gov/data-products/dairy- data/, February 14, 2017. U.S. Bureau of Economic Analysis, Real Disposable Personal Income: Per Capita [A229RX0], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/A229RX0, February 14, 2017. |
Engel Curve: Median Income and Spending on Motor Vehicles and Parts (by State)
My Engel Curve attempts to examine the relationship between median household income (by state), and per capita consumption spending on motor vehicles and parts. The median household income by state is taken from a 2010 report by the Census Bureau, and the consumption data was recorded from the FRED database, state by state, in 2010 $.
Model: Initial Regression
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 1 221858259 221858259 11.79 0.0012
Error 49 921998248 18816291
Corrected Total 50 1143856507
Root MSE 4337.77486 R-Square 0.1940
Dependent Mean 40271 Adj R-Sq 0.1775
Coeff Var 10.77149
Parameter Estimates
Parameter Standard
Variable Label DF Estimate Error t Value Pr > |t|
Intercept Intercept 1 27843 3669.93928 7.59 <.0001
Expense Expense 1 10.61806 3.09225 3.43 0.0012
While at first glance this regression appears to be statistically significant, it is important to carefully check to see if any unaccounted-for dynamic could be throwing the model off. To check for this, I ran a White Test for heteroskedasticity, to make sure the relationship between independent and dependent variables are similar across all ranges X-values.
proc reg; model income=expense; output out=resids residual=e; run; data white; set resids; title 'White Test'; e2= e**2; income2=income**2; proc reg data=white; model e2 = income income2; run;
Running this code gives a new regression, with an R^2 value of 0.0267.
This value (.0267) is then multiplied by n, or sample size (51), giving us our test statistic of 1.3617
1.3617 with 2 degrees of freedom means we cannot reject the null hypothesis of a White Test, suggesting that heteroscedasticity may not be a problem.
That said, I ran the regression anyway after taking the log of both variables, because we were supposed to transform the data to try for a better fit. The data is below, but even with logging each side the R^2 value barely increases.
Model: Double Log
Dependent Variable: lnincome
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 1 0.13141 0.13141 12.26 0.0010
Error 49 0.52504 0.01072
Corrected Total 50 0.65645
Root MSE 0.10351 R-Square 0.2002
Dependent Mean 10.59681 Adj R-Sq 0.1839
Coeff Var 0.97684
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 8.38966 0.63042 13.31 <.0001
lnexpense 1 0.31299 0.08937 3.50 0.0010
While T and F value appear to suggest that this Engel curve has a linear relationship, I would suggest there may be omitted variable that may be throwing the model off.