Forecasting Wealth using Advanced Regression Techniques
Code (R)
Training Data
Testing Data
Introduction
The basic task for our study is to create a prediction model for the total wealth of an individual. This has been an interesting problem throughout the history of economics, and has been widely studied. In the prediction of wages, it is widely known that age and work experience are highly correlated. However, in the case of total wealth, this relationship is less clear, because of intergenerational and other seemingly random effects that induce huge variation in it. We should take care to properly manage this huge variation, or else we would see a penalization in our model. We are given a dataset of eighteen features that we can use for prediction. Further, in this study we will compare the efficacy of several basic prediction models. Mainly, we will be comparing the results of ordinary least squares regression, ridge and lasso regression, and stepwise regression.
Data Exploration
Firstly, we should do some exploratory data analysis to evaluate data before we make any model choices. After summarizing the data, we can see several interesting facts about the regressors. Firstly, we should point our attention toward the ‘ira’ variable, which denotes the amount in that individual’s retirement account.
Figure 1.1: Basic exploration of problematic variables
Here, we can see that even the third-quartile value for ira is zero, implying that over three-fourths of the data is null for this feature. It is likely that we will need to omit this feature altogether. We find similar levels of imbalance in the various binarized educational attainment variables. In general, to combat such imbalances, we could employ certain statistical strategies such as bootstrapping. However, the wealth of variables we have available to us will likely lead us to simply drop binarized variables regarding educational attainment, and instead we will use the categorical version, educ.
Having such huge disparities between relative frequencies in a binarized variable will result in a predictor that is not very good at predicting, so we should try to avoid directly inputting these into our model.
Next, let us take a closer look at the huge variability that is present within the dependent variable (total wealth). We should find a method to best evaluate this, and eventually eliminate as much of the variance as we can. First, let us take a look at a scatter plot of the raw total wealth statistics compared to the index.
Figure 1.2: Z-Test of the ‘Total Wealth’ variable
Figure 1.3: Scatter plot of the ‘Total Wealth’ variable
In terms of methods that we can use to statistically evaluate the scale of this variance, a simple choice is the z-value. For each observation, we calculate the z-value of total wealth and compare them. Summarize the results so we can understand the scale.
After looking at the scatter plot on the right, the huge variance in total wealth is clear to see. We can see some outliers with values that are hugely outside of the realm of normality. If we choose to include these values, our model will become biased toward these unrealistic values, and our accuracy will suffer.
We can see that, compared to a median z-value of -0.35, the max z value included within the regression is a terrifying 16+! Our fears of huge outliers are confirmed, and we must take steps to properly eliminate them before we continue with our regressions. For this, we simply filter the data to exclude values that are greater than an absolute value of 3.0 when referencing their z-value. The choice of 3.0 is arbitrary, but values above this have a huge distance from the average, specifically, they are over three standard deviations away.
Model Building
Before we continue, let’s make a simple linear model from before we filter the values so we can compare the improvement. Below are the results of a simple linear regression that involves all regressors except ones which would result in multicollinearity.
Figure 2.1: Baseline Regression
From this, we can take several interesting results. Firstly, it is interesting to see which features are considered the significant by the model. Since we are including information that is extremely relevant to total wealth, such as the continuous retirement and income features. Every one of the continuous features is significant, so the ones that are not are the most important to discuss.
As we have mentioned above, the ‘ira’ feature is fundamentally not a very good predictor due to the huge inequality that is present within its distribution, and it is not considered significant in the model. During our final regression specifications, we will make sure not to include ‘ira’.
Surprisingly, the education variable is also not very significant. This is highly surprising to me, since previous economic literature weighed it very highly as a predictor of wages. Since we can assume a strong correlation between wages and total worth, it seems unbelievable that education is not a significant factor.
One of the most famous equations in economic literature, the Mincer equation, uses only age, potential experience, and potential experience squared as its regressors, and it has been shown to produce great results. Perhaps, if we utilize the teachings of the past, we can use the education feature to create something that is actually predictive.
Figure 2.2: Improved Regression
For this, we will add new terms for potential experience. In the academic literature, in the event that we are missing data for experience, we can create a rough approximation for it using age and years of education. For every data point, we subtract age and years of education to create an approximation of when the individual started working. We also normalize the calculation by subtracting six from the result, since the approximate age that children start schooling is six years old. Even if this estimate of the age at which children start schooling is inaccurate, our corresponding calculation will maintain ordinality and so this should not induce any major problems within our model.
On the right hand side we can see the results from our new linear regression model with no outliers after removing insignificant features and including terms for both potential experience and potential experience squared. We have omitted education because of multicollinearity with our new terms.
While our new features are not considered significant at the 10% level, we still have a relatively high t-statistic, and it is not much less significant than the coefficient for male. It is quite surprising that we do not see more of the variance explained as a result of including potential experience, and if we could gain a feature that included the true years of work experience, it is likely we would see that it was highly significant.
To continue to explore trends that have been previously established in past academic literature, we should look to employ interaction terms for gender and our other various features. For this, we choose to interact the male feature and the e401, nifa, inc, and age features to see if these features have any gender specific effects. We also create squared terms for e401, nifa and income since these are the most significant features in our regression so far.
Figure 2.3: Final Regression
The following regression is identical to the previous, cleaned regression with the inclusion of these square and interaction terms. Surprisingly, we do not see significant results on any of the gender specific interaction terms. It appears that the relation between gender and total wealth is minimal at best, and we have even lost significance on the coefficient for male. However, we can see that the inclusion of square terms has resulted in significant coefficients. For our final model, we should include only the square terms and not the gender specific interaction terms.
Next, we will compare three more advanced models using the augmented data to find the most optimal one. For this analysis, we will be comparing the lasso, ridge, and stepwise regression functions to find the one with the lowest MSPE.
We first run lasso and ridge regression and compare the MSPE between the two. It is important that we remove our z-score variable that we calculated earlier, because it will lead to the model overfitting disproportionately.
Below, we can see the results of the comparison between the lasso and ridge models.
By a pretty significant margin, we can see that the lasso regression is outperforming the ridge regression. So, our final model should likely choose a lasso instead of a ridge. We also compare the lasso and ridge to backward stepwise regression using k-fold cross validation. Results are shown below, and we can see that the stepwise regression is slightly outperforming both the lasso and ridge.
Here, we can see that the backward stepwise regression is outperforming both the lasso and ridge, with an MSPE that is smaller than the latter two models. However, the lasso approach is easier for interpretability, and so we will choose this model as our final model for predictions.
On the right hand side, we can see the mean squared error for different values of lambda on our lasso model. It seems that after about 7 iterations, we lose any increases in performance and instead only serve to increase the MSE.
This is our final model choice for the project. Here's what we have done so far. The model has been trained on a set with all outliers removed, such that any value of total wealth above an absolute value of 3 have been dropped. This has improved our model significantly.
Figure 2.4: Lasso Regression
Further, we have included new terms such as potential experience and potential experience squared as a further attempt to reduce the Variance. We introduced square terms for the most significant features.
We have compared the main methods of the course, OLS, Ridge/Lasso Regression, and stepwise to find the optimal model using K-fold cross validation, which happened to be a stepwise. We choose lasso regression instead because the MSPE is not much higher, and we gain interpretability of results. We chose not to regress upon variables that would cause multicollinearity, such as the various binary education levels, in addition to not including variables like fsize and marr because of their insignificance.
Conclusion
I have learned a great deal on this project. Firstly, we can positively impact the model without any new features, instead by cleaning and polishing the original dataset. By removing significant outliers for example, we were able to improve the amount of the variance we could explain in our model. I am also surprised to see that interaction terms for gender do not produce more significant results, even when combined with highly significant variables like non 401k assets and income.
Further, I am surprised to see that the creation of an approximate proxy for experience did not result in more significant results. I think this is due to the fact that some individuals do not choose to pursue work or academics, and thus our approximation was inaccurate. If we could include an accurate term for work experience, I hypothesize that we would see highly significant results for this term.
In all, we were able to create a highly predictive model for total wealth given the data at hand, and our prediction is optimized over many iterations of other models which included different regressors. Our learnings from previous economic literature did not prove to be true in this case, but perhaps this makes our results even more interesting since they conflict with previous studies.