Description of this paper

Stats Final Exam Part 2

Description

solution.


Question

Stats Final Exam Part 2

 

 

Description

 

solution

 

 

Question

 

Stats Final Exam Part 2

 

 

 

Given the data set called County Demographic Information, construct a predictive model for the variable “Total Serious Crime” using some or all of the other variables in the set of data.

 

 

The model should be mathematically valid, accurate and reliable.

 

 

Total Serious Crime is Variable #8

 

 

Other Variables:

 

 

#2 Land Area

 

 

#3 Total Population

 

 

#4 Percent of Population aged 18-34

 

 

#5 Percent of Population 65 or over

 

 

#6 Number of Active Physicians

 

 

#7 Number of Hospital Beds

 

 

#9 Percent of High School Graduates

 

 

#10 Percent of Population with College Degrees

 

 

#11 Percent of Population below poverty level

 

 

#12 Unemployment Percent

 

 

#13 Per Capita Income

 

 

#14 Total Personal Income

 

 

#15 Geographic Region

 

 

Note: I am omitting the data set to simplify this problem; the following analyses use the data set described above, and you can assume the math is calculated correctly. I am testing to see if you can identify what analytical techniques may be validly employed and how effective are they building a model.

 

 

Variables 2 to 14 are numeric variables and variable 15 is categoric.

 

 

Analysis #1

 

 

In the given data set, we were asked to determine if an accurate predictive model for Variable #8, Serious Crime could be found using the attached data.

 

 

Since Variable 15 was determined to be categoric, regression was not appropriate to use; so I used Analysis of Variance (ANOVA) to examine if there was a significant relationship between Variable 8 and 15. The results (using Systat 13.0) are printed above.

 

 

 

Variables Levels

 

 

VAR(15) (4 levels) 1.000 2.000 3.000 4.000

 

 

 

Dependent Variable VAR(8)

 

 

N 440

 

 

Multiple R 0.110

 

 

Squared Multiple R 0.012

 

 

 

Estimates of Effects B = (X\\\'X)-1X\\\'Y

 

 

Factor Level VAR(8)

 

 

CONSTANT 28,017.368

 

 

VAR(15) 1 -4,931.339

 

 

VAR(15) 2 -6,236.627

 

 

VAR(15) 3 -1,026.394

 

 

 

Analysis of Variance

 

 

Source Type III SS df Mean Squares F-Ratio p-Value

 

 

VAR(15) 1.795E+010 3 5.985E+009 1.774 0.151

 

 

Error 1.471E+012 436 3.374E+009

 

 

 

ANOVA results suggest that Variable 15 is significantly related to Variable 8, but Variable 15 can only explain approximately 15.1% of the variation in Variable 8.

 

 

Therefore, I conclude that variable 15 is significantly related to variable 8 although variable 15 is only a minor factor in predicting variable 8.

 

 

 

 

 

 

Analysis #2

 

 

Using Systat, I employed Multiple Linear Regression to attempt to create a predictive model, using all of the available variables as independent variables.

 

 

The results are shown below.

 

 

 

Dependent Variable VAR(8)

 

 

N 440

 

 

Multiple R 0.919

 

 

Squared Multiple R 0.844

 

 

Adjusted Squared Multiple R 0.839

 

 

Standard Error of Estimate 23,367.069

 

 

 

Regression Coefficients B = (X\\\'X)-1X\\\'Y

 

 

Effect Coefficient Standard Error Std.

 

 

Coefficient Tolerance t p-Value

 

 

CONSTANT -50,925.731 35,344.226 0.000 . -1.441 0.150

 

 

VAR(2) -3.054 0.849 -0.081 0.719 -3.599 0.000

 

 

VAR(3) 0.234 0.020 2.422 0.008 11.560 0.000

 

 

VAR(4) 221.063 424.685 0.016 0.393 0.521 0.603

 

 

VAR(5) 32.120 380.640 0.002 0.539 0.084 0.933

 

 

VAR(6) -5.189 3.150 -0.159 0.039 -1.647 0.100

 

 

VAR(7) 3.404 2.280 0.134 0.046 1.493 0.136

 

 

VAR(9) -265.566 321.799 -0.032 0.244 -0.825 0.410

 

 

VAR(10) 140.915 373.505 0.019 0.152 0.377 0.706

 

 

VAR(11) 1,142.711 488.132 0.091 0.241 2.341 0.020

 

 

VAR(12) -159.661 658.025 -0.006 0.526 -0.243 0.808

 

 

VAR(13) 2.335 0.699 0.163 0.154 3.339 0.001

 

 

VAR(14) -7.070 0.946 -1.564 0.008 -7.475 0.000

 

 

VAR(15) 1,456.610 1,319.387 0.026 0.668 1.104 0.270

 

 

 

Analysis of Variance

 

 

Source SS df Mean Squares F-Ratio p-Value

 

 

Regression 1.256E+012 13 9.664E+010 176.989 0.000

 

 

Residual 2.326E+011 426 5.460E+008

 

 

 

Since the combined model had a p-value of 0.000, I concluded that this model could accurately predict variable 8, Total Serious Crime. The R-Squared value of approximately .84 suggests that the model explains about 84% of the variation in Serious Crime. Therefore, I conclude that this is a fairly accurate, valid, predictive model of Total Serious Crime.

 

 

 

Analysis #3

 

 

Since many individual, independent variables of the previous regression model had p-values above .05, they were not significant factors. I discarded them, redid the regression analysis, and got the results listed below.

 

 

 

Dependent Variable VAR(8)

 

 

N 440

 

 

Multiple R 0.918

 

 

Squared Multiple R 0.842

 

 

Adjusted Squared Multiple R 0.840

 

 

Standard Error of Estimate 23,274.901

 

 

 

Regression Coefficients B = (X\\\'X)-1X\\\'Y

 

 

Effect Coefficient Standard Error Std.

 

 

Coefficient Tolerance t p-Value

 

 

CONSTANT -63,890.789 10,233.100 0.000 . -6.244 0.000

 

 

VAR(2) -3.109 0.758 -0.083 0.894 -4.101 0.000

 

 

VAR(3) 0.250 0.016 2.580 0.013 15.282 0.000

 

 

VAR(11) 1,449.915 307.144 0.116 0.603 4.721 0.000

 

 

VAR(13) 2.460 0.469 0.171 0.341 5.250 0.000

 

 

VAR(14) -7.899 0.787 -1.748 0.012 -10.037 0.000

 

 

 

Analysis of Variance

 

 

Source SS df Mean Squares F-Ratio p-Value

 

 

Regression 1.254E+012 5 2.508E+011 462.898 0.000

 

 

Residual 2.351E+011 434 5.417E+008

 

 

 

This model is a better predictive model than analysis #2 since it has a higher F-value, and therefore a smaller p-value. Also, each factor of the model has a p-value smaller than .05; this indicates that each component is significant in itself. The R-Squared value of .84 indicates that I can predict Variable 8 with approximately 84% accuracy, using only five variables and a constant.

 

 

 

Analysis #4

 

 

Repeating the previous analysis, but deleting the constant allowed me to raise the R-Squared value to almost .87.

 

 

Dependent Variable VAR(8)

 

 

N 440

 

 

Multiple R 0.932

 

 

Squared Multiple R 0.869

 

 

Adjusted Squared Multiple R 0.868

 

 

Standard Error of Estimate 23,381.775

 

 

 

Regression Coefficients B = (X\\\'X)-1X\\\'Y

 

 

Effect Coefficient Standard Error Std.

 

 

Coefficient Tolerance t p-Value

 

 

VAR(2) -3.010 0.763 -0.088 0.612 -3.942 0.000

 

 

VAR(3) 0.245 0.016 2.739 0.009 15.107 0.000

 

 

VAR(9) -697.218 118.026 -0.846 0.015 -5.907 0.000

 

 

VAR(10) 496.913 209.212 0.174 0.056 2.375 0.018

 

 

VAR(11) 683.363 248.743 0.105 0.206 2.747 0.006

 

 

VAR(13) 1.727 0.472 0.511 0.015 3.657 0.000

 

 

VAR(14) -7.658 0.780 -1.800 0.009 -9.818 0.000

 

 

 

Analysis of Variance

 

 

Source SS df Mean Squares F-Ratio p-Value

 

 

Regression 1.576E+012 7 2.251E+011 411.714 0.000

 

 

Residual 2.367E+011 433 5.467E+008

 

 

 

Using seven variables and no constant, I found a model that had each component with a low p-value (under .05) and an overall p-value of 0.000. I would conclude similar to what I did in analysis #3, but I would prefer this model because of its higher R-Squared value.

 

 

 

Analysis #5

 

 

Trying to optimize the model, I repeated the earlier analytical methods. I discarded the constant and tried to lower the number of variables. I was able to find a model (see results listed below, and compare to analyses #3 and #4 ) that used only four variables. Each variable had a p-value under .05, the F-value was higher than earlier models (therefore, the overall p-value was lower for the overall model) and the R-Squared value was still approximately .84.

 

 

 

Dependent Variable VAR(8)

 

 

N 440

 

 

Multiple R 0.916

 

 

Squared Multiple R 0.840

 

 

Adjusted Squared Multiple R 0.839

 

 

Standard Error of Estimate 25,805.795

 

 

 

Regression Coefficients B = (X\\\'X)-1X\\\'Y

 

 

Effect Coefficient Standard Error Std.

 

 

Coefficient Tolerance t p-Value

 

 

VAR(2) -2.141 0.814 -0.062 0.656 -2.629 0.009

 

 

VAR(3) 0.088 0.002 0.979 0.644 41.013 0.000

 

 

VAR(11) 1,240.562 217.578 0.191 0.327 5.702 0.000

 

 

VAR(13) -0.846 0.116 -0.251 0.314 -7.328 0.000

 

 

 

Analysis of Variance

 

 

Source SS df Mean Squares F-Ratio p-Value

 

 

Regression 1.522E+012 4 3.805E+011 571.367 0.000

 

 

Residual 2.903E+011 436 6.659E+008

 

 

 

Therefore, I concluded that Model #5 was the preferred model since it only had four input variables and achieved approximately the same predictive accuracy. Thus I needed only four independent variables to predict variable #8 with accuracy of approximately 84%.

 

 

 

A) Are each of the five analyses valid? (if not, why not?)

 

 

B) Are each of the five analyses significant? (why?)

 

 

C) Are each of the five analyses accurate? (why?)

 

 

D) Which is the best predictive model and why?

 

Paper#61787 | Written in 10-Dec-2015

Price : $30
SiteLock