Miami University Simple and Multiple Logistic Regression Models Discussion
Question Description
Directions: Complete the following questions. Thequestions have been separated into 4 parts of similar material. Parts1, 2, and 3 will only use the corona_train data while Part 4 will use the corona_test data. Use the Markdown starter file here hw7_starter.Rmd.
Part 1 – Odds
1. Using the training dataset, compute the odds that a county has reported a Coronavirus-related death. (2pts)
2. Does the odds of a Coronavirus-related death vary by Censusregion? Compute the odds that a county has reported aCoronavirus-related death for each Census region within the UnitedStates. Compare these values to address the question. (3pts)
Part 2 – Simple Logistic Regression
3. Build a plot (or plots) to explore how the logarithm of thepopulation density predicts whether a county has recorded acoronavirus-related death. Briefly discuss the results of your plot. (2pts)
4. Build a simple logistic model to statistically determine if thelogarithm of the population density predicts the probability a countyhas reported a Coronavirus-related death. Support your findings with anappropriate hypothesis test. (3pts)
Part 3 – Multiple Logistic Regression Models
5. Fit a multiple logistic regression model with the census region,the logarithm of population density, the cumulative coronavirus rate,the median county age, the median income, the percent of the county thatare U.S. citizens, the percent with a college degree, the percent ofthe population that are veterans of the U.S. armed services, the percentwith healthcare and the percent that voted for President Trump in the2016 general election to predict the probability a county has reported aCoronavirus-related death. Conduct an appropriate test to determinewhether this model significantly predicts the probability a county hasreported a Coronavirus-related death. (3pts)
6. Perform a backward selection procedure on the model from question 5. Which variable(s) has/have been removed from the model. (2pts)
7. We will now continue a backward selection procedure, but this timeusing Likelihood Ratio test. Using the drop1() function to determinewhich predictors are significant, iteratively remove all insignificantpredictors from the model in question 6. That is, look at the drop1()output from the model in question 6, refit the model after removing allinsignificant terms, look at the drop1() output, refit the model afterremoving all insignificant terms… Continue this process until allpredictors are significant. What predictor variables remain in themodel? (4pts)
8. The starter file contains some code to help you along on thisproblem. Build a table to compare the AIC, BIC and a Pseudo-R-squaredfor the models fit in questions 5, 6 and 7. Which model is best withrespect to each metric? (3pts)
9. Code was supplied for a Pseudo-R-squared calculation in question8. Explain how this value mimics that of the traditional R-squared valueused in multiple linear regression. (2pts)
10. For the model with the best BIC, of those fit in questions 5, 6,or 7, interpret the coefficient regionWest. Be sure to explain thiscoefficient in terms of odds (not log-odds, which do not provide a niceinterpretation). How does this compare to the results in question 2?Why might they be similar/different? (3pts)
Part 4 – Prediction
11. We will use three fitted models built above to predict whether acounty in the testing dataset will have a Coronavirus-related death.Some code is supplied in the starter file, edit and replicate so it willmake predictions using all three models. Briefly describe what thiscode is doing. (2pts)
12. Calculate and discuss the accuracy, sensitivity and specificityfor all three models to predict if a county has reported aCoronavirus-related death. Which model appears to be the best model atpredicting if a county has a Coronavirus-related death? Code is providedfor the confusion matrix of the first model. Replicate this code togenerate the confusion matrices for the other two models. (6pts)
13. Using the best model from theprevious question, compute the sensitivity and specificity if theprobability threshold (the 0.5 provided in the code for question 11)were 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, and 0.9. Use these valuesto complete the table in the starter file. Which threshold appears to bethe best choice? (5pts)
NOTE: the ideas of sensitivity and specificity areVERY relevant in today’s society as scientist develop tests for theCOVID-19 Coronavirus; for both antibody and detection of the disease. Wefelt it prudent to introduce these topics under the currentcircumstances.
Some Coding hints
We have covered a lot this semester… In an effort to help you withsome of the necessary coding, we provide the following hints but noteadditional code is needed for all to work
- xtabs() can be used in questions 1, 2, and 12
- ggplot() is needed in question 3
- glm(), drop1() and/or anova() are needed in questions 4, 5, 7 and 8
- stats::step() is needed in question 6
- summary() will provide output with model coefficients, you can also use coef()
I need rmd and html file in the end.
"Place your order now for a similar assignment and have exceptional work written by our team of experts, guaranteeing you "A" results."