- May 15, 2020

Assignment 1MAST90083 Computational Statistics and Data MiningDue time: 5PM, Monday September 16thYou must submit your report via LMS1 Data AnalysisGross domestic product is a standard measure of the size of an economy; it’s the total valueof all goods and services bought and solid in a country over the course of a year. It’s not aperfect measure of prosperity, but it is a very common one, and many important questionsin economics turn on what leads GDP to grow faster or slower. One common idea is thatpoorer economies, those with lower initial GDPs, should grower faster than richer ones.The reasoning behind this catching up is that poor economies can copy technologies andprocedures from richer ones, but already-developed countries can only grow as technologyadvances. A second, separate idea is that countries can boost their growth rate by under-valuing their currency, making the goods and services they export cheaper. Our dataset“uval.csv” contains the following variables:• Country, in a three-letter code.• Year (in five-year increments).• Per-capita GDP, in dollars per person per year• Average percentage growth rate in GDP over the next five years.• An index of currency under-valuation. The index is 0 if the currency is neither over-nor under-valued, positive if under-valued, negative if it is over-valued.Note that not all countries have data for all years. However, there are no missing values inthe data table.1. Linearly regress the growth rate on the under-valuation index and the log of GDP.Report the coefficients and their standard errors. Do the coefficients support theidea of catching up? Do they support the idea that under-valuing a currency boostseconomic growth?12. Repeat the linear regression but add as covariates the country, and the year. Usefactor(year), not year, in the regression formula.(a) Report the coefficients for log GDP and undervaluation, and their standard errors.(b) Explain why it is more appropriate to use factor(year) in the formula than justyear.(c) Plot the coefficients on year versus time.(d) Does this expanded model support the idea of catching up? Of undervaluationboosting growth?3. Does adding in year and country as covariates improve the predictive ability of a linearmodel which includes log GDP and under-valuation?(a) What are the R2 and the adjusted R2 of the two models?(b) Use leave-one-out cross-validation to find the mean squared errors of the twomodels. Which one actually predicts better, and by how much?(c) Explain why using 5-fold cross-validation would be hard here.4. Kernel regression Use kernel regression, as implemented in the np package, to non-parametrically regress growth on log GDP, under-valuation, country, and year (treatingyear as a categorical variable). Hint: read chapter four of Shalizi carefully. In partic-ular, try setting tol to about 10−3 and ftol to about 10−4 in the npreg command,and allow several minutes for it to run.(a) Give the coefficients of the kernel regression, or explain why you cannot.(b) Plot the predicted values of the kernel regression, for each country and year,against the predicted values of the linear model.(c) Plot the residuals of the kernel regression against its predicted values. Shouldthese points be scattered around a flat line, if the model is right? Are they?(d) The npreg function reports a cross-validated estimate of the mean squared errorfor the model it fits. What is that? Does the kernel regression predict better orworse than the linear model with the same variables?2 Kernel regression and varying smoothnessStarter code for this problem is in starter.R. That code will generate a data set to be usedfor this problem, and will also provide a true mean function µ(x). The resulting data framehas a x column (your predictor) and a y column (your response).1. Plot y versus x. Overlay the true mean function µ(x) using the curve function in R.What do you notice for x < 4pi and x > 4pi?22. Using the np library in R, fit a kernel regression on each of the following datasets:(a) Only those data points with x < 4pi.(b) Only those data points with x > 4pi.(c) All the data pointsFor each of these regressions, what is the optimal bandwidth? How does the optimalbandwidth for the overall data set compare to the optimal bandwidth for each of thehalves?3. For each of the three selected bandwidths, make a plot showing:• The true mean µ(x).• The data points.• The kernel regression predictions, with the bandwidth specified to be the selectedbandwidth.• The 95% confidence band for the regression curve µ using resampling of residuals.• The 95% confidence band for the regression curve µ using resampling of cases.The result should be three plots, each tuned to one of the selected bandwidths. Givethese plots clear titles to distinguish them.4. How do these three plots differ? In particular, how well do the regressions trained onthe left and right halves do on each half of the data set? How well does the bandwidthfit on the overall data set do on each half? (Be specific about the types of problemsthat occur.) What lesson might this tell about functions of varying smoothness andkernel regression, if any?3 Theoretical questions1. Exercise 1.2 in Shalizi2. Exercise 1.4 in Shalizi3. Exercise 7.4 in ESL3