Statistics 2: Advanced Topics: Inductive and Multivariate Statistics

Published

April 9, 2024

Exercise I: Estimation of Multiple Regression Analysis

Exercise 1

A problem of interest to health officials (and others) is to determine the effects of smoking during pregnancy on infant health. One measure of infant health is birth weight; a birth weight that is too low can put an infant at risk for contracting various illnesses. Since factors other than cigarette smoking that affect birth weight are likely to be correlated with smoking, we should take those factors into account. For example, higher income generally results in access to better prenatal care, as well as better nutrition forthe mother. An equation that recognizes this is \(bwght = \beta_0 + \beta_1cigs + \beta_2 faminc + u.\)

  1. What is the most likely sign for \beta_2?
  2. Do you think cigs and faminc are likely to be correlated? Explain why the correlation might be positive or negative.
  3. Now, estimate the equation with and without faminc, using the data in BWGHT.RAW. Report the results in equation form, including the sample size and R-squared. Discuss your results, focusing on whether adding faminc substantially changes the estimated effect of cigs on bwght.
Sample Solution
library('wooldridge')
data('bwght')
?bwght

#1)
#likely to be positive

#2)
#Yes
#negative correlation
linearBwght0 <- lm(cigs ~ faminc, data=bwght)
summary(linearBwght0)

#3)
linearBwght1 <- lm(bwght ~ cigs + faminc, data=bwght)
linearBwght2 <- lm(bwght ~ cigs, data=bwght)

summary(linearBwght1)
summary(linearBwght2)

Exercise 2

Use the data in HPRICE1.RAW to estimate the model \(price = \beta_0 + \beta_1sqrft + \beta_2bdrms + u,\) where price is the house price measured in thousands of dollars.

  1. Write out the results in equation form.
  2. What is the estimated increase in price for a house with one more bedroom, holding square footage constant?
  3. What is the estimated increase in price for a house with an additional bedroom that is 140 square feet in size? Compare this to your answer in question 2.
  4. What percentage of the variation in price is explained by square footage and number of bedrooms?
  5. The first house in the sample has sqrft = 2,438 and bdrms = 4. Find the predicted selling price for this house from the OLS regression line.
  6. The actual selling price of the first house in the sample was $300,000 (so price = 300). Find the residual for this house. Does it suggest that the buyer underpaid or overpaid for the house?
Sample Solution
library('wooldridge')
data('hprice1')
?hprice1

#1)
linearHprice <- lm(price ~ sqrft + bdrms, data=hprice1)
summary(linearHprice)

#2)
#+15.2 thousand dollars

#3)
140*0.128+15.2
#33.12 thousand dollars

#4)
#R2 = 0.63

#5)
2438*0.128+4*15.2-19.31
#354 thousand dollars

#6)
354-300
#residual of 54 thousand dollars
#the buyer underpaid

Exercise 3

The file CEOSAL2.RAW contains data on 177 chief executive officers and can be used to examine the effects of firm performance on CEO salary.

  1. Estimate a model relating annual salary to firm sales and market value. Make the model of the constant elasticity variety for both independent variables. Write the results out in equation form.
  2. Add profits to the model from question 1. Why can this variable not be included in logarithmic form? Would you say that these firm performance variables explain most of the variation in CEO salaries?
  3. Add the variable ceoten to the model in question 2. What is the estimated percentage return for another year of CEO tenure, holding other factors fixed?
  4. Find the sample correlation coefficient between the variables log(mktval) and profits. Are these variables highly correlated?
Sample Solution
library('wooldridge')
data('ceosal2')
?ceosal2

#1)
linearCEO <- lm(log(salary) ~ log(sales) + log(mktval), data=ceosal2)
summary(linearCEO)

#2)
linearCEO2 <- lm(log(salary) ~ log(sales) + log(mktval) + profits, data=ceosal2)
summary(linearCEO2)

min(ceosal2$profits)
#because it contains negative values
#Yes, R2 is important

#3)
linearCEO3 <- lm(log(salary) ~ log(sales) + log(mktval) + profits + ceoten, data=ceosal2)
summary(linearCEO3)

#4)
linearCEO4 <- lm(log(mktval) ~ profits, data=ceosal2)
summary(linearCEO4)
#Yes, R2 is important

Exercise II: Multiple Regression Analysis: Inference

Exercise 4

The following model can be used to study whether campaign expenditures affect election outcomes: \(voteA = \beta_0 + \beta_1log(expendA) + \beta_2log(expendB) + \beta_3 prtystrA + u,\) where voteA is the percentage of the vote received by Candidate A, expendA and expendB are campaign expenditures by Candidates A and B, and prtystrA is a measure of party strength for Candidate A (the percentage of the most recent presidential vote that went to A’s party).

  1. What is the interpretation of \(\beta_1\)?
  2. In terms of the parameters, state the null hypothesis that a 1% increase in A’s expenditures is offset by a 1% increase in B’s expenditures.
  3. Estimate the given model using the data in VOTE1.RAW and report the results in usual form. Do A’s expenditures affect the outcome? What about B’s expenditures? Can you use these results to test the hypothesis in question 2?
  4. Estimate a model that directly gives the t statistic for testing the hypothesis in question 2. What do you conclude? (Use a two-sided alternative knowing that the 10% critical value against a two-side alternative with 169 df is 1.645)
Sample Solution
library('wooldridge')
data('vote1')
?vote1

#1)
#beta_1 measures the percentage point candidate A votes increases when candidate A expenditures increase by 100%

#2)
#H_0: \beta_1 + \beta_2 = 0

#3)
linearVote <- lm(voteA ~ log(expendA) + log(expendB) + prtystrA, data=vote1)
summary(linearVote)
#No (linear combination of two estimates)

#4)
linearVote2 <- lm(voteA ~ log(expendA) + log(expendB/expendA) + prtystrA, data=vote1)
summary(linearVote2)
#We fail to reject H_0

Exercise 5

Use the data in WAGE2.RAW for this exercise.

  1. Consider the standard wage equation \(log(wage) = \beta_0 + \beta_1 educ + \beta_2 exper + \beta_3 tenure + u\). State the null hypothesis that another year of general workforce experience has no effect on log(wage).
  2. Test the null hypothesis in question 1. against a two-sided alternative, at the 5% significance level. What do you conclude?
  3. State the null hypothesis that another year of general workforce experience has the same effect on log(wage) as another year of tenure with the current employer.
  4. Test the null hypothesis in question 3. against a two-sided alternative, at the 5% significance level.
Sample Solution
library('wooldridge')
data('wage2')
?wage2

#1)
#H_0: \beta_2 = 0

#2)
linearWage <- lm(log(wage) ~ educ + exper + tenure, data=wage2)
summary(linearWage)
#Significant. a 1-year experience increases wage by 1.53%

#3)
#H_0: \beta_2 - \beta_3 = 0

#4)
wage_new <- wage2 %>% mutate(var=exper+tenure)
linearWage2 <- lm(log(wage) ~ educ + exper + var, data=wage_new)
summary(linearWage2)
#We fail to reject H_0

Exercise 6

The data set 401KSUBS.RAW contains information on net financial wealth nettfa, age of the survey respondent age, annual family income inc, family size fsize, and participation in certain pension plans for people in the United States. The wealth and income variables are both recorded in thousands of dollars. For this question, use only the data for single-person households (so fsize = 1).

  1. How many single-person households are there in the data set?
  2. Use OLS to estimate the model \(nettfa = \beta_0 + \beta_1inc + \beta_2age + u,\) and report the results using the usual format. Be sure to use only the single-person households in the sample. Interpret the slope coefficients. Are there any surprises in the slope estimates?
  3. Does the intercept from the regression in question 2 have an interesting meaning? Explain.
  4. Find the p-value for the test \(H_0: \beta_2 = 0\) against a two-side alternative. Do you reject \(H_0\) at the 1% significance level?
  5. If you do a simple regression of nettfa on inc, is the estimated coefficient on inc much different from the estimate in question 2? Why or why not?
Sample Solution
library('wooldridge')
data('k401ksubs')
?k401ksubs

#1)
k401ksubs %>% filter(fsize==1) %>% summarise(n())

#2)
data_single <- k401ksubs %>% filter(fsize==1)
linear_single <- lm(nettfa ~ inc + age, data=data_single)
summary(linear_single)
#Not very surprising

#3)
#Not an interesting meaning

#4)
#Yes

#5)
simple_linear <- lm(nettfa ~ inc, data=data_single)
summary(simple_linear)
#Not so different