Statistics 1: The Basics of Statistics

Published

April 2, 2024

Exercise I: Descriptive Statistics

Exercise 1

Use the data in WAGE1.RAW for this exercise.

  1. Find the average education level in the sample. What are the lowest and highest years of education?
  2. Find the average hourly wage in the sample. Does it seem high or low?
  3. The wage data are reported in 1976 dollars. Using the Economic Report of the President (2011 or later), obtain and report the Consumer Price Index (CPI) for the years 1976 and 2010.
  4. Use the CPI values from 3) to find the average hourly wage in 2010 dollars. Now does the average hourly wage seem reasonable?
  5. How many women are in the sample? How many non-women?
Sample Solution
library('wooldridge')
data('wage1')
?wage1

#1)
mean(wage1$educ)
min(wage1$educ)
max(wage1$educ)

#2)
mean(wage1$wage)
#This seems low

#3)
#page 259 of the report
#CPI in 1976 is 56.9
#CPI in 2010 is 218.056

#4)
mean(wage1$wage)*218.056/56.9

#5)
nrow(wage1[wage1$female == 1, ])
nrow(wage1[wage1$female == 0, ])

Exercise 2

Use the data in BWGHT.RAW to answer this question.

  1. How many women are in the sample, and how many report smoking during pregnancy?
  2. What is the average number of cigarettes smoked per day? Is the average a good measure of the “typical” woman in this case? Explain.
  3. Among women who smoked during pregnancy, what is the average number of cigarettes smoked per day? How does this compare with your answer from 2), and why?
  4. Find the average of fatheduc in the sample. Why are only 1,192 observations used to compute this average?
  5. Report the average family income and its standard deviation in dollars.
Sample Solution
library('wooldridge')
library('dplyr')
data('bwght')
?bwght

#1)
nrow(bwght[is.na(bwght$mothereduc), ])
#there is no missing information so we have 1388 women
1388-nrow(bwght[(bwght$cigs==0), ])

#2)
mean(bwght$cigs)
#No, because there is a lot of non-smokers

#3)
bwght %>% filter(cigs>0) %>% summarise(mean(cigs))

#4)
mean(bwght$fatheduc, na.rm = TRUE)

#5)
mean(bwght$faminc)
sd(bwght$faminc)

Exercise 3

The data in MEAP01.RAW are for the state of Michigan in the year 2001. Use these data to answer the following questions.

  1. Find the largest and smallest values of math4. Does the range make sense? Explain.
  2. How many schools have a perfect pass rate on the math test? What percentage is this of the total sample?
  3. How many schools have math pass rates of exactly 50%?
  4. Compare the average pass rates for the math and reading scores. Which test is harder to pass?
  5. Find the correlation between math4 and read4. What do you conclude?
  6. The variable exppp is expenditure per pupil. Find the average of exppp along with its standard deviation. Would you say there is wide variation in per pupil spending?
  7. Suppose School A spends $6,000 per student and School B spends $5,500 per student. By what percentage does School A’s spending exceed School B’s?
Sample Solution
library('wooldridge')
data('meap01')
?meap01

#1)
min(meap01$math4)
max(meap01$math4)

#2)
nrow(meap01[(meap01$math4==100), ])
nrow(meap01[(meap01$math4==100), ])/1823*100

#3)
nrow(meap01[(meap01$math4==50), ])

#4)
mean(meap01$math4)
mean(meap01$read4)

#5)
corr=lm(math4 ~ read4, data = meap01)
summary(corr)

#6)
mean(meap01$exppp)
sd(meap01$exppp)

#7)
(6000-5500)/5500*100

Exercise 4

The data in JTRAIN2.RAW come from a job training experiment conducted for low-income men during 1976–1977.

  1. Use the indicator variable train to determine the fraction of men receiving job training.
  2. The variable re78 is earnings from 1978, measured in thousands of 1982 dollars. Find the averages of re78 for the sample of men receiving job training and the sample not receiving job training. Is the difference economically large?
  3. The variable unem78 is an indicator of whether a man is unemployed or not in 1978. What fraction of the men who received job training are unemployed? What about for men who did not receive job training? Comment on the difference.
  4. From questions 2 and 3, does it appear that the job training program was effective? What would make our conclusions more convincing?
Sample Solution
library('wooldridge')
data('jtrain2')
?jtrain2

#1)
nrow(jtrain2[(jtrain2$train==1), ])/445

#2)
jtrain2 %>% filter(train==0) %>% summarise(mean(re78))
jtrain2 %>% filter(train==1) %>% summarise(mean(re78))

#3)
(nrow(jtrain2[(jtrain2$train==1), ]) - (jtrain2 %>% filter(train==1) %>% summarise(sum(unem78))))/nrow(jtrain2[(jtrain2$train==1), ])

(nrow(jtrain2[(jtrain2$train==0), ]) - (jtrain2 %>% filter(train==0) %>% summarise(sum(unem78))))/nrow(jtrain2[(jtrain2$train==0), ])

#4)
#No
#To make a linear regression to find if there is evidence on the fact that the training does not have any effects on the employement rate

Exercise 5

The data in FERTIL2.DTA were collected on women living in the Republic of Botswana in 1988. The variable children refers to the number of living children. The variable electric is a binary indicator equal to one if the woman’s home has electricity, and zero if not.

  1. Find the smallest and largest values of children in the sample. What is the average of children?
  2. What percentage of women have electricity in the home?
  3. Compute the average of children for those without electricity and do the same for those with electricity. Comment on what you find.
  4. From question 3), can you infer that having electricity “causes” women to have fewer children? Explain.
Sample Solution
library('wooldridge')
data('fertil2')
?fertil2

#1)
min(fertil2$children)
max(fertil2$children)
mean(fertil2$children)

#2)
nrow(fertil2[(fertil2$electric==1), ])/4361*100

#3)
fertil2 %>% filter(electric==0) %>% summarise(mean(children))
fertil2 %>% filter(electric==1) %>% summarise(mean(children))

#4)
#No

Exercise II: The Simple Linear Regression Model

Exercise 6

The data in 401K.RAW are a subset of data analyzed by Papke (1995) to study the relationship between participation in a 401(k) pension plan and the generosity of the plan. The variable prate is the percentage of eligible workers with an active account; this is the variable we would like to explain. The measure of generosity is the plan match rate, mrate. This variable gives the average amount the firm contributes to each worker’s plan for each $1 contribution by the worker. For example, if mrate = 0.50, then a $1 contribution by the worker is matched by a 50¢ contribution by the firm.

  1. Find the average participation rate and the average match rate in the sample of plans.
  2. Now, estimate the simple regression equation \(\widehat{prate} = \hat{\beta_0} + \hat{\beta_1}mrate,\) and report the results along with the sample size and R-squared.
  3. Interpret the intercept in your equation. Interpret the coefficient on mrate.
  4. Find the predicted prate when mrate = 3.5. Is this a reasonable prediction? Explain what is happening here.
  5. How much of the variation in prate is explained by mrate? Is this a lot in your opinion?
Sample Solution
library('wooldridge')
data('k401k')
?k401k

#1)
mean(k401k$prate)
mean(k401k$mrate)

#2)
LinMod <- lm(prate ~ mrate, data=k401k)
summary(LinMod)

#3)
#Intercept is 83.0755
#if mrate increase by 1% then prate increases by 5.86%

#4)
#No

#5)
#R-squared=0.075. No

Exercise 7

The data set in CEOSAL2.RAW contains information on chief executive officers for U.S. corporations. The variable salary is annual compensation, in thousands of dollars, and ceoten is prior number of years as company CEO.

  1. Find the average salary and the average tenure in the sample.
  2. How many CEOs are in their first year as CEO (that is, ceoten = 0)? What is the longest tenure as a CEO?
  3. Estimate the simple regression model \(log(salary) = \beta_0 + \beta_1ceoten + u,\) and report your results in the usual form. What is the (approximate) predicted percentage increase in salary given one more year as a CEO?
Sample Solution
library('wooldridge')
data('ceosal2')
?ceosal2

#1)
mean(ceosal2$salary)
mean(ceosal2$ceoten)

#2)
max(ceosal2$ceoten)
nrow(ceosal2[(ceosal2$ceoten==0), ])

#3)
LinMod <- lm(log(salary) ~ ceoten, data=ceosal2)
summary(LinMod)

Exercise 8

Use the data in SLEEP75.RAW from Biddle and Hamermesh (1990) to study whether there is a tradeoff between the time spent sleeping per week and the time spent in paid work. We could use either variable as the dependent variable. For concreteness, estimate the model \(sleep = \beta_0 + \beta_1totwrk + u,\) where sleep is minutes spent sleeping at night per week and totwrk is total minutes worked during the week.

  1. Report your results in equation form along with the number of observations and R-squared. What does the intercept in this equation mean?
  2. If totwrk increases by 2 hours, by how much is sleep estimated to fall? Do you find this to be a large effect?
Sample Solution
library('wooldridge')
data('sleep75')
?sleep75

#1)
LinMod <- lm(sleep ~ totwrk, data=sleep75)
summary(LinMod)

#2)
#120*0.15075=18.09 minutes

Exercise 9

For the population of firms in the chemical industry, let rd denote annual expenditures on research and development, and let sales denote annual sales (both are in millions of dollars).

  1. Write down a model (not an estimated equation) that implies a constant elasticity between rd and sales. Which parameter is the elasticity?
  2. Now, estimate the model using the data in RDCHEM.RAW. Write out the estimated equation in the usual form. What is the estimated elasticity of rd with respect to sales? Explain in words what this elasticity means.
Sample Solution
library('wooldridge')
data('rdchem')
?rdchem

#1) Constant elasticity -> log ~ log regression, the elasticity is the slope parameter

#2)
LinMod <- lm(log(rd) ~ log(sales), data=rdchem)
summary(LinMod)
#when sales increase by 1%, R&D increase by 1.08%