Statistics 1: The Basics of Statistics
Exercise I: Descriptive Statistics
Exercise 1
Use the data in WAGE1.RAW for this exercise.
- Find the average education level in the sample. What are the lowest and highest years of education?
- Find the average hourly wage in the sample. Does it seem high or low?
- The wage data are reported in 1976 dollars. Using the Economic Report of the President (2011 or later), obtain and report the Consumer Price Index (CPI) for the years 1976 and 2010.
- Use the CPI values from 3) to find the average hourly wage in 2010 dollars. Now does the average hourly wage seem reasonable?
- How many women are in the sample? How many non-women?
Exercise 2
Use the data in BWGHT.RAW to answer this question.
- How many women are in the sample, and how many report smoking during pregnancy?
- What is the average number of cigarettes smoked per day? Is the average a good measure of the “typical” woman in this case? Explain.
- Among women who smoked during pregnancy, what is the average number of cigarettes smoked per day? How does this compare with your answer from 2), and why?
- Find the average of
fatheduc
in the sample. Why are only 1,192 observations used to compute this average?
- Report the average family income and its standard deviation in dollars.
Exercise 3
The data in MEAP01.RAW are for the state of Michigan in the year 2001. Use these data to answer the following questions.
- Find the largest and smallest values of
math4
. Does the range make sense? Explain.
- How many schools have a perfect pass rate on the math test? What percentage is this of the total sample?
- How many schools have math pass rates of exactly 50%?
- Compare the average pass rates for the math and reading scores. Which test is harder to pass?
- Find the correlation between
math4
andread4
. What do you conclude?
- The variable
exppp
is expenditure per pupil. Find the average ofexppp
along with its standard deviation. Would you say there is wide variation in per pupil spending?
- Suppose School A spends $6,000 per student and School B spends $5,500 per student. By what percentage does School A’s spending exceed School B’s?
Exercise 4
The data in JTRAIN2.RAW come from a job training experiment conducted for low-income men during 1976–1977.
- Use the indicator variable
train
to determine the fraction of men receiving job training.
- The variable
re78
is earnings from 1978, measured in thousands of 1982 dollars. Find the averages ofre78
for the sample of men receiving job training and the sample not receiving job training. Is the difference economically large?
- The variable
unem78
is an indicator of whether a man is unemployed or not in 1978. What fraction of the men who received job training are unemployed? What about for men who did not receive job training? Comment on the difference.
- From questions 2 and 3, does it appear that the job training program was effective? What would make our conclusions more convincing?
Exercise 5
The data in FERTIL2.DTA were collected on women living in the Republic of Botswana in 1988. The variable children refers to the number of living children. The variable electric is a binary indicator equal to one if the woman’s home has electricity, and zero if not.
- Find the smallest and largest values of
children
in the sample. What is the average ofchildren
?
- What percentage of women have electricity in the home?
- Compute the average of
children
for those without electricity and do the same for those with electricity. Comment on what you find.
- From question 3), can you infer that having electricity “causes” women to have fewer children? Explain.
Exercise II: The Simple Linear Regression Model
Exercise 6
The data in 401K.RAW are a subset of data analyzed by Papke (1995) to study the relationship between participation in a 401(k) pension plan and the generosity of the plan. The variable prate
is the percentage of eligible workers with an active account; this is the variable we would like to explain. The measure of generosity is the plan match rate, mrate
. This variable gives the average amount the firm contributes to each worker’s plan for each $1 contribution by the worker. For example, if mrate
= 0.50, then a $1 contribution by the worker is matched by a 50¢ contribution by the firm.
- Find the average participation rate and the average match rate in the sample of plans.
- Now, estimate the simple regression equation \(\widehat{prate} = \hat{\beta_0} + \hat{\beta_1}mrate,\) and report the results along with the sample size and R-squared.
- Interpret the intercept in your equation. Interpret the coefficient on
mrate.
- Find the predicted
prate
whenmrate
= 3.5. Is this a reasonable prediction? Explain what is happening here. - How much of the variation in
prate
is explained bymrate
? Is this a lot in your opinion?
Exercise 7
The data set in CEOSAL2.RAW contains information on chief executive officers for U.S. corporations. The variable salary is annual compensation, in thousands of dollars, and ceoten is prior number of years as company CEO.
- Find the average salary and the average tenure in the sample.
- How many CEOs are in their first year as CEO (that is,
ceoten
= 0)? What is the longest tenure as a CEO? - Estimate the simple regression model \(log(salary) = \beta_0 + \beta_1ceoten + u,\) and report your results in the usual form. What is the (approximate) predicted percentage increase in salary given one more year as a CEO?
Exercise 8
Use the data in SLEEP75.RAW from Biddle and Hamermesh (1990) to study whether there is a tradeoff between the time spent sleeping per week and the time spent in paid work. We could use either variable as the dependent variable. For concreteness, estimate the model \(sleep = \beta_0 + \beta_1totwrk + u,\) where sleep
is minutes spent sleeping at night per week and totwrk
is total minutes worked during the week.
- Report your results in equation form along with the number of observations and R-squared. What does the intercept in this equation mean?
- If
totwrk
increases by 2 hours, by how much issleep
estimated to fall? Do you find this to be a large effect?
Exercise 9
For the population of firms in the chemical industry, let rd
denote annual expenditures on research and development, and let sales
denote annual sales (both are in millions of dollars).
- Write down a model (not an estimated equation) that implies a constant elasticity between rd and sales. Which parameter is the elasticity?
- Now, estimate the model using the data in RDCHEM.RAW. Write out the estimated equation in the usual form. What is the estimated elasticity of rd with respect to sales? Explain in words what this elasticity means.