2024-07-11
Simple approach to setting a consistent sample in Stata.
New Ph.D. students (including me when I was) tend to hoard data, using all the possible data for each model. This isn’t wrong on it’s face, but it’s also useful to establish a sample which the reader can follow through the paper with minimal deviations. This also makes your job as a writer easier as you can address the sample once and move on. I end up explaining this once to each student that I work with, so I’m writing this up here. As always, feel free to email with questions!
Consider a dataset of random numbers, some of them missing.
. // inspect the data
. list in 1/5
+--------------------------------------------------------------------------+
| id year quarter y x1 x2 x3 x4 x5 |
|--------------------------------------------------------------------------|
1. | 1 2001 1 1.38e+11 1 668659 . 1.66e+10 56843.59 |
2. | 1 2001 2 -5.01e+11 1 668659 . 1.66e+10 64507.01 |
3. | 1 2001 3 -2.68e+11 1 668659 . 1.66e+10 50003.9 |
4. | 1 2001 4 2.67e+12 1 668659 . 1.66e+10 57500.11 |
5. | 1 2002 1 3.84e+12 1 668659 . 7.18e+10 52113.82 |
+--------------------------------------------------------------------------+
If we summarize the data we’ll see that there are 1,600 observations but three of the variables have missing data.
. // summarize
. su y x*
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
y | 1,600 -5.95e+10 2.48e+12 -1.79e+13 1.31e+13
x1 | 1,200 2 .816837 1 3
x2 | 1,600 432333.3 199654.9 118209.1 668659
x3 | 1,120 9.535714 5.987668 1 20
x4 | 816 3.69e+10 2.74e+10 2.22e+08 1.52e+11
-------------+---------------------------------------------------------
x5 | 1,600 53703.42 5514.862 35966.54 71861.57
Now we can fit a series of models with different data requirements and we’ll see that this generates several samples.
. // fit a series of models
. quietly:eststo:regress y x2 x4
. quietly:eststo:regress y x2 x4 i.x1
. quietly:eststo:regress y x2 x4 i.x1 x3
. // tabulate and note sample sizes
. esttab
------------------------------------------------------------
(1) (2) (3)
y y y
------------------------------------------------------------
x2 423051.8 -6999882.8 -1833948.3
(0.52) (-0.66) (-0.17)
x4 2.853 3.389 -6.843
(0.89) (0.87) (-1.59)
x1 -8.33482e+11 -1.88427e+11
(-0.72) (-0.16)
x3 -1.47517e+10
(-0.81)
_cons -1.50561e+11 3.36708e+12 1.23826e+12
(-0.71) (0.67) (0.24)
------------------------------------------------------------
N 780 552 436
------------------------------------------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001
So far so good, but what if we want to get one sample to use in the summary and
model tables? The answer is in ereturn list. After an estimation command runs
is saves the details in ereturn list and return list, this is where
commands like esttab get the information that they tabulate. One of these
details is a vector of ones and zeros that indicates which observations were
included in the estimation. This is called e(sample) and we can assign it to
a variable and use that to define our sample as follows:
. // fit the most restrictive model
. quietly: regress y x2 x4 i.x1 x3
. generate mainsample = e(sample)
Now we can use mainsample to run all the models in the same sample:
. quietly:eststo:regress y x2 x4 if mainsample == 1
. quietly:eststo:regress y x2 x4 x1 if mainsample == 1
. quietly:eststo:regress y x2 x4 x1 x3 if mainsample == 1
. // tabulate and note sample sizes
. esttab
------------------------------------------------------------
(1) (2) (3)
y y y
------------------------------------------------------------
x2 -1941846.0 -3240948.1* -3131194.0
(-1.63) (-2.01) (-1.91)
x4 -7.289 -8.035* -7.841*
(-1.90) (-2.06) (-2.00)
x1 -2.23118e+11 -2.14745e+11
(-1.20) (-1.14)
x3 7.39664e+09
(0.39)
_cons 1.22904e+12 2.40732e+12* 2.25026e+12
(1.89) (2.04) (1.80)
------------------------------------------------------------
N 424 424 424
------------------------------------------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001