5 minute read

Stata has some sharp edges.

There are a lot of things that Stata does really well. That’s not what this is about. Today is about the little sharp edges in Stata, that will cut you if you bump them.

Missing numbers aren’t missing at all, they are INFINITY.

Stata indicates missing numeric values with ‘.’ You’ll see this being used in something like subinstr("This old string", "!", .) where the . is being used to mean INFINITY! This seems to be behavior that is deeply integrated into programming practice, commonly to allow unspecified upper limits (see help rangejoin for an example) and to move all missing data to the bottom of the data set.

However, this is a very sharp edge for folks using inequalities to subset their data. For example here is a random variable with one missing value:

set obs 10
set seed 1 
generate id = _n 
generate x = rrandom() if _n != 7

Here is the result:

     +-----------+
     |        x  |
     |-----------|
  1. |         . | <-- missing value
  2. |  .4870331 |
  3. |  .5545321 |
  4. | -.5739419 |
  5. | -1.683186 |
     |-----------|
  6. |  .2000261 |
  7. |  2.053563 |
  8. | -1.287491 |
  9. |  .7676956 |
 10. |  .5712904 |
     +-----------+

And now lets generate a variable equal to one if $x>\mu$:

su x, meanonly 
generate d = x > r(mean)

A quick note on the syntax generate d = x > r(mean):

This creates a variable (column) equal to the value of expression x > r(mean). In this expression x is the value of the random variable x created above, and r(mean) is the mean of x calculated by su x,meanonly. In Stata expressions like this return 1 if true and 0 if false, generate applies this expression to each observation. Here is the result:

     +---------------+
     |        x    d |
     |---------------|
  1. |         .   1 | <-- is this the this that you wanted?
  2. |  .4870331   1 |
  3. |  .5545321   1 |
  4. | -.5739419   0 |
  5. | -1.683186   0 |
     |---------------|
  6. |  .2000261   1 |
  7. |  2.053563   1 |
  8. | -1.287491   0 |
  9. |  .7676956   1 |
 10. |  .5712904   1 |
     +---------------+

Stata does not have missing numbers! It has INFINITY.

sort x 

Treats . as the largest value!

     +-----------+
     |         x |
     |-----------|
  1. | -1.683186 |
  2. | -1.287491 |
  3. | -.5739419 |
  4. |  .2000261 |
  5. |  .4870331 |
     |-----------|
  6. |  .5545321 |
  7. |  .5712904 |
  8. |  .7676956 |
  9. |  2.053563 |
 10. |         . |
     +-----------+

It turns out that you need to specify the thing that you obviously don’t want:

su x, meanonly 
generate d1 = x > r(mean) if !mi(x) 

Gives you what you really obviously wanted:

     +--------------------+
     |         x   d   d1 |
     |--------------------|
  1. | -1.683186   0    0 |
  2. | -1.287491   0    0 |
  3. | -.5739419   0    0 |
  4. |  .2000261   1    1 |
  5. |  .4870331   1    1 |
     |--------------------|
  6. |  .5545321   1    1 |
  7. |  .5712904   1    1 |
  8. |  .7676956   1    1 |
  9. |  2.053563   1    1 |
 10. |         .   1    . | <--- is this is the this that you wanted?
     +--------------------+

Sorts are unstable BY DEFAULT!

This might be the biggest threat to reproduceability in the Stata using population, as it almost guarantees that running the same code twice will not yield the same results.

The problem is how Stata handles ties when you sort. To see this lets expand the data set and give an id that will differentiate the observations.

set obs 3  
generate i = _n 
expand 3 
sort i 
bysort i: generate j = _n // bysort is also unstable

yields:

     +-------+
     | i   j |
     |-------|
  1. | 1   1 |
  2. | 1   2 |
  3. | 1   3 |
  4. | 2   1 |
  5. | 2   2 |
     |-------|
  6. | 2   3 |
  7. | 3   1 |
  8. | 3   2 |
  9. | 3   3 |
     +-------+

Now we’ll sort on a random variable and then use sort i to get a sort with j out of order, and then we’ll sort by i and r repeatedly:

generate r = rnormal()
sort r 
     +-------------------+
     | i   j           r |
     |-------------------|
  1. | 1   3   -2.796935 |
  2. | 3   1    -1.34123 |
  3. | 1   1   -.9382565 |
  4. | 2   3    -.613622 |
  5. | 2   2   -.0749784 |
     |-------------------|
  6. | 3   2    .4594351 |
  7. | 2   1    .6567299 |
  8. | 3   3     1.14646 |
  9. | 1   2     1.46703 |
     +-------------------+

sort i

list

     +-------------------+
     | i   j           r |
     |-------------------|
  1. | 1   2     1.46703 |
  2. | 1   1   -.9382565 |
  3. | 1   3   -2.796935 |
  4. | 2   1    .6567299 |
  5. | 2   2   -.0749784 |
     |-------------------|
  6. | 2   3    -.613622 |
  7. | 3   3     1.14646 |
  8. | 3   1    -1.34123 |
  9. | 3   2    .4594351 |
     +-------------------+

sort r

sort i

list

     +-------------------+
     | i   j           r |
     |-------------------|
  1. | 1   1   -.9382565 |
  2. | 1   3   -2.796935 |
  3. | 1   2     1.46703 |
  4. | 2   3    -.613622 |
  5. | 2   2   -.0749784 |
     |-------------------|
  6. | 2   1    .6567299 |
  7. | 3   2    .4594351 |
  8. | 3   3     1.14646 |
  9. | 3   1    -1.34123 |
     +-------------------+

sort r

sort i

list

     +-------------------+
     | i   j           r |
     |-------------------|
  1. | 1   1   -.9382565 |
  2. | 1   2     1.46703 |
  3. | 1   3   -2.796935 |
  4. | 2   1    .6567299 |
  5. | 2   2   -.0749784 |
     |-------------------|
  6. | 2   3    -.613622 |
  7. | 3   1    -1.34123 |
  8. | 3   3     1.14646 |
  9. | 3   2    .4594351 |
     +-------------------+


Notice that we are getting a different order each time even though the observations are returned to the same order each time by sort r.

This is because Stata chooses a random tiebreaker each time it sorts!

There are two solutions:

  1. Only sort by unique identifiers. This is a good practice, even when arbitrarily breaking ties so that you can be explicit about how you are doing it.
  2. sort, stable, note that this does not solve the bysort version of this problem. Useful when sorting on values (rather than ids) which may not be unique.

One way to do this:

local sort_vars group individual 
isid `sort_vars' // confirms that these identify unique obs  
sort `sort_vars', stable

The isid line makes the , stable redundant. But, getting in the habit of using sort, stable is a good way to remind yourself that this in not the default.

As an extra special bonus, this leads to duplicates, drop dropping different observations each time you run your code, because duplicates, drop has an unstable sort inside it.