code and collaboration

The caveat

I’ve never written software for a living. I’m reasonably adept at Stata and Python within the domains that are useful to me. R, Julia, SAS, and Matlab are in the back of my toolbox, if you will, but not part of my day-to-day work flow.

Two simple goals

Manage versioning and collaboration in a lightweight but effective way so that allows authors to work both simultaneously on different tasks, asynchronously on the same tasks with minimal mistakes and coordination effort. The initial creation of the script for a data task should be a manageable block of time, and subsequent edits should be simple to make, incorporate and track.
The objective is to, at submission, have a sharable set of original data files, extraction scripts, processing scripts, analysis scripts, and a LaTeX (or markdown) which other users (co-authors, referees, etc.) can compile and evaluate with minimal guidance.

The approach

four integrated pipelines
1. extract
2. process
3. analyze
4. create draft
  - includes the writing step, which is it’s own topic
code for 1-3 is managed through github.
- python, sas, and others for extraction of data
- stata for processing and analysis
- the stata project command is key to tracking changes here
data is shared through dropbox or somesuch. data folders should have some structure, my default structure is:
- /raw/: this holds everything created by the “extract” step
- /mediumraw/: this holds intermediate datasets created by the process step. I often save an unlabeled version of the estimation data here
- /final/: this folder holds the final data for estimation, this is the data that produces the tables in the paper
the draft is maintained on overleaf
- analysis step outputs to an overleaf synced file
- e.g. the project’s folder in Dropbox/Apps/Overleaf/project

Details on the approach

Coding Style Guide, Stepner
Code and Data for the Social Sciences: A Practitioner’s Guide, Gentzkow and Shapiro
Best Practices for Scientific Computing
Tidy Data, Wickham

Other useful resources:

Microeconometrics using Stata, Cameron and Trivedi
Statalist.org
Python for Data Analysis
Stack Exchange, in particular, Stack Overflow
Regression and Other Stories

Arthur Morris

The caveat

Two simple goals

The approach

Details on the approach

Other useful resources: