Essential Stata Commands for Clinical Epidemiology and Biostatistics [Stata 18 Beginner Command Guide]

Mayta
May 24, 2025
4 min read

Stage of Analysis	Stata Command(s)
Import & Manage Data	use, import excel, label variable, gen, drop, recode
Explore Structure / Counts	describe, summarize, tab, codebook
Visualize Distributions	histogram, graph box, graph hbox
Summaries by Group	tabstat
Classical Hypothesis Tests	ttest, ranksum, oneway, kwallis
Risk & Diagnostic Accuracy	cs, diagt
Regression / Modelling	logistic, poisson
Incidence Rates	ir
Survival Analysis	stset, sts graph, sts list, stcox
Sample-Size Planning	sampsi
Quick Display / Math	display

Why This Guide?

Stata’s syntax looks terse, but each command encapsulates a common research workflow—importing data, checking its quality, exploring patterns, testing hypotheses, and reporting results. By understanding what every command asks of you and what it returns, you can move from “running code” to “directing an analysis narrative.” The sections below follow the typical life-cycle of a project so you can dip in at any stage.

1 Prepare and Curate Your Dataset

• Open a Stata file with use "filename.dta", clear to load data and wipe anything already in memory; omit clear only when you are certain nothing valuable is open.

• Import an Excel sheet through the menu (File ➜ Import ➜ Excel) when columns contain mixed formats, because the wizard lets you preview date conversions. For quick checks, a simple copy-and-paste into the Data Editor works too.

• Give variables human-readable labels using label variable age "Age at admission (years)". Clear labels future-proof your output, especially in shared projects.

• Create new variables with gen bmi = weight/(height^2). Remember that Stata evaluates row-by-row; missing values in any term propagate.

• Drop variables cautiously: drop temp_* deletes all variables that start with “temp_”. Never type drop on its own—Stata will obey and erase everything.

• Recode into categories via recode age 0/17=0 18/64=1 65/max=2, gen(agegrp) to group continuous measurements for tables or models while preserving the original variable.

2 Inspect Structure and Summary Statistics

• describe lists every variable, storage type, and label so you can spot unintended string types or missing labels fast.

• summarize provides N, mean, and spread for each variable; append , detail to view percentiles, skewness, and kurtosis for one variable when checking normality.

• tab sex counts each category; add , m to include missing values and prevent silent bias.

• codebook diagnosis is a one-stop audit for categorical variables—it shows distinct values, labels, and any out-of-range codes.

3 Visualise Distributions and Outliers

• histogram systolic_bp sketches the shape of continuous data; add , by(group) to place multiple histograms side-by-side for quick group comparisons.

• graph box creatinine (vertical) or graph hbox creatinine (horizontal) flags outliers at a glance; use both orientations in reports to suit page layout.

4 Summaries by Group

tabstat cholesterol, by(smoker) stat(n mean sd p50 p25 p75 min max) prints a compact table with counts and seven spread measures. Swap statistics as needed—Stata always lists them in your chosen order, making copy-pasting into manuscripts painless.

5 Classical Hypothesis Tests

• Two-group means (parametric): ttest haemoglobin, by(sex) relies on normality; check that with a histogram or Shapiro-Wilk beforehand.

• Two-group medians (non-parametric): ranksum pain_score, by(treatment) keeps power when distributions are skewed.

• Categorical association: tab outcome exposure, col chi2 exact reports proportions, χ², and Fisher’s exact when counts are small (Stata switches automatically).

• Three-plus groups: oneway age vaccine_dose for ANOVA or kwallis age, by(vaccine_dose) when variances are unequal. Use the , bon option on oneway to request Bonferroni-adjusted pairwise comparisons in the same step.

6 Diagnostic Test Accuracy

• Build a two-by-two table with tab gold_standard new_test. Sensitivity, specificity, PPV, and NPV can be typed manually, e.g., display TP/(TP+FN).

• For automated metrics and confidence intervals, install diagt once (ssc install diagt), then run diagt new_test gold_standard to see likelihood ratios and ROC area without further coding.

7 Risk and Odds Estimation

• Risk ratio / difference: cs infection exposure yields cumulative risk, relative risk, and risk difference in one go; append , exact for small samples.

• Crude odds ratio: logistic mortality smoker fits a univariable logistic model and shows ORs with default 95 % CIs.

• Adjusted odds: expand the model (logistic mortality smoker age sex bmi …) to control confounders; check linearity of continuous covariates before trusting the output.

8 Rates and Poisson Models

• Incidence rate ratio: ir death group person_time expects a count of events and total person-time.

• Poisson regression: poisson death group, exp(person_time) irr reproduces the IRR and allows multiple covariates. Stata’s irr option exponentiates coefficients automatically, sparing manual calculations.

9 Survival Analysis

Create time-to-event in days (gen tte = date_exit - date_entry).
Tell Stata the outcome structure: stset tte status.
Plot Kaplan–Meier curves with sts graph, adding , survival gw by(group) to overlay groups and include Greenwood CIs.
List survival probabilities at specific times using sts list, survival at(1 3 5)—handy for speaking to clinicians who think in years rather than person-days.
Fit a Cox model with stcox treatment age sex. Always test proportional hazards (e.g., estat phtest) before summarising results.

10 Self-Help While You Work

Typing help regress (or any command) opens the full syntax, options, and clickable examples. Browse examples first—they often reveal shortcuts not obvious from the syntax diagram.

11 Planning Sample Size Early

• Compare two means: sampsi 12 14, sd1(4) sd2(4) p(0.8) a(0.05) oneside returns the required N per group under the stated power (p) and alpha (a).

• Compare two proportions: sampsi 0.15 0.30, p(0.8) a(0.05) is equally quick. Adjust one-sided or two-sided tests with oneside or by omitting it.

Putting It All Together

A reproducible analysis flows like this:

Load or import data with full labels.
Inspect structure, correct types, and compute derived variables.
Explore distributions visually and numerically.
Run the appropriate hypothesis tests or models, checking assumptions at each step.
Present estimates with confidence intervals, not just p-values.
Use help whenever you doubt a syntax option.

By mastering the commands above—and understanding the story each tells—you can drive a clinical research project from raw data to publishable insight entirely within Stata’s core toolset.