Concepts, Applications, and Implementation in Stata and R: Long vs. Wide Data

Mayta
Nov 15, 2025
3 min read

Concepts, Applications, and Implementation in Stata and R

In data science and applied statistics, the structure of a dataset fundamentally affects how it can be analyzed, modeled, and visualized. Two dominant data structures are long form and wide form. Understanding the distinction between them is essential for efficient data management, especially when working with repeated measurements, panel data, surveys, or experiments.

Definition of Wide-Form Data

Wide-form data presents each observational unit in a single row. Repeated measurements or multiple variables of the same type are stored in separate columns. This structure resembles a typical spreadsheet and is often intuitive for human reading.

Example (Wide)

id	bp1	bp2	bp3
1	120	118	116
2	130	128	125

In this example, each patient has one row, and each time point is stored as a separate column.

When Wide Form Is Useful

Machine learning models requiring fixed feature columns.
Fast access to subject-level information.
Stata commands that operate on multiple columns, such as reshape wide → long.
Certain statistical models where each repeated measure is manually specified.

Definition of Long-Form Data

Long-form data organizes each measurement as its own row. Time, condition, or measurement type is represented by a categorical variable. Long form is highly compatible with statistical modeling and data visualization frameworks that assume one observation per row.

Example (Long)

id	time	bp
1	1	120
1	2	118
1	3	116
2	1	130
2	2	128
2	3	125

When Long Form Is Useful

Panel-data models (fixed effects, random effects, mixed models).
Repeated-measures ANOVA, longitudinal analysis.
Tidyverse workflows in R.
ggplot2 visualizations require one observation per row.
Stata commands for panel models (xtset, xtreg).

Comparison Table: Wide vs. Long Form

Aspect	Wide Form	Long Form
Structure	One row per unit; repeated measures in columns	One row per observation; repeated measures stacked
Number of Rows	Fewer	More
Number of Columns	More	Fewer
Human Readability	Often easier	More compact but less intuitive
Suitable for ML	Yes, fixed number of features	Requires reshaping first
Suitable for Panel/Repeated Models	Often requires reshaping	Directly compatible
Preferred in R Tidyverse	No	Yes
Preferred for Visualizations	Generally no	Yes
Flexibility	Limited	High

Reshaping Data in Stata

Wide to Long

reshape long bp, i(id) j(time)

Long to Wide

reshape wide bp, i(id) j(time)

These commands tell Stata to identify the unit (id) and the varying index (time), transforming the dataset accordingly.

Reshaping Data in R

Wide to Long (tidyverse)

library(tidyr)

long_data <- pivot_longer(
  data  = wide_data,
  cols  = starts_with("bp"),
  names_to = "time",
  values_to = "bp"
)

Long to Wide (tidyverse)

wide_data <- pivot_wider(
  data  = long_data,
  names_from = time,
  values_from = bp
)

Base R alternatives such as reshape() exist but are used less frequently compared to tidyverse tools.

Practical Considerations

Statistical modeling frameworks increasingly prefer long-form data because it naturally encodes repeated measures and hierarchical structures.
Data visualization libraries (e.g., ggplot2 in R) require long-form data for most types of plots.
In Stata, panel data commands rely on long-form structures after declaring panel identifiers with xtset.
When working with machine learning models in R or Python, wide-form data is usually more appropriate, although feature engineering can often involve converting long-form to wide-form.

Key Message: Understanding Wide vs. Long Data Structure

In data management for statistics and data science, the distinction between wide and long formats is essential when handling repeated measurements, multisite observations, or laboratory values. Your summary is on the right track. Below is the refined, accurate version.

Wide Format: Repeated Measures in Separate Columns

In wide-form data, each subject or patient has one row, and repeated measurements or multiple body sites (e.g., eyes, arms, blood pressure readings, laboratory results) are placed in separate columns.

Each measurement type becomes its own column.
The dataset becomes “wide” because variables multiply horizontally.

Example

id	eye_score	arm_score	lab_day1	lab_day2
1	5	4	120	118

In this structure:

One patient = one row
Each body part or time point = different column

This is the structure typically used when outcomes or repeated measures are placed side-by-side in columns.

Long Format: Each Measurement in Its Own Row

In long-form data, repeated measurements are stacked vertically, so each patient can appear in multiple rows.

The dataset becomes “long” because the number of rows increases.
A single variable (e.g., score, measurement) holds the values, and another variable (e.g., site, time) identifies what the measurement refers to.

Example

id	site	value
1	eye	5
1	arm	4
1	lab_day1	120
1	lab_day2	118

In this structure:

One patient = multiple rows
Each measurement = its own row
The ID repeats for each measurement

This is the format preferred for:

Panel and longitudinal modeling
Mixed-effects models
Tidyverse workflows
ggplot2 visualizations
Stata xtset and xtreg

Summary

Wide form places repeated measurements across columns, while long form places repeated measurements down rows, meaning the same patient ID appears on multiple rows in long format.