Data Generating Process
A data generating process is the real-world mechanism that produces the observations in a dataset.
Machine learning workflows often begin with the data as given. Statistical modeling asks a prior question: what process created these measurements, labels, missing values, and errors?
In infrastructure work, the data generating process might include:
- construction practices
- inspection schedules
- changing material costs
- record keeping systems
- reporting incentives
- sensor or video quality
- regulatory requirements
- historical inequities in maintenance
Thinking about the data generating process helps connect domain knowledge to model structure. It also creates a bridge from prediction to causal reasoning.
Related: Interpretable Statistical Models.