Data Generating Process

A data generating process is the real-world mechanism that produces the observations in a dataset.

Machine learning workflows often begin with the data as given. Statistical modeling asks a prior question: what process created these measurements, labels, missing values, and errors?

In infrastructure work, the data generating process might include:

  • construction practices
  • inspection schedules
  • changing material costs
  • record keeping systems
  • reporting incentives
  • sensor or video quality
  • regulatory requirements
  • historical inequities in maintenance

Thinking about the data generating process helps connect domain knowledge to model structure. It also creates a bridge from prediction to causal reasoning.

Related: Interpretable Statistical Models.

0 items under this folder.