Canon

These are some articles that have had a significant impact on my thinking, there are many more, but these are some.
markdown

Data & Analytics canon

A collection of my favourite essays, posts, courses on data & analytics.

The analytics stack

Emerging Architectures, A16Z ref

How the Modern Data Stack is Reshaping Data Engineering ref

  • Modern data engineering teams are increasingly involved in selecting tools, integrating them, and keeping costs in check.
  • The industry is doubling down on templated SQL and YAML as a way to manage the “T” in ELT. > it feels like what early PHP was to web development. Early PHP would essentially be expressed as PHP code snippets living inside HTML files, and led to suboptimal patterns. It was clearly a great way to inject some more dynamisity into static web pages, but the pattern was breaking down as pages and apps became more dynamic. Similarly, in SQL+jinja, things increasingly break down as there’s more need for complex logic or more abstracted forms. It falls short on some of the promises of a richer programming model. > > I’ve been very interested in higher level abstractions sitting over the transform layer - namely “computation frameworks” or “parametric pipelines” as pieces of reusable data engineering machinery that can perform complex tasks. It’s pretty clear to me that combining SQL with Jinja templating doesn’t provide the proper foundation for these emerging constructs.
  • The metrics layer (popularized by Airbnb’s Minerva, Transform.co, and MetriQL), feature engineering frameworks (closer to MLops), A/B testing frameworks, and a cambrian explosion of homegrown computation frameworks of all shapes and flavors. Call this “data middleware”, “parametric pipelining” or “computation framework”, but this area is starting to take shape.
  • Whether you call it data mesh or more generally “decentralized governance”, teams of domain experts are starting to own and drive data systems. Each team will start to be responsible for data quality SLA’s and publishing metrics and dimensions for the rest of the organization to consume. It’s a huge mes(s/h)!
  • Every product is becoming a data product

Newsletters

  • https://roundup.getdbt.com/
  • https://benn.substack.com/

Honourable mentions

A Data Pipeline is a Materialized View ref

Pipelines can be considered as a virtual view of source data, containing all the necessary extractions and transformations. Interdependent pipelines can be thought of as a graph of materialised views.

As long as the input data and pipeline transformations (i.e. the pipeline code) are preserved, the output can always be recreated. The input data is primary; if lost, it cannot be replaced. The output data, along with any intermediate stages in the pipeline, are derivative; they can always be recreated from the primary data using the pipeline.

Any time someone queries the output of the pipeline, it’s logically equivalent to them running the entire pipeline on the source data to get the output they’re looking for. Of course, data pipelines don’t work this way in practice. Hence, the typical real-world pipeline materializes its output, and often also several of the intermediate datasets required to produce that final output.

To update a materialized view, there are two high-level properties you typically care about: the update trigger, and the update granularity.

Inspiration

Dan Romero’s start-up canon