Recently there has been a lot of excitement around the idea of a stand-alone metrics layer in the modern data stack. Traditionally, metrics have been defined in the BI or analytics layer where various dashboards are used to look at business metrics like Revenue, Sales Pipeline, numbers of Claims, or User Activity. Given that most organizations end up with multiple BI/Analytics tools, the idea has a lot of merits. Instead of pulling data into excel sheets and expecting everyone to calculate metrics independently, why not define the metrics in one central place for all to refer to? In software engineering, it is a classic candidate for refactoring the repeated definition of the same logic and following the principle of “Do not Repeat Yourself”
This is a fantastic start to defining the use cases that need to be addressed. I would add one more metric type that seems to be hinted at, but not defined fully. If we go back to Kimball's definition of metrics, we can add semi-additive metrics. These are metrics that can sometimes be aggregated across some dimensions, but not all. For instance, month-end balances can't be summed across months. They represent a running total. They can be averaged across months though. Another example is a budget or limit in a parent table that can't be aggregated across it's children, but needs to be aggregated across other budget records. In these instances, we can do nested sql to get the appropriate aggregations -- sum up line items, join to parent table and sum again. Also, we could join and sum making sure that we divided the parent metric by the # of line items so everything sums up correctly.
I would argue that these solutions are inefficient and a hack that makes self-service analytics by less technical people almost impossible to achieve. It's not that an end user shouldn't have to think about data granularity, but they shouldn't have to create technical solutions on their own for it. A metric store that can address this will be one that has the basic capabilities needed for creating analysis-ready data sets in an organization.
Aside from semi-additive metrics, the other question for metric stores is the reliance on relational models. How can a metric store function on top of a knowledge graph? How do we implement it as a graph or as a virtual graph (i.e. Stardog)? This has to be asked just due to the scalability and robust flexibility of graph modeling.
At Humanic, we further believe that the metrics layer needs to be integrated within a context of a tool that the business owner (Revenue Leader) uses to take some revenue generating action, otherwise it becomes yet another tool that Data Scientists and Business Analysts will be utilizing.
That by itself is not a bad idea except that much is lost in translation when a business user needs to communicate data requirements to a Data Scientist.
Amit, Excellent article and thanks for sharing your thoughts. I agree that there are benefits of unbundling the metrics layers and solution #3 is promising.
If solution #3 were adopted by a few vendors, the next question is, what will be the bridge between visualization grammar and the query generation tool.
Generation of queries for multiple aggregation levels is non-trivial. What if the tool vendor wishes to optimize the query created by a separate layer?
The metrics layer has growing up to do
This is a fantastic start to defining the use cases that need to be addressed. I would add one more metric type that seems to be hinted at, but not defined fully. If we go back to Kimball's definition of metrics, we can add semi-additive metrics. These are metrics that can sometimes be aggregated across some dimensions, but not all. For instance, month-end balances can't be summed across months. They represent a running total. They can be averaged across months though. Another example is a budget or limit in a parent table that can't be aggregated across it's children, but needs to be aggregated across other budget records. In these instances, we can do nested sql to get the appropriate aggregations -- sum up line items, join to parent table and sum again. Also, we could join and sum making sure that we divided the parent metric by the # of line items so everything sums up correctly.
I would argue that these solutions are inefficient and a hack that makes self-service analytics by less technical people almost impossible to achieve. It's not that an end user shouldn't have to think about data granularity, but they shouldn't have to create technical solutions on their own for it. A metric store that can address this will be one that has the basic capabilities needed for creating analysis-ready data sets in an organization.
Aside from semi-additive metrics, the other question for metric stores is the reliance on relational models. How can a metric store function on top of a knowledge graph? How do we implement it as a graph or as a virtual graph (i.e. Stardog)? This has to be asked just due to the scalability and robust flexibility of graph modeling.
My project: https://github.com/Agile-Data/flat-ql
It compiles flat query language into SQL dialect of various database for data analysis.
Great blog post Amit.
At Humanic, we further believe that the metrics layer needs to be integrated within a context of a tool that the business owner (Revenue Leader) uses to take some revenue generating action, otherwise it becomes yet another tool that Data Scientists and Business Analysts will be utilizing.
That by itself is not a bad idea except that much is lost in translation when a business user needs to communicate data requirements to a Data Scientist.
Would love to get a coffee to discuss further! saksena@humanic.ai
I’m working on this. :-)
https://github.com/TheSwanFactory/hclang/blob/master/hc/cqml.hc
Let’s do coffee!
Ernest.Prabhakar@gmail.com
Amit, Excellent article and thanks for sharing your thoughts. I agree that there are benefits of unbundling the metrics layers and solution #3 is promising.
If solution #3 were adopted by a few vendors, the next question is, what will be the bridge between visualization grammar and the query generation tool.
Generation of queries for multiple aggregation levels is non-trivial. What if the tool vendor wishes to optimize the query created by a separate layer?
Merits of metrics! 🙌 🎉