Continuing my exploration of the four dimensions of Data Strategy
. In this post, I bring together some earlier themes, including Pace Layering
The first point to emphasize is that there are many elements to your overall data strategy, and these don't all work at the same tempo. Data-driven design methodologies such as Information Engineering (especially the James Martin version) were based on the premise that the data model was more permanent than the process model, but it turns out that this is only true for certain categories of data.
So one of the critical requirements for your data strategy is to manage both the slow-moving stable elements and the fast-moving agile elements. This calls for a layered approach, where each layer has a different rate of change, known as pace-layering
The concept of pace-layering was introduced by Stewart Brand. In 1994, he wrote a brilliant and controversial book about architecture, How Buildings Change
, which among other things contained a theory about evolutionary change in complex systems based on earlier work by the architect Frank Duffy. Although Brand originally referred to the theory as Shearing Layers
, by the time of his 1999 book he had switched to calling it Pace Layering. If there is a difference between the two, Shearing Layers is primarily a descriptive theory about how change happens in complex systems, while Pace Layering is primarily an architectural principle for the design of resilient systems-of-systems.
In 2006, I was working as a software industry analyst, specializing in Service-Oriented Architecture (SOA). Microsoft invited me to Las Vegas to participate in a workshop with other industry analysts, where (among other things) I drew the following layered picture.
Here's how I now draw the same picture for data strategy. It also includes a rough mapping to the Trimodal approach.
Giles Slinger and Rupert Morrison, Will Organization Design Be Affected By Big Data?
(J Org Design Vol 3 No 3, 2014)
Wikipedia: Information Engineering
, Shearing Layers
Related Posts: Layering Principles
(March 2005), SPARK 2 - Innovation or Trust
(March 2006), Beyond Bimodal
(May 2016), Data Strategy - Agility
How far can general principles of asset management be applied to data? In this post, I'm going to look at some of the challenges of putting monetary or non-monetary value on your data assets.
Why might we want to do this? There are several reasons why people might be interested in the value of data.
- Establish internal or external benchmarks
- Set measurable targets and track progress
- Identify underutilized assets
- Prioritization and resource allocation
- Threat modelling and risk assessment (especially in relation to confidentiality, privacy, security)
Non-monetary benchmarks may be good enough if all we want to do is compare values - for example, this parcel of data is worth a lot more than that parcel, this process/practice is more efficient/effective than that one, this initiative/transformation has added significant value, and so on.
But for some purposes, it is better to express the value in financial terms. Especially for the following:
- Cost-benefit analysis – e.g. calculate return on investment
- Asset valuation – estimate the (intangible) value of the data inventory – e.g. relevant for flotation or acquisition
- Exchange value – calculate pricing and profitability for traded data items
There are (at least) five entirely different ways to put a monetary value on any asset.
- Historical Cost The total cost of the labour and other resources required to produce and maintain an item.
- Replacement Cost The total cost of the labour and other resources that would be required to replace an item.
- Liability Cost The potential damages or penalties if the item is lost or misused. (This may include regulatory action, reputational damage, or commercial advantage to your competitors, and may bear no relation to any other measure of value.)
- Utility Value The economic benefits that may be received by an actor from using or consuming the item.
- Market Value The exchange price of an item at a given point in time. The amount that must be paid to purchase the item, or the amount that could be obtained by selling the item.
But there are some real difficulties in doing any of this for data. None of these difficulties are unique to data, but I can't think of any other asset class that has all of these difficulties multiplied together to the same extent.
- Data is an intangible asset. There are established ways of valuing intangible assets, but these are always somewhat more complicated than valuing tangible assets.
- Data is often produced as a side-effect of some other activity. So the cost of its production may already be accounted for elsewhere, or is a very small fraction of a much larger cost.
- Data is a reusable asset. You may be able to get repeated (although possibly diminishing) benefit from the same data.
- Data is an infinitely reproducible asset. You can sell or share the same data many times, while continuing to use it yourself.
- Some data loses its value very quickly. If I’m walking past a restaurant, this information has value to the restaurant. Ten minutes later I'm five blocks away, and the information is useless. And even before this point, suppose there are three restaurants and they all have access to the information that I am hungry and nearby. As soon as one of these restaurants manages to convert this information, its value to the remaining restaurants becomes zero or even negative.
- Data combines in a non-linear fashion. Value (X+Y) is not always equal to Value (X) + Value (Y). Even within more tangible asset classes, we can find the concepts of Assemblage and Plottage. For data, one version of this non-linearity is the phenomenon of information energy described by Michael Saylor of MicroStrategy. And for statisticians, there is also Simpson’s Paradox.
The production costs of data can be estimated in various ways. One approach is to divide up the total ICT expenditure, estimating roughly what proportion of the whole to allocate to this or that parcel of data. This generally only works for fairly large parcels - for example, this percent to customer transactions, this percentage to transport and logistics, etc. Another approach is to work out the marginal or incremental cost: this is commonly preferred when considering new data systems, or decommissioning old ones. We can compare the effort consumed in different data domains, or count the number of transformation steps from raw data to actionable intelligence.
As for the value of the data, there are again many different approaches. Ideally, we should look at the use-value
or performance value
of the data - what contribution does it make to a specific decision or process, or what aggregate contribution does it make to a given set of decisions and processes.
- This can be based on subjective assessments of relevance and usefulness, perhaps weighted by the importance of the decisions or processs where the data are used. See Bill Schmarzo's blogpost for a worked example.
- Or it may be based on objective comparisons of results with and without the data in question - making a measurable difference to some key performance indicator (KPI). In some cases, the KPI may be directly translated into a financial value.
However, comparing performance fairly and objectively may only be possible for organizations that are already at a reasonable level of data management maturity.
In the absence of this kind of metric, we can look instead at the intrinsic
value of the data, independently of its potential or actual use. This could be based on a weighted formula involving such quality characteristics as accuracy, alignment, completeness, enrichment, reliability, shelf-life, timeliness, uniqueness, usability. (Gartner has published a formula that uses a subset of these factors.)
Arguably there should be a depreciation element to this calculation. Last year's data is not worth as much as this year's data, and the accuracy of last year's data may not be so critical, but the data is still worth something.
An intrinsic measure of this kind could be used to evaluate parcels of data at different points in the data-to-information process. For example, showing the increase of enrichment and usability from 1. to 2. and from 2. to 3., and therefore giving a measure of the added-value produced by the data engineering team that does this for us.
2. Data Lake – cleansed, consolidated, enriched and accessible to people with SQL skills
3. Data Visualization Tool – accessible to people without SQL skills
If any of my readers know of any useful formulas or methods for valuing data that I haven't mentioned here, please drop a link in the comments.
Heather Pemberton Levy, Why and How to Value Your Information as an Asset
(Gartner, 3 September 2015)
Bill Schmarzo, Determining the Economic Value of Data
(Dell, 14 June 2016)
Wikipedia: Simpson's Paradox
, Value of Information
Related posts: Information Algebra
(March 2008), Does Big Data Release Information Energy?
(April 2014), Assemblage and Plottage
My 2012 post on the Co-Production of Data and Knowledge
offered a critique of the #DIKW pyramid. When challenged recently to propose an alternative schema, I drew something quickly on the wall, against a past-present-future timeline. Here is a cleaned-up version.
is always given from the past – even if only a fraction of a second into the past.
We use our (accumulated) knowledge
(or memory) to convert data into information
– telling us what is going on right now. Without prior knowledge, we would be unable to do this. As Dave Snowden puts it, knowledge is the means by which we create information out of data.
We then use this information to make various kinds of judgement
into the future. In his book The Art of Judgment, Vickers identifies three types.We predict what will happen if we do nothing, we work out how to achieve what we want to happen, and we put these into an ethical frame.
is about the smooth flow towards judgement, as well as effective feedback and learning back into the creation of new knowledge, or the revision/reinforcement of old knowledge.
And finally, wisdom
is about maintaining a good balance between all of these elements - respecting data and knowledge without being trapped by them.
What the schema above doesn't show are the feedback and learning loops. Dave Snowden invokes the OODA loop, but a more elaborate schema would include many nested loops - double-loop learning and so on - which would make the diagram a lot more complex.
And although the schema roughtly indicates the relationship between the various concepts, what it doesn't show is the fuzzy boundary between the concepts. I'm really not interested in discussing the exact criteria by which the content of a document can be classified as data or information or knowledge or whatever.
Dave Snowden, Sense-making and Path-finding
Geoffrey Vickers, The Art of Judgment: A Study of Policy-Making (1965)
Related posts: Wisdom of the Tomato
(March 2011), Co-Production of Data and Knowledge
The lie detector (aka #polygraph) is back in the news. The name polygraph is based on the fact that the device can record and display several things at once. Like a dashboard.
In the 1920s, a young American physiologist named John Larson devised a version for detecting liars, which measured blood pressure, respiration, pulse rate and skin conductivity. Larson called his invention, which he took with him to the police, a
cardiopneumo psychogram, but
polygraph later became the standard term. To this day, there is no reliable evidence that polygraphs actually work, but the great British public will no doubt be reassured by official PR that makes our masters sound like the heroes of an FBI crime series.
Over a hundred years ago, G.K. Chesteron wrote a short story exposing the fallacy of relying on such a machine. Even if the measurements are accurate, they can easily be misinterpreted.
There's a disadvantage in a stick pointing straight. The other end of the stick always points the opposite way. It depends whether you get hold of the stick by the right end.
There are of course many ways in which the data displayed on the dashboard can be wrong - from incorrect and incomplete data to muddled or misleading calculations. But even if we discount these errors, there may be many ways in which the user of the dashboard can get the wrong end of the stick.
As I've pointed out before, along with the illusion that what the data tells you is true
, there are two further illusions: that what the data tells you is important
, and that what the data doesn't tell you is not important
No machine can lie, nor can it tell the truth.
G.K. Chesterton, The Mistake of the Machine
Hannah Devlin, Polygraph’s revival may be about truth rather than lies
(The Guardian, 21 January 2020)
Megan Garber, The Lie Detector in the Age of Alternative Facts
(Atlantic, 29 March 2018)
Stephen Poole, Is the word 'polygraph' hiding a bare-faced lie?
(The Guardian, 23 January 2020)
Related posts: Memory and the Law
(June 2008), How Dashboards Work
(November 2009), Big Data and Organizational Intelligence
At @imperialcollege this week to hear Professor David Hand talk about his new book on Dark Data.
Some people define dark data as unanalysed data, data you have but are not able to use, and this is the definition that can be found on Wikipedia. The earliest reference I can find to dark data in this sense is a Gartner blogpost from 2012.
In a couple of talks I gave in 2015, I used the term Data Data in a much broader sense - to include the data you simply don't have. My talks both included the following diagram.
Here's an example of this idea. A supermarket may know that I sometimes buy beer at the weekends. This information is derived from its own transaction data, identifying me through my use of a loyalty card. But what about the weekends when I don't buy beer from that supermarket? Perhaps I am buying beer from a rival supermarket, or drinking beer at friends' houses, or having a dry weekend. If they knew this, it might help them sell me more beer in future. Or sell me something else for those dry weekends.
Obviously the supermarket doesn't have access to its competitors' transaction data. But it does know when its competitors are doing special promotions on beer. And there may be some clues about my activity from social media or other sources.
The important thing to remember is that the supermarket rarely has a complete picture of the customer's purchases, let alone what is going on elsewhere in the customer's life. So it is trying to extract useful insights from incomplete data, enriched in any way possible by big data.
Professor Hand's book is about data you don't have -
perhaps data you wish you had, or hoped to have, or thought you had, but nevertheless data you don't have
. He argues that
the missing data are at least as important as the data you do have
. So this is the same sense that I was using in 2015.
Hand describes and illustrates many different manifestations of dark data, and talks about a range of statistical techniques for drawing valid conclusions from incomplete data and for overcoming potential bias. He also talks about the possible benefits of dark data - for example, hiding some attributes to improve the quality and reliability of the attributes that are exposed. A good example of this would be double-blind testing in clinical trials, which involves hiding which subjects are receiving which treatment, because revealing this information might influence and distort the results.
Can big data solve the challenges posed by dark data? In my example, we might be able to extract some useful clues from big data. But although these clues might lead to new avenues to investigate, or hypotheses that could be tested further, the clues themselves may be unreliable indicators. The important thing is to be mindful of the limits of your visible data.
David J Hand, Dark Data: Why what you don't know matters (Princeton 2020). See also his presentation at Imperial College, 10 February 2020 https://www.youtube.com/watch?v=R3IO5SDVmuk
Richard Veryard, Boundaryless Customer Engagement
(Open Group, October 2015), Real-Time Personalization
(Unicom December 2015)
Andrew White, Dark Data is like that furniture you have in that Dark Cupboard
(Gartner, 11 July 2012)
Wikipedia: Dark Data
Related post: Big Data and Organizational Intelligence
More Recent Articles