A long time ago, I did some work for a client that had an out-of-date and inflexible billing system. The software would send invoices and monthly statements to the customers, who were then expected to remit payment to clear the balance on their account.
The business had recently introduced a new direct debit system. Customers who had signed a direct debit mandate no longer needed to send payments.
But faced with the challenge of introducing this change into an old and inflexible software system, the accounts department came up with an ingenious and elaborate workaround. The address on the customer record was changed to the address of the internal accounts department. The computer system would print and mail the statement, but instead of going straight to the customer it arrived back at the accounts department. The accounts clerk used a rubber stamp PAID BY DIRECT DEBIT, and would then mail the statement to the real customer address, which was stored in the Notes field on the customer record.
Although this may be an extreme example, there are several important lessons that follow from this story.
Firstly, business can't always wait for software systems to be redeveloped, and can often show high levels of ingenuity in bypassing the constraints imposed by an unimaginative design.
Secondly, the users were able to take advantage of a Notes field that had been deliberately underdetermined to allow for future expansion.
Furthermore, users may find clever ways of using and extending a system that were not considered by the original designers of the system. So there is a divergence between technology-as-designed and technology-in-use.
Now let's think what happens when the IT people finally get around to replacing the old billing system. They will want to migrate customer data into the new system. But if they simply follow the official documentation of the legacy system (schema etc), there will lots of data quality problems.
And by documentation, I don't just mean human-generated material but also schemas automatically extracted from program code and data stores. Just because a field is called CUSTADDR doesn't mean we can guess what it actually contains.
Here's another example of an underdetermined data element, which I presented at a DAMA conference in 2008. SOA Brings New Opportunities to Data Management.
In this example, we have a sales system containing a Business Type called SALES PROSPECT. But the content of the sales system depends on the way it is used - the way SALES PROSPECT is interpreted by different sales teams.
- Sales Executive 1 records only the primary decision-maker in the prospective organization. The decision-maker’s assistant is recorded as extra information in the NOTES field.
- Sales Executive 2 records the assistant as a separate instance of SALES PROSPECT. There is a cross-reference between the assistant and the boss
Now both Sales Executives can use the system perfectly well - in isolation. But we get interoperability problems under various conditions.
- When we want to compare data between executives
- When we want to reuse the data for other purposes
- When we want to migrate to new sales system
(And problems like these can occur with packaged software and software as a service just as easily as with bespoke software.)
So how did this mess happen? Obviously the original designer / implementer never thought about assistants, or never had the time to implement or document them properly. Is that so unusual?
And this again shows the persistent ingenuity of users - finding ways to enrich the data - to get the system to do more than the original designers had anticipated.
And there are various other complications. Sometimes not all the data in a system was created there, some of it was brought in from an even earlier system with a significantly different schema. And sometimes there are major data quality issues, perhaps linked to a post before processing paradigm.
Both data migration and data integration are plagued by such issues. Since the data content diverges from the designed schemas, it means you can't rely on the schemas of the source data but you have to inspect the actual data content. Or undertake a massive data reconstruction exercise, often misleadingly labelled "data cleansing".
There are several tools nowadays that can automatically populate your data dictionary or data catalogue from the physical schemas in your data store. This can be really useful, provided you understand the linitations of what this is telling you. So there a few important questions to ask before you should trust the physical schema as providing a complete and accurate picture of the actual contents of your legacy data store.
- Was all the data created here, or was some of it mapped or translated from elsewhere?
- Is the business using the system in ways that were not anticipated by the original designers of the system?
- What does the business do when something is more complex than the system was designed for, or when it needs to capture additional parties or other details?
- Are classification types and categories used consistently across the business? For example, if some records are marked as "external partner" does this always mean the same thing?
- Do all stakeholders have the same view on data quality - what "good data" looks like?
- And more generally, is there (and has there been through the history of the system) a consistent understanding across the business as to what the data elements mean and how to use them?
Related posts: Post Before Processing (November 2008), Ecosystem SOA 2 (June 2010), Technology in Use (March 2023)
A friend of mine shares an email thread from his organization discussing the definition of CUSTOMER, disagreeing as to which categories of stakeholder should be included and which should be excluded.
Why is this important? Why does it matter how
the CUSTOMER label is used? Well, if you are going to call yourself a
customer-centric organization, improve customer experience and increase
customer satisfaction, it would help to know whose experience, whose
satisfaction matters. And how many customers are there actually?
The organization provides services to A, which are experienced by B and paid for by C, based on a contractual agreement with D. This is a complex network of actors with overlapping roles, and the debate is about which of these count as customers and which don't. I have often seen similar confusion elsewhere.
My friend asks: Am I supposed to have a different customer definition for different teams (splitter), or one customer definition across the whole business (lumper)? As an architect, my standard response to this kind of question is: it depends.
One possible solution is to prefix everything - CONTRACT CUSTOMER, SERVICE CUSTOMER, and so on. But although that may help sort things out, the real challenge is to achieve a joined-up strategy across the various capabilities, processes, data, systems and teams that are focused on the As, the Bs, the Cs and the Ds, rather than arguing as to which of these overlapping groups best deserves the CUSTOMER label.
Sometimes there is no correct answer, but a best fit across the board. That's architecture for you!
Many business concepts are not amenable to simple definition but have fuzzy boundaries. In my 1992 book, I explain the difference between monothetic classification (here is a single defining characteristic that all instances possess) and polythetic classification (here is a set of characteristics that instances mostly possess). See also my post Modelling Complex Classification (February 2009).
But my friend's problem is a slightly different one: how to deal with multiple conflicting monothetic definitions. One possibility is to lump all the As, Bs, Cs and Ds into a single overarching CUSTOMER class, and then provide different views (or frames) for different teams. But this still leaves some important questions open, such as which of these types of customer should be included in the Customer Satisfaction Survey, whether they all carry equal weight in the overall scores, and whose responsibility is it to improve these scores.
In her book on medical ontology, Annemarie Mol develops Marilyn Strathern's notion of partial connections as a way of overcoming an apparent fragmentation of identity - in our example, between the Contract Customer and the Service Customer - when these are sometimes the same person.
Being one shapes and informs the other while they are also different identities. ... Not two different persons or one person divided into two. But they are partially connected, more than one, and less than two. Mol pp 80-82
Mol argues that
frictions are vital elements of wholes,
... a tension that comes about inevitably from the fact that, somehow, we have to share the world. There need not be a single victor as soon as we do not manage to smooth all our differences away into consensus. Mol p 114
Mol's book is about medical practice rather than commercial business, but much of what she says about patients and their conditions applies also to customers. For example, there are some elements that generally belong to "the patient", and although in some cases there may be a different person (for example a parent or next-of-kin) who stands proxy for the patient and speaks on their behalf, it is usually not considered necessary to mention this complication except when it is specifically relevant.
Human beings can generally cope with these elisions, ambiguities and tensions in practice, but machines (by which I mean bureaucracies as well as algorithms) not so well. Organizations tend to impose standard performance targets, monitored and controlled through standard reports and dashboards, which fail to allow for these complexities. My friend's problem is then ultimately a political one, how is responsibility for "customers" distributed and governed, who needs to see what, and what consequences may follow.
(As it happens, I was talking to another friend yesterday, a doctor, about the way performance targets are defined, measured and improved in the National Health Service. Some related issues, which I may try to cover in a future post.)
Annemarie Mol, The Body Multiple: Ontology in Medical Practice (Duke University Press 2002)
Richard Veryard, Information Modelling - Practical Guidance (Prentice-Hall
- Polythetic Classification Section 3.4.1 pp 99-100
- Lumpers and Splitters Section 6.3.1, pp 169-171
@Jon_Ayre questions whether an organization's being data-driven drives the right behaviours. He identifies a number of pitfalls.
- It's all too easy to interpret data through a biased viewpoint
- Data is used to justify a decision that has already been made
- Data only tells you what happens in the existing environment, so may have limited value in predicting the consequences of making changes to this environment
In a comment below Jon's post, Matt Ballentine suggests that this is about evidence-based decision making, and notes the prevalence of confirmation bias. Which can generate a couple of additional pitfalls.
- Data is used selectively - data that supports one's position is emphasized, while conflicting data is ignored.
- Data is collected specifically to provide evidence for the chosen position - thus resulting in policy-based evidence instead of evidence-based policy.
A related pitfall is availability bias - using data that is easily
available, or satisfies some quality threshold, and overlooking the
possibility that other data (so-called dark data) might reveal a
different pattern. In science and medicine, this can take the form of publication bias. In the commercial world, this might mean analysing successful sales and ignoring interrupted or abandoned transactions.
It's not difficult to find examples of these pitfalls, both in the corporate world and in public affairs. See my analysis of Mrs May's Immigration Targets. See also Jonathan Wilson's piece on the limits of a data-driven approach in football, in which he notes
low sample size, the selective nature of the data, and an absence of nuance.
One of the false assumptions that leads to these pitfalls is the idea that the data speaks for itself. (This idea was asserted by the editor of Wired Magazine in 2008, and has been widely criticized since. See my post Big Data and Organizational Intelligence.) In which case, being data driven simply means following the data.
During the COVID pandemic, there was much talk about following the data, or perhaps following the science. But given that there was often disagreement about which data, or which science, some people adopted an ultra-sceptical position, reluctant to accept any data or any science. Or they felt empowered to do their own research. (Francesca Tripodi sees parallels between the idea that one should research a topic oneself rather
than relying on experts, and the Protestant ethic of
bible study and scriptural inference. See my post Thinking with the majority - a new twist.)
But I don't think being data-driven entails simply blindly following some data. There should be space for critical evaluation and sense-making, questioning the strength and relevance of the data, open to alternative interpretations of the data, and always hungry for new sources of data that might provide new insight or a different perspective. Experiments, tests.
Jon talks about Amazon running experiments instead of relying on historical data alone. And in my post Rhyme or Reason I talked about the key importance of A/B testing at Netflix. If Amazon and Netflix don't count as data-driven organizations, I don't know what does.
So Matt asks if we should be talking about "experiment-driven" instead. I agree that experiment is important and useful, but I wouldn't put it
in the driving seat. I think we need multiple
tools for situation awareness (making sense of what is going on and
where it might be going) and action judgement (thinking through the
available action paths), and experimentation is just one of these tools.
Jonathan Wilson, Football tacticians bowled over by quick-fix data risk being knocked for six (Guardian, 17 September 2022)
Related posts: From Dodgy Data to Dodgy Policy - Mrs May's Immigration Targets (March 2017), Rhyme or Reason (June 2017). Big Data and Organizational Intelligence (November 2018), Dark Data (February 2020), Business Science and its Enemies (November 2020), Thinking with the majority - a new twist (May 2021), Data-Driven Reasoning (COVID) (April 2022)
My new book on Data Strategy now available on LeanPub: How To Do Things With Data.
One of the ideas running through my work on #datastrategy is to see data as a means to an end, rather than an end in itself. As someone might once have written,
Data scientists have only interpreted the world in various ways. The point however is to change it.
Many people in the data world are focussed on collecting, processing and storing data, rendering and analysing the data in various ways, and making it available for consumption or monetization. In some instances, what passes for a data strategy is essentially a data management strategy.
I agree that this is important and necessary, but I don't think it is enough.
I am currently reading a brilliant book by Annemarie Mol on Medical Ontology. In one chapter, she describes the uses of test data by different specialists in a hospital. The researchers in the hospital laboratory want to understand a medical condition in great detail - what causes it, how it develops, what it looks like, how to detect it and measure its progress, how it responds to various treatments in different kinds of patient. The clinicians on the other hand are primarily interested in interventions - what can we do to help this patient, what are the prospects and risks.
In the corporate world, senior managers often use data as a monitoring tool - screening the business for areas that might need intervention. Highly aggregated data can provide them with a thin but panoramic view of what is going on, but may not provide much guidance on corrective or preventative action. See my post on OrgIntelligence in the Control Room (October 2010).
Meanwhile, what if your data strategy calls for a 360 view of key data domains, such as CUSTOMER and PRODUCT. If these initiatives are to be strategically meaningful to the business, and not merely exercises in technical plumbing, they need to be closely aligned with the business strategy - for example delivering on customer centricity and/or product leadership.
In other words, it's not enough just to have a lot of good quality data and generating a lot of analytic insight. Hence the title of my new book - How To Do Things With Data.
Annemarie Mol, The Body Multiple: Ontology in Medical Practice (Duke University Press 2002)
My book on Data Strategy is now available in beta version. https://leanpub.com/howtodothingswithdata/
#datastrategy My latest book has been published by @Leanpub
This is a beta version, and I intend to add more material as well as responding to feedback from readers and making general improvements. Subscribers will always have access to the latest version.
More Recent Articles