What’s relevant? What data do you need? What data will survive???
A recent post by Tim Crawford on LinkedIn proclaimed: 2016 is the year of data and relevance. The topic was echoed in blog by Myles Suer: 2016: The Year of Data and Relevance. Get the right data at the right level at the right time for the right decisions! That looks like the central tenet of Big Data. And now that we have Big Data, what can we do with it to make better business decisions to drive value, reduce costs, or minimize risks? In my last post, I looked at a set of predictions for 2016 by the IBM THINK Leaders: Six data-driven leadership trends for 2016. Two of the predictions were: “Survival of the Fittest Data” and “Hyper-Relevancy”. It does seem to be a year for Data Relevance!
Myles Suer in his blog states: “CIOs need to create relevant and trustworthy data.” That’s a statement I disagree with at the broad level. It implies that the CIO will know and understand the needs of and relevance for each and every business user. That’s not a fair position to put the CIO in, and I think it moves in the wrong direction.
In my years working with data quality, integration, and governance, I’ve seen a lot of discussion around whether the quality of data was inherently ‘good’ or ‘bad’ with a lot of measures including assessments of trust to determine which side it fell into. But I’ve increasingly felt that criteria was off target. To me, data quality is about “Does it fit your needs?” If it does, and you trust that it consistently does, then it’s “good”; if not, then it’s not. I end up coming back to the question: “What problem are you trying to solve?”
The patterns and concepts of the Fitness Landscape, as expressed in the fields of evolutionary biology, seem particularly relevant. “Fitness landscapes are often conceived of as ranges of mountains. There exist local peaks (points from which all paths are downhill, i.e. to lower fitness) and valleys (regions from which most paths lead uphill)…. An evolving population typically climbs uphill in the fitness landscape.”
Populations evolve upwards, and with each upward move the number of options available decrease. It’s also very hard to move back down to take another path.
Figure 1: Copyright: Randy Olson, 2013; https://upload.wikimedia.org/wikipedia/commons/e/ea/Visualization_of_two_dimensions_of_a_NK_fitness_landscape.png
In working with data, you have a lot of potential initial options, but some may not be accessible or understandable, and you’ll make particular decisions with what you do have and know that bring you to a given peak. It may be locally optimal based on what you have, but could be an incredibly poor choice across the broader landscape, effectively leading you to poor or misguided business decisions.
And that’s a starting point to look at whether specific sources of data are:
- Available, accessible, and possibly useful to your purpose;
- Relevant for your problem;
- Fit for your purpose; and
- Able to be incorporated into your solution.
Cataloging Data: Awareness and Identity
You can create the most consistent, trusted, relevant source of customer data possible, but if the business units working with customer insights do not have awareness of it, it never comes into view. Similarly, there may be great sources of data relevant to customer demographics or geographic locations, but without awareness of such content, that data will not come into the picture. It may be relevant, but it is unknown.
It’s not enough to have database or file name. I may know that directory X contains the following files:
It looks like there’s some data on a clinical study, but whose? There’s some data about facilities, but are they related to the clinical study, or are they something else? What are these associated with? Where did they come from? Is October 19, 2015 the date they were created or the date added?
Without some organizing principles applied, such data is like a random fitness landscape where peaks and valleys are close together. It is a landscape where natural selection (i.e. finding relevant data) can not work.
A much better starting point for the CIO is to create a relevant and trustworthy Catalog of the data.
Curating Data: Description, Provenance, Use
More than just naming a dataset or database, it’s important to understand the basics about a given source of data. What is it? What does it relate to? Who created it? How old is it? How did it get into its current form? How should I use it?
These are the questions that need to be answered for effective data curation. Building from a catalog of data, the catalog needs to be enriched with metadata. Consider the data available at data.gov, home of the U.S. Government’s open data. Without information about the data, who can tell what may or may not be relevant?
But I can fairly readily browse topics or data sets to seek relevant information. I may be trying to determine risk factors for home insurance evaluation or the likelihood of requiring certain supplies in a retail food chain. Climate data is relevant to each, and perhaps data on coastal flooding would be of use. By taking advantage of tags, I get to a set of 63 data sets that may be of immediate relevance.
Each data set has been curated to some level. It’s described. It indicates who created it. I can see the formats available, and even how many people have recently viewed it! It might be of further use, like on Amazon.com, to see how well other users like the data set or what their comments are.
Data Curation is a collective activity. Some content may be gathered via metadata management tools, and other information such as where it came from might be automatically collected as data lineage. But if someone downloads the NCDC Storm Events database to a local directory, who knows where it came from? Is the information that it came from data.gov and the National Oceanic and Atmospheric Administration available? It is this added metadata that gives me higher trust in a given data source, and this content should be available in any decision-making process.
From the perspective of the fitness landscape, you are reducing the set of random peaks and valleys to a set from which you can effectively start exploring and evaluating for best fit.
Evaluating Data: Exploration and Discovery; Or what’s really here?
Whether through manual investigation or the use of automated techniques for data profiling, text analytics, or other data discovery and exploration tools, evaluation has centered on various criteria and measures, often with pre-set assumptions. Looking at an attribute in a given set of data, I may determine that 80% of the attribute is null or blank – it’s incomplete. Is that bad? Does it change the relevance of the data? Maybe. Maybe not.
If it is data from a sensor, that incomplete data could have different meanings depending on context. If I’m studying the rate of sensor failure, then that null and blank content may be critical! That ‘incomplete’ data is in fact, completely relevant and fit for my purpose. If I’m instead assessing the correlation of cold weather with online shopping and my weather sensors have a lot of missing data, then I may determine either a) the complete data is still relevant, but I need to filter out those records with incomplete content; or b) the data would be relevant if available, but this data set is not reliable so I need to correct it or find an alternative to address the gaps.
The starting question, driven by business requirements, makes all the difference. If the CIO took it into her hands to make upfront assumptions about relevance and trustworthiness, would I even have this set of ‘incomplete’ data? Or would it be marked as ‘bad’, suggesting that I avoid it without even assessing whether it answers my questions? Data catalogs, whether incorporating automated or user-driven context, need to avoid embedding judgmental assumptions. More important is the ability to incorporate connections and relationships that may be hidden within the larger metadata content.
As noted in the article, The Five Elements of a Data Scientist’s Job: “The reason data scientists need to be involved in this process is to keep the project on track, so that everyone on the team knows which data to include in the work. Otherwise, [Scott Nicholson, the chief data scientist at Accretive Health] said, engineers on a big data project ‘are dumping whatever they want into a Hadoop cluster and downstream as a data scientist you are trying to consume this stuff, and you have no idea what it means. And you realize you have to go find the engineer and clean it up. It’s just a mess. So the lesson is to have your data scientist sit next to the people who are logging the data.'”
You’ve now identified which peak in your fitness landscape appears optimal to work with. Now you can hone your data and evaluate it towards your business decisions.
Preparing Data: Fitting your Purpose
Through awareness, curation, and discovery, you’ve identified those sources of data most likely of use in and relevance to your business problem. Discovery helps ascertain the steps you need to make that data fit for purpose. Now you need to prepare it. And that’s still a sizable task, often estimated at 50-80% of a data scientist’s time.
Preparation of data is an evolutionary process. It’s likely that additional data is needed and must be evaluated together. It typically requires iterations to filter what’s needed or not needed, join data sources together, and assess results. The outputs of such work may be processes and data sets that meet adhoc needs or that drive ongoing business decisions. Establishing metrics at this point is useful to measure how well the data meets your needs – and these metrics can become the measures used to track the data over time. Ideally, the steps taken and the metrics applied become part of the catalog of data that can give added value to other users. As the IBM Think Leaders noted, “companies will prioritize data sets that provide a real competitive advantage.”
The Evolving Landscape
The data is not the end goal, though. You’ve used the data to address a business problem. It’s important to assess not only the data, but whether the business problem has been resolved. On the fitness landscape, this is where you look to see if you’re on the right peak. Have you asked the right questions upfront? Or has the business problem changed in a way that the questions must be asked anew, possibly requiring different data. In an era of rapid business change and disruption, we see whole-scale changes to both underlying business questions and available data.
It is an ever-evolving data landscape in the quest to survive as a business.