Relevant Data…Survival of the Fittest

What’s relevant?  What data do you need?  What data will survive???

A recent post by Tim Crawford on LinkedIn proclaimed:  2016 is the year of data and relevance.  The topic was echoed in blog by Myles Suer: 2016: The Year of Data and Relevance.  Get the right data at the right level at the right time for the right decisions!  That looks like the central tenet of Big Data.  And now that we have Big Data, what can we do with it to make better business decisions to drive value, reduce costs, or minimize risks?  In my last post, I looked at  a set of predictions for 2016 by the IBM THINK Leaders: Six data-driven leadership trends for 2016.  Two of the predictions were:  “Survival of the Fittest Data” and “Hyper-Relevancy”.  It does seem to be a year for Data Relevance!

Myles Suer in his blog states: “CIOs need to create relevant and trustworthy data.”  That’s a statement I disagree with at the broad level.  It implies that the CIO will know and understand the needs of and relevance for each and every business user.  That’s not a fair position to put the CIO in, and I think it moves in the wrong direction.

In my years working with data quality, integration, and governance, I’ve seen a lot of discussion around whether the quality of data was inherently ‘good’ or ‘bad’ with a lot of measures including assessments of trust to determine which side it fell into.  But I’ve increasingly felt that criteria was off target.  To me, data quality is about “Does it fit your needs?” If it does, and you trust that it consistently does, then it’s “good”; if not, then it’s not.   I end up coming back to the question:  “What problem are you trying to solve?”

The patterns and concepts of the Fitness Landscape, as expressed in the fields of evolutionary biology, seem particularly relevant.  “Fitness landscapes are often conceived of as ranges of mountains. There exist local peaks (points from which all paths are downhill, i.e. to lower fitness) and valleys (regions from which most paths lead uphill)…. An evolving population typically climbs uphill in the fitness landscape.”

Populations evolve upwards, and with each upward move the number of options available decrease.  It’s also very hard to move back down to take another path.


Figure 1: Copyright: Randy Olson, 2013;

In working with data, you have a lot of potential initial options, but some may not be accessible or understandable, and you’ll make particular decisions with what you do have and know that bring you to a given peak.  It may be locally optimal based on what you have, but could be an incredibly poor choice across the broader landscape, effectively leading you to poor or misguided business decisions.

And that’s a starting point to look at whether specific sources of data are:

  1. Available, accessible, and possibly useful to your purpose;
  2. Relevant for your problem;
  3. Fit for your purpose; and
  4. Able to be incorporated into your solution.


Cataloging Data:  Awareness and Identity

You can create the most consistent, trusted, relevant source of customer data possible, but if the business units working with customer insights do not have awareness of it, it never comes into view.  Similarly, there may be great sources of data relevant to customer demographics or geographic locations, but without awareness of such content, that data will not come into the picture.   It may be relevant, but it is unknown.

It’s not enough to have database or file name.  I may know that directory X contains the following files:


It looks like there’s some data on a clinical study, but whose?  There’s some data about facilities, but are they related to the clinical study, or are they something else?  What are these associated with?  Where did they come from?  Is October 19, 2015 the date they were created or the date added?

Without some organizing principles applied, such data is like a random fitness landscape where peaks and valleys are close together.  It is a landscape where natural selection (i.e. finding relevant data) can not work.


A much better starting point for the CIO is to create a relevant and trustworthy Catalog of the data.


Curating Data: Description, Provenance, Use

More than just naming a dataset or database, it’s important to understand the basics about a given source of data.  What is it?  What does it relate to?  Who created it?  How old is it?  How did it get into its current form?  How should I use it?

These are the questions that need to be answered for effective data curation.  Building from a catalog of data, the catalog needs to be enriched with metadata.  Consider the data available at, home of the U.S. Government’s open data.  Without information about the data, who can tell what may or may not be relevant?

But I can fairly readily browse topics or data sets to seek relevant information.  I may be trying to determine risk factors for home insurance evaluation or the likelihood of requiring certain supplies in a retail food chain.  Climate data is relevant to each, and perhaps data on coastal flooding would be of use.  By taking advantage of tags, I get to a set of 63 data sets that may be of immediate relevance.


Each data set has been curated to some level.  It’s described.  It indicates who created it.  I can see the formats available, and even how many people have recently viewed it!  It might be of further use, like on, to see how well other users like the data set or what their comments are.

Data Curation is a collective activity.  Some content may be gathered via metadata management tools, and other information such as where it came from might be automatically collected as data lineage.   But  if someone downloads the NCDC Storm Events database to a local directory, who knows where it came from?  Is the information that it came from and the National Oceanic and Atmospheric Administration available?  It is this added metadata that gives me higher trust in a given data source, and this content should be available in any decision-making process.

From the perspective of the fitness landscape, you are reducing the set of random peaks and valleys to a set from which you can effectively start exploring and evaluating for best fit.


Evaluating Data: Exploration and Discovery; Or what’s really here?

Whether through manual investigation or the use of automated techniques for data profiling, text analytics, or other data discovery and exploration tools, evaluation has centered on various criteria and measures, often with pre-set assumptions.  Looking at an attribute in a given set of data, I may determine that 80% of the attribute is null or blank – it’s incomplete.  Is that bad?  Does it change the relevance of the data?  Maybe.  Maybe not.

If it is data from a sensor, that incomplete data could have different meanings depending on context.  If I’m studying the rate of sensor failure, then that null and blank content may be critical!  That ‘incomplete’ data is in fact, completely relevant and fit for my purpose.  If I’m instead assessing the correlation of cold weather with online shopping and my weather sensors have a lot of missing data, then I may determine either a) the complete data is still relevant, but I need to filter out those records with incomplete content; or b) the data would be relevant if available, but this data set is not reliable so I need to correct it or find an alternative to address the gaps.

The starting question, driven by business requirements, makes all the difference.  If the CIO took it into her hands to make upfront assumptions about relevance and trustworthiness, would I even have this set of ‘incomplete’ data?  Or would it be marked as ‘bad’, suggesting that I avoid it without even assessing whether it answers my questions?  Data catalogs, whether incorporating automated or user-driven context, need to avoid embedding judgmental assumptions.  More important is the ability to incorporate connections and relationships that may be hidden within the larger metadata content.

As noted in the article, The Five Elements of a Data Scientist’s Job: “The reason data scientists need to be involved in this process is to keep the project on track, so that everyone on the team knows which data to include in the work. Otherwise, [Scott Nicholson, the chief data scientist at Accretive Health] said, engineers on a big data project ‘are dumping whatever they want into a Hadoop cluster and downstream as a data scientist you are trying to consume this stuff, and you have no idea what it means. And you realize you have to go find the engineer and clean it up. It’s just a mess. So the lesson is to have your data scientist sit next to the people who are logging the data.'”

You’ve now identified which peak in your fitness landscape appears optimal to work with.  Now you can hone your data and evaluate it towards your business decisions.


Preparing Data:  Fitting your Purpose

Through awareness, curation, and discovery, you’ve identified those sources of data most likely of use in  and relevance to your business problem.  Discovery helps ascertain the steps you need to make that data fit for purpose.  Now you need to prepare it.  And that’s still a sizable task, often estimated at 50-80% of a data scientist’s time.

Preparation of data is an evolutionary process.   It’s likely that additional data is needed and must be evaluated together.  It typically requires iterations to filter what’s needed or not needed, join data sources together, and assess results.  The outputs of such work may be processes and data sets that meet adhoc needs or that drive ongoing business decisions.  Establishing metrics at this point is useful to measure how well the data meets your needs – and these metrics can become the measures used to track the data over time.  Ideally, the steps taken and the metrics applied become part of the catalog of data that can give added value to other users.  As the IBM Think Leaders noted, “companies will prioritize data sets that provide a real competitive advantage.”


The Evolving Landscape

The data is not the end goal, though.  You’ve used the data to address a business problem.  It’s important to assess not only the data, but whether the business problem has been resolved.  On the fitness landscape, this is where you look to see if you’re on the right peak.  Have you asked the right questions upfront?  Or has the business problem changed in a way that the questions must be asked anew, possibly requiring different data.  In an era of rapid business change and disruption, we see whole-scale changes to both underlying business questions and available data.

It is an ever-evolving data landscape in the quest to survive as a business.


Information Landscape 2.0

Welcome to Patterns in the Information Landscape!

I’m Harald Smith, long-time practitioner and product strategist across the information integration, quality, and governance domains, and currently Director, Product Management at Trillium Software.   Thoughts and opinions here are my own.


Blog 2.0

While not new to blogging, this is a blog relaunch, a 2.0 version of my earlier blog, Journeys in the Information Landscape, which went dormant amidst a job change and other life events.  It is also a rename of the blog: its both a call out to my earlier book Patterns of Information Management, co-authored with Mandy Chessell and published by Pearson/IBM Press and its associated community Patterns of Information Management an IBM developerWorks community (which focused on topics from our book) and a focus on the patterns that feature in, drive, and emerge when working with information management and governance.

My themes remain largely the same:

  • How do we obtain and effectively use data and information for business decisions and processes?
  • How do we ensure such information fits its purpose by providing the right data with the right quality at the right time?  (Assuming we can even identify what that means!)
  • How do we govern and protect that data and comply with policies while making such information available to the consumers who need it?
  • How do we establish effective use and reuse of the data, processes, and tools involved with these whether through best practices, methods, and approaches; managing the design and delivery of data products for specific needs; or just responding to questions and issues?
  • And coupled with the above, are there insights and patterns whether drawn from math and science, history and art, nature or games, or other domains that may help answer those questions?


Into the New Year: thinking about data-driven trends for 2016

A new blog, a new year, what’s better than to consider some predictions for the information landscape for 2016!  One set of predictions that recently caught my eye was from the IBM THINK Leaders: Six data-driven leadership trends for 2016.  Their first prediction was on the continued advancement of data-powered personalization.

Data-powered personalization.  “Businesses will leverage user data at key touchpoints to guide product development and personalize customer journeys.”

There’s a difficult dichotomy with this.  If you do it right, the consumer is happy.  If you do it wrong, the consumer is left annoyed, frustrated, irritated, or outright upset.  Within the last month, a friend posted a picture of a rabbit on Facebook.  It was a classic Monty Python moment, looking much like the Killer Rabbit, and I responded accordingly.  It was not long before Facebook gave me an ad for a Monty Python t-shirt.  Talk about data-powered personalization!

  • Yes, it was data-driven – I did reference Monty Python.
  • Yes, it was directed to me, so certainly personalized.
  • Yes, it was driven by a business seeking to increase its sales and revenue by taking advantage of Facebook’s data.
  • Was it relevant?  No.  I don’t shop via Facebook.
  • Was it annoying?  Yes.  Although I’m well aware that Facebook seeks ways to make money, I do not enjoy the constant bombardment with ads there or elsewhere across the internet.
  • Was it creepy?  Yes, it got close to the creepiness factor: that sense of being watched in everything you do.
  • Will I stop using Facebook?  Not yet, but as I’ve already experienced with radio and TV where the saturation of ads is now extreme,  there is a tipping point at which I discontinue use of a given media or source.

It’s a challenge to find the right balance.  As a product manager, I face this challenge with each decision about the user experience or a given user interface:  what’s useful and important to the user, what makes a user happy with the interface and the results, what frustrates the user and how do I resolve those pain points?

We face similar issues as we work with data integration and preparation tools to build out and provide data sets for business processes and analytics.  Whether for a targeted marketing campaign or a next best action at a call center, we need to think carefully about the data at hand.  Where did it come from?  Did the provider of the data really understand they were ‘authorizing’ its use?  What happens when that data is joined to other data?  How would you respond if it was your personal data?

My former colleague and co-author Mandy Chessell wrote a concise and useful whitepaper on this topic of Ethics, Big Data, and analytics which can be found here: Ethics for big data and analytics.  It’s an important topic that overlays everything we do and consider when working towards data-powered personalization.


In subsequent posts, I’ll consider some of the other data-driven trends for 2016.  If there are specific topics of interest to you regarding information management and governance, please let me know and I will see if I can address them.

So, a new year, a new blog home, a relaunched blog, and familiar topics in the information landscape.  At least that is the premise!   And as always, the postings on this site are my own.