Executive Summary
The terms big data, data science, machine learning, and real-time analytics have invaded the vernacular of just about every executive and manager worldwide. Regardless of which buzz words you favor, the concept is essentially the same: Leverage your data more effectively, generate more business value. Ironically, even organizations that spend millions of dollars on the latest infrastructure and software often find it difficult or cost prohibitive to employ data-driven solutions. The reason is simple; data growth frequently outpaces data capabilities. When an increase in data volume is paralleled by a rise in the number of data backends, data formats, data structures, domain objects, etc., organizations reach an inflection point, creating disjoint collections of data and people and making it harder for each to connect the dots.
This whitepaper introduces a novel concept called data culture and illustrates how the most common data challenges facing organizations typically stem from a lack thereof. Each obstacle will be described in detail, and we’ll conclude with a discussion on how organizations can encourage and refine their data culture. While some managers may be tempted to say there is not enough time for this sort of work, this statement could not be further from the truth. Imagine how much more efficiently your organization could leverage data if each point of friction was removed. Analyses and reports that took days to generate are now completed in minutes with minimal expertise. Development teams previously reluctant to consume other teams’ data are now heavily integrated, forming a tight-knit technical ecosystem. To say there is no time to pursue such an ideal, to embrace inefficiency in favor or progress, is simply absurd.
What is Data Culture?
Data culture is what separates organizations who simply use their data without a consistent strategy from their truly data-driven competitors. Let’s begin with a definition of the broader term organizational culture. Former MIT Professor Dr. Edgar Schein has one of better definitions of organizational culture that I’ve come across, which he defines as:
…comprising a number of features, including a shared pattern of basic assumptions which group members have acquired over time as they learn to successfully cope with internal and external organizationally relevant problems.
To paraphrase, organizational culture is developed over time and implicitly dictates who can do what, and how it should be done, to adhere to the established norms of the organization. Data culture is quite similar. Data culture is organizational culture as it pertains to an organization’s handling of data. Here at Raft, we formally define data culture as follows:
Data culture encompasses a shared set of assumptions and philosophies pertaining to an organization’s use of data, refined through successes and failures, enabling an organization to efficiently leverage internal and external data through the decision-making process.
An Organization Divided
Imagine an organization with two development teams: Team A and Team B. Team A has an application which publishes messages into a data stream (Kafka for example). Team B wishes to perform an analysis on this data and asks Team A where said data is stored. Team A says that their Kafka topic only has a two-week retention period, and it is not their responsibility to persist the data; they simply ensure the data stream is consistently generated. Team B disagrees. Management might not know what to think. In the absence of a well-defined data culture problems such as this will occur repeatedly. These kinds of problems drastically increase the amount of time required to make decisions and distract from meaningful work.
Understand Data Facets
It is helpful to think of data culture as being composed of the following facets.
- Generation Data – is created and collected in myriad ways. Equally numerous are the philosophies surrounding how this should be accomplished. What naming conventions, data structures, and file types will we use? Should timestamps be represented in ISO-8601 or use a milliseconds epoch representation? Without any baselines or best practices in place, each team is forced to reinvent the wheel. This is often the crux of many data analysis problems.
- Consistency – When a team decides to make changes to their data, is there an established protocol that must be followed (e.g. Protobuf, JSON, Avro, etc)? Is backwards compatibility a requirement? How do we let consumers know about said changes? Do we even know who is consuming our data? Without established protocols and practices, making changes to upstream data is like flicking the bottom card in a house of cards.
- Documentation – When someone in your organization wants to know the interpretation or meaning of a particular field within a data set, where do they go? Oftentimes, this information only exists inside the head of a developer, without any corresponding physical documentation. This then spawns a tedious and error-prone game of telephone, email, zoom to find information that should have taken minutes. The next individual with the same question must repeat this dance all over again.
- Infrastructure as Code (IaC) – This aspect of data culture dances the line between engineering and documentation. Imagine your colleague shows you a chart depicting space debris trends and forecasts. You wish to repeat their analysis, but to accomplish this you need to manually download various data sets, configure a local database, install Python 3.8 and various packages, and then, finally, hope their script runs on your laptop. Even if this process is well documented, there’s no way to guarantee it’ll run on my laptop, your laptop, and the laptop of the guy down the hall. If the entire process were packaged alongside all its dependencies, it would be a lot easier to share. In short, data becomes more valuable faster, and to a wider audience, when packaged along with the infrastructure needed to put it to use.
- Storage – Where, if anywhere, does the data end up, and who is responsible for this? Recall our example above: Team A believed their responsibility ended at producing the data. Team B assumed Team A should also be responsible for storing their data. Regardless of which team is correct in this matter, where should the data go? Do we need to spin up a SQL database somewhere? Do we do away with the two-week retention period and store all the data in Kafka? Is every team forced, yet again, to reinvent the wheel and architect their own solution?
- Access Control – In a disparate data environment, access control typically poses its own slew of challenges. Not only do data backends themselves need to be managed, but access to each system must also be managed. If your organization is utilizing different mechanisms to control who has access to what, you’ve got a data culture problem.
A strong data culture emerges when organizations treat each data facet as part of a holistic data conversation. When this occurs, the enterprise becomes well-positioned to integrate existing tools, systems, and human resources into a refinery for transforming their raw streams of data into rocket fuel for running the business. The first step is to identify and remove the obstacles in your organization that can prevent your efforts to derive more value from data from getting off the ground.
The Top Three Obstacles to Maximizing Data Value
We’ll now discuss the most common obstacles organizations face when they endeavor to utilize their data and examine how these issues persist in the absence of a well-defined data culture. Employees often suffer in silence, so problems go unnoticed by company leaders who become increasingly frustrated with the organization’s lack of progress results without understanding the underlying cause.
Obstacle One: Data Location
I’ve lost track of how many emails and chat messages I’ve sent asking where I can find a certain data set. The process is almost always the same. I begin by messaging a handful of my closest data comrades, who typically reply with something along the lines of “I’m not sure. Try reaching out to so and so.” With a bit of luck, patience, and tenacity you might find the data you were looking for. If the task you’re working on requires you to synthesize multiple data sets, you may need to repeat this process for each and every one. Worse yet, the next individual looking to leverage the same data sets will likely need to repeat this process all over again.
The larger the organization, the more challenging and time consuming this becomes. Often, teams just settle for leveraging only the data they directly produce and manage. Synthesizing data across teams, divisions, etc. is viewed as too costly from a time and development standpoint. Alternatively, an individual in a large organization may not even know a certain data set exists. As a result, many organizations will make important decisions without a complete view of the data. For example, I once worked with a company that performed an analysis on their supply chain data and concluded that purchasing more trucks would enable them to move more inventory. Upon receiving 500 shinynew trucks, they realized they didn’t have anywhere to park them. Had it been easier to synthesize supply chain data with corporate real estate data, this problem could have been easily avoided.
Why a Data Lake is Not the Only Answer
To solve the disparate data problem, many organizations have fallen victim to the trap of blindly implementing a centralized data lake architecture. If done properly, and at the right time, this approach can work quite well. However, those organizations lacking an established data culture will quickly find their data lake turns into a data swamp.
Without proper documentation, consistent naming conventions, and standardized domain object representations, the question where can I find data on such and such becomes where in the data lake does data on such and such reside? If data was challenging to find prior to a data lake migration, the data lake is not going to help. Additionally, each data backend must be consistently ingested into the data lake to ensure synchronization. Without a data change protocol, the slightest tweak to any of the upstream data sources will cause this ETL process to fail. This is only a sampling of the problems that commonly result from a premature push to a centralized data architecture. Organizations lacking data culture cannot hope to substitute technical architecture for unified data standards and practices.
The Little Data Engine that Could
If your organization is not ready for a data lake, what can be done instead? An internal data search engine (e.g. a Metadata Platform) is one of the most important tools an organization can possess. The idea is quite simple. Suppose we have a team that generates a stream of data. This team will bestow the honor of data owner to a team member (inevitably the new guy), whose job is to register the data set into a centralized data catalog. This registry contains, at minimum, information such as the data origin (Kafka, SQL, etc.), schema, field descriptions, and a high-level overview of the data set as a whole. Once populated, anyone in the organization can use this tool to search for data. This enables producer and consumers to quickly find one another, and the existence of a data owner fosters accountability. Such a platform enables data culture to grow organically through these interactions. When your organization is ready to make the move towards a centralized data ecosystem this tool can be easily modified.
Obstacle Two: Data Synthesis
To put it simply, synthesizing data from multiple sources in organizations without a strong data culture often requires hours of busy work before the real work can begin. Transforming datasets into useful analysis material typically involves a sequence of ingesting, joining, filtering, cleaning, parsing, and munging. While tools such as Tableau and Microsoft’s Power BI claim to make this process simple, their usefulness only extends so far. Data scientists and analysts within an organization can waste countless hours implementing custom data ingestion scripts to accomplish something resembling a SQL join query. A mature data culture can significantly ease the development burden of synthesizing data.
Size Matters
Before beginning any sort of analysis, we need to ask ourselves a very simple question: where are we going to put the synthesized data set? If the upstream data sets are small, we can likely get away with pulling the data into memory and constructing the aggregate data set(s) on a laptop. If the data is a bit more sizeable, maybe we can leverage a library such as Dask in Python to utilize the hard drive, enabling us to work with datasets that are larger than memory. But what if the dataset doesn’t even fit on your hard drive? An established data culture can provide guidance for situations such as these and help identify architectural limitations.
But first, standards
Let’s assume the data we wish to synthesize will fit into memory and we do not need to concern ourselves with any external data architecture. If common standards regarding domain object representation, timestamp formatting, naming conventions, etc. have not been established, this task is likely to be quite difficult. Every table might have an Id field, but there may or may not be any inherent relation between the Id columns of different tables. Perhaps one of the tables has a column called D_Ser3 which looks like it corresponds with column ProductNumber in another table, but you aren’t really sure. Furthermore, you really want to analyze this data on a daily basis, but one table uses an ISO 8601 timestamp (2021-04-18T17:25:42+0000), another is using a UNIX Epoch timestamp (1618766842), and yet another isusing a home-brewed timestamp format rounded to the nearest 15 minutes (2021/04/18 01:15:00 PM). To accomplish the seemingly simple task of joining a handful of tables in memory, you not only need to figure out how they relate to one another, but you’ll need to spend a bunch of time and effort to standardize data formats, naming conventions timestamp representations, etc., all before arriving at the few lines of code where you perform the join. Had all this been standardized beforehand, the analyst could have gone into Tableau, Power BI, or Apache Superset, made use of the user interface to join the tables together, and had a pretty dashboard up and running in a matter of minutes.
If this sounds like your organization, a great place to start building data culture is through standardizing the representation of timestamps across technical teams. Begin with a small working group; perhaps just one individual from each of a handful of teams. By collaboratively coming to an agreement, documenting the decision, and sharing the findings, these individuals will have made a huge step in laying the foundation of the organization’s data culture.
Obstacle Three: Data Regeneration
Repeating an analysis or regenerating a report typically requires starting from scratch. Sharing a chart is not the same thing as sharing an analysis. Effective data culture, complemented with sound technical architecture, enables anyone in the organization to easily regenerate and extend the work of their peers with minimal effort. Isaac Newton once said, “If I have seen further than others, it is by standing upon the shoulders of giants.” Organizations that can easily and effectively leverage prior work are infinitely more productive than their counterparts who need to recreate everything from the ground up.
Your data has baggage
It’s tempting to prematurely conclude that anyone can regenerate a chart, a metric, etc. simply by running the same scripts as the author. However, most practitioners are likely to run into problems. The author of a chart may have been leveraging local directories, specific software packages, and perhaps, utilizing manual processes to extract and store data. Even if written documentation is plentiful, inconsistencies and problems will surely arise. Troubleshooting these problems can sometimes take longer than it would have taken to recreate the entire process from scratch.
Admittedly, this problem is as much a data culture problem as it is an infrastructure problem. However, it’s important to realize that these concepts do not exist independently. An organization focusing exclusively on implementing the latest technical infrastructure while neglecting data culture will inevitably fall short of their goals.
All together now
Replicating the work of others is rarely trivial, even more so when data is involved. This is made easier by bundling the data ingestion processes, software dependencies, directory structure, and scripts into a standalone entity. Containers (images) and Container Orchestration (Kubernetes) compliment data culture beautifully. Organizations that supplement their written documentation with infrastructure as code will experience exponential gains in productivity. Analyses, metrics, charts, and reports can all exist as packaged entities ready to serve as building blocks for the next initiative. As an added benefit, the increased transparency in how data is used will facilitate conversations and collaboration, further refining and solidifying the organization’s data culture.
Developing and Refining Data Culture
Establishing, developing, and refining data culture is, and will always be, an ongoing process. Regardless of how far along the journey your organization is, the various approaches all revolve around a central theme: communication. When teams and individuals operate in isolation, they make decisions they deem the most logical and or most convenient. These decisions may not work for the rest of the organization. For example, a developer may elect to store data generated by their application in a NoSQL database because it is easier, for whatever reason, than storing it in a more-traditional relational database such as PostgreSQL. Meanwhile, the downstream consumers are frustrated because this particular dataset is significantly more difficult to consume than other data sources; everything else uses PostgreSQL. All the while, the developer producing this data is likely to be completely unaware of the exasperation they are causing their colleagues. Had a conversation taken place between data producer and data consumer before, during, or even after this decision was made, the problem could have been either avoided entirely, course corrected, or fixed. Of course, the earlier this conversation takes place, the better, but is important to realize that it’s never too late to begin the dismantling of communication silos.
Hello Human
How does one begin developing data culture? Or, if your organization already has an established data culture, what can be done to ensure continued development and refinement? The answer lies in facilitating targeted conversation. What data challenges are most pervasive within your organization? Are most individuals struggling to join disparate data? Are they constantly fighting with inconsistent formatting? Ask around. You’ll probably be amazed at how forthright people usually are when it comes to what makes their jobs difficult.
Raft has decades of combined experience in developing, maintaining, and optimizing all aspects of data-driven workflows. Through our combination of technical expertise and user-focused design, we’ve helped public and private organizations of all sizes leverage their data in ways they never imagined, delivering highly impactful software solutions while simultaneously fostering the development of organizational data culture. Let’s talk!