Data Science Amid the Pandemic
Data science is having a moment. It’s not the first time, of course: 2008 and 2016 were other banner years for the discipline, when the public took an interest in using data to predict our collective fates. (Data turned out to be more effective at prediction in one year — “How Democrats Won The Data War In 2008” — but not in the other — “How Data Failed Us in Calling an Election.” In 2020, as governments work to mitigate the spread of COVID-19, the public has become even more acutely aware of the impact that data science has on all of our lives.
Along with this heightened awareness of data’s importance, the public is also seeing its many complexities. As people from all backgrounds and disciplines post epidemiology graphs and “R-nought” curves on social media, they’re also getting embroiled in disagreements about what this data means and which models to pay attention to.
Big data scientists are no strangers to these discussions, which have long been a regular occurrence in the enterprise. They know that disagreements are inevitable and even necessary to develop more accurate models, as long as these discussions are collegial.
Innovation Through Collaboration
As Sarah Callaghan writes in the journal Patterns, “I would urge all data scientists wishing to help in these modeling efforts to not just simply grab the data and plug them into their preferred analysis software. The numbers that result can be terrifying, especially without the domain-specific knowledge that epidemiologists have to put it all into context.”
Callaghan encourages data scientists to join the Kaggle COVID-19 Open Research Dataset Challenge (CORD-19), a response to the White House Office of Science and Technology Policy's call to action to address high-priority COVID-19 questions. She adds that the Kaggle challenge is an opportunity “where we can all work together as a team and play to our respective strengths.”
For those in the enterprise, such collaborative initiatives are opportunities to learn what factors lead a group of people to consensus and actionable answers. If such endeavors can happen on a massive scale to tackle one of the toughest problems our world has encountered in a century, creating effective collaborative data policies and initiatives in the enterprise is within reach.
Here are some concrete lessons that organizations can learn from COVID-19 data science initiatives:
1. Gather All the Data
Collecting real-time data at the center of your organization on an ongoing basis isn’t a simple task. As the lack of available testing and reliance on manually collected and coded data during the outbreak indicates, the infrastructure and processes put in place to ingest high volumes and diversity of data types matter. Not collecting data accurately can bias your models and delay the time it takes to gather large enough samples to analyze.
In the enterprise, it’s crucial to analyze all of your data, not just some of it. This principle was a driving force for us as we designed our hybrid cloud data analytics software Vantage to leverage 100% of a company’s data. We knew that this level of visibility would be the best way for enterprise leaders to see connections that couldn’t be identified otherwise.
2. Make Data Open and Accessible
Organizations and teams within the same enterprise are always going to be protective of their data, but when a global crisis threatens everyone’s lives and livelihoods, that territorial instinct fades quickly. How can you encourage similar levels of access and collaboration on business-critical projects, even without a global pandemic?
Making data accessible begins with your governance, which should do more than just ensure integrity and security. Your governance must be developed as part of your broader data analytics management strategy. Consider creating a layered data architecture that allows you to keep control of your metadata, such as your business rules and definitional criteria, while still freeing your people to access data in an agile manner. You can open up raw, unstructured data sets to the technical data scientist, for example, but create more structured and automated interfaces for the business analyst. Both roles will still have the autonomy they need to work with the same data and uncover insights, while your data security and integrity will stay intact.
3. Encourage Community Feedback and Sharing
Even though participants in Kaggle’s CORD-19 Challenge are competing for prize money, they still openly discuss tools and approaches that others may find helpful as they develop their data science projects. The Kaggle participants are also submitting feedback regularly to the organizers about making the Challenge run more smoothly.
The enterprise can create this same level of community and support by creating a culture of continuous learning, where sharing ideas and working across functions is rewarded. At Teradata, we have a platform called Transcend that contributes significantly to our culture of collaborative learning. Our people use Transcend to safely experiment with our own corporate data and see what other teams have tried in the common endeavor to optimize our products and services for our customers’ needs.
4. Integrate and Share Data to Invite Discovery
Putting data in context is a crucial step on the path towards helping a community discover answers. Johns Hopkins University engineers understood this early on in the outbreak when they built the widely circulated COVID-19 Global Map showing real-time case data around the world.
Presenting data in the context of its scale or comparison to other familiar anchor values can reveal answers that data in isolation could never uncover. That’s why integrating data sets and types and giving data scientists the tools to visualize and communicate data context are so important. It’s through providing this context, often through data visualizations that are easy to understand at a glance, that data scientists have helped the public understand the virus’ threat and take action to slow its spread.
The Responsibility of a Data Scientist
While data scientists aren’t on the frontlines of this pandemic in the same ways that essential workers are, they still have a critical role to play — even a civic duty — to fight it. Data scientists can apply their expertise in cleaning, integrating, modeling, and communicating about data to shed light on complex questions. Whenever data is a major sticking point (and it frequently is), the data scientist can remove roadblocks to understanding.
Enterprise leaders, in turn, have a responsibility to remove the roadblocks standing in the way of anyone in the organization who could discover answers in data. For a company with a thriving learning culture, when the path is cleared to make a meaningful difference, many people often do.
Curious about how Teradata Vantage can encourage innovation?