It is something along the lines of common wisdom to describe cybersecurity as the biggest strategic challenge confronting the United States. Recent headlines – from the Justice Department’s indictment of four Chinese nationals for hacking, to the dramatic uptick in global ransomware attacks, and even China’s alleged hack of Norway’s parliament – only confirm this trend. Indeed, in every year since 2013, the US Intelligence Community has ranked cybersecurity the number one threat facing the nation in its annual global threat assessment – only in 2021, at the height of a global pandemic, did the flawed state of our collective information security lose its top spot.
There is, however, one major fault with the commonly accepted wisdom about cybersecurity: It has a blind spot.
More specifically, traditional cybersecurity measures all too frequently fail to take into account data science and the security vulnerabilities that are unique to artificial intelligence systems. Put simply, the policies being developed and deployed to secure our software systems do not clearly apply to data science activities and the artificial intelligence systems they give rise to.
This means that just as we are collectively encouraging the adoption of technologies like AI – motivating policy after policy proclaiming the benefits of AI and data analytics – we are also seeking to secure software in ways that are fundamentally blind to the challenges they create. This paradox is reflected not just in our policies, but also in the software we are collectively adopting: We now spend more than ever on both AI systems and cybersecurity tools, and yet the state of our information security has never been worse. Every day, it seems, brings with it the announcement of a new vulnerability, hack, or breach.
Without significant adjustment, the fact is that we cannot have both more AI and more security – at least not at the same time and without a change in the way we approach securing software and data. An analysis of a few of our most prominent blind spots, then, is what we believe to be the first step towards a solution – and, perhaps most importantly, is what we hope to accomplish in this brief article.
Specifically, we use the Biden Administration’s recently released Executive Order (EO) on Improving the Nation’s Cybersecurity as our inspiration – a document that’s ambitious, intelligently executed… but also symptomatic of the ways in which data science’s impact on cybersecurity is more generally overlooked. While our goal is not to criticize the EO – which we believe to be a laudable attempt to improve our collective cybersecurity – it does contain significant gaps, a discussion of which will help drive future improvements. Ultimately, our hope is to help the right hand of cybersecurity, so to speak, develop a better understanding of what the left hand of data science is doing.
We begin with the idea of Zero Trust.
Zero Trust
How can you maintain security in an environment plagued by, and asymmetrically friendly to, threat actors? The current, widely accepted answer is to assume “zero trust” – a concept at the heart of the recently released EO – which requires assuming breaches in nearly all scenarios. Here is how the EO defines zero trust:
“…a security model, a set of system design principles, and a coordinated cybersecurity and system management strategy based on an acknowledgement that threats exist both inside and outside traditional network boundaries. The Zero Trust security model eliminates implicit trust in any one element, node, or service . . . In essence, a Zero Trust Architecture allows users full access but only to the bare minimum they need to perform their jobs. If a device is compromised, zero trust can ensure that the damage is contained. The Zero Trust Architecture security model assumes that a breach is inevitable or has likely already occurred, so it constantly limits access to only what is needed and looks for anomalous or malicious activity.”
Exactly what this means in practice is clear in the environment of traditional software and traditional software controls: implementing risk-based access controls, ensuring that the least privileged access is implemented by default, embedding resiliency requirements into network architectures to minimize single points of failure, and more. Zero Trust can, as a concept, be thought of as a culmination of years of experience developing IT infrastructure and all the while watching attackers succeed.
The problem, however, is that none of this applies clearly to data science, which requires continuous access to data – and lots of it. Indeed, it’s rare that a data scientist even knows all the data they require at the beginning of any one analytics project; instead, and in practice, data scientists frequently demand all the data available, and only then will they be able to deliver a model that sufficiently solves the problem at hand. “Give me all the data, and then I’ll tell you what I’ll do with it” – this might as well be the motto of every data scientist.
And this dynamic, after all, makes sense: Analytics models more generally, and AI more specifically, require data to train upon. As one of us has written elsewhere, “Machine learning models are shaped by the data they train on. In simpler terms, they eat data for breakfast, lunch, and supper too” (with credit to co-author Dan Geer).
So how does Zero Trust fit into this environment, where users building AI systems actively require access to voluminous amounts of data? The simple answer is that it does not. The more complicated answer is that Zero Trust works for applications and production-ready AI models, but it does not work for training AI – a pretty significant carve out, if we are serious about our investment in and adoption of AI.
A New Kind of Supply Chain
The idea that software systems suffer from a supply chain issue is also common wisdom: Software systems are complex, and it can be easy to hide or obscure vulnerabilities within this complexity. Commendable studies, such as Olav Lysne’s examination of Huawei and the difficulty of fully certifying third party software, have been conducted on the subject. This is, at least in part, why the EO so forcefully emphasizes the importance of supply chain management, both the physical hardware and the software running on it.
Here’s how the EO summarizes the problem:
“The development of commercial software often lacks transparency, sufficient focus on the ability of the software to resist attack, and adequate controls to prevent tampering by malicious actors. There is a pressing need to implement more rigorous and predictable mechanisms for ensuring that products function securely, and as intended. The security and integrity of “critical software” — software that performs functions critical to trust (such as affording or requiring elevated system privileges or direct access to networking and computing resources) — is a particular concern. Accordingly, the Federal Government must take action to rapidly improve the security and integrity of the software supply chain, with a priority on addressing critical software.”
The problem, however, is again one of mismatch: Efforts to focus on the software security do not apply to data science environments, which are predicated on access to data that in turn form the foundation for AI code. Whereas humans program software in the creation of traditional software systems, line-by-line and through painstaking effort, AI is largely “programmed” by the data it is trained upon, creating both new security vulnerabilities and challenges from a cybersecurity perspective.
Mitre, Microsoft, and a host of other organizations recently released an “adversarial threat matrix” outlining all the ways in which machine learning systems can be attacked, which can be read in further detail here. A few highlights include “model poisoning,” designed to undermine the performance of the AI system based on certain triggers; “data poisoning,” used to undermine the performance of the system by inserting malicious data into the underlying data set; and “model extraction,” used to steal underlying training data or the model itself, among many other examples. Georgetown’s CSET has just released a great report on AI supply chain issues as well. The takeaway is that the list of ways that AI systems can be attacked is expansive – and likely, in our view, to grow over time.
What, then, can we do about these types of security issues? The answer, like so many other things in the world of AI, is to focus on the data: Knowing where the data came from, who and how it has been accessed, and tracking access in real time are the only long term ways to monitor these types of systems for all these new and evolving vulnerabilities. We must, in other words, add efforts to track data to the already-complicated supply chain if we really seek to ensure that both our software and our AI is secure.
A New Kind of Scale – And Urgency
Perhaps most importantly, as AI becomes adopted more widely, cybersecurity vulnerabilities are unlikely to grow in proportion to their underlying code base. Instead, they will scale in proportion to the data the AI systems are trained upon, meaning threats will grow exponentially.
At a high level, as bad as things seem today in the world of cybersecurity, they’re bound to get worse: Software systems have been limited in their size and complexity by the time it takes humans to manually write the code. Manual programming requires painstaking and careful effort, as any software developer will tell you – and as we noted above.
But as we move to a world in which data itself is the code, this limiting factor is likely to disappear. Based simply on the growing volume of data we generate, the opportunities to exploit digital systems are likely to increase beyond our imagination. We are approaching a world where there is no boundary between safe and unsafe systems, no clear tests to determine that any system is trustworthy – instead, huge amounts of data are creating an ever-expanding attack surface as we deploy more AI.
The good news is that this new AI-driven world will give rise to boundless opportunities for innovation. Our Intelligence Community will know more about our adversaries in as close to real-time as possible. Our Armed Forces will benefit from a new type of strategic intelligence, which will reshape the speed, and even the boundaries, associated with the battlefield. But this future is also, for the reasons we’ve described above, likely to be afflicted with insecurities that are destined to grow at rates faster than human comprehension allows.
Which brings us back to the central thesis of this short piece: If we are to take cybersecurity seriously, we must understand and address how AI creates and exacerbates all these vulnerabilities. The same goes with our strategic investments in AI. We simply cannot have AI without a better understanding of its impact on security.
More simply stated, the long-term success of our cybersecurity policies will rest on how clearly they apply to the world of AI.
To find out how Immuta is helping to manage data access control and enable sensitive data use at scale in the Public Sector, visit our website or get in touch with our dedicated Public Sector team at [email protected].