Systems development and architecture teams have been here often. They find they must sacrifice client expectations for system security or the reverse and put client expectations ahead of security. This is the root cause of the recent IoT DDoS that took out significant portions of Dyn’s DNS infrastructure. This attack brought Twitter, Reddit, Netflix, and Github, as well as many of Dyn’s other clients to their knees. This attack isn’t the fault of Dyn or its clients, instead it is due to the proliferation of a “ship-it-unfinished-and-patch-it-later” mentality that has burdened software development as systems and public consumers of those systems have multiplied. Specifically, tying release date deadlines to marketing or wholesale orders means feature fulfillment will push security out of precedence every time. In this case ~100,000 webcams with default hardcoded administrative passwords coupled with ISPs and long haul carriers failing to filter spoofed packets was enough to DDoS many of Dyn’s subscribers.
It’s easy enough to say this is a failure of planning or of architecture, but the reality is that it’s just a fact of life for anyone building a complex system. So what smart risk management moves can teams make to stay ahead of it?
Security is not an obstacle, it’s a survival requirement in the same way that oxygen, food, water, and shelter are for humans.
Put security first.
Every. Single. Time.
There are enough new vulnerabilities discovered and exploited on a daily basis without engineering them into a system. They are like rats and they multiply in the walls until the entire structure must be condemned. It is possible to balance velocity and client expectations while maintaining security posture but the team has to be willing to accept some serious truths in life.
There are no appliances or software packages that will supply organizational change, which is the fundamental issue that must be addressed in order to be successful. All of the teams and stakeholders must buy in or the project will fail. This means the concept of security first must be sold to everyone involved, both internally, to the teams, the middle management, and the executives and externally to the client and their stakeholders. Without buy-in for the security-first mindset, it simply won’t work. Of course, if security isn’t put first, the organization, project, and client won’t survive in the long term anyhow.
The best way to sell the concept to stakeholders depends on each group’s perception of risk, e.g. executives fear P&L impacts, developers hate missed delivery dates due to rework, operations personnel worry about uptime metrics, etc. Realistically, fear of being held accountable will drive decision making and be a forcing factor for making security foundational in all activities rather than an accessory or feature. Using fear to sell system security is entirely appropriate considering that the average cost of a data breach per record was $158 in 2016, a number which only partially accounts for lost immediate revenue and indirect costs like lost customer acquisition. This is also only for incidents of data breaches, not for total lost business or refunds due to denials of service or other security-related outages.
Take the long view and think strategically. When putting security first, the organization will have to develop a focused plan that implements security as a matter of course. If it’s for something that is brand spanking new, this is relatively easy compared to integrating it into an existing system. Fortunately, intelligent resource management can help even in these cases. In short, for legacy systems, organizations should plan to build the replacement in parallel, support it as a sidecar by gradually wiring new functionality in until it is fully replaced end-to-end, and then turn off the old in favor of the new.
The organization’s current policies will likely be a greater impediment than enabler for real security. Most security policies and regulations are compliance-based and checklist-oriented, rather than holistically managed continuous system monitoring for high availability and security. It’s important to make sure that system plans are brought up to date with current best practices by going above and beyond the bare minimum required by policy. Adversaries will begin exploiting vulnerabilities the minute that they are in the wild. The organization’s plan for a system must, at a minimum, make it harder to attack than its competitors to avoid being script kiddie fodder. There is no way to guarantee a system’s security. Even a completely disconnected, powered-down machine is susceptible to security flaws.
Ending or rolling back efforts to secure a system because it’s harder than not doing so will result in a bigger mess than never having started in the first place. It is critical that an organization maintain its commitment to doing it right.
Smart organizations create a secure base to build upon by leveraging automation at every possible opportunity. Clever automation helps make the outcome of every process consistent and repeatable. Doing it right requires a team with a holistic viewpoint, experience in every aspect of the environment’s tech stack, and the flexibility to learn on the fly. The same processes that make good security a matter of course are the processes that enable resilience in systems architecture and development. It’s better to hire lifelong learners with broad skillsets but the capability to specialize as needed instead of forty Balkanized specialists with no understanding (or desire to understand) the environment as a whole. This is another area where policy and hiring practices make it difficult to do the job right. Side note: a project manager or otherwise nontechnical team lead should never hire a new technical asset without vetting them with trusted technical team members.
Configuration-as-code, infrastructure-as-code, software-defined networking, etc, each provide a label for essentially the same concept, i.e. versioned systems and environment configuration management. Automation is the way to leverage Agile DevOps principles and apply them to holistic systems management, from the first development virtual machine spun up to the day the lights are turned off on a retired database server. Security can easily be baked into these configurations, added as configuration items, tracked, and measured as a matter of course, rather than as an error-prone manual checklist after deployment.
Fast recovery is as important as comprehensive resilience and smart COOP is best enabled with automation. In the event of catastrophic failures or debilitating attacks, it is sometimes more important to be able to return to operations than to sustain the attack in the first place. Automation to shut down and/or shift computing resources with little or no downtime is the best way to provide reliability management.
Recruit, hire, train, and retain technical teams who relish learning new things. Fortunately, skilled systems management and development personnel are available in droves, and most welcome the challenge of daily learning. Focusing on team members’ desire and ability to learn creates a sea change within an organization, increases employee satisfaction, and destroys silos separating work functions. Big vendors are only capable of selling dependency, but multifaceted talent will innovate, save time and money, and leverage holistic knowledge to build more secure and effective technical solutions. Free (as in beer) building blocks like automation, microservices, RESTful APIs, and open source toolsets and libraries require better architectural understanding and technical acumen from personnel, but provide greater resource efficiency and are an impetus for the recruiting, training, and retaining of the superior talent organizations require to stay competitive in the long term.
It’s time to step back from the asses-in-seats productivity metric and provide employees the ability to choose their duty station. Telepresence, chat applications, VPNs, whitelisting, virtual desktops, certificate-based authentication and authorization, etc. are widely available to sustain a work anywhere philosophy. It’s a seeker’s market for top technical talent. It’s up to organizations to support their desire to work from home, a coffee shop, a yurt in the sticks, or anywhere else. The only criteria should be that they’re capable of meeting the real productivity metrics a good employer should already have documented for the position. Telework, like good security, is more difficult to manage at the start, but provides incredible long-term benefits to employer and employee. Organizations must provide secure networks and capabilities to support remote work and it speaks volumes to their competence when they are unable to do so. The problem of secure infrastructure is essentially the same as secure systems and environments, so an organization claiming to supply secure systems to a client that is unable to supply secure infrastructure internally for remote work is suspect.
Organizations should change it, throw it away, or only pay it lip service if it burns resource while providing no benefit. For example: in an unnamed government organization, policy dictates daily intrusive vulnerability scanning, but only requires that the vulnerability definitions be updated once a month. This means that users see regular service interruptions and slowdowns, to no real effect. A better plan is log and resource continuous monitoring of systems and services, coupled with random-schedule daily scanning and daily updates of vulnerability definitions. Most organizations never revisit aged policy, preferring to keep the status quo. Change hurts, but is necessary. Teams should be willing to fight to kill, change, or only minimally support bad policy (and sometimes systems).