Organizations take great care to secure cloud platforms, relying on sophisticated access control systems. Multi-factor authentication, identity management, and fine-grained authorization aim to fully lock down valuable data and infrastructure. However, what if these complex safeguards unexpectedly fail? In this blog, we'll discuss the vital role of the break glass process for emergency access in IT.
Companies inevitably adopt new platforms to run tasks and store data, whether it’s a substantial transition like migrating to a public cloud or a more localized shift such as adopting HR management SaaS. Regardless of scale, the introduction of a new platform signifies a relatively isolated entity that houses valuable information, necessitating reliable user authentication, strict least-privilege permissions, and "zero trust" authorization policies. The rationale is clear–these hosted systems contain sensitive information and disruption has serious business impact. Recognizing that access control is a nontrivial problem, companies deploy or leverage existing implementations, employing a suite of specialized systems. These systems are integrated to undertake authentication and authorization, identify human and machine users, and grant limited access to the new platform.
Indeed, an access control system is inherently complex, and anything complex can break. If access control fails, it brings the entire business to a halt.
Even if nothing breaks, we're not entirely safe. The access control system is designed based on specific work processes and usage patterns assumed for the target platform. Numerous assumptions have to be made, but it's impossible to foresee everything. Sometimes something unexpected happens. Often it can be ignored, but sometimes it can be an emergency that demands swift action. Yet, even if we're willing to take action, our carefully designed access control system may impede it. Failing to act could result in financial losses, harm to the business, or even more severe consequences.
Simply accepting such risks doesn't make sense. Facing rare, acute and potentially dangerous circumstances, relying on sophisticated mitigation procedures is impractical. We need to do something simple but very effective, albeit unusual or even generally prohibited—like smashing the doors in case of fire, stopping the train to prevent an accident, or just breaking the glass and pulling the alarm.
The term "break glass" refers to an action, which is generally discouraged or not allowed, in a case of emergency. This concept is pretty broad, even in the context of IT. For example, HIPAA compliance mandates a process of accessing patient health information during a medical emergency. This is a rare but regularly formalized procedure for elevating information access rights. In this post, we focus specifically on emergency access to an IT service platform in uncommon scenarios not covered by existing processes. In particular, this pertains to accessing a cloud platform for control and data.
The distinctive properties of break glass procedures are:
Unlike privileged access management (PAM), which relies on a collection of processes for executing prescribed tasks, break glass relies on human judgment in unpredictable scenarios. PAM strictly governs routine tasks through rigid workflow controls to prevent errors. Break glass procedures intentionally bypass controls to enable decisive human action in crises. This separation isn't universal. For example, in the real world, breaking glass to activate a fire alarm is a simple prescribed action. However, in the realm of IT systems, which have smaller "state space" and numerous tools, such predictable cases are normally handled through automation rather than break glass.
True break glass scenarios involve unanticipated risks where the benefits of empowering operators to act swiftly outweigh the dangers of bypassing safeguards. Still, the decision to invoke break glass access must be weighed carefully.
Another class of cases that shouldn't trigger a break glass procedure arises from personal data regulation. For example, a user data deletion request within the scope of GDPR clearly requires an action generally discouraged: information destruction. However, it is a prescribed process, known in advance, so its execution should not leave space for human error. It must be a canned procedure, preferably automated, and safeguarded with appropriate controls to prevent accidental triggering.
In contrast, the access control system failure example from the beginning of this post is a valid case to discard some rules. When the front door lock fails, you’re left with few options—break the door or smash the window. Similarly, when standard authentication stops working, exceptional access may be the only way to regain control.
One may reasonably question the need for such a lengthy discussion about system failures and the applicability of elevated access procedures when the problem domain revolves around access to the cloud infrastructure. After all, any information system has a master key, "root" credentials, or a "superadmin" that created this piece of cloud infrastructure and has unlimited powers over it. Armed with these credentials, a human operator can do anything. In a crisis scenario, why not simply unlock the vault and retrieve the sheet of paper with root credentials? True, you'll need to locate the two assigned custodians holding the vault keys, but that's a standard part of a regular PAM process. Should we look any further?
Yes, we should. For example, cloud access management guidelines emphasize that these initial root credentials should never, under any circumstances, be used for operations. Operations, even in emergencies, imply that the system will continue to function after the emergency is mitigated. Root credentials are essentially a piece of data, and once exposed to an operator or a software utility, there’s an inevitable risk of leakage in various directions. Eliminating all copies of these credentials becomes an impossible task once they're released. Would you bet that these credentials won't persist in
~/.aws/credentials file on an operator's laptop, at a VDI instance, a jump-box VM, or in the
AWS_SECRET_ACCESS_KEY environment variable in a hastily crafted script? Moreover, operators, with good intentions to expedite issue resolution, may hold onto privileged credentials.
So no, you shouldn't involve the root credentials of the cloud platform, even if it's your personal cloud account.
The inadvertent retention of privileged access credentials isn't the primary concern for an organization. Worldwide, organizations face escalating scrutiny from regulators overseeing information security, sensitive data access, and even public behavior. Virtually every regulation explicitly or implicitly mandates:
It's clear that the use of root credentials violates all these requirements. It conceals the operator's identity under the root account, it grants the user unconstrained powers within the information system with no temporal limitations, and while access tracing may still function, its effectiveness and accuracy depend entirely on the operator holding the root credentials.
Most regulatory documents aren't written as technical specifications. They employ more generic legal language to sustain the relevance of regulations in a rapidly evolving technology landscape. However, they leave specific interpretation to auditors and state-appointed governance bodies. These oversight entities generally use common sense in interpreting regulations and compliance implementation specifics. Nevertheless, these interpretations aren't loose. For instance, an auditor may accept a scenario involving the use of root credentials for cloud infrastructure operations but it will demand an exceptionally high level of control and assurance, rendering the entire approach practically meaningless.
It's important that compliance issues aren’t left to chance. Compliance requirements are known in advance, and compliance audits occur regularly. Unaddressed compliance violations inevitably lead to damages comparable to a worst-case scenario—financial losses, business termination, or even criminal charges. Unlike an emergency incident, which one hopes may not happen, compliance violations pose a persistent threat to the organization.
In addition to imposing restrictive requirements, many regulations (GDPR being a prominent example) also demand operations continuity. Consequently, the affected organization cannot afford to wait until the issue resolves on its own. The organization is bound by the law to take action, not solely driven by direct financial losses. Therefore, upon establishing an information processing system, organizations are obligated to have a process for mitigating rare, high-impact risks of uncertain type and origin—such as a break glass procedure.
It's worth noting that the break glass process isn't a singular entity for an organization. Given that an enterprise has multiple information systems aggregated at different levels, it may need corresponding emergency procedures at each level. For example, an identity provider failure likely impacts everything, hence it should be addressed globally. But if an IAM configuration error renders a particular information system inaccessible, it should be addressed at the system level without invoking organization-level superpowers.
In many cases enterprises have emergency access methods in place, often quite elaborate and explicitly designated as "break glass". Despite this, they often share the same shortcomings that prevent these otherwise reasonable PAM methods from being suitable for "break glass" circumstances. We explore a few examples below.
This example is typical for cloud-first organizations with mature IT departments, specifically addressing access control to enterprise AWS resources. Due to compliance policies, IAM configuration blocks operators' access to application data in production AWS accounts. For emergency cases, the organization implements a method of privileged access to AWS resources, bypassing these restrictions. The method leverages the existing identity management system, identity provider and PAM system, all of which are robust and trusted. These systems are integrated and configured to handle the approval process, credentials issuance and rotation, as well as the audit trail.
This is a well-designed PAM system, showcasing many PAM best practices. Permissions and policies are assigned through roles and groups, centrally managed in Active Directory. Strong authentication with multi-factor authentication is in place, and the credentials life cycle is automated. Each component of the entire system is among the best tools for its task. When implemented, it works, but it often fails in cases of emergency. We'll discuss why after another example.
Another example concerns credentials rotation in AWS, following an AWS guideline, which is purely AWS-centric and doesn't depend on third-party proprietary tools.
Credentials update is just one aspect of a break glass process. Even though there are already a bunch of Lambda functions, a Secret Manager instance, a few more AWS services, and half a dozen IAM policies involving cross-account access. What could go wrong? Read on to find out.
The above examples represent reasonable implementations of privileged access management (PAM). Despite their apparent soundness, they often fall short when needed most. Let's explore what they have in common.
The most evident issue is the complexity of these PAM implementations. Each process involves several separate systems, often managed by different teams. The managing team updates the system, implements new features, improves the existing functionality, and sometimes decommissions parts of it. Consequently, the underlying assumptions of the PAM process eventually break. This happens for any complex integration. Such issues are expected for actively operating business systems, usually detected during testing, and promptly addressed. If a system participates in a busy integration, the managing team has to be aware of it and handle the integration to prevent disruptions due to system changes.
However, an emergency access procedure isn't triggered daily or weekly. It may occur so infrequently that the entire team may undergo natural staff rotation before it happens again. Consequently, nobody remembers that such a procedure exists, and involves the team-managed system, until an on-call support operator gets a P0 incident, scrambles for emergency access, and finds it nonfunctional.
Even if this PAM machinery works, it often fails on the human side. A complex procedure implies a complex operator runbook. When waking up at 4 am, the operator is likely to misstep somewhere. Or the runbook documentation no longer matches the updated UI of, for instance, a credentials vault service.
Emergency access procedures are cross-functional and very rarely invoked. It sets such procedures apart from regular PAM processes. Hence, they must be very simple—both technically and process-wise, even if it means being manual, unscalable, or otherwise costly.
A working break glass emergency access procedure should adhere to several requirements. Some of them are formal, stemming from externally imposed regulations and constraints. Others are practical, and essential for establishing a truly functional and useful process.
Formal requirements mainly come from legal regulations. Most of them have similar expectations from access control to information systems:
Regulators and overseeing bodies accept the need for heightened access measures, but this doesn't imply that these measures can compromise compliance guarantees. Elevated access must be leveraged by elevated controls. It's the foundational formal requirement.
The main practical requirement is that a break glass procedure must work flawlessly 100% of the time.
While a break glass procedure must be highly reliable, it doesn't have to be scalable, efficient, or cheap. It doesn't need to support hundreds (or even tens) of users. It may rely on manual actions. That's fine, as long as the procedure is reproducible.
To satisfy the formal requirements mentioned earlier, a break glass procedure should implement the following constraints:
Last in the list of requirements but not the least important is when a break glass procedure can be used. For example, break glass procedures can be triggered in the following scenarios:
A list of permitted use cases should accompany every break glass access method. The method must be invoked only in these cases (and in corresponding drills) and never used for anything else.
After having spent most of this post discussing complications and examples of failures, let's outline a few methods that work. In a nutshell, the problem isn't difficult; we just need an access method for the target platform that works in any circumstances and satisfies compliance requirements.
An access method doesn't necessarily imply a set of credentials, but in many cases, it's the most straightforward way. A set of static credentials is provisioned in advance (because provisioning may fail during an incident) and stored securely. The credentials typically include a secret such as a passphrase or a key. It can also be a hardware device; it just needs to be self-sufficient, although such devices aren't widely used yet for this purpose. For a cloud computing platform, it's an "IAM account" (not to be confused with an AWS account, the primary container of all AWS resources for a user)—a static authorization entity with a static secret as a credential. So, an operator only needs this secret, a network to connect to the cloud platform, and a computer connected to the network—the minimal list of prerequisites that cannot be further reduced. Besides these basics, the break glass access methods differ in the way those static credentials are maintained.
The simplest way is fully manual:
It's tedious, but it works. This method may be employed for higher levels of emergency access—such as emergency access to a cloud tenant encompassing all cloud resources belonging to an organization.
Down the stack, more things happen, and the break glass process is used more frequently, involving more people who should have access to it. At this level, limited automation is reasonable—a standalone credentials vault (or PAM) system may replace human custodians. Popular examples of such systems are CyberArk PAM and HashiCorp Vault. These digital vaults handle credentials storage, release, rotation and logging, possibly leaving only the approval process to humans. However, it's preferable to have dedicated instances of these systems specifically for break glass credentials management, primarily to minimize the attack surface around these credentials.
Last but not least, periodic drills are essential to keep the emergency access method working. Anything not in use becomes rusty and falls apart. Therefore, once the organization adopts a cloud platform, it should assign someone on the staff responsible for its operations. A break glass process should be one of the first things for that team to implement, and drills for this process should be one of the first recurring events in the team calendar.
An emergency access procedure is one of many processes an organization has to implement once it adopts a new platform service. It isn't (and shouldn't be) complex technically, but it has surprisingly many ramifications.
A break glass process is a necessity. Without it, operators lack control over a distinct set of high-risk issues in the application system or shared platform. While organizations recognize this need, a common misinterpretation views break glass as a regular privileged access management (PAM) scenario. It's not, and this distinction is important. Break glass and PAM have divergent requirements and acceptable trade-offs, and they apply to different use cases.
Break glass procedures and regulations compliance aren't at odds and that's counter-intuitive. However, break glass procedure compliance demands a distinct implementation—relying on careful observability and in-process control instead of upfront restrictions.
Implementing break glass involves unique best practices, occasionally conflicting with standard IT operations patterns like denial by default, approvals in advance, and excessive automation. While these practices may seem inconvenient, deviating from them can create more problems than solutions.
A well-implemented break glass procedure creates a robust line of defense against new and unknown threats. Moreover, it addresses concerns in adjacent areas such as application reliability, regulations compliance, and the overall cost of ownership.