Building a Secure & Usable Auth System


The NVIDIA cloud platform team operates hundreds of nodes including CPU/GPU servers and network infrastructure devices that span across multiple data centers (DC) to support our GPU cluster services. Our engineers often need remote access to these nodes to perform administrative functions. Provisioning and managing user accounts on large number of nodes is not trivial and require dedicated systems and engineering resources to operate. Besides servers, we need to enable RBAC access to devices such as switches, routers, firewalls and console servers, those are traditionally password based. 

For SSH based host access, the earlier practice was to create a shared Unix account on all hosts, and copy engineer’s public key to target hosts upfront. However this method is known to have has significant security, usability and scalability problems as noted below:

  • The use of long lived SSH keys. There are three main problems with long-lived keys - (a) the likelihood of a compromise is high (b) key revocation can be expensive (c) the risk of not detecting a compromise.
  • Access is not integrated with corporate identity systems and MFA - Besides the lack of traceability, we need to manually remove keys from all known hosts when an employee changes team or leave the company.
  • Lack of role based access control (RBAC) and the challenges with managing the user access (grant and revoke) on a large number of target hosts.
  • Reduce security visibility because of shared sudo account.

The goal of our work was to design and implement a system that is secure, usable, and can scale to thousands of data center nodes.  We started out with a challenging set of requirements including:

  • Eliminate passwords and static SSH keys.
  • Integrate node access with corporate identity systems and MFA.
  • Replace traditional/legacy systems such as RADIUS/TACACS, LDAP and VPN with a unified cloud-scale mechanism.
  • Secure, usable and can scale to thousands of nodes in DC and hybrid environments (e.g. AWS-VPCs).
  • Support for automation accounts and access bootstrapping as part of host re-image/PXE boot. 

Join us and learn about the design and implementaiton of an unified authentication system for accessing data center nodes.

Room 106
Saturday, March 7, 2020 - 16:30 to 17:30