Tug of war on cloud security – Enterprise IT vs. LOB Application Owners?

Today, both enterprise IT leaders and line of business application owners are facing similar challenges about securing workloads in public cloud such as AWS.  While application owners focus on balancing agility and security of their production workloads, IT leaders, CISOs and CIOs have a broader responsibility to assess and enforce security of all the workloads in cloud – a multi-cloud security objective that includes both on-premise and cloud.  CISOs and IT leaders are clearly worried as more applications and workloads move to cloud but they seem to have very little visibility into the overall state of security, compliance and risks of cloud.  Pushing the “cloud security problem” back to application owners is not a realistic solution either as each application team might not have the right security focus, knowledge or competency to keep their workloads secure in cloud. CISOs also suspect that application owners might be trade-offing security against agility.  These are some of the challenges facing enterprise IT in how to effectively govern multiple cloud AWS accounts and app teams.  As organizations mature in adopting cloud, enterprise IT needs to start playing a critical role in defining, enforcing and automating “minimum viable” security of cloud workloads for all application teams.  They need to do this of course, without impacting application team’s agility and flexibility.  We believe that cloud security is a shared responsibility between both IT, CISO and application owners.  We also show how a security policy automation solution can effectively be used by both enterprise and application owners together to keep cloud workloads secure at scale.

Three CISO challenges for public cloud workloads

Enterprise IT and CISO faces complex challenges in managing security, compliance and risk across hundreds of different AWS cloud accounts used by application teams or lines of business teams within their company:

  • “We have hundreds of AWS accounts and application teams using cloud for production.  Are all these hundreds of workloads in multiple AWS accounts really secure?” – one of the CISO’s I talked to recently expressed such concerns.
  • One size of compliance does not fit all app teams – some of the AWS accounts due to their nature of workload might require different set of compliance frameworks such as PCI or HIPAA.  How does IT/CISO help in enforcing different levels of compliance across multiple AWS cloud accounts?
  • Finally, how does IT make sure security does not slow down application team’s agility and flexibility and yet provide sufficient guardrails on security.  When you have hundreds of developers making changes to cloud every hour and day, they do not want to be slowed down by being handed down compliance and security checklists from IT.

So, how does a CISO govern security and compliance across all lines of business app teams at scale?

Is embedding security engineer in each application team or a security policy document a viable solution?

One of the approaches many enterprises have taken is to embed full time or part-time security engineers in each of the application team.  While this approach can work at smaller scale such as a few application teams, this is clearly not a scalable solution as enterprises have dozens to hundreds of AWS accounts.

Some enterprises create 100 page security policy and controls document (pdf/Excel) that they expect all application teams to follow.  However, a manual process such as this will clearly slow down the application teams and is not acceptable solution either.

When you want to have both scale and agility, security automation is the only clear solution for enterprise IT to manage multiple accounts.  Let us look at the steps enterprise IT must take to anchoring minimum viable security across multiple application teams.

Enterprise IT – Start defining and enforcing security controls through automation

I. Define security and compliance controls inventory and SLA

 

 

One of the first steps enterprise IT must do is to define a set of security and compliance controls that are fundamental for security of all workloads in cloud.  Every line of business (LOB) application team using cloud must be assessed with these set of controls for non-compliance violations.  

For AWS, the best sources for controls are given in the two standards – CIS AWS Foundation and CIS Three tier web application policies.  There are 100+ policies that define security controls for AWS that cover all these areas, a few exemplary examples are given below under each category:

  • IAM
    • Ensure IAM password policies
  • Networking
    • Ensure no security groups allow ingress from 0.0.0.0/0 to port 22/3389
  • Logging
    • Ensure CloudTrail is enabled in all regions
  • Monitoring
    • Ensure a log metric filter and alarm exist for usage of “root” account
  • Application
    • Ensure Databases running on RDS have encryption at rest enabled

For certain workloads requiring a higher level of compliance such as FedRAMP or NIST 800-sp3 or HIPAA or PCI, additional policies with controls must also be created.  These set of controls forms the “controls inventory” that an IT organization should define and maintain within the company.

Finally, IT must also define SLA for resolving the critical and high priority violations, such as fix critical violations or security findings within 1 week.

II. Enforce controls on all public cloud accounts through continuous monitoring

Policy automation solutions (such as BMC SecOps Policy Service) allow IT to automatically and continuously (eg. daily or hourly) run compliance checks across hundreds of public cloud accounts  as shown in the diagram below.  

blog2

In this architecture, a security policy automation tool assesses security posture of all the 1..N AWS accounts in an enterprise.  It provides security and compliance visibility at aggregate level for CISO as well as for LOB application owner. Security policy automation tools also must support multi-cloud control compliance to extend support for AWS, Azure and Google clouds.
Finally, if certain AWS accounts require NIST, HIPAA or PCI level of compliance, the security policy automation tool must allow different policies enforced on such accounts.  This enables a unique set of compliance policies applied to each application team cloud accounts based on the compliance needs of the workloads.  For example, if application #2 needs to be PCI compliant, then a PCI policy with its controls should be assessed against cloud accounts owned by application team #2.

III. Collaborate with Application Owners for remediation

Enterprise IT should continually monitor security and compliance of all AWS (and Azure, multi-cloud) accounts for the set of controls that they have defined as a minimum security for all AWS accounts.  It should then develop a collaborative approach to alert the LOB application teams about critical violations of control policies and work with application teams to ensure that the violations get resolved within the SLAs.

So, who is responsible for cloud security – Enterprise IT or Application Owners?

Enterprise IT must start to assert itself in cloud security by extending the security it knows well in on-premises to cloud.  Of course, traditional security practices from on-prem cannot be simply used “as-is” in cloud.  Instead, a very prescriptive automatable new set of security policies for cloud technology are the first key steps for enterprise IT to establish a baseline for cloud security in the organization.  Investing in a policy automation such as BMC SecOps Policy Service or a CASB solution would be the next step for enterprise IT to start getting visibility into security posture for all cloud workloads spread across multiple cloud accounts. Finally, IT should work closely with each LOB application owners for remediation.

LOB application owners also have an equal responsibility of security of their workload and must work closely with enterprise IT to ensure that security findings are acknowledged, addressed and remediated within SLA.

We believe a shared responsibility model between LOB application teams and enterprise IT/CISO would be the best approach to secure cloud workloads.

BMC Policy Service is a security automation tool that you can get started in less than 30 minutes and secure all your buckets and cloud resources – https://www.bmc.com/it-solutions/secops-policy-service.html.

What’s going on with these high profile AWS S3 breaches?

We have all seen alarming headlines this year with over a dozen high profile breaches or exposure of critical customer data stored in AWS S3 storage.

  • [July 2017] Verizon contractor leaked over 10 million customer records stored in S3
  • [July 2017] WWE leaked customer information for over 3 million customers stored in S3
  • [July 2017] Dow Jones leaked over 2 million customer details from data stored in S3
  • [August 2017], A voting machine supplier leaked over 1.8 million voter from S3
  • [September 2017] Vehicle tracking vendor leaked half a million records about customers
  • [September 2017], Time Warner Cable leaked 4 million customer records from S3
  • [September 2017], Accenture left potential data on S3 exposed
  • [November 2017], US Government DoD exposed 1.8 billion posts from S3

The list keeps growing as every month more breaches get disclosed.  We begin by asking five fundamental questions on what is going on behind all these massive breaches and present 3 simple steps that enterprises can take proactively to prevent these.  As enterprises rush in moving their workloads and data to cloud, security of S3 and cloud resources should not be forgotten or ignored.  Governance of cloud resources in design, operations and continuous monitoring is critical to keep the data secure. 

Five security challenges

Let us dig deeper into five challenges that enterprises face in keeping data secure in S3 buckets in cloud.

  1. [Skillset Problem]  As enterprises move to cloud, they perhaps lack the security expertise needed to keep their customer data safe in cloud.  For example, despite a secure sounding access control, setting an “Authenticated Users group” permission on S3 bucket is highly insecure that caused the Dow Jones exposure.  Novices in cloud security can easily misinterpret and mistakenly think they are secure when they aren’t.
  2. [Detection Problem] Most of the S3 bucket leaks came from a simple misconfigured security settings on the AWS S3 that made them “public” readable.  Can such misconfigurations be easily detected and proactively corrected? For example, setting S3 buckets with GET permissions to allow global access is really a very insecure option for almost all use cases – http://docs.aws.amazon.com/AmazonS3/latest/dev/example-bucket-policies.html#example-bucket-policies-use-case-2.  This permission makes sense for a bucket with public website but never for sensitive data.
  3. [Scale Problem] As enterprises have hundreds of accounts with thousands of developers using the cloud, it is manually impossible to audit these hundreds or thousands of S3 buckets.  This can be one reason for insecure buckets where there just are too many of them to manually audit them and continuously monitor permissions.
  4. [Agility Problem] As business move fast with thousands of changes being made to cloud every day with new features, new configurations and new services, how does an enterprise make sure that their cloud environment such as S3 buckets are secure and compliant every day and every hour?
  5. [Automation Problem] Finally, as scale and frequency of changes in cloud increase (#3 and #4 points above), there is a clear need for automation.  Are there tools to help with security automation?

We explored possible causes for insecure buckets.  Now let us focus on solutions.

Whose fault is it for insecure S3 buckets?
Cloud security is a shared responsibility between cloud providers (AWS) and customers, and customers should make sure they do their part of this responsibility.  AWS S3 storage is secure by default when buckets are created.  However, as customers change the permissions on the buckets, setting incorrect permissions such as “Public” or “Authenticated Users” can leave these buckets wide open to the public.  This configuration drift of permissions can happen with time and there needs to be a way to detect, alert and respond to this.

As businesses store sensitive data on S3, encryption of this data is another complimentary solution that should be implemented along with permissions control.  AWS S3 provides server side encryption capabilities that customers can use to further protect the data.  And if buckets with sensitive data are not encrypted, such misconfigurations need to be detected, alerted and responded.

These are two examples of the customer responsibilities to keep S3 buckets secure.

Three simple steps to secure S3 buckets

AWS S3 buckets can be kept secure by taking 3 simple steps.  

I. Design Security Model, Policies and controls: First, enterprises must define a security model for governing all their cloud resources such as S3 buckets and create policies to enforce this model.  They must encode the security model as part of their “security as code” design best practices such as using AWS cloudformation templates to bake security early in the design.  Example of security model policies to enforce are given below for S3:

  1. Ensure that all S3 buckets do not have public read for all users or Authenticated users
  2. Ensure all S3 buckets have policy to require server-side and in transit encryption for all objects stored in bucket.
  3. Ensure that principle of least privilege is followed for S3 permissions
  4. Ensure that S3 bucket access logging is enabled
  5. Ensure that S3 bucket cloudtrail logs are not publicly accessible
  6. Any exceptions to the above policies should be approved by application owner of the service

The first policy rule if correctly designed and continuously checked could have prevented 90% of the breaches this year.  Second rule is all about encryption at rest for all data stored in s3.  Many compliance standards require this so important to get this done.  The 3rd rule is a IAM security design best practice that must be applied to all cloud resources including S3 buckets.  The 4th and 5th rules are important from audit logging the access for monitoring and forensics.  Most of these policies can be sourced from standards such as AWS CIS foundation and three tier apps.

II. Implement Security Policy Automation: Enterprises should implement a policy automation solution using a cloud security vendor tool such as BMC Policy Service.  These tools can programmatically enforce the “security as code” policies and controls.  For example, these tools can continuously assess cloud resources and notify the security posture of S3 buckets even when thousands of changes to S3 bucket are happening in an agile way across hundreds of buckets in dozens of AWS accounts.  Policy security automation is a fundamental mind shift that needs to happen to prevent such breaches.  This easily addresses not only the scale and agility challenges that we identified earlier but also can mitigate some of the skill set gap.

III. Remediation flows: Finally, enterprise should implement an actionable remediation and response plan to compliance violations through policy automation tools such as BMC Policy Service.  This can include automating remediating misconfigured buckets, creating tickets in Jira or change control systems as well as taking exceptions.  Just imagine a developer misconfigured an S3 bucket and within a few minutes, this misconfiguration was automatically detected by a policy automation tool as a non-compliant bucket.  Based on a remediation policy, this non-compliant bucket then was automatically remediated and configuration changes reverted to a more secure configuration.  This is the level of security that enterprises should demand and expect to get from the policy automation and cloud security vendors.

What next?

All the AWS S3 breaches we heard of this year were completely preventable. Customers should define security models, policies and controls and use security policy automation tools to implement them and remediate them.  This will ensure that if S3 or other cloud resource misconfigurations happen over time they get alerted on and even auto-remediated in many cases.

BMC Policy Service is a security automation tool that you can get started in less than 30 minutes and secure all your buckets and cloud resources – https://www.bmc.com/it-solutions/secops-policy-service.html

 

From 5 to 85 – Just 5 security practices can make you 85 % secure

Most security attacks are simple minded.  As you read through the news of the recent security breaches, most of them could have been avoided.  These breaches can be tracked down to simple misconfigurations such as a misconfigured s3 bucket for public access or lack of basic security hygiene such as out of date security patching.  And yet, so many of us don’t seem to know or follow the basic security hygiene.  The security market is confusing with hundreds of security vendors offering their products and dozens of compliance frameworks such as NIST, CIS, HIPPA, FedRAMP, and more to choose from.  It often is unclear where to begin and what to focus on first.   It seems that enterprises might be focussing on defending complex security scenarios and challenges such as sophisticated threat vectors and advanced persistent threats but unwittingly missing the basic common sense security practices first steps.  We believe that by implementing just the top 5 Center for Internet Security (CIS) controls, enterprises can reduce their cyber risk by 85% according to many studies done in the industry.  Let us break this down further and see how we can get to 85% safety with just 5 steps.

Top 5 CIS Controls – What are they?

The Center for Internet Security has created 20 controls and prioritized the top 5 if implemented rightly will reduce security risks by 85%.  These controls are pragmatic, prescriptive and can be easily automated through security automation tools listed below. 

Table 1. Center for Internet Security Top 5 Security Controls

The Top 5 CIS Controls Security Automation Tools and Practices
1. Inventory of Authorized and Unauthorized Devices BMC Discovery

SecOps Response

2. Inventory of Authorized and Unauthorized Software BMC Discovery

Bladelogic Server Automation

Bladelogic Network Automation

3. Secure configurations for Hardware and Software (also see CIS 9. and CIS 11. for securing configurations of network infrastructure) Assessment and Remediation in Cloud – SecOps Policy

Assessment and Remediation in Datacenter – Bladelogic Server and Network Automation

4. Continuous Vulnerability Assessment and Remediation Scanning – Qualys, Nessus, Rapid7

Vulnerability management – SecOps Response

Patching – Bladelogic, SCCM

5. Controlled Use of Administrative Privileges SecOps Policy

SIEM

The First Two

The first 2 CIS controls focus on ‘Inventory of Authorized and Unauthorized Devices and Software’.  We all have heard a management tenet – You cannot manage what you don’t know.  In security, “You cannot secure what you don’t know” is equally applicable. Having a complete visibility into all devices, computers, servers, network devices, databases, as well as all the software and applications is the first critical step for any enterprise.  Without this complete list of “inventory”, security scanning, patching, compliance and operations teams will not know about the missing devices (machines etc.) or missing applications, and will fail to either scan them or patch them.  This lack of visibility can lead to security threats.  Enterprises should consider a) An automated discovery tool that discovers not just infrastructure, devices and machines but also applications and software b) Correlation tools that can correlate and map device data from multiple enterprise tools such as scanning, patching, compliance and discovery tools and alert you on the ‘blindspots’.  These blindspots represent potential security risks as these remain invisible and hence go unscanned or un-patched.


The Middle

The CIS control #3 deals with ‘Secure Configurations for Hardware and Software’.  Common security misconfigurations are very common and very much preventable.   For example, a port on a machine or firewall is left open, an operating system that is not hardened, an s3 bucket or files are left open for public access, MongoDB or Elasticsearch DB is left publicly accessible, default passwords are not changed or a misconfigured firewall was letting unwanted traffic are some of the common misconfigurations that can yield to highly publicized breaches both in cloud an datacenter.

While CIS #3 control focusses on hardware and software, there are a 2 related and similar controls that would extend secure configurations to network devices and are important considerations.  CIS control #11 is about “Secure Configurations of Network Devices” that is similar to #3.  CIS control #9 is “Limitation and Control of Network Ports, Protocols and Services”.  These two together will secure network infrastructure such as open firewalls and ports we discussed earlier.

Enterprises should consider a) Tools to assess, harden and remediate operating systems to create “golden” images and Docker containers based on standards such as CIS, DISA and PCI  b) Tools that can assess, harden and remediate their cloud resources that cover all resources from hardware, software, servers, storage and network.  As companies move to public cloud, it is even more important that cloud resources are also hardened just like servers are in datacenters.  Cloud resources include s3 buckets, load balancers and security groups security configurations using standards such as CIS AWS foundations.   Once the assessment is completed, a well established process must exist to remediate violations across the full stack and multi-cloud.  More details here.

The CIS control #4 refers to “Continuous Vulnerability Assessment and Remediation”.  Vulnerabilities in servers are one of the sure ways of getting attacked given there are over 80,000 vulnerabilities (CVEs) identified in NIST and each week a dozen or more continuously get added.  Now, what is even more interesting is that 99% of the exploited vulnerabilities were compromised more than a year after the CVE was published.  The window of opportunity for attackers should be minimized by fixing the critical, high and higly exploitable vulnerabilities as soon as fixes are available and not wait for months and years.  Enterprises should consider a) Automated scanning tools that can scan for vulnerabilities b) Vulnerability life-cycle management tools that can prioritize and plan for remediation SLAs to keep the window of opportunity small and c) Patching tools to execute remediations on servers, and network devices across cloud and datacenters.

The Last but not the least

The final control #5 in the top 5 deals with “Controlled Use of Administrative Privileges” or “processes and tools used to track/control/prevent/correct the use, assignment, and configuration of administrative privileges on computers, networks, and applications.”  Elevating access privileges, improperly configured identity and access permissions (IAM), weak administrative passwords without complex password policies, no rotation of passwords, no MFA, and more can all lead to breaches and attacks.  Enterprises should consider a) Assessment and hardening tools to validate that all their systems from operating systems, servers, to applications and cloud use best practices for administrative privilege management such as assessing for password policies for account logins; b) Continuous monitoring tools like SIEM to track activity related to administrative privilege usage and alert on suspicious activity.  CIS for AWS cloud and servers has number of rules related to this that most assessment and hardening tools can easily address such as SecOps and Bladelogic.

What next?

The CIS Critical Security Controls are a prioritized and actionable set of cybersecurity best practices to prevent the most dangerous and common cyber attacks.  Using just the top 5 can reduce your risks by 85%.   As an example, a cloud assessment and remediation tool like SecOps Policy could have detected the three high profile breaches related to s3 very easily.  The recent Wannacry attack could have been thwarted by proactively remediating vulnerablities using SecOps Response. There are a number of security automation tools in the marketplace to help you achieve 85% security that you should leverage first.  After you implement the top 5, move to the remaining 15 and to more complex frameworks such as NIST and ISO which will add another 100-300 controls to further tighten security of cloud apps and datacenter.

 

Three Cloud Data Security best practices against Ransomware

Ransomware is becoming a global menace – last week’s Wannacry ransomware attack as well as ransomware attacks earlier this year on MongoDB and Elasticsearch clusters have become common headlines in recent times.  Hundreds of thousands of servers, and databases were hacked in 2017 as a result of ransomware.   Clearly, as indicated by this tweet, the immediate response to Wannacry ransomware is to patch the Windows servers to remediate a vulnerability that Microsoft patched two months ago.  Are there better ways to address data security more proactively?  Why were thousands of servers un-patched when a patch was released by Microsoft 2 months ago?  We describe data security process and three key best practices for protecting data against ransomware.

Wannacry

Data Security Process

Data security process has four key stages as shown below: discover and classify, prevent, detect, and respond.

data-security-process

The first step is to discover data in the enterprise such as databases and datastores and classify all your data into different levels such as sensitive or personal information data that requires strong security.

The second step is to apply prevention techniques to secure your data such as using proactive policies-as-code to reduce your security attack blast radius, building layers of protection with defense-in-depth, continuously taking backups and continuously auditing the security of datastore as well as security of all compute and network resources along potential attack paths with access to these datastores.   Having a policy management process and management tool for automating policy checks for data security, compliance and backups would ensure that continuous automated audits happen with auto remediations.  Another prevention technique is having a vulnerability and patch management process and remediation time SLAs for critical patches.

Finally, you need to continually monitor and detect data breaches and data security issues and respond to them as reactive steps to mitigate such issues.  Having a vulnerability management process and server management tools would enable quick identification of vulnerable servers that indicates risks to the business and  patching to remediate them.

Let us next break down these 3 practices for securing your data.

I. Prevention – Policies to continuously audit

Ransomware and other data loss can be prevented with many defensive techniques by proactively checking configurations in systems, network, servers and databases.   Four key policies need to be defined and enforced through a policy management process and tool.  These policies form the defense-in-depth layers that will protect the data from hackers or ransomware attacks.

data-security-in-cloud

  1. Compliance policy – There are a number of DB CIS configuration checks that must be followed to ensure that all database configurations are secured.  Many default database settings such as MongoDB or Elasticsearch leave them wide open to internet.  These security and compliance checks need to be continuously evaluated and corrected as one of the first measure of prevention.
  2. Data security policy – Data can be secured by ensuring that we do encryption-at-rest as well as customer provided encryption keys as standard practice for all databases.  Default credentials should be changed as well as appropriate backups must be done and verified.  All these checks can be easily defined as policies and continuously evaluated for any misconfigurations.
  3. Network security policy – Databases, cloud servers and networks along access paths to databases must be secured through use of appropriate firewall or security group routing rules as well as having isolated VPCs and subnets hosting the databases.  Whitelisting of all database accessible servers must be done as well to limit network access.  These policies can be evaluated and enforced through enterprise policies continuously.
  4. Server compliance policy – Databases run on servers that need to have hardened OS images and a vulnerability and patch management process.  The next two best practices describe the vulnerability and patch management process and tools.  All servers within the enterprise should also follow these practices as lateral movement can make database servers insecure if one of the other servers in the cloud becomes insecure.

II. Prevention – Patch SLA monitoring and continuous patching

Enterprises should define a vulnerability and patch management process with objectives on “time to remediate critical security issues”, RTO – Remediation Time Objective (not to be confused with backup RTO).  Enterprises should have RTO SLA policies in place that specify “Number of days all critical vulnerabilities such as with CVSS severity scores of 9 or 10 will be remediated”.  An RTO SLA of 15 to 30 days is quite common for patching critical security vulnerabilities.  This SLA needs to be continuously monitored and any violations need to be notified and corrected.  In the latest WannaCry attack, thousands of computers remain un-patched for more than 60 days even after a patch for this vulnerability was released two months ago.  This could have been avoided with continuous SLA monitoring and remediation of RTO.

Many enterprises could go one step further by not just monitoring and actively managing RTO SLA but also automating the detection and patching.  As soon as critical vulnerabilities are identified through periodic scans and patches are available from the vendors, the management tools must be able to automatically update their patch catalogs and patch servers and network devices in a zero-touch approach.

  1. Continuously scan environment for detecting vulnerabilities
  2. Select critical vulnerabilities for automated patching
  3. Continuously look for patches from vendors such as Microsoft, download critical patches for vulnerabilities and keep patch catalog contents updated automatically
  4. Automatically apply patches for critical vulnerabilities when they are discovered based on policies for RTO.

With these SLA monitoring and patching controls in place, enterprises can achieve a high degree of data security through proactive prevention.

III. Response – Vulnerability and Patching

Even after preventive controls discussed in I and II, there is still a need to detect and respond as all security attacks cannot be always prevented.   In the detect and response situation, once ransomware or other data exfiltration and security threats have been identified, it is important to have the ability to identify the vulnerable servers and patch them as soon as possible.  A reactive vulnerability and patch management system must have the ability to select specific CVE,  assess the servers that require patching and with a few clicks be able to apply patches and configuration changes to remediate that critical CVE.

Recommendations

Data security starts with an enterprise data security process consisting of data discovery, prioritization, prevention, detection and response stages.  The first best practice for data security is prevention where the datastores such as MongoDB, Elasticsearch as well as servers and networks are continuously audited for security and compliance through policies.  A policy management tool will be a critical enabler for achieving this audit and preventive checks.  These tools can be thought of as ways to “detect” and “harden” all places and paths along which data is stored, moved and accessed thereby achieving defence-in-depth.  The second best practice for data security is to define a “Remediation Time Objective”-SLA and implementing a vulnerability lifecycle management process.  A vulnerability management tool continuously scans for vulnerabilities, gives visibility into critical vulnerabilities with SLA violations, and automatically keeps environment patched with zero touch.  Many enterprises that followed a 30 day RTO SLA were not impacted by Wannacry ransomware as they had patched systems in March soon after patch was released. The third best practice is to have an ability to assess a vulnerability and remediate it during emergencies or as a part of security incident response such as the weekend Wanncry ransomware threat.  All the above three proactive and reactive practices and tools can keep the data secure and avoid costly and reputation damaging Ransomware attacks on enterprises.

BMC Software has three products: Bladelogic server automation, SecOps Response and Policy cloud services that can keep your applications, servers, networks and data safe from ransomware attacks.  Wannacry ransomware was a non-event for these customers as they were proactively implementing vulnerability and patch management processes through our tools.

Full disclosure:  I work at BMC Software.  Check out http://www.bmc.com/it-solutions/bladelogic-server-automation.html,  http://www.bmc.com/it-solutions/secops-response-service.html and https://www.youtube.com/watch?v=hSFP5-kzbT0.

Policies Rule Cloud and Datacenter Operations – Cloud 2.0

Trust but verify – A new way to think about Cloud Management

Cloud management platforms (CMPs) are very popular to manage cloud servers and applications and have been widely adopted by small and large enterprises. For datacenter management (DC) spanning over decades before, there has been a sprawl of systems management tools to manage datacenters. The common wisdom in both these models is to control access to the cloud at the gates by CMPs or DC tools just like in the historic days forts were protected and access controlled with moats and gates. However, with the increasing focus on agility and delivering faster business value to customers, developers and application release teams require a much greater flexibility in working with cloud than previously imagined. Developers want full control and flexibility on tools and APIs to interact with cloud instead of being stopped at the gates and prescribed a uniform single gate to use cloud. Application owners want to allow this freedom but still want cloud workload to be managed, compliant, secure and optimized. This freedom and business driver for agility is creating a new way to reimagine cloud 2.0 which does not stop you at the gates but allows you to come in while continuously checking policies to ensure that you behave well in cloud. Ability to create and apply policies will play a key role in this new emerging model of governance where freedom is tied to responsibility. We believe that the next generation cloud operational plane will drive the future vision on how workloads will be deployed, operated, changed, secured, and monitored in clouds. Enterprises should embrace policies at all stages of software development lifecyle and operations for datacenters in cloud and on-prem. Creating, defining and evaluating policies and taking corrective actions based on policies will be a strategic enabler for all enterprises in the new cloud 2.0 world.

Defining Cloud Operational Plane

In this new cloud management world, you are not stopped at gates but checked continuously. Trust but verify is the new principle of governance in Cloud 2.0.  Now, let us review the 5 key areas for a cloud operational plane and how policies will play a critical role in governance.

  • Provisioning and deployment of cloud workload
    • Are my developers or app teams provisioning right instance types?
    • Is each app team using within their allocated quota of cloud resources?
    • Is the workload or configuration change being deployed secure and compliant?
    • How many pushes are going on per hour, daily and weekly?
    • Are any failing and why?
  • Configuration changes
    • Is this change approved?
    • Is it secure and compliant?
    • Tell me all the changes happening in my cloud?
    • Can I audit these changes to know who did what when?
    • How can I apply changes to my cloud configurations, resources, upgrade to new machine images etc.?
  • Security and compliance
    • Continuously verify that my cloud is security and compliant
    • Alert me on security or compliance events instantly or daily/weekly
    • Remediate these automatically or with my approval
  • Optimization
    • Are my resources most optimally being used? Does it have the right capacity? Do I have the scaling where I need it?
    • Showback of my resources
    • Tell me where am I wasting resources?
    • Tell me how I can cut down costs and waste?
  • Monitoring, state and health
    • Is my cloud workload healthy?
    • Tell me what are key monitoring events? Unhealthy events?
    • Remediate these automatically or with my approval

How Cloud Operational Plane can be enabled through Policies?

The following table compares the new and old world cloud management.  In the old world of cloud management platforms (CMP), we block without trust.  In the new world of cloud operational plane, since gates are open, it becomes necessary to manage the cloud through policies as the central tenet for cloud operations.  This is the cloud operational plane (COP).

  CMP – Block without Trust COP –Trust but Verify Recommended Practices
Deployment      
Deployment to multi-cloud Single API across all clouds, forced to use this.
Catalog driven provisioning
Various tools + No single point of control

No single API

No single Tool

Use best API/tool for each cloud

No catalog

DevOps – your choice

 

Manage/start/stop your resources Single tool Various tools + No single point of control DevOp/Cloud tool – your choice
DevOps continuous deployment Hard to integrate, API of CMP is a hindrance to adoption Embraces this flexibility, allow changes through any toolset Policies for DevOps process for compliance
Config Changes      
Unapproved config changes Block if not approved Allow usually or block if more control desired Change Policies
Config changes API Single API No single API DevOps tool
Audit config changes Yes Yes Audit – capture all changes
Rollback changes No Yes, advanced tools for Blue-Green, Canary etc. DevOps tool
Change monitoring No Yes Change Monitoring
Change security No Yes Policy for change compliance/security
Security & Compliance      
Security in DevOps process N/A Yes Policy for DevOps security
Monitor, scan for issues, get notified N/A Continuously monitor for compliance & security Multi-tool integrations
Prioritize issues N/A Yes, multiple manual and automated prioritization Policy based prioritization
Security and Compliance of middleware and databases

 

 

N/A Yes Compliance and security policies for middleware and databases
Optimization      
Quota & decommissioning Block deployment if out of quota

Decommission on lease expiry

Allow but notify or remove later with resource usage policies.

Decommission on lease expiry

Policies for quota and decommissiong
Optimization N/A Yes Policies for optimization and control

Policies in Enterprises

As enterprises move into a world of freedom and agility with Cloud and DevOps, it becomes increasingly important to use policies to manage cloud operations.  An illustrative diagram below shows how policies can be used to manage everything from DevOps process, on-prem and cloud environments, production environments, cloud infrastructure, applications, servers, middleware and databases.

Policies rule the world

For agile DevOps, policy checks can be embedded early or as needed in the process to catch compliance, security, cost or shift-left violations in source code and libraries. For example, consider a DevOps process starting with a continous integration (CI) tool such as Jenkins®. Developers and release managers can trigger the OWASP (Open Web Application and Security Project) checks to run a scan against source code libraries and block the pipeline if any insecure libraries are found.

Production environments have applications consisting of servers, middleware, databases and networks hosted in clouds such as AWS and Azure.  All these need to be governed by policies as shown above.  For example, RHEL servers in cloud are governed by 4 policies – cost control, patch policy, compliance policy and a vulnerability remediation policy.  Similarly there are security, compliance, scale and cost policies for other cloud resources such as databases and middleware.  Finally, the production environment itself is governed by change, access control and DR policies.

All these policies in the modern cloud 2.0 will be encoded as code.  A sample policy as code can be written in a language such as JSON or YAML:

  • If s3 bucket is open to public, then it is non-compliant.
  • If a firewall security group is open to public, then it is non-compliant.
  • If environment is DEV and instance type is m4.xlarge, then environment is non-compliant

Using policy-as-code will ensure that these policies are created, evaluated, managed and updated in a central place and all through APIs.  Additionally, enterprises will choose to remediate resources and processes on violation of certain policies to ensure that cost, security, compliance and changes are governed.

Recommendations

Cloud management is changing from a “block on entry” to “trust but verify” model.  Some enterprises who wish to govern with an absolute control at gates will continue to use cloud management platforms extensively and effectively.  However, many enterprises are beginning to move to a new cloud 2.0 model where agility and flexibility of DevOps tools and processes are critical for their success.  Instead of prescribing a single entry choke point or a single “CMP tool” to work with cloud, we allow everybody in with their own tools and processes, but continuously verify that policies for deployment, resource usage, quota, cost, security, compliance and changes are continuously tracked, monitored and corrected. Simple effective API based policy as code definition, management, evaluation and remediation will be a central capability that enterprises will need to run new clouds effectively.

Full disclosure:  I work for BMC Software and my team has built a cloud native policy SaaS service, check out the 2 minute video here: https://www.youtube.com/watch?v=hSFP5-kzbT0

Acknowledgement: A few of my colleagues at work, JT and Daniel proposed the analogy of forts and cloud operational plane that is fascinating and cool.  This motivated me to write this blog to show how cloud management itself is evolving from guard at gates to trust but verify model.

 

 

 

 

Secret Sauce for Building Cloud Native Apps

A few months back I had written a blog comparing mode 2 vs. mode 1 application development.  Gartner defines BiModal IT as an organizational model that segments IT services into two categories based on application requirements, maturity and criticality. “Mode 1 is traditional, emphasizing scalability, efficiency, safety and accuracy. Mode 2 is non-sequential, emphasizing agility and speed”.  In the last 6 months, we built a mode 2 cloud native application on AWS cloud and this got us thinking on two questions:  What were unique enablers that led to a successful mode 2 cloud app?  Is Gartner right in keeping the two mode 1/2 segments separate and what mode 1 can change and learn from mode 2?  Let us analyze the first question here.  We will discuss key best practices in building cloud native mode 2 applications.

Focus – do one thing – know your MVP – Mode 2 cloud native applications require a maniacal focus on customer problem, market requirements and delivering value to the customer by focussing on one or two use cases.  We spent very little time in needless meetings, deciding what to build or how to build it.  I have seen many projects that took 2-6 months to define the product to build.  We knew exactly what to build with an uncompromising faith.  Startups usually have this solid vision and focus of doing one thing right usually called the “Minimum Viable Product (MVP)”.  We acted like a startup charged with solving a customer problem.

Leverage cloud native higher level services – We decided to go “all in” with AWS cloud using disruptive PaaS higher level services such as Kinesis, Lambda, API Gateway and NoSQL databases.  We also went serverless with AWS Lambda as the key architectural microservice pattern while we used servers only when we absolutely needed.  No more discussions about the “platform”, “lock-in”, portability to multiple clouds, or multiple datacenters – “on-prem” and “SaaS” or other utopian goals.  We wanted to get to market fast and AWS managed higher level services was our best bet.  Decisions such as which opensource to use, doing endless comparisons, discussions, debates, and paralysis by analysis were avoided at all costs.  We also worked secretly avoiding cross organizational discussions that could slow us down and worked almost as skunkworks fully autonomous team charged with making our own decisions and destiny.  After 6 months of using AWS, we continue to be delighted and amazed by the power of the platform.  The undifferentiated heavy lifting is all done by AWS – infrastructure automation such as routing, message buses, load balancers, server monitoring, running clusters and servers, patching them, maintaining them, and so on – these are not our key business strengths.  We want to focus on business problems that matter and focus on building “apps” not managing “infra” that is best left to AWS.

cloud native app

Practice everything “as code” concepts – From the kick-off, we followed “-as-code” concepts to define all our configuration, infrastructure, security and compliance and even operations.  Each of these aspects is declaratively specified as code and version controlled in Git repository.  We all have heard infrastructure-as-code.  We followed this in our practices where we can build and deploy our entire application+infrastructure stack using AWS cloud formation (CFN) templates with the single click of a button.  Even we can deploy customized stacks for dev, qa, perf and production all driven from a single CFN configured for each environment differently.  This was a major achievement and a decision that saves us time in keeping all our environments consistent, and repeatable.  Today, we have a CFN for each microservice and a few common infrastructure CFNs that not only describe the components but also configuration, such as amount of memory, security such as using HTTPS/SSL, IAM permission roles and compliance as code.  Leveraging AWS ecosystem to do all this has vastly reduced our time to market as everything works seamlessly and is well integrated.

Security, operations and monitoring in cloud services – Being a SaaS cloud service, we leveraged security, operations, metering and monitoring framework from AWS cloud platform and ensured that architecture and development of microservices were built for SaaS.  Security threat modeling and using standardized AWS blueprints for NIST helped us save time and achieve a standardized well tested security to start with such as VPCs, security groups and IAM permission model.  Thinking about operations, health and monitoring while building SaaS components is a key new enlightenment we had as we moved into 3rd and 4th sprint.  For example, log messages with any “Exceptions” or “unique patterns” can generate “alerts” that can be sent to email or sms.  AWS Cloudwatch provides a powerful alerting and log aggregation capability to do all this and we used it extensively to monitor our appfrom log patterns and custom application metrics.  Finally, most of our operations is “app ops” since we own and manage almost no servers, no databases, and no infrastructure.  Being truly serverless has completely reduced an entire set of concerns related to infrastructure, and we are starting to see the benefits of it as we get into operations.  Even our operations guys are talking about application issues and nothing about rebooting servers or patching for security updates.

DevOps, Automation and Agility Cool-aid – Just as our AWS “all in” architecture decision, in our first week, we picked our DevOps tool that a successful cloud company had used before and went “all in” with it.  We built our pipelines, one per microservice with all the automation and stages from commits, dev, qa, staging and production.  Once we built our pipelines, we still had “Red” all over our pipelines as pipelines failed in many automated tests.  This is when we realized that the culture  plays a very important part of our process of agile software development.  Just having the right technology doesn’t cut it.  We indoctrinated a new set of developer responsibilities – owning not just code but automation, pipeline and the operations of code in production.  This mindset shift is absolutely essential in moving to agile software delivery where now we can push software changes to production a couple of times a week.  We are, of course, not Facebook or Amazon yet but yet we have achieved success that would have been unimaginable in a traditional packaged software company.  We are also following 1-2 week sprints to support this.  The more frequent you deliver to production the less is the risk is what we are learning fast.

Pushing to production – Pushing to production is an activity that we stumbled a couple of times before learning how to get it done right.  We have a goal of 30-60 minute production push that includes a quick discussion with developers, testers, product managers and release engineers about the release readiness followed by a push and canary and synthetic test, after which we decide whether to continue or rollback.  Thinking about this is a journey for us and we continue to build operational know-how and supporting tools to help us do smoother deployments and an occasional rollback.

Think production when developing a feature – Push to production also plays a central part of new features and capabilities.  Recently, we started thinking about how to rollout a major infrastructure change and a major schema change by considering that the production has huge amounts of data and all pushes require zero downtime.  These considerations are now regularly a part of our every conversation about a new feature not just an after thought during a push to production.  Production scale and resiliency are also other aspects to continuously watch for in building apps since the scale factor with increasing customers will be quite large compared to typical on-prem packaged apps.  Resiliency is another critical consideration since every call, message, server, service or infrastructure component can potentially fail or become slow; apps will have to deal with this gracefully.  Finally, we build applications that follow the 12 factor principles.

Cost angle – One of the areas which we started managing very closely is the cost of AWS services.  We tracked daily and monthly bills and looked for ways to optimize our usage of AWS resources.  Now, an architecture decision also requires a cost discussion which we never had when we were building enterprise packaged mode 1 applications.  This is healthy for us as this also creates innovation in putting together neat solutions that optimize not just performance, scale, resiliency but also cost.

Startup small team thinking – Finally, we are a small two-three pizza team.  Getting the complete ownership of not just software code, but also all decisions about individual microservices, tools and language choices (we use Node.js, Java and very soon Python) and doing the right thing for the product felt great and excited all on the team.  The startup feeling while inside a traditional enterprise software firm gives the freedom for all developers and testers to innovate, try new things and get things done.

That’s it for now.  We are on a journey to become cloud-native and are already are pushing features on a weekly and an occasional daily schedule into production without sacrificing stability, safety and security.  In the second follow-up blog, I will cover the Gartner debate on how mode 2 learnings above can be leveraged in mode 1 applications to make traditional mode 1 applications more like cloud native mode 2 apps.

Shift-Left: Security and Compliance as Code

In the part I of the blog, we focussed on how development and operations can use ‘Infrastructure-As-Code’ practices for better governance, traceability, reproducibility and repeatability of infrastructure.  In this part II of the blog, we will discuss the best practices for using ‘Security-As-Code’ and ‘Compliance-As-Code’ in cloud native DevOps pipelines to maintain security without sacrificing agility.

Traditional security is an afterthought.  Once a release is just about ready to be deployed, security vulnerability testing, pen testing, threat modeling and assessments are done using tools, security checklists and documents.  If critical or high vulnerabilities or security issues are found, the release is stopped from being deployed until these are fixed causing weeks to months of delays.

The ‘Security-As-Code’ principle codifies security practices as code and automates its delivery as a part of DevOps pipelines.  It ensures that security and compliance is measured and mitigated very early during DevOps pipeline – at use cases, design, development and testing stages of application delivery.  This shift-left approach allows us to manage security just like code by storing security policies, tests and results as a part of repository and DevOps pipelines.

Let us break this down with key best practices in achieving this.  As shown in the diagram below, there are 5 points in a pipeline where we can apply security and compliance as code practices.  We will use a cloud native application running on AWS cloud to illustrate these.

security as code blog.png

 

a) Define and codify security policies: Security is defined and codified for any cloud application at the beginning of a project and are kept in a source code repository.  As an example, a few AWS cloud security policies v1.0 are defined below:

  • Security groups/firewalls secured, eg. No 0.0.0.0 ingress rules
  • VPC, ELB, NACL enforced and secured
  • All EC2 and other resources must be in VPC and each resource individually firewalled
  • All AWS resources –EC2, EBS, RDS, DDB, S3 are encrypted as well as logs, use KMS.

Having security policies such as above in a document is old world.  All such security policies need to be automated – by pressing a button, one should be able to evaluate security policies for any application at any stage and environment.  This is a major shift that happens with the principle of  security as code – everything about security is codified, versioned, and automated instead of having a “Word” or “Pdf” document about security.  As shown in the diagram above, security as code is checked in as “code” in repository and versioned.

Enterprises should build standardized security patterns to allow easy reuse of security across multiple organizations and applications.  Using standardized security templates (also as “code”) will result in out-of-box security instead of each organization or application owner defining such policies and automation per team. For example, if you are building a 3 tier application, a standard cloud security pattern must be defined.  Similarly, for a dev/test cloud, another cloud security pattern must be defined and standardized.  Both hardened servers and hardened clouds are patterns that can yield security out of the box.

For example, with AWS cloud, the 5 NIST AWS cloud formation templates can be used to harden your networks, VPCs, IAM roles with these templates.  The cloud application can then be deployed on this secure cloud infrastructure where these templates form the “hardened cloud” (Blue) layer of the stack as shown below.  For cloud or on-premise servers, server hardening can be done by products such as BMC Bladelogic.

hardened cloud blog

b) Define security user stories and do security assessment of application and infrastructure architecture: Security related user stories must be defined in the agile process just like any other feature stories.  This will ensure that security is not ignored.

Security controls are identified as a result of application security risk and threat models on the application and infrastructure architecture.  These controls are implemented in the application code or policies.

 

c) Test and remediate security and compliance at application code check-in: As frequent application changes are continuously deployed to production, the security testing and assessment is automated early in the DevOps pipeline.  The security testing is automatically triggered as soon as code check-in happens for both application or infrastructure changes.  Security vulnerability tools are plentiful in the marketplace such as Nessus, opensource recipes from Chef for security automation, web pen testing tools, static and dynamic code analysis, infrastructure scanners and others.

If critical or high vulnerabilities are found, automation will create “defects/tickets” in JIRA for resolution assigned to the owner of microservice for any new security issues.

d) Security and Compliance test infrastructure-as-code artifacts: In the new cloud native world, infrastructure is represented as code such as Chef recipes, AWS CFN templates and Azure templates.  Security policies are embedded in these artifacts as well as code together with infrastructure.  For example, IAM AWS resources can be created as a part of AWS CFN templates.

Security policies need to be tested for violations of security policies such as “no s3 buckets should be publicly readable or writable” or “do not use open security groups with 0.0.0.0/all traffic allowed”.  This needs to be automated as a part of the application delivery pipeline using security and compliance tools.  Many tools such as CloudCheckr, AWS Config and others can be used.  For regulatory compliance policy checks such as CIS and DISA on servers, databases and networks, BMC Bladelogic suite of products can be effectively used to not just detect but also remediate compliance.

e) Test and remediate security and compliance in production: As application moves from dev to qa to test, automated security and compliance testing continues to happen although risk of finding issues here is much less as most of the security automation is done much earlier in the DevOps pipeline.  Tools such as BMC Bladelogic can be used for production security patching after scanning of production servers for mutable infrastructure.  For immutable infrastructure, of course, new servers are created and replaced instead of patching.

Conclusion

Shift-left approach for security and compliance is a key mind shift change with DevOps pipeline.  Instead of security being a gating add-on process at the end of the release, or a periodic security scanning of production environments, start with security at the beginning of an application release by representing it as “code” and baking it in.  We showed 5 best practices to embed security and compliance best practices for application delivery.  These practices include defining security policies as a part of infrastructure-as-code, standardized hardened cloud just like hardened servers; defining compliance policies as code; automating testing at early lower environments and stages in a pipeline as well as production environments. We strongly believe that with these 5 best practices, successful product companies can ensure security while maintaining agility and speed of innovation.

 

Product Trials – Managing 1000’s of machines on AWS Cloud

Most software companies desire to have their products and services available for trials.  This allows customers to test drive the product or service before purchase as well as gives marketing valuable leads for customer acquisition and marketing campaigns.  Many companies also want get feedback from customers from their unreleased products through trials.  If you have a SaaS service or a hosted product, this is achieved by making the product registration page to have a limited N-day trial built into the service itself.  However, if you are in the business of delivering packaged software, it is not that easy.  With an automation engine, baked images of your product and public clouds such as Amazon AWS or Azure, even packaged software companies  can quickly jump start a trial program in weeks.  We used a cloud management product automation engine and AWS cloud to put together trials of several of our products in less than 4 weeks.  Since going live a few months ago, we have successfully provisioned with 99.99% reliability over 500 trial requests and thousands of AWS cloud EC2 machines.

Cloud management platforms such as BMC Cloud Lifecycle Management have been extensively used in many enterprise use cases to manage private, public and Dev/Test clouds as well as to automate provisioning and operations that has resulted in cost savings and agility.  One cool new use case we recently used CLM for is product trial automation.  Let me describe the basic flow:

try and buy wordpress trial flow

  1. As soon as trial users registers at the bmc.com trial page, a mail gets sent to a mailbox.
  2. A provisioning machine running BMC Cloud Lifecycle management retrieves mail, interprets the request and starts the automation flow.  Automation uses CLM SDK to make concurrent service request calls for provisioning for each trial user.
  3. Based on the mail containing the service request, CLM starts provisioning of ‘product-1’, ‘product-2’,… trial machines on Amazon AWS cloud using service offerings and blueprints pre-defined in CLM containing baked AMIs.  Each product trial has an associated service blueprint describing the environment of a collection of AMIs to provision.  These service offering blueprints range from a simple single machine application stack to a full stack multi-machine application environment.  It even includes target machines.  Once the provisioning of the trial machines is completed, CLM automatically executes scripts and post-deploy actions as Bladelogic jobs to properly configure the trial machines that includes executing running BSA jobs to properly configure all the targets depending on the blueprints.
  4. As soon as machines are provisioned and configured on AWS cloud, a mail is sent back to the user indicating that machines are ready and URLs are provided to the user to login and trial the product.
  5. At the end of the trial period, the trial machines are decommissioned automatically through BMC CLM.  This is extremely important to govern the sprawl of EC2 machines.  

Cloud lifecycle management products such as CLM with their advanced automation, orchestration, blueprinting and governance capabilities can be used to solve new interesting automation use cases such as product trials that we hadn’t envisioned earlier.  With over 500 trials and thousands of machines successfully provisioned and decommissioned on Amazon cloud, we feel good that we are eating our own dogfood in creating and using our solutions to manage product trials.

 

 

 

How “Everything as Code” changes everything?

Software is eating the world.  Infrastructure, security, compliance and operations once used to be done by independent teams separate from application development teams.  These teams traditionally worked mostly in isolation, occasionally interacted with application development teams and used their own tools, complex slow moving processes such as change boards, approvals, checklists and 100-page policy documents and tribal specialized knowledge.  However, as the cloud native applications have seen a dramatic growth and acceptance in small and large enterprises, these 4 disciplines are being reimagined to ensure agility while still maintaining governance.  A new paradigm of “everything as code” is changing these 4 disciplines (infrastructure, security, compliance and operations) with principles such as “Infrastructure as code” and “Security as code”.  The core tenet is that infrastructure, security, compliance and operations are all described and treated exactly like application code.  In this two part blog, we will explore these new “everything as code” principles, benefits of representing everything as code, best practices and technologies to support this in development and operations.  In part-I, we will focus on infrastructure as code while in part-II we will focus on security, compliance and operations as code practices.

Infrastructure as Code

The fundamental idea behind infrastructure as code is that you treat infrastructure as code just like application software and follow software development lifecycle practices.  Infrastructure includes anything that is needed to run your application: servers, operating systems, installed packages, application servers, firewalls, firewall rules, network paths, routers, NATs, configurations for these resources and so on.  

Following are some of the key best practices that we have learn’t after operating several production cloud applications for the past year on AWS cloud.

  1. Define and codify infrastructure: Infrastructure is codified in declarative specification such as AWS cloudformation templates (CFN) for AWS cloud, Azure resource templates for Azure cloud, Terraform templates from Hashicorp, Docker compose and Dockerfiles, BMC CLM blueprints for both public cloud and on-prem datacenters and so on.  Infrastructure can also be represented as procedural code such as scripts and cookbooks using configuration tools such as Chef or Puppet.  In either case, the key best practice is to represent infrastructure as code in declarative or procedural ways with the same set of tools that you use for configuration management or cloud management.
  2. Source repo, peer review and test: Next, infrastructure as code is kept in source control such as Git, is versioned, is under change control, is tracked, is peer reviewed and is tested just like application software.  This will increase traceability and visibility into changes as well as collaborative ways to manage infrastructure with peer reviews.  For example, if operations wants to roll out a change to production infrastructure, ops does not do it through console directly in production as traditionally done in IT datacenters.  Instead, ops creates a pull request on ‘infra as code’ Git artifacts with  peer reviews conducted on these changes and then deployed to production.
  3. DevOps pipeline: Infrastructure as code (CFN templates, blueprints etc.) goes through a DevOps pipeline just like application code and gets deployed to production.  DevOps pipeline is a critical component of infrastructure as code since it allows infrastructure change delivery and governance to ensure that changes to infrastructure are tested and deployed in a controlled manner to production environments.  For example, in AWS clouds, ops will do pull requests on CFN to make changes to configuration parameters or AWS resources.
  4. Automation: Infrastructure automation provisions infrastructure from the codified infrastructure.  It is very easy to implement infrastructure as code automation for public or private clouds where datacenter is an API, for example, with AWS CFN or Azure Templates or Hashicorp’s Terraform.  All modifications to infrastructure are also first made to the code and then the changes to infrastructure done in Dev or Prod.  
  5. Immutability: Finally, infrastructure as code also supports server and container immutability.  Production engineers don’t make changes to servers or containers directly in production but go through a full DevOps pipeline to create new server or container images through DevOps pipeline and then deploy into production these new server images or container images by replacing the running servers or containers.  Thanks to my colleague Brendan Farrell for pointing out this benefit.

Example

With these best practices, infrastructure as code can be successfully used in both cloud native and on-prem application delivery management.  As an example, the diagram below shows application and infrastructure pipelines. Team-1’s pipeline defines application and infrastructure in a single file that is added to repository and used to drive the pipeline and deploy the full application stack. The full stack includes servers (infra), Tomcat (infra) and application (app1.war).  Note that this can be deployed as a single baked AMI, single baked VM template or Docker container depending on the technologies used by the team and supports immutability.  Team-2 only deploys their app2.war while the infra is built and deployed by a separate Infra team that includes hardened “OS” for higher level security needed by app2.  This illustrates multiple ways of including infra as code in pipelines as part of the microservice application (app1) or separate (app2) but combined together for final deployment.

infra as code

Conclusion

Finally, what’s in it for developers and operations teams to make the move to start using “Infrastructure as code”? Developers can specify infrastructure as a part of their application code in a repository.  This keeps the full application stack code, definition, testing and delivery logically connected and results in agility and autonomy for developers for full stack provisioning and operations.  For operations, best practices for software development are applied to infrastructure that results in preserving agility while also adding governance, reproducibility and repeatability of managing infrastructure.  It is also less error-prone to make infrastructure changes as it also goes through a DevOps pipeline with peer reviews and collaborative process just like code and supports immutable infrastructure.  Finally, there is traceability and ease of answering questions such as what is my current infrastructure, who made infrastructure changes in the past few days, can I rollback the latest configuration change made to my infrastructure?  

We strongly believe that using infrastructure as code principles in managing application delivery in DevOps can result in compelling advantages to both developers and operations.  In the part-II, we will cover security, compliance and operations as code principles.

What I learn’t from 1 day hack in building real-time streaming with AWS Lambda and Kinesis

It was a rainy Saturday which was a perfect day for hacking a real time streaming prototype on AWS.  I started thinking of Apache Kafka and Storm but then decided to give AWS Kinesis a try out as I didn’t want to spend my time in creating infrastructure with the goals of writing minimal code, zero setup and zero configuration pains to get this running.  This blog describes how easy it was to build a real-time streaming platform on AWS to do real-time applications like aggregation, counts, top 10, etc. Many key questions are also discussed:

  • Kafka vs. Kinesis – How do you decide?
  • KCL SDK vs. Lambda – Which is better to write streaming apps?
  • Streams, shards, partition keys and record structure – Design guidelines
  • Debugging
  • Automation of Kinesis in production

Sketch your conceptual streaming architecture

The first step you need to think is the type and volume of data you are planning to stream in real-time and number of real-time applications that would consume this data.

Next question to ask is how you will ingest the data in real-time from the producers of the data.  Two common ways to do this are shown below:  API Gateway + Lambda and using a native SDK streaming library (aws-sdk), the latter being much faster way to push data.

Moving on, there are Kinesis design issues such as how many real-time streams you need, as well as the ‘capacity’.  There are concepts such as shards and partitionkeys that will need to be decided at this point although this can be modified later as well.  For this prototype, I picked a few server end points with a single shard Kinesis stream as shown below.  All the Red lines were implemented as a part of this prototype.

Real-time-arch-conceptual-kinesis kinesis

Once the initial conceptual architecture for ingesting is thought through, the next most important design decisions are about the applications that will consume the real-time stream.  As seen from this diagram, there are multiple ways in which apps can consume stream data – Lambda, AWS Kinesis Client Libaray (KCL) and AWS Kinesis API, as well as Apache Storm through connectors.  In my serverless philosophy of minimal infrastructure, I went ahead with Lambda functions since my application was simple.  However, for complex apps, the best enterprise option is to use KCL or even Storm to reduce dependency and lock-in on AWS.  Also, note that KCL requires that you be a Java shop as tooling to write in other languages like JS and Ruby is simply too complex and non-intuitive to use (wrappers around MultilangDaemon).

Big data lifecycle beyond real-time streaming

Big data has a lifecycle – just real time streaming and real time applications are not enough.  Usually, in addition to real-time data stream, this data is also aggregated, archived and stored for further processing by downstream  batch processing jobs.  This will require connectors from Kinesis to downstream systems such as S3, RedShift, Storm, Hadoop, EMR and others.  Check this paper out for further details.

Now that we have thought about the data architecture and big picture, it is now time to start building this architecture.  We followed 3 steps to do this: create streams, build producers and consumers.

Create Streams

Streams can be created manually from AWS management console or programatically as shown by the Node js code snippet below.

kinesis.createStream(params, function(err, data) {
if (err && err.code !== 'ResourceInUseException') {
callback(err);
return;
}
)};

Note that streams can take several minutes to be created so it is essential to wait before using them as shown below by periodically checking whether the stream is “Active”.

kinesis.describeStream({StreamName : streamName},
function(err, data) { if (err) {callback(err); return;}
if (data.StreamDescription.StreamStatus === 'ACTIVE') {callback();}
else {
setTimeout(function()...

Helloworld Producer

This step can be done by invoking Kinesis SDK call kinesis.putRecord call.  Note that partionkey, streamname (created from previous step) and data are required to be passed in as JSON structure to this method.

var recordParams = {
Data: data, //can be just a JSON record that is pushed to Kinesis
PartitionKey: partitionKey,
StreamName: streamName
};
kinesis.putRecord(recordParams, function(err, rdata) {
//...
});

This should start sending data to the stream we just created.

Helloworld Lambda Consumer

Using the management console, a Lambda function can be written easily that listens to the stream as event source and gets invoked to do business processing analytics on a set of records in real-time.

exports.handler = function(event, context) {
event.Records.forEach(function(record) {
var payload = new Buffer(record.kinesis.data, 'base64').toString('ascii');
//do your processing here
});
context.succeed("Successfully processed " + event.Records.length + " records.");
};

In less than a day, we had created a fully operational real-time streaming platform that can scale and run applications in real-time.  Now, let us take a look at some key findings and takeaways.
Learnings

Kafka vs. Kinesis- Kafka requires a lot of management and operations effort to keep a Kafka cluster running at scale across datacenters with mirroring, keeping it secure, fault tolerant and monitoring disk space allocation. Kinesis can be thought of as “Kafka as a service” where operations and management costs are almost zero. Kafka, on the other hand, has more features such as  ‘topics’ which Kinesis doesn’t provide but this should not be a deciding factor until all the factors are considered. If your company has made a strategic decision to run on AWS, the obvious choice is to use Kinesis as it has advantage of ecosystem integration with  Lambda, AWS data sources and AWS hosted databases.  If you are mostly on-prem, or require a solution that needs to run both on-prem and SaaS, then you might not have a choice but to invest in Kafka.

Stream Processing Apps – Which is better? KCL vs. Lambda vs. API – Lambda is best used in simple use cases and is a serverless approach that requires no EC2 instances or complex managment.  KCL, on the other hand, requires a lot more operational management, and provides a more extensive library to handle complex stream applications such as worker processes, integration with other AWS services like DynamoDB, fault tolerance and load balancing streaming data.  Go with KCL unless your processing is simple or requires standard integrations to other AWS services in which case Lambda would be better.  Storm is another approach to be considered for reducing lock-in but I didn’t investigate it enough to know the challenges in using it.  Storm spouts is on my plate to evaluate it next.

Streams – How many? – It is always difficult to decide how many streams one needs. If you mix all different types of data streams in one stream, there could not only be scalability limits for each stream but also the fact that your stream processing apps will require to read everything and then filter out the records they want to process. Kinesis doesn’t have a filter and every application must read everything.  If you have independent data records requiring independent processing, then keep the streams separate.

Shards – How many? – The key decision point for this is the amount of records throughput and the amount of concurrency needed at read time during processing.  AWS management console provides a recommendation for number of shards needed.  Of course, you don’t need to get this right when creating a shard as resharding is possible later on.

Producer – small or big chunks? – It is better to minimize network calls to push data through SDK into AWS so some aggregation and chunking is needed on the client end points.  This needs to be traded-off against staleness of data.  Holding on to data and aggregating it for a few seconds should be acceptable in most cases.

Debugging – still tough – It was a bit of challenge to debug the Kinesis and Lambda as logs and CloudWatch showed metrics with delays.  Better tooling is needed here.

Kinesis DevOps Automation – For production operations, automation of Kinesis will be needed.  Two prominent capabilities needed here are stream creation per region, single pane of glass for multiple streams and stream scaling automation.  Scaling streams is still manual and error-prone but that is a good problem to have when amount of data ingest is increasing due to larger number of customers.

Final Word

In less than a day, I had a successful real-time stream running where I can push big volumes of data records from my laptop or a REST end point into AWS API gateway, to Kinesis stream and then 2 Lambda applications reading this data in real-time doing simple processing.  AWS Kinesis together with Lambda provides a compelling real time data processing platform.