From 5 to 85 – Just 5 security practices can make you 85 % secure

Most security attacks are simple minded.  As you read through the news of the recent security breaches, most of them could have been avoided.  These breaches can be tracked down to simple misconfigurations such as a misconfigured s3 bucket for public access or lack of basic security hygiene such as out of date security patching.  And yet, so many of us don’t seem to know or follow the basic security hygiene.  The security market is confusing with hundreds of security vendors offering their products and dozens of compliance frameworks such as NIST, CIS, HIPPA, FedRAMP, and more to choose from.  It often is unclear where to begin and what to focus on first.   It seems that enterprises might be focussing on defending complex security scenarios and challenges such as sophisticated threat vectors and advanced persistent threats but unwittingly missing the basic common sense security practices first steps.  We believe that by implementing just the top 5 Center for Internet Security (CIS) controls, enterprises can reduce their cyber risk by 85% according to many studies done in the industry.  Let us break this down further and see how we can get to 85% safety with just 5 steps.

Top 5 CIS Controls – What are they?

The Center for Internet Security has created 20 controls and prioritized the top 5 if implemented rightly will reduce security risks by 85%.  These controls are pragmatic, prescriptive and can be easily automated through security automation tools listed below. 

Table 1. Center for Internet Security Top 5 Security Controls

The Top 5 CIS Controls Security Automation Tools and Practices
1. Inventory of Authorized and Unauthorized Devices BMC Discovery

SecOps Response

2. Inventory of Authorized and Unauthorized Software BMC Discovery

Bladelogic Server Automation

Bladelogic Network Automation

3. Secure configurations for Hardware and Software (also see CIS 9. and CIS 11. for securing configurations of network infrastructure) Assessment and Remediation in Cloud – SecOps Policy

Assessment and Remediation in Datacenter – Bladelogic Server and Network Automation

4. Continuous Vulnerability Assessment and Remediation Scanning – Qualys, Nessus, Rapid7

Vulnerability management – SecOps Response

Patching – Bladelogic, SCCM

5. Controlled Use of Administrative Privileges SecOps Policy


The First Two

The first 2 CIS controls focus on ‘Inventory of Authorized and Unauthorized Devices and Software’.  We all have heard a management tenet – You cannot manage what you don’t know.  In security, “You cannot secure what you don’t know” is equally applicable. Having a complete visibility into all devices, computers, servers, network devices, databases, as well as all the software and applications is the first critical step for any enterprise.  Without this complete list of “inventory”, security scanning, patching, compliance and operations teams will not know about the missing devices (machines etc.) or missing applications, and will fail to either scan them or patch them.  This lack of visibility can lead to security threats.  Enterprises should consider a) An automated discovery tool that discovers not just infrastructure, devices and machines but also applications and software b) Correlation tools that can correlate and map device data from multiple enterprise tools such as scanning, patching, compliance and discovery tools and alert you on the ‘blindspots’.  These blindspots represent potential security risks as these remain invisible and hence go unscanned or un-patched.

The Middle

The CIS control #3 deals with ‘Secure Configurations for Hardware and Software’.  Common security misconfigurations are very common and very much preventable.   For example, a port on a machine or firewall is left open, an operating system that is not hardened, an s3 bucket or files are left open for public access, MongoDB or Elasticsearch DB is left publicly accessible, default passwords are not changed or a misconfigured firewall was letting unwanted traffic are some of the common misconfigurations that can yield to highly publicized breaches both in cloud an datacenter.

While CIS #3 control focusses on hardware and software, there are a 2 related and similar controls that would extend secure configurations to network devices and are important considerations.  CIS control #11 is about “Secure Configurations of Network Devices” that is similar to #3.  CIS control #9 is “Limitation and Control of Network Ports, Protocols and Services”.  These two together will secure network infrastructure such as open firewalls and ports we discussed earlier.

Enterprises should consider a) Tools to assess, harden and remediate operating systems to create “golden” images and Docker containers based on standards such as CIS, DISA and PCI  b) Tools that can assess, harden and remediate their cloud resources that cover all resources from hardware, software, servers, storage and network.  As companies move to public cloud, it is even more important that cloud resources are also hardened just like servers are in datacenters.  Cloud resources include s3 buckets, load balancers and security groups security configurations using standards such as CIS AWS foundations.   Once the assessment is completed, a well established process must exist to remediate violations across the full stack and multi-cloud.  More details here.

The CIS control #4 refers to “Continuous Vulnerability Assessment and Remediation”.  Vulnerabilities in servers are one of the sure ways of getting attacked given there are over 80,000 vulnerabilities (CVEs) identified in NIST and each week a dozen or more continuously get added.  Now, what is even more interesting is that 99% of the exploited vulnerabilities were compromised more than a year after the CVE was published.  The window of opportunity for attackers should be minimized by fixing the critical, high and higly exploitable vulnerabilities as soon as fixes are available and not wait for months and years.  Enterprises should consider a) Automated scanning tools that can scan for vulnerabilities b) Vulnerability life-cycle management tools that can prioritize and plan for remediation SLAs to keep the window of opportunity small and c) Patching tools to execute remediations on servers, and network devices across cloud and datacenters.

The Last but not the least

The final control #5 in the top 5 deals with “Controlled Use of Administrative Privileges” or “processes and tools used to track/control/prevent/correct the use, assignment, and configuration of administrative privileges on computers, networks, and applications.”  Elevating access privileges, improperly configured identity and access permissions (IAM), weak administrative passwords without complex password policies, no rotation of passwords, no MFA, and more can all lead to breaches and attacks.  Enterprises should consider a) Assessment and hardening tools to validate that all their systems from operating systems, servers, to applications and cloud use best practices for administrative privilege management such as assessing for password policies for account logins; b) Continuous monitoring tools like SIEM to track activity related to administrative privilege usage and alert on suspicious activity.  CIS for AWS cloud and servers has number of rules related to this that most assessment and hardening tools can easily address such as SecOps and Bladelogic.

What next?

The CIS Critical Security Controls are a prioritized and actionable set of cybersecurity best practices to prevent the most dangerous and common cyber attacks.  Using just the top 5 can reduce your risks by 85%.   As an example, a cloud assessment and remediation tool like SecOps Policy could have detected the three high profile breaches related to s3 very easily.  The recent Wannacry attack could have been thwarted by proactively remediating vulnerablities using SecOps Response. There are a number of security automation tools in the marketplace to help you achieve 85% security that you should leverage first.  After you implement the top 5, move to the remaining 15 and to more complex frameworks such as NIST and ISO which will add another 100-300 controls to further tighten security of cloud apps and datacenter.



Three Cloud Data Security best practices against Ransomware

Ransomware is becoming a global menace – last week’s Wannacry ransomware attack as well as ransomware attacks earlier this year on MongoDB and Elasticsearch clusters have become common headlines in recent times.  Hundreds of thousands of servers, and databases were hacked in 2017 as a result of ransomware.   Clearly, as indicated by this tweet, the immediate response to Wannacry ransomware is to patch the Windows servers to remediate a vulnerability that Microsoft patched two months ago.  Are there better ways to address data security more proactively?  Why were thousands of servers un-patched when a patch was released by Microsoft 2 months ago?  We describe data security process and three key best practices for protecting data against ransomware.


Data Security Process

Data security process has four key stages as shown below: discover and classify, prevent, detect, and respond.


The first step is to discover data in the enterprise such as databases and datastores and classify all your data into different levels such as sensitive or personal information data that requires strong security.

The second step is to apply prevention techniques to secure your data such as using proactive policies-as-code to reduce your security attack blast radius, building layers of protection with defense-in-depth, continuously taking backups and continuously auditing the security of datastore as well as security of all compute and network resources along potential attack paths with access to these datastores.   Having a policy management process and management tool for automating policy checks for data security, compliance and backups would ensure that continuous automated audits happen with auto remediations.  Another prevention technique is having a vulnerability and patch management process and remediation time SLAs for critical patches.

Finally, you need to continually monitor and detect data breaches and data security issues and respond to them as reactive steps to mitigate such issues.  Having a vulnerability management process and server management tools would enable quick identification of vulnerable servers that indicates risks to the business and  patching to remediate them.

Let us next break down these 3 practices for securing your data.

I. Prevention – Policies to continuously audit

Ransomware and other data loss can be prevented with many defensive techniques by proactively checking configurations in systems, network, servers and databases.   Four key policies need to be defined and enforced through a policy management process and tool.  These policies form the defense-in-depth layers that will protect the data from hackers or ransomware attacks.


  1. Compliance policy – There are a number of DB CIS configuration checks that must be followed to ensure that all database configurations are secured.  Many default database settings such as MongoDB or Elasticsearch leave them wide open to internet.  These security and compliance checks need to be continuously evaluated and corrected as one of the first measure of prevention.
  2. Data security policy – Data can be secured by ensuring that we do encryption-at-rest as well as customer provided encryption keys as standard practice for all databases.  Default credentials should be changed as well as appropriate backups must be done and verified.  All these checks can be easily defined as policies and continuously evaluated for any misconfigurations.
  3. Network security policy – Databases, cloud servers and networks along access paths to databases must be secured through use of appropriate firewall or security group routing rules as well as having isolated VPCs and subnets hosting the databases.  Whitelisting of all database accessible servers must be done as well to limit network access.  These policies can be evaluated and enforced through enterprise policies continuously.
  4. Server compliance policy – Databases run on servers that need to have hardened OS images and a vulnerability and patch management process.  The next two best practices describe the vulnerability and patch management process and tools.  All servers within the enterprise should also follow these practices as lateral movement can make database servers insecure if one of the other servers in the cloud becomes insecure.

II. Prevention – Patch SLA monitoring and continuous patching

Enterprises should define a vulnerability and patch management process with objectives on “time to remediate critical security issues”, RTO – Remediation Time Objective (not to be confused with backup RTO).  Enterprises should have RTO SLA policies in place that specify “Number of days all critical vulnerabilities such as with CVSS severity scores of 9 or 10 will be remediated”.  An RTO SLA of 15 to 30 days is quite common for patching critical security vulnerabilities.  This SLA needs to be continuously monitored and any violations need to be notified and corrected.  In the latest WannaCry attack, thousands of computers remain un-patched for more than 60 days even after a patch for this vulnerability was released two months ago.  This could have been avoided with continuous SLA monitoring and remediation of RTO.

Many enterprises could go one step further by not just monitoring and actively managing RTO SLA but also automating the detection and patching.  As soon as critical vulnerabilities are identified through periodic scans and patches are available from the vendors, the management tools must be able to automatically update their patch catalogs and patch servers and network devices in a zero-touch approach.

  1. Continuously scan environment for detecting vulnerabilities
  2. Select critical vulnerabilities for automated patching
  3. Continuously look for patches from vendors such as Microsoft, download critical patches for vulnerabilities and keep patch catalog contents updated automatically
  4. Automatically apply patches for critical vulnerabilities when they are discovered based on policies for RTO.

With these SLA monitoring and patching controls in place, enterprises can achieve a high degree of data security through proactive prevention.

III. Response – Vulnerability and Patching

Even after preventive controls discussed in I and II, there is still a need to detect and respond as all security attacks cannot be always prevented.   In the detect and response situation, once ransomware or other data exfiltration and security threats have been identified, it is important to have the ability to identify the vulnerable servers and patch them as soon as possible.  A reactive vulnerability and patch management system must have the ability to select specific CVE,  assess the servers that require patching and with a few clicks be able to apply patches and configuration changes to remediate that critical CVE.


Data security starts with an enterprise data security process consisting of data discovery, prioritization, prevention, detection and response stages.  The first best practice for data security is prevention where the datastores such as MongoDB, Elasticsearch as well as servers and networks are continuously audited for security and compliance through policies.  A policy management tool will be a critical enabler for achieving this audit and preventive checks.  These tools can be thought of as ways to “detect” and “harden” all places and paths along which data is stored, moved and accessed thereby achieving defence-in-depth.  The second best practice for data security is to define a “Remediation Time Objective”-SLA and implementing a vulnerability lifecycle management process.  A vulnerability management tool continuously scans for vulnerabilities, gives visibility into critical vulnerabilities with SLA violations, and automatically keeps environment patched with zero touch.  Many enterprises that followed a 30 day RTO SLA were not impacted by Wannacry ransomware as they had patched systems in March soon after patch was released. The third best practice is to have an ability to assess a vulnerability and remediate it during emergencies or as a part of security incident response such as the weekend Wanncry ransomware threat.  All the above three proactive and reactive practices and tools can keep the data secure and avoid costly and reputation damaging Ransomware attacks on enterprises.

BMC Software has three products: Bladelogic server automation, SecOps Response and Policy cloud services that can keep your applications, servers, networks and data safe from ransomware attacks.  Wannacry ransomware was a non-event for these customers as they were proactively implementing vulnerability and patch management processes through our tools.

Full disclosure:  I work at BMC Software.  Check out, and

Policies Rule Cloud and Datacenter Operations – Cloud 2.0

Trust but verify – A new way to think about Cloud Management

Cloud management platforms (CMPs) are very popular to manage cloud servers and applications and have been widely adopted by small and large enterprises. For datacenter management (DC) spanning over decades before, there has been a sprawl of systems management tools to manage datacenters. The common wisdom in both these models is to control access to the cloud at the gates by CMPs or DC tools just like in the historic days forts were protected and access controlled with moats and gates. However, with the increasing focus on agility and delivering faster business value to customers, developers and application release teams require a much greater flexibility in working with cloud than previously imagined. Developers want full control and flexibility on tools and APIs to interact with cloud instead of being stopped at the gates and prescribed a uniform single gate to use cloud. Application owners want to allow this freedom but still want cloud workload to be managed, compliant, secure and optimized. This freedom and business driver for agility is creating a new way to reimagine cloud 2.0 which does not stop you at the gates but allows you to come in while continuously checking policies to ensure that you behave well in cloud. Ability to create and apply policies will play a key role in this new emerging model of governance where freedom is tied to responsibility. We believe that the next generation cloud operational plane will drive the future vision on how workloads will be deployed, operated, changed, secured, and monitored in clouds. Enterprises should embrace policies at all stages of software development lifecyle and operations for datacenters in cloud and on-prem. Creating, defining and evaluating policies and taking corrective actions based on policies will be a strategic enabler for all enterprises in the new cloud 2.0 world.

Defining Cloud Operational Plane

In this new cloud management world, you are not stopped at gates but checked continuously. Trust but verify is the new principle of governance in Cloud 2.0.  Now, let us review the 5 key areas for a cloud operational plane and how policies will play a critical role in governance.

  • Provisioning and deployment of cloud workload
    • Are my developers or app teams provisioning right instance types?
    • Is each app team using within their allocated quota of cloud resources?
    • Is the workload or configuration change being deployed secure and compliant?
    • How many pushes are going on per hour, daily and weekly?
    • Are any failing and why?
  • Configuration changes
    • Is this change approved?
    • Is it secure and compliant?
    • Tell me all the changes happening in my cloud?
    • Can I audit these changes to know who did what when?
    • How can I apply changes to my cloud configurations, resources, upgrade to new machine images etc.?
  • Security and compliance
    • Continuously verify that my cloud is security and compliant
    • Alert me on security or compliance events instantly or daily/weekly
    • Remediate these automatically or with my approval
  • Optimization
    • Are my resources most optimally being used? Does it have the right capacity? Do I have the scaling where I need it?
    • Showback of my resources
    • Tell me where am I wasting resources?
    • Tell me how I can cut down costs and waste?
  • Monitoring, state and health
    • Is my cloud workload healthy?
    • Tell me what are key monitoring events? Unhealthy events?
    • Remediate these automatically or with my approval

How Cloud Operational Plane can be enabled through Policies?

The following table compares the new and old world cloud management.  In the old world of cloud management platforms (CMP), we block without trust.  In the new world of cloud operational plane, since gates are open, it becomes necessary to manage the cloud through policies as the central tenet for cloud operations.  This is the cloud operational plane (COP).

  CMP – Block without Trust COP –Trust but Verify Recommended Practices
Deployment to multi-cloud Single API across all clouds, forced to use this.
Catalog driven provisioning
Various tools + No single point of control

No single API

No single Tool

Use best API/tool for each cloud

No catalog

DevOps – your choice


Manage/start/stop your resources Single tool Various tools + No single point of control DevOp/Cloud tool – your choice
DevOps continuous deployment Hard to integrate, API of CMP is a hindrance to adoption Embraces this flexibility, allow changes through any toolset Policies for DevOps process for compliance
Config Changes      
Unapproved config changes Block if not approved Allow usually or block if more control desired Change Policies
Config changes API Single API No single API DevOps tool
Audit config changes Yes Yes Audit – capture all changes
Rollback changes No Yes, advanced tools for Blue-Green, Canary etc. DevOps tool
Change monitoring No Yes Change Monitoring
Change security No Yes Policy for change compliance/security
Security & Compliance      
Security in DevOps process N/A Yes Policy for DevOps security
Monitor, scan for issues, get notified N/A Continuously monitor for compliance & security Multi-tool integrations
Prioritize issues N/A Yes, multiple manual and automated prioritization Policy based prioritization
Security and Compliance of middleware and databases



N/A Yes Compliance and security policies for middleware and databases
Quota & decommissioning Block deployment if out of quota

Decommission on lease expiry

Allow but notify or remove later with resource usage policies.

Decommission on lease expiry

Policies for quota and decommissiong
Optimization N/A Yes Policies for optimization and control

Policies in Enterprises

As enterprises move into a world of freedom and agility with Cloud and DevOps, it becomes increasingly important to use policies to manage cloud operations.  An illustrative diagram below shows how policies can be used to manage everything from DevOps process, on-prem and cloud environments, production environments, cloud infrastructure, applications, servers, middleware and databases.

Policies rule the world

For agile DevOps, policy checks can be embedded early or as needed in the process to catch compliance, security, cost or shift-left violations in source code and libraries. For example, consider a DevOps process starting with a continous integration (CI) tool such as Jenkins®. Developers and release managers can trigger the OWASP (Open Web Application and Security Project) checks to run a scan against source code libraries and block the pipeline if any insecure libraries are found.

Production environments have applications consisting of servers, middleware, databases and networks hosted in clouds such as AWS and Azure.  All these need to be governed by policies as shown above.  For example, RHEL servers in cloud are governed by 4 policies – cost control, patch policy, compliance policy and a vulnerability remediation policy.  Similarly there are security, compliance, scale and cost policies for other cloud resources such as databases and middleware.  Finally, the production environment itself is governed by change, access control and DR policies.

All these policies in the modern cloud 2.0 will be encoded as code.  A sample policy as code can be written in a language such as JSON or YAML:

  • If s3 bucket is open to public, then it is non-compliant.
  • If a firewall security group is open to public, then it is non-compliant.
  • If environment is DEV and instance type is m4.xlarge, then environment is non-compliant

Using policy-as-code will ensure that these policies are created, evaluated, managed and updated in a central place and all through APIs.  Additionally, enterprises will choose to remediate resources and processes on violation of certain policies to ensure that cost, security, compliance and changes are governed.


Cloud management is changing from a “block on entry” to “trust but verify” model.  Some enterprises who wish to govern with an absolute control at gates will continue to use cloud management platforms extensively and effectively.  However, many enterprises are beginning to move to a new cloud 2.0 model where agility and flexibility of DevOps tools and processes are critical for their success.  Instead of prescribing a single entry choke point or a single “CMP tool” to work with cloud, we allow everybody in with their own tools and processes, but continuously verify that policies for deployment, resource usage, quota, cost, security, compliance and changes are continuously tracked, monitored and corrected. Simple effective API based policy as code definition, management, evaluation and remediation will be a central capability that enterprises will need to run new clouds effectively.

Full disclosure:  I work for BMC Software and my team has built a cloud native policy SaaS service, check out the 2 minute video here:

Acknowledgement: A few of my colleagues at work, JT and Daniel proposed the analogy of forts and cloud operational plane that is fascinating and cool.  This motivated me to write this blog to show how cloud management itself is evolving from guard at gates to trust but verify model.





Secret Sauce for Building Cloud Native Apps

A few months back I had written a blog comparing mode 2 vs. mode 1 application development.  Gartner defines BiModal IT as an organizational model that segments IT services into two categories based on application requirements, maturity and criticality. “Mode 1 is traditional, emphasizing scalability, efficiency, safety and accuracy. Mode 2 is non-sequential, emphasizing agility and speed”.  In the last 6 months, we built a mode 2 cloud native application on AWS cloud and this got us thinking on two questions:  What were unique enablers that led to a successful mode 2 cloud app?  Is Gartner right in keeping the two mode 1/2 segments separate and what mode 1 can change and learn from mode 2?  Let us analyze the first question here.  We will discuss key best practices in building cloud native mode 2 applications.

Focus – do one thing – know your MVP – Mode 2 cloud native applications require a maniacal focus on customer problem, market requirements and delivering value to the customer by focussing on one or two use cases.  We spent very little time in needless meetings, deciding what to build or how to build it.  I have seen many projects that took 2-6 months to define the product to build.  We knew exactly what to build with an uncompromising faith.  Startups usually have this solid vision and focus of doing one thing right usually called the “Minimum Viable Product (MVP)”.  We acted like a startup charged with solving a customer problem.

Leverage cloud native higher level services – We decided to go “all in” with AWS cloud using disruptive PaaS higher level services such as Kinesis, Lambda, API Gateway and NoSQL databases.  We also went serverless with AWS Lambda as the key architectural microservice pattern while we used servers only when we absolutely needed.  No more discussions about the “platform”, “lock-in”, portability to multiple clouds, or multiple datacenters – “on-prem” and “SaaS” or other utopian goals.  We wanted to get to market fast and AWS managed higher level services was our best bet.  Decisions such as which opensource to use, doing endless comparisons, discussions, debates, and paralysis by analysis were avoided at all costs.  We also worked secretly avoiding cross organizational discussions that could slow us down and worked almost as skunkworks fully autonomous team charged with making our own decisions and destiny.  After 6 months of using AWS, we continue to be delighted and amazed by the power of the platform.  The undifferentiated heavy lifting is all done by AWS – infrastructure automation such as routing, message buses, load balancers, server monitoring, running clusters and servers, patching them, maintaining them, and so on – these are not our key business strengths.  We want to focus on business problems that matter and focus on building “apps” not managing “infra” that is best left to AWS.

cloud native app

Practice everything “as code” concepts – From the kick-off, we followed “-as-code” concepts to define all our configuration, infrastructure, security and compliance and even operations.  Each of these aspects is declaratively specified as code and version controlled in Git repository.  We all have heard infrastructure-as-code.  We followed this in our practices where we can build and deploy our entire application+infrastructure stack using AWS cloud formation (CFN) templates with the single click of a button.  Even we can deploy customized stacks for dev, qa, perf and production all driven from a single CFN configured for each environment differently.  This was a major achievement and a decision that saves us time in keeping all our environments consistent, and repeatable.  Today, we have a CFN for each microservice and a few common infrastructure CFNs that not only describe the components but also configuration, such as amount of memory, security such as using HTTPS/SSL, IAM permission roles and compliance as code.  Leveraging AWS ecosystem to do all this has vastly reduced our time to market as everything works seamlessly and is well integrated.

Security, operations and monitoring in cloud services – Being a SaaS cloud service, we leveraged security, operations, metering and monitoring framework from AWS cloud platform and ensured that architecture and development of microservices were built for SaaS.  Security threat modeling and using standardized AWS blueprints for NIST helped us save time and achieve a standardized well tested security to start with such as VPCs, security groups and IAM permission model.  Thinking about operations, health and monitoring while building SaaS components is a key new enlightenment we had as we moved into 3rd and 4th sprint.  For example, log messages with any “Exceptions” or “unique patterns” can generate “alerts” that can be sent to email or sms.  AWS Cloudwatch provides a powerful alerting and log aggregation capability to do all this and we used it extensively to monitor our appfrom log patterns and custom application metrics.  Finally, most of our operations is “app ops” since we own and manage almost no servers, no databases, and no infrastructure.  Being truly serverless has completely reduced an entire set of concerns related to infrastructure, and we are starting to see the benefits of it as we get into operations.  Even our operations guys are talking about application issues and nothing about rebooting servers or patching for security updates.

DevOps, Automation and Agility Cool-aid – Just as our AWS “all in” architecture decision, in our first week, we picked our DevOps tool that a successful cloud company had used before and went “all in” with it.  We built our pipelines, one per microservice with all the automation and stages from commits, dev, qa, staging and production.  Once we built our pipelines, we still had “Red” all over our pipelines as pipelines failed in many automated tests.  This is when we realized that the culture  plays a very important part of our process of agile software development.  Just having the right technology doesn’t cut it.  We indoctrinated a new set of developer responsibilities – owning not just code but automation, pipeline and the operations of code in production.  This mindset shift is absolutely essential in moving to agile software delivery where now we can push software changes to production a couple of times a week.  We are, of course, not Facebook or Amazon yet but yet we have achieved success that would have been unimaginable in a traditional packaged software company.  We are also following 1-2 week sprints to support this.  The more frequent you deliver to production the less is the risk is what we are learning fast.

Pushing to production – Pushing to production is an activity that we stumbled a couple of times before learning how to get it done right.  We have a goal of 30-60 minute production push that includes a quick discussion with developers, testers, product managers and release engineers about the release readiness followed by a push and canary and synthetic test, after which we decide whether to continue or rollback.  Thinking about this is a journey for us and we continue to build operational know-how and supporting tools to help us do smoother deployments and an occasional rollback.

Think production when developing a feature – Push to production also plays a central part of new features and capabilities.  Recently, we started thinking about how to rollout a major infrastructure change and a major schema change by considering that the production has huge amounts of data and all pushes require zero downtime.  These considerations are now regularly a part of our every conversation about a new feature not just an after thought during a push to production.  Production scale and resiliency are also other aspects to continuously watch for in building apps since the scale factor with increasing customers will be quite large compared to typical on-prem packaged apps.  Resiliency is another critical consideration since every call, message, server, service or infrastructure component can potentially fail or become slow; apps will have to deal with this gracefully.  Finally, we build applications that follow the 12 factor principles.

Cost angle – One of the areas which we started managing very closely is the cost of AWS services.  We tracked daily and monthly bills and looked for ways to optimize our usage of AWS resources.  Now, an architecture decision also requires a cost discussion which we never had when we were building enterprise packaged mode 1 applications.  This is healthy for us as this also creates innovation in putting together neat solutions that optimize not just performance, scale, resiliency but also cost.

Startup small team thinking – Finally, we are a small two-three pizza team.  Getting the complete ownership of not just software code, but also all decisions about individual microservices, tools and language choices (we use Node.js, Java and very soon Python) and doing the right thing for the product felt great and excited all on the team.  The startup feeling while inside a traditional enterprise software firm gives the freedom for all developers and testers to innovate, try new things and get things done.

That’s it for now.  We are on a journey to become cloud-native and are already are pushing features on a weekly and an occasional daily schedule into production without sacrificing stability, safety and security.  In the second follow-up blog, I will cover the Gartner debate on how mode 2 learnings above can be leveraged in mode 1 applications to make traditional mode 1 applications more like cloud native mode 2 apps.

Shift-Left: Security and Compliance as Code

In the part I of the blog, we focussed on how development and operations can use ‘Infrastructure-As-Code’ practices for better governance, traceability, reproducibility and repeatability of infrastructure.  In this part II of the blog, we will discuss the best practices for using ‘Security-As-Code’ and ‘Compliance-As-Code’ in cloud native DevOps pipelines to maintain security without sacrificing agility.

Traditional security is an afterthought.  Once a release is just about ready to be deployed, security vulnerability testing, pen testing, threat modeling and assessments are done using tools, security checklists and documents.  If critical or high vulnerabilities or security issues are found, the release is stopped from being deployed until these are fixed causing weeks to months of delays.

The ‘Security-As-Code’ principle codifies security practices as code and automates its delivery as a part of DevOps pipelines.  It ensures that security and compliance is measured and mitigated very early during DevOps pipeline – at use cases, design, development and testing stages of application delivery.  This shift-left approach allows us to manage security just like code by storing security policies, tests and results as a part of repository and DevOps pipelines.

Let us break this down with key best practices in achieving this.  As shown in the diagram below, there are 5 points in a pipeline where we can apply security and compliance as code practices.  We will use a cloud native application running on AWS cloud to illustrate these.

security as code blog.png


a) Define and codify security policies: Security is defined and codified for any cloud application at the beginning of a project and are kept in a source code repository.  As an example, a few AWS cloud security policies v1.0 are defined below:

  • Security groups/firewalls secured, eg. No ingress rules
  • VPC, ELB, NACL enforced and secured
  • All EC2 and other resources must be in VPC and each resource individually firewalled
  • All AWS resources –EC2, EBS, RDS, DDB, S3 are encrypted as well as logs, use KMS.

Having security policies such as above in a document is old world.  All such security policies need to be automated – by pressing a button, one should be able to evaluate security policies for any application at any stage and environment.  This is a major shift that happens with the principle of  security as code – everything about security is codified, versioned, and automated instead of having a “Word” or “Pdf” document about security.  As shown in the diagram above, security as code is checked in as “code” in repository and versioned.

Enterprises should build standardized security patterns to allow easy reuse of security across multiple organizations and applications.  Using standardized security templates (also as “code”) will result in out-of-box security instead of each organization or application owner defining such policies and automation per team. For example, if you are building a 3 tier application, a standard cloud security pattern must be defined.  Similarly, for a dev/test cloud, another cloud security pattern must be defined and standardized.  Both hardened servers and hardened clouds are patterns that can yield security out of the box.

For example, with AWS cloud, the 5 NIST AWS cloud formation templates can be used to harden your networks, VPCs, IAM roles with these templates.  The cloud application can then be deployed on this secure cloud infrastructure where these templates form the “hardened cloud” (Blue) layer of the stack as shown below.  For cloud or on-premise servers, server hardening can be done by products such as BMC Bladelogic.

hardened cloud blog

b) Define security user stories and do security assessment of application and infrastructure architecture: Security related user stories must be defined in the agile process just like any other feature stories.  This will ensure that security is not ignored.

Security controls are identified as a result of application security risk and threat models on the application and infrastructure architecture.  These controls are implemented in the application code or policies.


c) Test and remediate security and compliance at application code check-in: As frequent application changes are continuously deployed to production, the security testing and assessment is automated early in the DevOps pipeline.  The security testing is automatically triggered as soon as code check-in happens for both application or infrastructure changes.  Security vulnerability tools are plentiful in the marketplace such as Nessus, opensource recipes from Chef for security automation, web pen testing tools, static and dynamic code analysis, infrastructure scanners and others.

If critical or high vulnerabilities are found, automation will create “defects/tickets” in JIRA for resolution assigned to the owner of microservice for any new security issues.

d) Security and Compliance test infrastructure-as-code artifacts: In the new cloud native world, infrastructure is represented as code such as Chef recipes, AWS CFN templates and Azure templates.  Security policies are embedded in these artifacts as well as code together with infrastructure.  For example, IAM AWS resources can be created as a part of AWS CFN templates.

Security policies need to be tested for violations of security policies such as “no s3 buckets should be publicly readable or writable” or “do not use open security groups with traffic allowed”.  This needs to be automated as a part of the application delivery pipeline using security and compliance tools.  Many tools such as CloudCheckr, AWS Config and others can be used.  For regulatory compliance policy checks such as CIS and DISA on servers, databases and networks, BMC Bladelogic suite of products can be effectively used to not just detect but also remediate compliance.

e) Test and remediate security and compliance in production: As application moves from dev to qa to test, automated security and compliance testing continues to happen although risk of finding issues here is much less as most of the security automation is done much earlier in the DevOps pipeline.  Tools such as BMC Bladelogic can be used for production security patching after scanning of production servers for mutable infrastructure.  For immutable infrastructure, of course, new servers are created and replaced instead of patching.


Shift-left approach for security and compliance is a key mind shift change with DevOps pipeline.  Instead of security being a gating add-on process at the end of the release, or a periodic security scanning of production environments, start with security at the beginning of an application release by representing it as “code” and baking it in.  We showed 5 best practices to embed security and compliance best practices for application delivery.  These practices include defining security policies as a part of infrastructure-as-code, standardized hardened cloud just like hardened servers; defining compliance policies as code; automating testing at early lower environments and stages in a pipeline as well as production environments. We strongly believe that with these 5 best practices, successful product companies can ensure security while maintaining agility and speed of innovation.


Product Trials – Managing 1000’s of machines on AWS Cloud

Most software companies desire to have their products and services available for trials.  This allows customers to test drive the product or service before purchase as well as gives marketing valuable leads for customer acquisition and marketing campaigns.  Many companies also want get feedback from customers from their unreleased products through trials.  If you have a SaaS service or a hosted product, this is achieved by making the product registration page to have a limited N-day trial built into the service itself.  However, if you are in the business of delivering packaged software, it is not that easy.  With an automation engine, baked images of your product and public clouds such as Amazon AWS or Azure, even packaged software companies  can quickly jump start a trial program in weeks.  We used a cloud management product automation engine and AWS cloud to put together trials of several of our products in less than 4 weeks.  Since going live a few months ago, we have successfully provisioned with 99.99% reliability over 500 trial requests and thousands of AWS cloud EC2 machines.

Cloud management platforms such as BMC Cloud Lifecycle Management have been extensively used in many enterprise use cases to manage private, public and Dev/Test clouds as well as to automate provisioning and operations that has resulted in cost savings and agility.  One cool new use case we recently used CLM for is product trial automation.  Let me describe the basic flow:

try and buy wordpress trial flow

  1. As soon as trial users registers at the trial page, a mail gets sent to a mailbox.
  2. A provisioning machine running BMC Cloud Lifecycle management retrieves mail, interprets the request and starts the automation flow.  Automation uses CLM SDK to make concurrent service request calls for provisioning for each trial user.
  3. Based on the mail containing the service request, CLM starts provisioning of ‘product-1’, ‘product-2’,… trial machines on Amazon AWS cloud using service offerings and blueprints pre-defined in CLM containing baked AMIs.  Each product trial has an associated service blueprint describing the environment of a collection of AMIs to provision.  These service offering blueprints range from a simple single machine application stack to a full stack multi-machine application environment.  It even includes target machines.  Once the provisioning of the trial machines is completed, CLM automatically executes scripts and post-deploy actions as Bladelogic jobs to properly configure the trial machines that includes executing running BSA jobs to properly configure all the targets depending on the blueprints.
  4. As soon as machines are provisioned and configured on AWS cloud, a mail is sent back to the user indicating that machines are ready and URLs are provided to the user to login and trial the product.
  5. At the end of the trial period, the trial machines are decommissioned automatically through BMC CLM.  This is extremely important to govern the sprawl of EC2 machines.  

Cloud lifecycle management products such as CLM with their advanced automation, orchestration, blueprinting and governance capabilities can be used to solve new interesting automation use cases such as product trials that we hadn’t envisioned earlier.  With over 500 trials and thousands of machines successfully provisioned and decommissioned on Amazon cloud, we feel good that we are eating our own dogfood in creating and using our solutions to manage product trials.




How “Everything as Code” changes everything?

Software is eating the world.  Infrastructure, security, compliance and operations once used to be done by independent teams separate from application development teams.  These teams traditionally worked mostly in isolation, occasionally interacted with application development teams and used their own tools, complex slow moving processes such as change boards, approvals, checklists and 100-page policy documents and tribal specialized knowledge.  However, as the cloud native applications have seen a dramatic growth and acceptance in small and large enterprises, these 4 disciplines are being reimagined to ensure agility while still maintaining governance.  A new paradigm of “everything as code” is changing these 4 disciplines (infrastructure, security, compliance and operations) with principles such as “Infrastructure as code” and “Security as code”.  The core tenet is that infrastructure, security, compliance and operations are all described and treated exactly like application code.  In this two part blog, we will explore these new “everything as code” principles, benefits of representing everything as code, best practices and technologies to support this in development and operations.  In part-I, we will focus on infrastructure as code while in part-II we will focus on security, compliance and operations as code practices.

Infrastructure as Code

The fundamental idea behind infrastructure as code is that you treat infrastructure as code just like application software and follow software development lifecycle practices.  Infrastructure includes anything that is needed to run your application: servers, operating systems, installed packages, application servers, firewalls, firewall rules, network paths, routers, NATs, configurations for these resources and so on.  

Following are some of the key best practices that we have learn’t after operating several production cloud applications for the past year on AWS cloud.

  1. Define and codify infrastructure: Infrastructure is codified in declarative specification such as AWS cloudformation templates (CFN) for AWS cloud, Azure resource templates for Azure cloud, Terraform templates from Hashicorp, Docker compose and Dockerfiles, BMC CLM blueprints for both public cloud and on-prem datacenters and so on.  Infrastructure can also be represented as procedural code such as scripts and cookbooks using configuration tools such as Chef or Puppet.  In either case, the key best practice is to represent infrastructure as code in declarative or procedural ways with the same set of tools that you use for configuration management or cloud management.
  2. Source repo, peer review and test: Next, infrastructure as code is kept in source control such as Git, is versioned, is under change control, is tracked, is peer reviewed and is tested just like application software.  This will increase traceability and visibility into changes as well as collaborative ways to manage infrastructure with peer reviews.  For example, if operations wants to roll out a change to production infrastructure, ops does not do it through console directly in production as traditionally done in IT datacenters.  Instead, ops creates a pull request on ‘infra as code’ Git artifacts with  peer reviews conducted on these changes and then deployed to production.
  3. DevOps pipeline: Infrastructure as code (CFN templates, blueprints etc.) goes through a DevOps pipeline just like application code and gets deployed to production.  DevOps pipeline is a critical component of infrastructure as code since it allows infrastructure change delivery and governance to ensure that changes to infrastructure are tested and deployed in a controlled manner to production environments.  For example, in AWS clouds, ops will do pull requests on CFN to make changes to configuration parameters or AWS resources.
  4. Automation: Infrastructure automation provisions infrastructure from the codified infrastructure.  It is very easy to implement infrastructure as code automation for public or private clouds where datacenter is an API, for example, with AWS CFN or Azure Templates or Hashicorp’s Terraform.  All modifications to infrastructure are also first made to the code and then the changes to infrastructure done in Dev or Prod.  
  5. Immutability: Finally, infrastructure as code also supports server and container immutability.  Production engineers don’t make changes to servers or containers directly in production but go through a full DevOps pipeline to create new server or container images through DevOps pipeline and then deploy into production these new server images or container images by replacing the running servers or containers.  Thanks to my colleague Brendan Farrell for pointing out this benefit.


With these best practices, infrastructure as code can be successfully used in both cloud native and on-prem application delivery management.  As an example, the diagram below shows application and infrastructure pipelines. Team-1’s pipeline defines application and infrastructure in a single file that is added to repository and used to drive the pipeline and deploy the full application stack. The full stack includes servers (infra), Tomcat (infra) and application (app1.war).  Note that this can be deployed as a single baked AMI, single baked VM template or Docker container depending on the technologies used by the team and supports immutability.  Team-2 only deploys their app2.war while the infra is built and deployed by a separate Infra team that includes hardened “OS” for higher level security needed by app2.  This illustrates multiple ways of including infra as code in pipelines as part of the microservice application (app1) or separate (app2) but combined together for final deployment.

infra as code


Finally, what’s in it for developers and operations teams to make the move to start using “Infrastructure as code”? Developers can specify infrastructure as a part of their application code in a repository.  This keeps the full application stack code, definition, testing and delivery logically connected and results in agility and autonomy for developers for full stack provisioning and operations.  For operations, best practices for software development are applied to infrastructure that results in preserving agility while also adding governance, reproducibility and repeatability of managing infrastructure.  It is also less error-prone to make infrastructure changes as it also goes through a DevOps pipeline with peer reviews and collaborative process just like code and supports immutable infrastructure.  Finally, there is traceability and ease of answering questions such as what is my current infrastructure, who made infrastructure changes in the past few days, can I rollback the latest configuration change made to my infrastructure?  

We strongly believe that using infrastructure as code principles in managing application delivery in DevOps can result in compelling advantages to both developers and operations.  In the part-II, we will cover security, compliance and operations as code principles.

What I learn’t from 1 day hack in building real-time streaming with AWS Lambda and Kinesis

It was a rainy Saturday which was a perfect day for hacking a real time streaming prototype on AWS.  I started thinking of Apache Kafka and Storm but then decided to give AWS Kinesis a try out as I didn’t want to spend my time in creating infrastructure with the goals of writing minimal code, zero setup and zero configuration pains to get this running.  This blog describes how easy it was to build a real-time streaming platform on AWS to do real-time applications like aggregation, counts, top 10, etc. Many key questions are also discussed:

  • Kafka vs. Kinesis – How do you decide?
  • KCL SDK vs. Lambda – Which is better to write streaming apps?
  • Streams, shards, partition keys and record structure – Design guidelines
  • Debugging
  • Automation of Kinesis in production

Sketch your conceptual streaming architecture

The first step you need to think is the type and volume of data you are planning to stream in real-time and number of real-time applications that would consume this data.

Next question to ask is how you will ingest the data in real-time from the producers of the data.  Two common ways to do this are shown below:  API Gateway + Lambda and using a native SDK streaming library (aws-sdk), the latter being much faster way to push data.

Moving on, there are Kinesis design issues such as how many real-time streams you need, as well as the ‘capacity’.  There are concepts such as shards and partitionkeys that will need to be decided at this point although this can be modified later as well.  For this prototype, I picked a few server end points with a single shard Kinesis stream as shown below.  All the Red lines were implemented as a part of this prototype.

Real-time-arch-conceptual-kinesis kinesis

Once the initial conceptual architecture for ingesting is thought through, the next most important design decisions are about the applications that will consume the real-time stream.  As seen from this diagram, there are multiple ways in which apps can consume stream data – Lambda, AWS Kinesis Client Libaray (KCL) and AWS Kinesis API, as well as Apache Storm through connectors.  In my serverless philosophy of minimal infrastructure, I went ahead with Lambda functions since my application was simple.  However, for complex apps, the best enterprise option is to use KCL or even Storm to reduce dependency and lock-in on AWS.  Also, note that KCL requires that you be a Java shop as tooling to write in other languages like JS and Ruby is simply too complex and non-intuitive to use (wrappers around MultilangDaemon).

Big data lifecycle beyond real-time streaming

Big data has a lifecycle – just real time streaming and real time applications are not enough.  Usually, in addition to real-time data stream, this data is also aggregated, archived and stored for further processing by downstream  batch processing jobs.  This will require connectors from Kinesis to downstream systems such as S3, RedShift, Storm, Hadoop, EMR and others.  Check this paper out for further details.

Now that we have thought about the data architecture and big picture, it is now time to start building this architecture.  We followed 3 steps to do this: create streams, build producers and consumers.

Create Streams

Streams can be created manually from AWS management console or programatically as shown by the Node js code snippet below.

kinesis.createStream(params, function(err, data) {
if (err && err.code !== 'ResourceInUseException') {

Note that streams can take several minutes to be created so it is essential to wait before using them as shown below by periodically checking whether the stream is “Active”.

kinesis.describeStream({StreamName : streamName},
function(err, data) { if (err) {callback(err); return;}
if (data.StreamDescription.StreamStatus === 'ACTIVE') {callback();}
else {

Helloworld Producer

This step can be done by invoking Kinesis SDK call kinesis.putRecord call.  Note that partionkey, streamname (created from previous step) and data are required to be passed in as JSON structure to this method.

var recordParams = {
Data: data, //can be just a JSON record that is pushed to Kinesis
PartitionKey: partitionKey,
StreamName: streamName
kinesis.putRecord(recordParams, function(err, rdata) {

This should start sending data to the stream we just created.

Helloworld Lambda Consumer

Using the management console, a Lambda function can be written easily that listens to the stream as event source and gets invoked to do business processing analytics on a set of records in real-time.

exports.handler = function(event, context) {
event.Records.forEach(function(record) {
var payload = new Buffer(, 'base64').toString('ascii');
//do your processing here
context.succeed("Successfully processed " + event.Records.length + " records.");

In less than a day, we had created a fully operational real-time streaming platform that can scale and run applications in real-time.  Now, let us take a look at some key findings and takeaways.

Kafka vs. Kinesis- Kafka requires a lot of management and operations effort to keep a Kafka cluster running at scale across datacenters with mirroring, keeping it secure, fault tolerant and monitoring disk space allocation. Kinesis can be thought of as “Kafka as a service” where operations and management costs are almost zero. Kafka, on the other hand, has more features such as  ‘topics’ which Kinesis doesn’t provide but this should not be a deciding factor until all the factors are considered. If your company has made a strategic decision to run on AWS, the obvious choice is to use Kinesis as it has advantage of ecosystem integration with  Lambda, AWS data sources and AWS hosted databases.  If you are mostly on-prem, or require a solution that needs to run both on-prem and SaaS, then you might not have a choice but to invest in Kafka.

Stream Processing Apps – Which is better? KCL vs. Lambda vs. API – Lambda is best used in simple use cases and is a serverless approach that requires no EC2 instances or complex managment.  KCL, on the other hand, requires a lot more operational management, and provides a more extensive library to handle complex stream applications such as worker processes, integration with other AWS services like DynamoDB, fault tolerance and load balancing streaming data.  Go with KCL unless your processing is simple or requires standard integrations to other AWS services in which case Lambda would be better.  Storm is another approach to be considered for reducing lock-in but I didn’t investigate it enough to know the challenges in using it.  Storm spouts is on my plate to evaluate it next.

Streams – How many? – It is always difficult to decide how many streams one needs. If you mix all different types of data streams in one stream, there could not only be scalability limits for each stream but also the fact that your stream processing apps will require to read everything and then filter out the records they want to process. Kinesis doesn’t have a filter and every application must read everything.  If you have independent data records requiring independent processing, then keep the streams separate.

Shards – How many? – The key decision point for this is the amount of records throughput and the amount of concurrency needed at read time during processing.  AWS management console provides a recommendation for number of shards needed.  Of course, you don’t need to get this right when creating a shard as resharding is possible later on.

Producer – small or big chunks? – It is better to minimize network calls to push data through SDK into AWS so some aggregation and chunking is needed on the client end points.  This needs to be traded-off against staleness of data.  Holding on to data and aggregating it for a few seconds should be acceptable in most cases.

Debugging – still tough – It was a bit of challenge to debug the Kinesis and Lambda as logs and CloudWatch showed metrics with delays.  Better tooling is needed here.

Kinesis DevOps Automation – For production operations, automation of Kinesis will be needed.  Two prominent capabilities needed here are stream creation per region, single pane of glass for multiple streams and stream scaling automation.  Scaling streams is still manual and error-prone but that is a good problem to have when amount of data ingest is increasing due to larger number of customers.

Final Word

In less than a day, I had a successful real-time stream running where I can push big volumes of data records from my laptop or a REST end point into AWS API gateway, to Kinesis stream and then 2 Lambda applications reading this data in real-time doing simple processing.  AWS Kinesis together with Lambda provides a compelling real time data processing platform.

AWS Lambda and Serverless Computing – Impacts on Dev, Infrastructure, Ops and Security

Developing software as event driven programming is not new.  Most of our UI is based on events through libraries such as jQuery and AJAX.  However, AWS Lambda service from Amazon has taken this concept of event driven reactive programming to the backend processing in cloud and made it simple and intuitive to use in a few clicks.  AWS Lambda requires an event and some code associated with the event to be executed – that’s it.  No server provisioning, no auto-scaling, no server monitoring and no server patching and security.  The “events” that can be specified with AWS are growing and includes external events such as HTTPS REST endpoint API call or internal events inside AWS ecosystem such as a new S3 file or new record in a Kinesis stream.  The “code” is simply a Node Javascript or Java “function” that encapsulates the business logic. If I just have “events” and “code”, and no servers, then a number of questions arise:

  • What types of use cases are best suited for serverless apps?
  • How does my app design change due to Lambda?
  • What happens to my DevOps pipelines?
  • How do I do  “infrastructure as code” when I don’t have any infrastructure?  No dynamic infrastructure to manage?
  • Do I still need operations?  And do I still need security scans?

This article answers these key questions by analyzing the impact of AWS Lambda on cloud native app design, devops pipeline tools, cloud lifecycle management tools (eg. CMP), operations and security.  It also discusses how serverless infrastructure could become the next disruption after containerization and virtualization in cloud by fundamentally changing the way we design, build and operate cloud native applications.

Use Cases for Server-less Computing

The key use cases where serverless computing such as AWS Lambda have been used in the last year since AWS announced this service in November of 2014 are categorized below.

  1. Simple event driven business logic that does not justify a continuously running server.  There are many use cases of the form of IFTTT (If-This-Then-That).  For example, I want to build a Git push to my custom social tool integration.  This can be easily done by writing a function in Lambda that listens for a webhook from Git push and does some action in my “social tool”.
  2. Event driven highly scalable transformation, notification, audit or analysis of data such as transforming media files from one format to another, or other such utilities can be easily written as Lambda functions and triggered as data arrives.
  3. Stateless microapplications – These applications are simple self-contained utilities or scripts that can be deployed in cloud but again do not justify a server or PaaS to run.
  4. Extreme scaling such as taking a task and breaking it down into 100’s of subtasks each doing independent invocation of the same business logic coded as Lambda function.
  5. Mobile and IoT backend that requires extreme scaling
  6. Rule based processing – Rules can be designed as Lambda functions to drive business logic.  For example, operational metrics and AWS infrastructure management itself can be done through rules (See Netflix for example).

This shows that a wide variety of applications from simple apps to complex backends can be developed in Lambda when automatic unlimited scaling and minimal operational overhead are key drivers.

Cloud Native Application Design

Cloud native application design is heavily based on microservices that can be scaled easily.  With AWS Lambda paired with AWS API Gateway services, the microservices design becomes even simpler and scalable without writing any code or management.  Each microservice is clearly focused on a single responsibility and business function, that is coded as a Lambda “function”.  The REST endpoint for this is provided by API Gateway and becomes the “event” triggering this microservice Lambda funciton.

Application = Collection of Lambda Functions: A cloud native application can be thought of as a collection of microservices or a collection of Lambda functions each with its REST endpoint or triggered from another event or called from another Lambda function.  The composition and chaining of Lambda functions can easily be done to build more complex microservice orchestration.  Also, Lambda requires a more functional and event driven programming approach when designing and building apps on it.

Hybrid: Except for some simple use cases outlined earlier, most of the complex cloud native apps will contain a mix of IaaS, PaaS, containers and Lambda functions.  This requires that a declarative blueprint of the application allow not just servers and PaaS resources but also Lambda functions as resources.  Cloud Management Products (eg. BMC CLM) should support such blueprints.

Development, DevOps and NoOps

Lambda based applications will require tools and technologies to have new integrations to AWS Lambda.

Local Development:  Developing and testing code locally by developers is currently a challenge with AWS Lambda.  Lambda requires a lot of manual steps such as creating multiple IAM roles, dependent AWS resources; zipping up code including dependencies and uploading it to AWS, retrieving logs from CloudWatch and so on. This is an opportunity for a new breed of tools and integration with existing tools to simplify the developer experience.  Source code systems such as Git,  IDE tools such as Eclipse and CI tools such as Jenkins can have more tighter integration with AWS Lambda. Imagine that I just wrote a Lambda function in Eclipse, and with a right click, I am able to test it out by pushing this new version to Amazon and running jUnit test cases before check-in without using AWS CLI/console and creating the half a dozen or so manual steps in setting up my code on AWS Lambda service.

CI/CD Pipelines: Lambda functions based microservices still require the traditional release pipelines where the code “Lambda function microservice” is built, tested and deployed in test environments.  This will require creating AWS Lambda environments for testing the Lambda function as well as integration with AWS Lambda and after the tests are run, decommissioning the Lambda environments.  Many code delivery and release management products have started to add plug-ins to Lambda in their products to achieve this capability: and

Lifecycle Management of Lambda Functions: Cloud Management Product (CMP)

Lambda functions are just microservices that are event triggered, and hence require the complete lifecycle management.  This can be accomplished by Cloud Management Platforms (CMP) such as BMC CLM.  As noted earlier, complex blueprints could have both Lambda resources as well as traditional cloud resource such as AMI machines.  Developers and admins require ability to declaratively specify applications consisting of Lambda and regular cloud resources as well as multiple deployment models such as dev, test, staging and prod.  CMP then takes these blueprints and provisions resources.  In case of Lambda, the resources such as API gateway and Lambda functions will be provisioned together with all the security and configuration needed today to set them up properly (such as IAM roles, connecting API gateway linkages to Lambda and so on).  Orchestration and visualization of a set of Lambda and non-Lambda cloud resources are additional capabilities which CMPs will provide.  End users and operations require capabilities to visualize all the Lambda offerings and Lambda functions that they have provisioned for development or production workloads.  There are also a number of day 2 actions possible on Lambda functions such as increasing the memory or timeout for Lambda function.  Many cloud provisioning tools have started to build provisioning support for Lambda – see

Security, Logs and Monitoring

Even though Lambda functions abstract the servers, security is still needed.  Vulnerability and web pen test of the Lambda functions will be critical in assessing that there are no application level vulnerabilities.  Compliance rules can also be written for such functions that would require evaluation.

Lambda functions as microservices require log analysis and monitoring and traditional tools will require plug-ins to be added to support this.

What next?

We believe that serverless computing thinking will be a disruptive force similar to containerization.  We also believe that event driven computing is not suitable for all applications and we expect to see microservices that are implemented using a mix of containers, virtual machines or Lambda functions.  AWS Lambda will require our DevOps, application release management, cloud management and security tools to adapt for effective operations as described here.

Bimodal IT – How IT needs to change?

Gartner defines BiModal IT as an organizational model that segments IT services into two categories based on application requirements, maturity and criticality. “Mode 1 is traditional, emphasizing scalability, efficiency, safety and accuracy. Mode 2 is nonsequential, emphasizing agility and speed.”   A recent Forbes article also highlighted the differences.

Now what does this all mean to IT.  There are 4 impacts IT needs to consider.

a) Applications in mode 2 are more agile, likely built with microservices and running in cloud, mostly public clouds.  They will also be web scale and likely supporting the mobile apps or big data analytic workloads.  IT needs to enable back end infrastructure to support such applications and services and must be willing to work out solutions both on-prem or in public clouds.  For example, microservices will be delivered through containers such as Docker, and hence IT should focus on preparing for the latest technologies for building these infrastructures to help application developers and operations teams to deploy these apps.  Some of these technologies can range from a simple cluster of Docker hosts being managed as a cluster through a Cloud Management product (CMP) like BMC CLM to more complex cluster managers and schedulers such as Google’s Kubernetes or even datacenter operating systems like Mesos.  These are the next generation infrastructures needed for running cloud applications and IT should be able to be proficient in these as well as be able to manage this infrastructure efficiently.

b) Rate of change of these apps is going to be much higher than the traditional 6-12 month delivery cycle for legacy apps.  The apps team needs to implement a DevOps pipeline that allows quick release of application through build, test and deploy stages.  IT should get out of the way of apps team and also provide API based services that can be called from DevOps pipelines such as security services that they wish to enforce.  IT must also provide API based environments to be created and destroyed in public or on-prem private clouds, and IT should enable this instead of being a roadblock.  Even shadow IT is alright.  If there are minimalistic policies on security and compliance on the applications or infrastructure, these should be provided by IT as a API based service.  For example, if IT requires security compliance to check for vulnerabilities in Docker container, IT must provide an API based security compliance as a service.

c) Application updates with creative ootb deployment strategies in production. Application changes have to be pushed into production using modern techniques such as blue-green deployment models that is quite extensively used by companies like Netflix on AWS or a rolling deployment update model as well.  Both are based on the common theme that application releases have newer versions being deployed together with existing versions, and they need to be validated by releasing some traffic to the newer version, verifying them in production, and then switching gradually to the newer version of the application.  DevOps tools must support much of these deployment orchestration to production.

d) Less governance and control.  Although some minimal viable governance is certainly needed for even the new agile IT, the level and evaluation of governance differs from mode 1.  In mode 1, governance is stringent and rigid and is based on approvals and change control boards.  We loosen this a bit in mode 2, the new agile way.  Instead the governance is based on permissions and s built into the Devops process itself, such as basic compliance and vulnerability testing of applications is a part of the DevOps pipeline itself.

By adapting to this new mode 2 agile IT, IT and application development teams can achieve the goals of building and delivering new applications faster.  IT products such as cloud management tools, DevOps tools, deployment tools, configuration management tools, compliance, patching and monitoring tools need to adjust to ensure that both mode 1 and mode 2 are supported based on the use cases and type of application these tools are supporting.