Secret Sauce for Building Cloud Native Apps

A few months back I had written a blog comparing mode 2 vs. mode 1 application development.  Gartner defines BiModal IT as an organizational model that segments IT services into two categories based on application requirements, maturity and criticality. “Mode 1 is traditional, emphasizing scalability, efficiency, safety and accuracy. Mode 2 is non-sequential, emphasizing agility and speed”.  In the last 6 months, we built a mode 2 cloud native application on AWS cloud and this got us thinking on two questions:  What were unique enablers that led to a successful mode 2 cloud app?  Is Gartner right in keeping the two mode 1/2 segments separate and what mode 1 can change and learn from mode 2?  Let us analyze the first question here.  We will discuss key best practices in building cloud native mode 2 applications.

Focus – do one thing – know your MVP – Mode 2 cloud native applications require a maniacal focus on customer problem, market requirements and delivering value to the customer by focussing on one or two use cases.  We spent very little time in needless meetings, deciding what to build or how to build it.  I have seen many projects that took 2-6 months to define the product to build.  We knew exactly what to build with an uncompromising faith.  Startups usually have this solid vision and focus of doing one thing right usually called the “Minimum Viable Product (MVP)”.  We acted like a startup charged with solving a customer problem.

Leverage cloud native higher level services – We decided to go “all in” with AWS cloud using disruptive PaaS higher level services such as Kinesis, Lambda, API Gateway and NoSQL databases.  We also went serverless with AWS Lambda as the key architectural microservice pattern while we used servers only when we absolutely needed.  No more discussions about the “platform”, “lock-in”, portability to multiple clouds, or multiple datacenters – “on-prem” and “SaaS” or other utopian goals.  We wanted to get to market fast and AWS managed higher level services was our best bet.  Decisions such as which opensource to use, doing endless comparisons, discussions, debates, and paralysis by analysis were avoided at all costs.  We also worked secretly avoiding cross organizational discussions that could slow us down and worked almost as skunkworks fully autonomous team charged with making our own decisions and destiny.  After 6 months of using AWS, we continue to be delighted and amazed by the power of the platform.  The undifferentiated heavy lifting is all done by AWS – infrastructure automation such as routing, message buses, load balancers, server monitoring, running clusters and servers, patching them, maintaining them, and so on – these are not our key business strengths.  We want to focus on business problems that matter and focus on building “apps” not managing “infra” that is best left to AWS.

cloud native app

Practice everything “as code” concepts – From the kick-off, we followed “-as-code” concepts to define all our configuration, infrastructure, security and compliance and even operations.  Each of these aspects is declaratively specified as code and version controlled in Git repository.  We all have heard infrastructure-as-code.  We followed this in our practices where we can build and deploy our entire application+infrastructure stack using AWS cloud formation (CFN) templates with the single click of a button.  Even we can deploy customized stacks for dev, qa, perf and production all driven from a single CFN configured for each environment differently.  This was a major achievement and a decision that saves us time in keeping all our environments consistent, and repeatable.  Today, we have a CFN for each microservice and a few common infrastructure CFNs that not only describe the components but also configuration, such as amount of memory, security such as using HTTPS/SSL, IAM permission roles and compliance as code.  Leveraging AWS ecosystem to do all this has vastly reduced our time to market as everything works seamlessly and is well integrated.

Security, operations and monitoring in cloud services – Being a SaaS cloud service, we leveraged security, operations, metering and monitoring framework from AWS cloud platform and ensured that architecture and development of microservices were built for SaaS.  Security threat modeling and using standardized AWS blueprints for NIST helped us save time and achieve a standardized well tested security to start with such as VPCs, security groups and IAM permission model.  Thinking about operations, health and monitoring while building SaaS components is a key new enlightenment we had as we moved into 3rd and 4th sprint.  For example, log messages with any “Exceptions” or “unique patterns” can generate “alerts” that can be sent to email or sms.  AWS Cloudwatch provides a powerful alerting and log aggregation capability to do all this and we used it extensively to monitor our appfrom log patterns and custom application metrics.  Finally, most of our operations is “app ops” since we own and manage almost no servers, no databases, and no infrastructure.  Being truly serverless has completely reduced an entire set of concerns related to infrastructure, and we are starting to see the benefits of it as we get into operations.  Even our operations guys are talking about application issues and nothing about rebooting servers or patching for security updates.

DevOps, Automation and Agility Cool-aid – Just as our AWS “all in” architecture decision, in our first week, we picked our DevOps tool that a successful cloud company had used before and went “all in” with it.  We built our pipelines, one per microservice with all the automation and stages from commits, dev, qa, staging and production.  Once we built our pipelines, we still had “Red” all over our pipelines as pipelines failed in many automated tests.  This is when we realized that the culture  plays a very important part of our process of agile software development.  Just having the right technology doesn’t cut it.  We indoctrinated a new set of developer responsibilities – owning not just code but automation, pipeline and the operations of code in production.  This mindset shift is absolutely essential in moving to agile software delivery where now we can push software changes to production a couple of times a week.  We are, of course, not Facebook or Amazon yet but yet we have achieved success that would have been unimaginable in a traditional packaged software company.  We are also following 1-2 week sprints to support this.  The more frequent you deliver to production the less is the risk is what we are learning fast.

Pushing to production – Pushing to production is an activity that we stumbled a couple of times before learning how to get it done right.  We have a goal of 30-60 minute production push that includes a quick discussion with developers, testers, product managers and release engineers about the release readiness followed by a push and canary and synthetic test, after which we decide whether to continue or rollback.  Thinking about this is a journey for us and we continue to build operational know-how and supporting tools to help us do smoother deployments and an occasional rollback.

Think production when developing a feature – Push to production also plays a central part of new features and capabilities.  Recently, we started thinking about how to rollout a major infrastructure change and a major schema change by considering that the production has huge amounts of data and all pushes require zero downtime.  These considerations are now regularly a part of our every conversation about a new feature not just an after thought during a push to production.  Production scale and resiliency are also other aspects to continuously watch for in building apps since the scale factor with increasing customers will be quite large compared to typical on-prem packaged apps.  Resiliency is another critical consideration since every call, message, server, service or infrastructure component can potentially fail or become slow; apps will have to deal with this gracefully.  Finally, we build applications that follow the 12 factor principles.

Cost angle – One of the areas which we started managing very closely is the cost of AWS services.  We tracked daily and monthly bills and looked for ways to optimize our usage of AWS resources.  Now, an architecture decision also requires a cost discussion which we never had when we were building enterprise packaged mode 1 applications.  This is healthy for us as this also creates innovation in putting together neat solutions that optimize not just performance, scale, resiliency but also cost.

Startup small team thinking – Finally, we are a small two-three pizza team.  Getting the complete ownership of not just software code, but also all decisions about individual microservices, tools and language choices (we use Node.js, Java and very soon Python) and doing the right thing for the product felt great and excited all on the team.  The startup feeling while inside a traditional enterprise software firm gives the freedom for all developers and testers to innovate, try new things and get things done.

That’s it for now.  We are on a journey to become cloud-native and are already are pushing features on a weekly and an occasional daily schedule into production without sacrificing stability, safety and security.  In the second follow-up blog, I will cover the Gartner debate on how mode 2 learnings above can be leveraged in mode 1 applications to make traditional mode 1 applications more like cloud native mode 2 apps.

Advertisements

Shift-Left: Security and Compliance as Code

In the part I of the blog, we focussed on how development and operations can use ‘Infrastructure-As-Code’ practices for better governance, traceability, reproducibility and repeatability of infrastructure.  In this part II of the blog, we will discuss the best practices for using ‘Security-As-Code’ and ‘Compliance-As-Code’ in cloud native DevOps pipelines to maintain security without sacrificing agility.

Traditional security is an afterthought.  Once a release is just about ready to be deployed, security vulnerability testing, pen testing, threat modeling and assessments are done using tools, security checklists and documents.  If critical or high vulnerabilities or security issues are found, the release is stopped from being deployed until these are fixed causing weeks to months of delays.

The ‘Security-As-Code’ principle codifies security practices as code and automates its delivery as a part of DevOps pipelines.  It ensures that security and compliance is measured and mitigated very early during DevOps pipeline – at use cases, design, development and testing stages of application delivery.  This shift-left approach allows us to manage security just like code by storing security policies, tests and results as a part of repository and DevOps pipelines.

Let us break this down with key best practices in achieving this.  As shown in the diagram below, there are 5 points in a pipeline where we can apply security and compliance as code practices.  We will use a cloud native application running on AWS cloud to illustrate these.

security as code blog.png

 

a) Define and codify security policies: Security is defined and codified for any cloud application at the beginning of a project and are kept in a source code repository.  As an example, a few AWS cloud security policies v1.0 are defined below:

  • Security groups/firewalls secured, eg. No 0.0.0.0 ingress rules
  • VPC, ELB, NACL enforced and secured
  • All EC2 and other resources must be in VPC and each resource individually firewalled
  • All AWS resources –EC2, EBS, RDS, DDB, S3 are encrypted as well as logs, use KMS.

Having security policies such as above in a document is old world.  All such security policies need to be automated – by pressing a button, one should be able to evaluate security policies for any application at any stage and environment.  This is a major shift that happens with the principle of  security as code – everything about security is codified, versioned, and automated instead of having a “Word” or “Pdf” document about security.  As shown in the diagram above, security as code is checked in as “code” in repository and versioned.

Enterprises should build standardized security patterns to allow easy reuse of security across multiple organizations and applications.  Using standardized security templates (also as “code”) will result in out-of-box security instead of each organization or application owner defining such policies and automation per team. For example, if you are building a 3 tier application, a standard cloud security pattern must be defined.  Similarly, for a dev/test cloud, another cloud security pattern must be defined and standardized.  Both hardened servers and hardened clouds are patterns that can yield security out of the box.

For example, with AWS cloud, the 5 NIST AWS cloud formation templates can be used to harden your networks, VPCs, IAM roles with these templates.  The cloud application can then be deployed on this secure cloud infrastructure where these templates form the “hardened cloud” (Blue) layer of the stack as shown below.  For cloud or on-premise servers, server hardening can be done by products such as BMC Bladelogic.

hardened cloud blog

b) Define security user stories and do security assessment of application and infrastructure architecture: Security related user stories must be defined in the agile process just like any other feature stories.  This will ensure that security is not ignored.

Security controls are identified as a result of application security risk and threat models on the application and infrastructure architecture.  These controls are implemented in the application code or policies.

 

c) Test and remediate security and compliance at application code check-in: As frequent application changes are continuously deployed to production, the security testing and assessment is automated early in the DevOps pipeline.  The security testing is automatically triggered as soon as code check-in happens for both application or infrastructure changes.  Security vulnerability tools are plentiful in the marketplace such as Nessus, opensource recipes from Chef for security automation, web pen testing tools, static and dynamic code analysis, infrastructure scanners and others.

If critical or high vulnerabilities are found, automation will create “defects/tickets” in JIRA for resolution assigned to the owner of microservice for any new security issues.

d) Security and Compliance test infrastructure-as-code artifacts: In the new cloud native world, infrastructure is represented as code such as Chef recipes, AWS CFN templates and Azure templates.  Security policies are embedded in these artifacts as well as code together with infrastructure.  For example, IAM AWS resources can be created as a part of AWS CFN templates.

Security policies need to be tested for violations of security policies such as “no s3 buckets should be publicly readable or writable” or “do not use open security groups with 0.0.0.0/all traffic allowed”.  This needs to be automated as a part of the application delivery pipeline using security and compliance tools.  Many tools such as CloudCheckr, AWS Config and others can be used.  For regulatory compliance policy checks such as CIS and DISA on servers, databases and networks, BMC Bladelogic suite of products can be effectively used to not just detect but also remediate compliance.

e) Test and remediate security and compliance in production: As application moves from dev to qa to test, automated security and compliance testing continues to happen although risk of finding issues here is much less as most of the security automation is done much earlier in the DevOps pipeline.  Tools such as BMC Bladelogic can be used for production security patching after scanning of production servers for mutable infrastructure.  For immutable infrastructure, of course, new servers are created and replaced instead of patching.

Conclusion

Shift-left approach for security and compliance is a key mind shift change with DevOps pipeline.  Instead of security being a gating add-on process at the end of the release, or a periodic security scanning of production environments, start with security at the beginning of an application release by representing it as “code” and baking it in.  We showed 5 best practices to embed security and compliance best practices for application delivery.  These practices include defining security policies as a part of infrastructure-as-code, standardized hardened cloud just like hardened servers; defining compliance policies as code; automating testing at early lower environments and stages in a pipeline as well as production environments. We strongly believe that with these 5 best practices, successful product companies can ensure security while maintaining agility and speed of innovation.

 

Product Trials – Managing 1000’s of machines on AWS Cloud

Most software companies desire to have their products and services available for trials.  This allows customers to test drive the product or service before purchase as well as gives marketing valuable leads for customer acquisition and marketing campaigns.  Many companies also want get feedback from customers from their unreleased products through trials.  If you have a SaaS service or a hosted product, this is achieved by making the product registration page to have a limited N-day trial built into the service itself.  However, if you are in the business of delivering packaged software, it is not that easy.  With an automation engine, baked images of your product and public clouds such as Amazon AWS or Azure, even packaged software companies  can quickly jump start a trial program in weeks.  We used a cloud management product automation engine and AWS cloud to put together trials of several of our products in less than 4 weeks.  Since going live a few months ago, we have successfully provisioned with 99.99% reliability over 500 trial requests and thousands of AWS cloud EC2 machines.

Cloud management platforms such as BMC Cloud Lifecycle Management have been extensively used in many enterprise use cases to manage private, public and Dev/Test clouds as well as to automate provisioning and operations that has resulted in cost savings and agility.  One cool new use case we recently used CLM for is product trial automation.  Let me describe the basic flow:

try and buy wordpress trial flow

  1. As soon as trial users registers at the bmc.com trial page, a mail gets sent to a mailbox.
  2. A provisioning machine running BMC Cloud Lifecycle management retrieves mail, interprets the request and starts the automation flow.  Automation uses CLM SDK to make concurrent service request calls for provisioning for each trial user.
  3. Based on the mail containing the service request, CLM starts provisioning of ‘product-1’, ‘product-2’,… trial machines on Amazon AWS cloud using service offerings and blueprints pre-defined in CLM containing baked AMIs.  Each product trial has an associated service blueprint describing the environment of a collection of AMIs to provision.  These service offering blueprints range from a simple single machine application stack to a full stack multi-machine application environment.  It even includes target machines.  Once the provisioning of the trial machines is completed, CLM automatically executes scripts and post-deploy actions as Bladelogic jobs to properly configure the trial machines that includes executing running BSA jobs to properly configure all the targets depending on the blueprints.
  4. As soon as machines are provisioned and configured on AWS cloud, a mail is sent back to the user indicating that machines are ready and URLs are provided to the user to login and trial the product.
  5. At the end of the trial period, the trial machines are decommissioned automatically through BMC CLM.  This is extremely important to govern the sprawl of EC2 machines.  

Cloud lifecycle management products such as CLM with their advanced automation, orchestration, blueprinting and governance capabilities can be used to solve new interesting automation use cases such as product trials that we hadn’t envisioned earlier.  With over 500 trials and thousands of machines successfully provisioned and decommissioned on Amazon cloud, we feel good that we are eating our own dogfood in creating and using our solutions to manage product trials.