To gather insights on the current and future state of how DevOps is scaling in the enterprise, we asked IT executives from 32 different companies to share their thoughts. We asked them, “What are some real-world problems being solved by scaling DevOps?” Here’s what they told us:
- The biggest problems we solve with DevOps are: 1) MTTI, mean time to innovation, how fast you’re able to come out with new services, feature, and innovation. We automate the SDLC, deployments, monitoring, tests, checks into the delivery pipeline to deliver better quality software faster. We’ve moved from two releases per year to every other week with sprints. New code changes within an hour if necessary. 2) The other is MTTR — repair or resolve. Remediate as quickly as possible. 80 to 95% reduction in time required to fix problems in production with automation and monitoring the production environment and automating the remediation. 3) Culture – DevOps means shared responsibility, no siloed teams, all work together to build stuff that supports the business DevOps promotes shared responsibility.
- Within our application portfolio, there are approximately 550 enterprise business applications. Approximately 30% of the applications are custom and cannot be procured from an ISV or rented from a SaaS provider. The need for an environment to build custom apps, combined with industry trends toward cloud and DevOps, was the driver for building our DevOps platform. Currently, there are 15 applications using our platform. More will come over the next two years as we scale up our custom app development in the DevOps teams.
- Our stack is all about high availability and security. We allow customers to quickly and efficiently deploy apps from development to deployment to prod and if there are any changes or challenges, they can quickly go back. We develop technology that secures that process with less malware and security risk. If you encounter a problem, you can rollback.
- Get away from ad hoc sense-making (i.e. digging for status/impact of work by hand). Automate the feature flagging solution, structure in a repeatable way.
- Look at the full lifecycle as you roll out an app to ensure everything is running smoothly. You can look back and arrive at what the next inflection point is, and see where you need to add capacity. When you have a problem, we can alert you. We have data within the tool to help you narrow the problem down. We provide a lot of automation and labels to narrow the problem down to a very specific portion of the infrastructure or application to resolve the problem faster. DevOps has become very important. Delivery models have changed. You release several times a day on a SaaS platform. You need to know what’s deployed within your environment and what software changes have been made. Be alerted to images running with vulnerabilities. We cover the full lifecycle from deploy, to scale, to the problem, root cause, security posture, we can help with all of that. We can deal with different personas in different ways to close gaps in visibility to provide peace of mind with security.
- The question should be: What are some of the real-world scaling problems which you are solving using DevOps? By using a DevOps approach, there is no real difference running the same service 10, 100 or 1000 times. The bigger question is if the services are architected in such a way that you can scale them based on actual needs. Getting back to the topic, the biggest benefit for us is that our infrastructure, systems, and services can better use the available resources and budget. As we are mostly limited to self-hosted hardware for our internal infrastructure, it is critical to use available resources efficiently as we can’t simply add more capacity using cloud providers. At this moment, our DevOps transition gives us an opportunity to scale and, in the near future, also to scale by utilizing the public cloud, easily integrate it with our existing infrastructure, and create a hybrid cloud to scale with our organization based on product and customer demands.
- Outages have been more pronounced. GitHub and YouTube have gone down. These are customers who want help. 25 to 60% reduction in MTTR. Through automation, the learning process is accelerated. In SRE everything is automated in recreating the failure process.
- As our engineering teams grew, so did the number of projects and services. This created a ceiling constraint to our legacy CI/CD tool as a result of this growth. Since CI/CD is so important in any DevOps foundation, we had to scale this first before any other part of the infrastructure. Automated builds and deployments that worked a year or 6 months ago are no longer adequate since new services and dependencies have been introduced. We needed to overhaul the fundamental design of how we build, package, and deploy to solve the current constraints as well as enable it to scale effectively later on. Throwing additional compute resources or hardware at the CI/CD tool will not make it run or allow it to build and deploy new services in new operating systems any faster. We had to re-design the whole CI/CD to take into account the current set of requirements and built some flexibility into the system to accommodate future changes. As a result of this effort, we are in a much better position to provide effective DevOps services and support to all of our teams.
- The real-world problems associated with scaling DevOps are focused around the shift in the data warehouse space in the past 5+ years. Now if you look at our customers of all sizes, data warehouses are mission-critical. Therefore, by automating the DevOps aspects, you are de-risking the process of deploying the warehouse to an environment through automation, since as you know, humans make mistakes. The business needs to change the data warehouse at a rapid pace, which makes it necessary to adopt a DevOps approach in order to be successful.
- Most of the components in our stack are open source projects that release often. Our typical customer is an enterprise, where DevOps adoption is sometimes still nascent and people aren’t comfortable with making frequent changes to their infrastructure. Distributed systems can fail in complex ways and over the years, we have developed very sophisticated build and test tools that allow us to run complex distributed systems tests that are based on failure scenarios we’ve seen. For example, we can automatically spin up test clusters with hundreds of nodes and run simulated customer workloads on them over the span of multiple days. This sophisticated release process gives our customers the confidence that new releases will work in their environments. Recently I’m seeing increasing demand to use DevOps practices to deploy machine learning (ML) systems into production. ML is a younger field compared to software engineering, and few leading companies in the space know the best practices for running production systems. Many DevOps tools and methodologies can be used to deploy machine learning models as well, with some tweaks and a few additional new tools.
- In the data protection space, customers are faced with the greatest challenges to making sure their data is compliant, backed up, and hydrated in the event of a disaster. As part of the team that launched AWS Backup, our goal is to make this process simple for customers by providing policies and automation. And as we scale to protect even more AWS services, being able to deliver well-tested code is an important goal for us; given the number of regions we deploy to and the number of dependencies, automation is the only way we can ensure to deploy to the satisfaction of all our dependencies. As a whole, Amazon has invested greatly in automation frameworks and continuous deployment practices.
- We have our own example of tackling scaling problems. We found ourselves in a situation where we needed to scale our workload dramatically: our database went from writing 300 million new data points a minute to now up to 1.5 billion data points per minute. The database scans trillions of stored events and metrics to deliver results when customers query their data. At the same time, we had to tackle another crucial aspect of scale: complexity. The platform is a sophisticated tool that includes more than 300 unique services, petabytes of SSD storage, and dozens of Agile teams performing 50-70 releases a day. Because of this, we had to change the way we were organized. Operating under a DevOps model was the best way for our engineering teams to enhance flexibility and autonomy while still adhering to more processes from growing teams.
- Impedance mismatch. Often times DevOps culture, pre-scaling, happens at a team level only and is focused on Dev as the epicenter. In this situation, it is easier to collaborate (a smaller set of stakeholders) and easier to coordinate across functional silos. However, we have realized that while our engineering teams can now deliver software faster, to ensure we are always delivering the “right thing” and to get that into the customers hands in a usable way requires continuous collaboration and sharing of data and insights with all of the stakeholders, which in our case is everybody. In other words, if Dev and Ops are on an iterative, continuous cadence, but docs, marketing, support, and others aren’t, there is an impedance mismatch. We can only go as fast as the slowest point in the process. Accordingly, by scaling DevOps both up and out, we are aligning the organization on the same cadence and remove the impedance mismatch, that slows the delivery of the “right thing” in time to drive real impact with the customer and the business.
- It’s still about removing from the deployment pipeline and scaling for speed and quality. Lift-and-shift workloads from legacy platforms to AWS. A large global insurer with 15,000+ employees, eight different IT environments from Windows and mainframe, spending $3 million per year on each environment. With our CI and automated test and provisioning solutions, they moved to AWS, reduced infrastructure expense from $3 million to $120,000, reduced provisioning time from 6 weeks to two hours, and improved test time by 50%. Incredible cost and optimization they are recapturing value and putting it back into business for development and to deliver more value.
- We help the customer implement the initial practice and move more into NoOps. A social networking application started with a DevOps assessment to understand where they stand currently, application design interactions. This turned into an assessment of their digital transformation initiative. The cultural piece was the most challenging. They were unable to scale fast enough based on what end customers wanted and not able to provide more futuristic features. We began with a proof of concept to show how a cultural change will affect the release of certain features. We chose the application with the most visibility for the consumer and internally. It took six months to get it into production. We showed the speed with which features were released and the SLAs achieved and then started onboarding others. We created a platform for exiting teams to come together, ops team, bus team and showed how it is done. We reskilled developers with regards to the cultural change – writing code, branding strategies, et al. Over the past one year and 10 months, things are still running smoothly and we started automating pieces in terms of infrastructure and quality so the team can have a self-service approach to the DevOps practice. They have integrated the knowledge gathered based on event and data streams in the channels. They have a combination of a playbook and how people get acquainted with the DevOps practice.
- Test automation is one of the most neglected components of a DevOps strategy. One of our clients, TOTVS, is the largest software provider in all of South and Latin America, and they have been using our autonomous testing solution for over a year now. They’ve seen a 6x increase in QA productivity.
- 1) Sonar EMR, SAP for hospitals, a strategic initiative to move to more of cloud infrastructure. Solve regression debt of releasing frequently to help with impact analysis, automated risk analysis with auto-generated test and quantifying the risk of release. Reduced speed of adoption of new releases by months. 2) Nationwide (UK) retail bank focused on digital with mobile banking. Needed to respond quickly to user feedback. We helped with getting testing to the right level and user production to monitor users live to know what devices they are using and what problems they are having. Flag issues that are causing consumer problems. Brought the customer into the development process by using data from customers to inform development.
- 1) Comcast has an advanced DevOps team of which security is a part. Easily deploy without changing how to write code. Several hundred apps. Reduced time to do security testing. Run in the background as doing normal security testing. They are able to push code faster and see how security is improving over time. Projects are quickly working off their security backlog from 40 to 50 vulnerabilities. 2) Large IT org with 5,600 developers. Needed to secure their applications. Deployed on all their apps seeing a massive reduction in vulnerabilities six months versus five years. If using a scanner on 1,000 apps it can’t scale and plug into DevOps pipeline. You can’t wait two hours to scan when you’re getting code out in 14 minutes.
- One recent example would be the work we did for Your Call Football. The YCF application needed to scale to 100,000 concurrent users during the same window. Our Mission Managed DevOps implemented Terraform for infrastructure-as-code and used it to deploy Kubernetes as the container management and scaling platform.
Here’s who shared their insights: