As software development gets increasingly complex, moving from monolith to the distributed system, It's now essential to have a detailed checklist before we go live with our product.
The Story
Before we jump in into the list, let me tell you a story from my experience -
Seeing how startups build their product with limited developers is always interesting. I was part of a startup that didn't have a QA (Quality Analyst) initially. As a company, we were trying to move from a brick-and-mortar business to a digital company with our very first SaaS product. It's always a risk because it will be an initial setback if the first product fails. The developer group ensured we tested each feature properly before going live. A few days before THE SPECIAL DAY, we drafted a detailed checklist to ensure we considered all possible items to support our go-live, and if all of them are marked positive, then only we will make the production deployment. This process helped us successfully deploy the application, mark the product as a massive success for the organization, and support the product as a monolith with almost 30+ developers and 3+ teams working on it for over 2 years. We used to call them production checklists.
Production Readiness Checklist
From OpsLevel,
Production readiness checklists can reduce the cognitive load of having to remember all the different vulnerability and failure points we need to consider in our complex landscape.
To simplify it, the Production Readiness Checklist is the universal language for your team to answer, "Is the product ready for launch?" However, the checklist may vary between domains or teams. It may be possible to have different checklists for different tiers of an application. It gives a positive start to look into pending items before GoLive.
Production Readiness Checklists can be opinionated. Below are my recommendations based on the nature of the products I have dealt with.
General
Ownership: Ownership of the service is clearly defined. This can be achieved by having the CODEOWNERS file on your Git project or assigning contributors to a repository.
Documentation: It can cover multiple items, such as API documentation, development steps, README, onboarding instruction to the service and FAQs. Once the service is mature enough and if any significant defect is found, it can be documented in a troubleshooting guide, preferably staying on the repository.
SLA / SLO: Since this product was a customer-facing SaaS product, it is highly important to provide a clear SLA (Service Level Agreement) to the customer and SLO (Service Level Objectives) for the internal teams. Depending upon the business violation of SLA impacts the revenue of the company, so it is important to provide a realistic than optimistic SLA.
Data Management
Persistency: Ensure to have resiliency setup for the failure of a data source without loss of performance. It must have a continuous backup and a one-click restore mechanism.
Scaling: Strategy to scale the data source during optimum load.
Testing
Unit Tests: Unit tests are highly important for any service to operate with an additional safety net.
Integration Tests: When one is working with multiple components inside a single repository, for example, we had a multi-module structure connecting Controller, Services and Repository layer, it is important to have integration tests covering the contracts.
E2E Acceptance Test: End-to-end (E2E) tests are costly tests but extremely helpful when you are low in the availability of talent. When we were supposed to go live with our first release, we drafted a basic E2E test covering all of the business critical components.
Operational Testing: There are a few optional business or customer-level tests that can be added as an extra safety check for your application to avoid any unpleasure circumstances. Tests for worse customer experience, load tests, and sanity checks help to make your product operation ready.
It is recommended to execute these tests periodically on your CI/CD environments.
It is important to structure your automated tests well to achieve larger benefits. Here are few articles that helps with few ideas based on industry best practices,
Deployment
Continuous integration: Each repositories must have their automated pipeline, When an engineer pushes their changes, it runs the tests, builds, static code analysis, security checks etc. Based on the scale of the application, the team can decide what kind of automated steps they want to have in their pipeline.
Continuous Delivery: Depending upon organization standards or repository needs, one may deploy from lower-level environments to higher environments. As a standard, CD should execute after CI tasks (i.e. test, lint and build). Single-click deployments are quite popular and provide ease of realising new changes and downgrading applications. Changelog and Release Notes are mandatory after each production deployment.
Operational Excellence
Escalation Strategy: Escalation strategy itself is a big topic, but to keep it short, there are a few things one needs to set for each product/service,
On-Call Rotation
Incident Management with practices for short-term and long-term remediation
Post Mortem of incidents
Runbooks: Runbooks are essential to understanding known failures and their quick remediations. Ensure it is up-to-date.
Observability: This contains multiple touchpoints,
Logging
Tracing
Metrics
Customer Impact / Operational Dashboard
SLO/SLA dashboard
Error Budget dashboard
Security
Authentication/authorization: Ensure authentication and authorization strategies are taken place for any public-facing API, regardless of internal or external apps.
Secrets: It is recommended to use a secret management store to keep all secret information such as database passwords. Recommendations are Vault, AWS Secret Manager etc.
Dependency Scan: To ensure all dependencies are up-to-date with their security fix. My goto tool is dependabot, however, I used Synk for dependency scan and other security practices, and have had positive experiences.
There are many other items that can be included based on the practices a team/organization is following. I would recommend structuring the checklist based on product and grouping them in a one-time or repetitive list. As an organization when a system matures, we must update the checklist. In our case, we kept updating our touch points every time we learned something new and focused on the automation-first approach. This helped us to ship our product with low defects that earned the trust of our users.
I prefer to keep the checklist as close as possible to the source of the product, usually at the GitHub repository in markdown format. This helps us to build automation if needed.