Jump to Content
Cloud Operations

Take the first step toward SRE with Cloud Operations Sandbox

January 22, 2021
Simon Zeltser

Developer Programs Engineer

Daniel Sanche

Developer Programs Engineer

At Google Cloud, we strive to bring Site Reliability Engineering (SRE) culture to our customers not only through training on organizational best practices, but also with the tools you need to run successful cloud services. Part and parcel of that is comprehensive observability tooling—logging, monitoring, tracing, profiling and debugging—which can help you troubleshoot production issues faster, increase release velocity and improve service reliability. 

We often hear that implementing observability is hard, especially for complex distributed applications that are implemented in different programming languages, deployed in a variety of environments, that have different operational costs, and many other factors. As a result, when migrating and modernizing workloads onto Google Cloud, observability is often an afterthought. 

Nevertheless, being able to debug the system and gain insights into the system’s behavior is important for running reliable production systems. Customers want to learn how to instrument services for observability and implement SRE best practices using tools Google Cloud has to offer, but without risking production environments. With Cloud Operations Sandbox, you can learn in practice how to kickstart your observability journey and answer the question, “Will it work for my use-case?”

Cloud Operations Sandbox is an open-source tool that helps you learn SRE practices from Google and apply them on cloud services using Google Cloud’s operations suite (formerly Stackdriver). Cloud Operations Sandbox has everything you need to get started in one click:

  • Demo service - an application built using microservices architecture on modern, cloud-native stack (a modified fork of a Online Boutique microservices demo app)

  • One-click deployment - automated script that deploys and configures the service to Google Cloud, including:

    • Service Monitoring configuration

    • Tracing with OpenTelemetry

    • Cloud Profiling, Logging, Error Reporting, Debugging and more

  • Load generator - a component that produces synthetic traffic on the demo service

  • SRE recipes - pre-built tasks that manufacture intentional errors in the demo app so you can use Cloud Operations tools to find the root cause of problems like you would in production

  • An interactive walkthrough to get started with Cloud Operations 

Getting started

Launching the Cloud Operations Sandbox is as easy as can be. Simply:

This creates a new Google Cloud project. Within that project, a Terraform script creates a Google Kubernetes Engine (GKE) cluster and deploys a sample application to it. The microservices that make up the demo app are pre-instrumented with logging, monitoring, tracing, debugging and profiling as appropriate for each microservices language runtime. As such, sending traffic to the demo app generates telemetry that can be useful for diagnosing the cloud service’s operation. In order to generate production-like traffic to the demo app, an automated script deploys a synthetic load generator in a different geo-location than the demo app.

https://storage.googleapis.com/gweb-cloudblog-publish/images/Terraform_script_creates_a_GKE_cluste.max-900x900.jpg

It creates 11 custom dashboards (one for each microservice) to illustrate the four golden signals of monitoring as described in Google’s SRE book.

https://storage.googleapis.com/gweb-cloudblog-publish/images/creates_11_custom_dashboards.max-1900x1900.jpg

It also adds and automatically configures uptime checks, service monitoring (SLOs and SLIs), log-based metrics, alerting policies and more.

https://storage.googleapis.com/gweb-cloudblog-publish/images/checkout_service.max-1200x1200.jpg

At the end of the provisioning script you’ll get a few URLs of the newly created project:

https://storage.googleapis.com/gweb-cloudblog-publish/images/provisioning_script.max-1100x1100.jpg

You can follow the user guide to learn about the entire Cloud Operations suite of tools, including tracking microservices interactions in Cloud Trace (thanks to the OpenTelemetry instrumentation of the demo app) and see how to apply the learnings to your scenario

Finally, to remove the Sandbox once you’re finished using it, you can run

Loading...

Next steps

Following SRE principles is a proven method for running highly reliable applications in the cloud. We hope that the Cloud Operations Sandbox gives you the understanding and confidence you need to jumpstart your SRE practice. 

To get started, visit  cloud-ops-sandbox.dev, explore the project repo, and follow along in the user guide.

Posted in