Setting Up Your Open Source Observability Stack
Category
Engineering tools

Setting Up Your Open Source Observability Stack

Apr 2, 2024
10 min read
Do you have noise in your alerts? Install Doctor Droid’s Slack bot to instantly identify noisy alerts.
Read More

Introduction to Setting Up Your Open Source Observability Stack

Modern software systems are more distributed and dynamic than ever, making observability a critical component for ensuring reliability, performance, and scalability.

Observability goes beyond traditional monitoring by offering insights into the why behind system behaviors through metrics, logs, and traces. However, setting up an effective observability stack often involves significant costs when using proprietary tools.

Open-source observability solutions offer a cost-effective, flexible alternative. By leveraging tools like Prometheus, Grafana, Loki, and Jaeger, you can create a robust observability stack tailored to your infrastructure and application needs.

These tools provide the building blocks to collect, store, and analyze telemetry data, empowering teams to troubleshoot issues, optimize performance, and improve system reliability.

In this blog, we’ll guide you through the step-by-step process of setting up your own open-source observability stack. From configuring infrastructure-level metrics to selecting the right storage solutions for logs, metrics, and traces, this comprehensive guide covers everything you need to get started.

Let’s get in and transform how you monitor and manage your systems!

💡 Pro Tip

While choosing the right monitoring tools is crucial, managing alerts across multiple tools can become overwhelming. Modern teams are using AI-powered platforms like Dr. Droid to automate cross-tool investigation and reduce alert fatigue.

Get a Sample Application Up & Running

The first step in building your open-source observability stack is setting up a sample application. This application will act as the foundation for testing, configuring, and validating your observability tools.

By simulating real-world scenarios, you can ensure your stack is optimally configured to handle production workloads.

How to Set Up a Sample Application

  • Choose a simple yet representative application that mirrors your actual production environment.
  • Ensure it includes components like APIs, databases, and background processes to generate meaningful telemetry data.

For a detailed walkthrough, refer to the Playground with Prometheus, Grafana, Loki, and k6 guide.

This blog provides step-by-step instructions to:

  1. Deploy a sample application in a containerized or virtualized environment.
  2. Configure basic telemetry outputs like logs, metrics, and traces.
  3. Integrate the sample application with Prometheus, Grafana, Loki, and k6 for end-to-end observability testing.

Here’s how it is explained in the blog:

  • Step 1: Provision VM.
  • Step 2: Set up the observability environment.
    • Prometheus
    • Grafana
    • Loki
  • Step 3: Setup the microservice
  • Step 4: Set up traffic simulation with k6.
  • Conclusion: Explore metrics in Grafana.

Why This Step is CrucialHere are some of the reasons that make this step crucial:

  • Validate Observability Tools: A sample application helps you ensure that your observability stack is correctly collecting and processing data.
  • Test in a Controlled Environment: Identify and resolve configuration issues before applying them to a live system.
  • Understand Metrics and Logs: Familiarize yourself with how telemetry data is generated and processed.

With your sample application up and running, you’re ready to move on to the next step: simulating traffic to generate meaningful data for analysis.

💡 Pro Tip

While choosing the right monitoring tools is crucial, managing alerts across multiple tools can become overwhelming. Modern teams are using AI-powered platforms like Dr. Droid to automate cross-tool investigation and reduce alert fatigue.

Simulate Traffic

Once your sample application is up and running, the next step is to generate simulated traffic.

Simulating traffic helps create realistic workloads. It mimics production environments, providing the telemetry data necessary to test and validate your observability stack.

Why Simulating Traffic is Important

Here are key points that highlight why simulating traffic is crucial for building and validating your observability stack:

  • Realistic Testing: Simulated traffic replicates the types of requests and interactions your application might experience in a live setting.
  • Validate Observability Tools: This ensures that metrics, logs, and traces are being captured correctly by your stack.
  • Identify Bottlenecks: Traffic simulation helps uncover performance bottlenecks or areas where your observability tools may need optimization.

How to Simulate Traffic

  1. Use a Load Testing Tool
    • Tools like k6, Apache JMeter, or Locust can generate a variety of traffic patterns to simulate real-world usage.
    • Example with k6:
      • Write scripts to simulate different user behaviors, such as login requests, API calls, or database queries.
      • Run the scripts to produce consistent traffic over time.
  2. Vary Traffic Patterns
    • Generate bursts of traffic to mimic peak times.
    • Simulate different user actions, such as concurrent logins or large file uploads, to test specific application components.
  3. Integrate Traffic Simulation with Observability Tools
    • Ensure your metrics, logs, and traces reflect the simulated traffic.
    • Use dashboards in Grafana or similar tools to visualize how your application handles the load.

Key Metrics to Monitor During Traffic Simulation

  • Response Times: Measure how quickly the application responds under varying loads.
  • Error Rates: Monitor HTTP errors, exceptions, or failed transactions.
  • Throughput: Evaluate how many requests your application handles per second.
  • Infrastructure Metrics: Track CPU usage, memory consumption, and network performance.

The outcome of Simulated Traffic

  • Gain confidence that your observability stack captures accurate and actionable telemetry data.
  • Validate the performance and scalability of both your application and monitoring tools.
  • Identify any gaps or misconfigurations in your observability stack before moving to production.

Simulating traffic ensures that your observability stack is ready for the demands of a live system, setting the stage for the next steps in instrumentation and monitoring.

💡 Pro Tip

While choosing the right monitoring tools is crucial, managing alerts across multiple tools can become overwhelming. Modern teams are using AI-powered platforms like Dr. Droid to automate cross-tool investigation and reduce alert fatigue.

Infrastructure-Level Metrics Instrumentation

Infrastructure-level metrics provide critical insights into the performance and health of your underlying systems, such as servers, containers, and orchestration platforms.

These metrics form the backbone of any observability stack, ensuring you can monitor resource utilization, detect anomalies, and maintain system reliability.

Prometheus, a widely used open-source monitoring tool, supports metrics exporting and collection features for all types of infrastructure.

Below are the key components you can use based on your stack configuration:

1. Node Exporters: Monitoring Host-Level Metrics

  • Purpose: Node exporters collect metrics from physical or virtual machines at the host level.
  • Metrics Captured:
    • CPU usage, memory utilization, disk I/O, network traffic, and file system performance.
  • Use Case: Ideal for monitoring individual servers or virtual machines.
  • Setup:
    • Install the Prometheus Node Exporter agent on each host.
    • Configure Prometheus to scrape metrics from the exporter.

2. Container-Based Collectors: Monitoring Docker Containers

  • Purpose: These collectors focus on metrics specific to containerized applications running on Docker.
  • Metrics Captured:
    • Container resource usage (CPU, memory), container lifecycle events, and storage utilization.
  • Use Case: Essential for environments running Dockerized applications.
  • Setup:
    • Use tools like cAdvisor or native Prometheus integrations to scrape container metrics.
    • Integrate with Prometheus to track metrics across multiple containers.

3. Kube Prometheus: Observability for Kubernetes Environments

Image Source

  • Purpose: Kube Prometheus is a comprehensive monitoring solution for Kubernetes clusters.
  • Metrics Captured:
    • Cluster-wide resource usage, pod-level performance, service uptime, and Kubernetes control plane health.
  • Use Case: Designed specifically for Kubernetes environments with multiple nodes and services.
  • Setup:
    • Deploy the Kube Prometheus stack in your Kubernetes cluster.
    • Leverage built-in Kubernetes metrics and Prometheus integrations for seamless observability.

Also, Read more about the Kube Prometheus stack with the guide “Simplify Kubernetes Monitoring: Kube-Prometheus-stack Made Easy

How to Choose the Right Instrumentation Agent

  • For Traditional Servers: Use Node Exporters to monitor physical or virtual machines.
  • For Containerized Applications: Use container-based collectors like cAdvisor to track Docker containers.
  • For Orchestrated Workloads: Use Kube Prometheus for Kubernetes clusters to gain full visibility into your container orchestration.

By implementing the appropriate instrumentation agents, you can ensure comprehensive infrastructure-level observability. This foundational layer of monitoring enables proactive system management and sets the stage for application-level instrumentation.

💡 Pro Tip

While choosing the right monitoring tools is crucial, managing alerts across multiple tools can become overwhelming. Modern teams are using AI-powered platforms like Dr. Droid to automate cross-tool investigation and reduce alert fatigue.

Application-Level Instrumentation

While infrastructure-level instrumentation focuses on the health and performance of your systems, application-level instrumentation dives deeper into your applications' behavior.

By capturing metrics, logs, and traces, you can gain detailed insights into how your code performs, identify bottlenecks, and troubleshoot issues efficiently.

OpenTelemetry: A Standard for Application-Level Observability

Image Source

OpenTelemetry is an open-source observability framework that simplifies the collection of application-level telemetry data, including:

  • Metrics: Application-specific performance metrics, such as request rates, latency, and throughput.
  • Logs: Event-specific data to understand application behavior and debug errors.
  • Traces: Distributed traces to map the flow of requests across different application components.

Want to know more about tracing? Watch this video for more information!

Why Use OpenTelemetry?

  • Language Agnostic: Supports a wide range of programming languages.
  • Standardized Telemetry: Ensures compatibility with various backends like Prometheus, Grafana, and Jaeger.
  • Easy Integration: Offers SDKs and APIs for seamless instrumentation.

Read more here about OpenTelemetry with the guide “Beginner’s Guide to OpenTelemetry”.

Also, read more about the “Core components of the OpenTelemetry open-source project” here!

Prometheus APM Agents: Application Monitoring Simplified

Prometheus APM (Application Performance Monitoring) agents enable you to monitor application performance metrics efficiently. These agents are available for all popular languages and frameworks, such as:

  • Python, Java, Go, Node.js: Monitor request latency, error rates, and throughput.
  • Web Frameworks (e.g., Flask, Spring Boot): Track framework-specific metrics like handler execution times.

Benefits of Prometheus APM Agents:

  • Native integration with Prometheus, ensuring efficient metric collection.
  • Lightweight and customizable, allowing you to focus on key performance indicators.

Read more about Prometheus APM Agents with the guide “Introducing Prometheus Agent Mode, an Efficient and Cloud-Native Way for Metric Forwarding”.

How to Implement Application-Level Instrumentation:

  1. Install the Required SDKs or Agents:
    • For OpenTelemetry: Install the language-specific SDK and configure your application to export metrics, logs, and traces.
    • For Prometheus APM: Use the appropriate client library for your programming language.
  2. Define Key Metrics and Traces:
    • Identify metrics that are critical to your application, such as request latency, memory usage, and database query performance.
    • Define spans and trace IDs to map the flow of requests across your application.
  3. Integrate with Observability Tools:
    • Export data to Prometheus, Jaeger, or Grafana for visualization and analysis.
    • Combine application-level telemetry with infrastructure metrics for a complete observability view.

Outcome of Application-Level Instrumentation:

  • Gain deep visibility into your application’s performance and behavior.
  • Detect and resolve issues faster with comprehensive telemetry data.
  • Optimize application performance by identifying bottlenecks and inefficiencies.

By incorporating application-level instrumentation using OpenTelemetry and Prometheus APM agents, you can build a robust observability stack that provides end-to-end visibility across your systems and applications.

💡 Pro Tip

While choosing the right monitoring tools is crucial, managing alerts across multiple tools can become overwhelming. Modern teams are using AI-powered platforms like Dr. Droid to automate cross-tool investigation and reduce alert fatigue.

Set Up the Observability Storage Layer

The storage layer is the backbone of your observability stack, holding all the collected telemetry data, including logs, metrics, and traces. Choosing the right storage solutions ensures optimal performance, scalability, and cost efficiency.

Image Source

Here’s a breakdown of the options available for each type of telemetry data:

Logs: Capturing and Querying Event Data

  1. Clickhouse
    • How to Set It Up:
      • Deploy Clickhouse as a storage backend and configure it to accept logs from your log aggregation tools (e.g., Fluentd or Vector).
      • Use SQL queries to analyze and retrieve log data.
    • When to Use:
      • Best for high-speed log storage with low storage costs.
      • Ideal for environments with high log ingestion rates and real-time querying needs.
  2. Elasticsearch/OpenSearch
    • Purpose:
      • Flexible log indexing and querying solutions that support full-text search and advanced analytics.
    • When to Use:
      • Suitable for use cases requiring robust search capabilities and structured log analysis.
      • Ideal for large-scale log environments with complex query requirements.
  3. Loki
    • Purpose:
      • Lightweight, cost-efficient log storage specifically optimized for Kubernetes environments.
    • When to Use:
      • Ideal for environments with containerized workloads and simple log querying needs.

Metrics: Storing and Querying Time-Series Data

  1. Prometheus
    • Purpose:
      • The de facto standard for time-series metrics collection and querying.
    • When to Use:
      • Best for small to medium-scale environments.
      • Suitable for monitoring infrastructure-level metrics and short-term retention.

Read more about Prometheus with this guide!

  1. Clickhouse
    • Purpose:
      • Efficient for large-scale, scalable metric storage with real-time querying capabilities.
    • When to Use:
      • Ideal for setups that require long-term metric storage with high ingestion rates.
  2. Mimir
    • Purpose:
      • A centralized solution for managing multiple Prometheus instances.
    • When to Use:
      • Best for multi-cluster environments where consolidating metrics into a single view is essential.

Want to read more about Mimir? Click here!

  1. VictoriaMetrics
    • Purpose:
      • High-performance storage optimized for large-scale, high-frequency metrics.
    • When to Use:
      • Suitable for enterprises managing millions of metrics across distributed systems.

All you need to know about VictoriaMetrics is here!

Traces: Storing and Analyzing Distributed Request Flows

  1. Clickhouse
    • Purpose:
      • High-speed, scalable trace storage capable of handling large volumes of trace data.
    • When to Use:
      • Best for cost-effective, long-term storage of distributed traces with real-time querying.

You can learn more about Clickhouse here!

  1. Elasticsearch
    • Purpose:
      • Stores distributed traces alongside logs for unified analysis.
    • When to Use:
      • Ideal for setups where the correlation between logs and traces is critical.

Read more about Elasticsearch here!

  1. Jaeger
    • Purpose:
      • A dedicated tool for tracing distributed systems, offering robust visualization and analysis.
    • When to Use:
      • Best for teams focused on tracing and debugging microservices architectures.

Want to know more about Jaeger? Watch this video!

  1. Tempo
    • Purpose:
      • A simplified, scalable tracing solution from Grafana.
    • When to Use:
      • Ideal for environments that prioritize trace storage scalability and easy Grafana integration.

Read more here about Garfana Tempo.

Setting up the right observability storage layer is essential for ensuring your stack can handle the demands of your system’s telemetry data.

By selecting storage solutions tailored to your logs, metrics, and traces, you can achieve a balance between performance, scalability, and cost efficiency. With the storage layer in place, you’ll be equipped to visualize, analyze, and act on insights from your observability stack effectively.

If you’re interested in learning more, check out this insightful YouTube video for additional details.

💡 Pro Tip

While choosing the right monitoring tools is crucial, managing alerts across multiple tools can become overwhelming. Modern teams are using AI-powered platforms like Dr. Droid to automate cross-tool investigation and reduce alert fatigue.

Ready to simplify your observability stack?

Dr. Droid works with your existing tools to automate alert investigation and diagnosis.
Start Free POC →

Conclusion

Building an Efficient Open Source Observability Stack

Setting up an open-source observability stack empowers you to monitor and optimize your systems effectively while maintaining cost efficiency. By leveraging tools like Prometheus, Grafana, Loki, and Jaeger and following a structured approach to instrumentation and storage, you can achieve end-to-end observability tailored to your infrastructure.

However, managing and scaling observability can still be a complex process. This is where Doctor Droid can make a difference. With its intelligent Playbooks, Doctor Droid simplifies workflows by automating incident response, reducing noise, and providing actionable insights.

Take your observability efforts to the next level with Doctor Droid Playbooks.

Learn more at Doctor Droid Playbooks and start optimizing your workflows today.

Want to reduce alerts and fix issues faster?
Managing multiple tools? See how Dr. Droid automates alert investigation across your stack

Table of Contents

Made with ❤️ in Bangalore & San Francisco 🏢

Doctor Droid