Ninja Docs Help

HLD014 - Observability

Revision

Date

Description

1.0

27.08.2024

Init document

Introduction

The purpose of this document is to outline a high-level design architecture for logging, to collect metrics, traces and logs.

  • LLD016-001 - Log collection

  • LLD016-002 - Metrics and traces

  • LLD016-003 - Data visualization

Background

To establish a robust system for collecting and storing logs, metrics, and traces from various services in a redundant manner, it is necessary to integrate multiple components that collectively contribute to modifying, sending, or receiving the data.

Architecture diagram

HLD014-O-01.png

This diagram represents an overview of architecture for the observability system implementation.

Explanation

  • Services and workloads generate observability data (metrics, traces and logs) in all AWS accounts.

  • Traces are forwarded to an external service Dynatrace in which developers can analyse various issues and performance of monitored applications.

  • Logs and metrics are forwarded to central monitoring account which will be stored in AWS Managed Prometheus Workspace and AWS OpenSearch service.

  • Metrics will be presented via AWS Managed Grafana dashboards and logs collected in OpenSearch will be queryable via OpenSearch Dashboards.

Observability services employed in infrastructure

AWS OpenSearch

AWS OpenSearch Service is a fully managed, open-source search and analytics service that makes it easy to deploy, secure, and scale an Elasticsearch cluster on the AWS cloud. OpenSearch Service is compatible with Elasticsearch, meaning you can use familiar Elasticsearch APIs, plugins, and tools with the added benefits of AWS's managed service offerings. Here are some key features and integrations of AWS OpenSearch Service:

OpenSearch and Elasticsearch Compatibility

AWS OpenSearch Service is compatible with the Elasticsearch API, allowing you to use existing Elasticsearch applications, libraries, and tools seamlessly. It supports a wide range of Elasticsearch versions and supports popular Elasticsearch features such as full-text search, aggregations, geospatial search, and more.

Security and Access Control

OpenSearch Service integrates with AWS Identity and Access Management (IAM) for fine-grained access control. You can define IAM policies to control who can access and perform operations on the OpenSearch cluster. It also supports encryption at rest and in transit to ensure data security.

Data Ingestion and Analysis

OpenSearch Service provides various methods to ingest data into the cluster, including the use of Elasticsearch APIs, Logstash, or AWS integrations such as AWS Lambda and Amazon Kinesis Data Firehose. It enables real-time analysis, visualization, and monitoring of data using Kibana, which is tightly integrated with OpenSearch Service.

Key components

  • OpenSearch Cluster - The OpenSearch cluster is the core component of the service. It is a distributed collection of nodes that work together to store and index data, perform search operations, and execute analytics queries. The cluster can scale horizontally to handle large amounts of data and high query loads.

  • OpenSearch Domain - An OpenSearch domain is a logical container for an OpenSearch cluster. It provides a dedicated environment for your search and analytics workloads. Each domain is associated with a unique domain name and configuration settings.

  • Elasticsearch API - OpenSearch Service is fully compatible with the Elasticsearch API, which means you can use Elasticsearch client libraries, APIs, and tools to interact with the service. The Elasticsearch API provides a rich set of functionalities for data ingestion, querying, indexing, and analytics.

  • OpenSearch Dashboards - OpenSearch Dashboards is an open-source data visualization and exploration tool based on Kibana that integrates with OpenSearch Service. It provides a web-based interface for creating interactive dashboards, visualizing data, and performing ad-hoc searches. OpenSearch Dashboards allows you to gain insights from your indexed data and build custom visualizations and reports.

  • Access Policies - Access policies in OpenSearch Service define who can access and perform operations on the cluster. AWS Identity and Access Management(IAM) is used to manage access control policies. You can define fine-grained permissions to control actions such as reading, writing, or modifying the data in the cluster.

AWS Managed Prometheus

AWS Managed Prometheus is a fully managed and scalable monitoring service offered by AWS. It is compatible with the Prometheus ecosystem and provides a serverless architecture. You can collect, store, query, and visualize metrics data from various sources. Integration with AWS services and seamless scalability make it an efficient solution for monitoring and analyzing metrics in AWS environments.

Key components

  • Prometheus Server - The Prometheus server is responsible for scraping and storing time-series data. It pulls metrics from various targets using configured scrape jobs and stores them in a local time-series database.

  • Prometheus Alertmanager - The Alertmanager component handles alerts generated by Prometheus. It allows you to define and manage alert rules, deduplicate and group alerts, and route them to different receivers (such as email, PagerDuty, or custom webhook).

  • Prometheus Exporters - Exporters are software components that expose metrics in a format that Prometheus can scrape. There are many official and community-maintained exporters available, allowing you to collect metrics from different systems, applications, and services.

  • PromQL - Prometheus Query Language (PromQL) is a powerful query language used to retrieve and manipulate metrics stored in Prometheus. It enables you to perform advanced queries, aggregations, and calculations to gain insights from the collected metrics.

Dynatrace

Dynatrace is a comprehensive, full-stack observability platform designed to provide deep insights into the performance, availability, and user experience of complex, dynamic IT environments. It offers a wide range of monitoring capabilities across applications, infrastructure, and user interactions to enable efficient troubleshooting, optimization, and proactive performance management.

AWS Managed Grafana

AWS Managed Grafana is a fully managed and scalable data visualization service offered by Amazon Web Services (AWS). It allows users to create interactive dashboards and analyze metrics data without the need to manage underlying infrastructure. With seamless integration into the AWS ecosystem, AWS Managed Grafana simplifies the process of visualizing and gaining insights from data.

Data Visualization

Grafana provides a rich set of visualization options, including interactive graphs, charts, tables, heatmaps, and more. It allows you to explore and analyze your data in real-time, zoom in on specific time ranges, and apply various transformations and aggregations.

Alerting and Notification

Grafana supports alerting based on defined conditions and thresholds. You can create alert rules and configure notification channels to receive alerts through various means such as email, Slack, PagerDuty, or other custom webhooks.

Data Source Integration

Grafana can connect to and fetch data from different data sources, including databases, monitoring systems (such as Prometheus or CloudWatch), log aggregators, and time-series databases. This flexibility allows you to centralize and visualize data from various sources in a unified dashboard.

User Management and Permissions

Grafana provides user authentication and authorization mechanisms. You can configure access control for individual users or groups, defining their permissions to view, create, or modify dashboards and data sources

Dashboard Sharing and Collaboration

Grafana allows you to share dashboards and collaborate with others. You can create read-only dashboards or publish them to specific user groups or the public. Users can also annotate dashboards with comments, share links, or export snapshots.

Key components

  • Grafana Workspace - The AWS Managed Grafana Workspace is the core component responsible for serving the Grafana user interface, handling user authentication, and managing data sources and dashboards.

  • Data Sources - Grafana supports a wide range of data sources, including popular databases like Prometheus, Elasticsearch, InfluxDB, MySQL, and more. These data sources provide the metrics and time-series data that Grafana can visualize and analyze. AWS Managed Grafana offers integration with below AWS services as data sources:

    • AWS IoT SiteWise

    • AWS X-Ray

    • Amazon CloudWatch

    • Amazon OpenSearch Service

    • Amazon Managed Service for Prometheus

    • Amazon TimeStream

    • Amazon Redshift

    • Amazon Athena

  • Dashboards - Dashboards are the primary means of visualizing data in Grafana. They consist of panels that display graphs, tables, and other visual representations of data. Grafana provides a flexible and customizable dashboard editor for creating and arranging panels.

  • Plugins - Grafana offers a plugin architecture that allows extending its functionality. There are numerous community-built plugins available for adding additional data sources, visualization options, and integration with various systems.

AWS CloudWatch

CloudWatch Logs

CloudWatch Logs is a fully managed service that enables you to collect, monitor, and store log files from your applications and infrastructure. It allows you to centralize logs from various sources, such as EC2 instances, Lambda functions, and custom applications and query them in web interface. Additionally logs can be can be configured to continously stream to an external service via subscription filters:

Lambda Function Subscription Filter

A Lambda function subscription filter allows you to send log events to an AWS Lambda function for custom processing. When a log event matches the specified filter pattern, it triggers the Lambda function. This enables you to perform custom transformations, filtering, or enrichment of the log data before forwarding it to other services or systems.

Kinesis Data Firehose Subscription Filter

A Kinesis Data Firehose subscription filter sends log events that match the specified filter pattern to an Amazon Kinesis Data Firehose delivery stream. Kinesis Data Firehose can then deliver the log data to destinations such as Amazon S3, Amazon Redshift, or Splunk for further processing and analysis.

Elasticsearch Subscription Filter

An Elasticsearch subscription filter forwards log events that match the filter pattern to an Amazon Elasticsearch Service (Amazon ES) cluster. This enables you to index and search the log data using Elasticsearch and perform advanced analytics, visualization, and monitoring.

CloudWatch Metrics

CloudWatch Metrics allows you to collect and monitor various performance metrics from AWS services, applications, and custom sources. Here's how you can leverage metrics exporting:

Alarm Actions

CloudWatch Alarms can be configured to trigger actions based on metric thresholds. These actions can include sending notifications, executing AWS Lambda functions, or making changes to other AWS resources. Alarms provide proactive monitoring and alerting capabilities to maintain application health and performance.

CloudWatch Events

CloudWatch Events provides a mechanism to track and respond to changes in your AWS environment. It captures events emitted by various AWS services and allows you to configure rules to trigger actions based on specific event patterns. While not directly related to log or metric exporting, CloudWatch Events plays a crucial role in event-driven architectures and automation workflows.

AWS X-Ray

AWS X-Ray is a distributed tracing service provided by Amazon Web Services (AWS) that helps developers analyze and debug applications by providing insights into the request flow and performance of distributed systems. It allows you to trace requests as they travel across various microservices, identify bottlenecks, and gain visibility into application behavior.

Distributed Tracing

AWS X-Ray allows you to trace requests as they travel across multiple services and components in a distributed application. It captures information about each component involved in processing the request and generates a trace map that shows the entire path and timing of the request.

Service Maps

X-Ray generates service maps that visualize the interactions and dependencies between different services and resources in your application. These maps help you understand the architecture and identify bottlenecks or performance issues.

Performance Insights

X-Ray provides performance insights by showing the latency breakdown of each component in the request flow. This allows you to pinpoint where the most time is being spent and identify performance bottlenecks or areas for optimization.

Error Analysis

X-Ray captures and highlights errors or exceptions encountered during the request processing. It provides detailed information about the error, including stack traces, error codes, and related metadata, helping you identify and troubleshoot issues quickly.

Integration with AWS Services

X-Ray seamlessly integrates with various AWS services, including AWS Lambda, Amazon EC2, Amazon ECS, Amazon API Gateway, and more. It automatically traces requests made to these services, providing end-to-end visibility into your application's behavior across different AWS resources.

Sampling and Filtering

X-Ray allows you to control the amount of trace data collected by implementing sampling rules and filters. This helps you balance the trade-off between the granularity of tracing and the overhead of capturing and storing trace data.

Key components

  • X-Ray SDK - The X-Ray SDK is a set of libraries and software development kits (SDKs) available in multiple programming languages. By instrumenting your application code with the X-Ray SDK, you can capture trace data and add metadata, annotations, and custom segments to the traces.

  • X-Ray Daemon - The X-Ray daemon is responsible for receiving trace data from the X-Ray SDKs and sending it to the X-Ray service. It runs either on the EC2 instances or as a sidecar container in containerized environments.

  • X-Ray Service - The X-Ray service collects, aggregates, and analyzes the trace data received from the X-Ray SDKs and daemons. It provides a web-based console and APIs to visualize and explore the traced requests, as well as access to advanced analytics and insights.

Pricing

AWS OpenSearch

AWS OpenSarch service uses EC2 instances to provide compute and store for querying and storing logs. Pricing depends on instance type and size with significant cost savings when using Reserved Instances, amount of attached EBS volumes and their sizes if choosing EBS as storage type, AWS OpenSearch UltraWarm and cold managed storage and standard data transfer rates inside and between AWS Regions.

Approximated price for AWS OpenSearch deployment: https://calculator.aws/#/estimate?ch=cta&cta=lower-pricingcalc&id=b5be14ca11aed33de01513434f4e84b041344675

AWS Managed Prometheus Workspace

AWS Managed Prometheus Workspace service charges for each metric ingested via Prometheus-compatible endpoint, total storage size taken by metrics and their metadata as well as all samples processed through a QueryMetric API.

Estimated price for AWS Managed Prometheus Workspace: https://calculator.aws/#/estimate?id=1c33ff36e141b60084686f281681b58e10ee2161

AWS Managed Grafana Workspace

AWS Managed Grafana Workspaces comes in two flavours of license pricing per active user per month, Amazon Managed Grafana Editor for active administrators of the workspace and Amazon Managed Grafana Viewer for a view-only active user type. Additionally there is available a Grafana Enterprise plan which includes plugins, customer support and on-demand trainings provided by Grafana Labs.

Approximate price for AWS Managed Grafana Workspace user licences: https://calculator.aws/#/estimate?id=7b3c7ea6454a9f1cf61a911eabe5c8ffe9272533

AWS CloudWatch

CloudWatch Logs

AWS CloudWatch Logs has two types of pricing:

  • Standard Logs - All log types, there is no cost of data transfer used for ingestion but standard EC2 data transfer for exporting data for both inside network and to internet.

  • Vended Logs - Vended logs are specific AWS service logs natively published by AWS services on behalf of the customer and available at volume discount pricing.

Delivery to CloudWatch Logs
Delivery to S3
  • VPC flow logs

  • AWS Global Accelerator flow logs

  • Route 53 Resolver query logs

  • EC2 Spot Instance logs

  • Evidently logs

  • MSK logs

  • AWS Network Firewall alert and flow logs

  • WAF logs

Delivery to Kinesis Data Firehose
  • MSK logs

  • Route 53 Resolver query logs

  • ElastiCache for Redis logs

  • Evidently logs

  • VPC Flow Logs

Amazon Kinesis Data Firehose has ingestion and streaming charges. Delivery to Kinesis Data Firehose is subject to both CloudWatch Logs delivery pricing and Kinesis Data Firehose pricing.

Cost estimation: https://calculator.aws/#/estimate?id=16b171a1bfc10351aa74e9499be38fe210a88567

Last modified: 17 February 2025