monitoring

#Monitoring

The architecture diagram of a notification system:

diagram

SNS topic: my-test-topic

prod-EC2-CPU-above-90%
awsrds-prod-vpc-Low-Free-Storage-Space - Check when database space lower than 5Gb
awsec2-i-06dd0928d238e33ca-High-Status-Check-Failed-Any- - Monitor failed Production instance
prod-RDS-CPU-above-85%
prod-EC2-disk-above-85%
prod-EC2-memory-above-85%
prod-EC2-network-1Gb-in - Alert if instance receive requests with bandwidth > 1Gb per second

Containers down for more than 3 min for each main service (see file alert.rules)

New monitoring dashboard Services Info On this dashboard you can observe, how much memory,CPU each service use.

Query for getting CPU usage per container is:

sum(rate(container_cpu_usage_seconds_total{name="workflow_executor"}[5m]))

We need to use sum, because prometheus+cAdvisor(container advisor) capture CPU usage for each CPU core that we have.

Query for getting memory usage in GB is:

container_memory_usage_bytes{name="workflow_executor"} / 1000000000

Next steps: