Микросервисы / распределенные системы@microservices

Микросервисы / распределенные системы

Измерение и обучение - Встраивание механизмов измерения - Регулярный анализ данных - Проведение ретроспектив - Выявление возможностей улучшения - Непрерывное улучшение Основные метрики доступности: mean time between failures (MTBF) and mean time to recover…

Механизмы обеспечения Resilience

Обнаружение проблемы
- Health Checks - рекомендуются синтетические транзакции
- Watchdogs and Alerts(A watchdog is a piece of software whose only responsibility is to watch for a specific condition and then perform an action, usually creating some form of alert, in response.)
- monitoring (metrics, traces, and logs)Metrics are the numeric measurements tracked over time. Traces are sequences of related events that reveal how requests flow through the system. Logs are the timestamped records of events.
- observability(extending monitoring to provide insight into the internal state of the system to allow its failure modes to be better predicted and understood)
(we need to understand how the system works and the various states it can be in, and we must be able to correlate from the data we have collected to what state the system is in (or was in) and how it got there)

Изоляция проблемы
- Synchronous versus Asynchronous Communication: RPCs versus Messages
- Limit the scope of a failure- Bulkheads (по сути - не клади все яйца в одну корзину, тут все от отдельных потоков до зон доступности)
- Defaults and Caches

Защита компонентов системы от перегргузки
– Back Pressure(some sort of signaling back through the system so that the clients can tell that the servers are over- loaded and there is no point in sending more requests yet)
- Load Shedding(reject workload that can’t be processed or that would cause the system to become unstable)
– rate limiting usually defined in terms of the rate of requests arriving from a particular source (e.g., a client ID, a user, or an IP address) in a time period
- Timeouts(defines how long the caller will wait, and we need a mechanism to interrupt or notify the caller that the timeout has occurred so that it can abandon the request and perform whatever logic is needed to clean things up)
- Circuit Breakers(small, state machine–based proxy that sits in front of our service request code;)

Смягчение проблемы
- Data Consistency: Compensation(for any change to a database (a transaction), the caller has the ability to make another change that undoes the change (a compensating transaction))
- Data Availability: Replication(mitigate node failure by ensuring that the data from the failed node is immediately available on other nodes)
- Data Availability: Checks (checking the underlying storage mecha- nism’s integrity and checking the integrity of the system’s data) (tactics to use to mitigate data corruption are regular checking of database integrity, regular backup of the databases, and fix- ing corruption in place)
- Data Availability: Backups

Решение проблемы

👍3🔥3❤1

www.tgoop.com/microservices_arch/515

2.69K viewsSergey Baranov, Dec 5, 2023 at 10:35

tgoop.com/microservices_arch/515

Create: 2023-12-05
Last Update: 2025-07-23 20:22:01

BY Микросервисы / распределенные системы

Share with your friend now:
tgoop.com/microservices_arch/515

Telegram News

Механизмы обеспечения Resilience