decrypt101
SocialOpen ProjectsSupport me My Resumes
  • Preface
    • Motivation
    • Roadmap’s
  • Introduction to Blockchain
    • A Brief History
    • Growth of Blockchain
    • Structure of Blockchain
    • Types of Blockchain
    • Key Technologies of Blockchain
    • Features of Blockchain
    • How Blockchain Works ?
    • Implementation of Blockchain
    • Summary
  • Components of Blockchain Architecture
    • Distributed Ledger
    • Blocks
    • Transaction
    • Chain
    • Peer-to-Peer Network
    • Blockchain Layers
    • Off-Chain & On-Chain
    • Wallet
    • Mining
    • Tokens
    • Assets
    • State Channels
    • Sidechains
    • Oracles on Blockchain
    • Atomic Swaps
    • Decentralized Identity (DID)
    • Blockchain Data Storage
    • Interoperability
    • Data structures for Scaling Blockchain
    • Maximal Extractable Value (MEV)
  • Consensus Mechanisms
    • Proof of Work (PoW)
      • Implemation Using Rust
    • Proof of Stake (PoS)
    • Proof of Burn (PoB)
    • Proof of Capacity (PoC)
    • Proof of Activity (PoAc)
    • Proof of Weight (PoWe)
    • Proof of Luck (PoL)
    • Proof of Ownership (PoO)
    • Proof of Existence (PoE)
    • Proof of Believability (PoBe)
    • Proof of History (PoH)
    • Proof of Authority (PoA)
    • Proof of Elapsed Time (PoET)
  • Cryptographics
    • Encryption & Decryption
      • Symmetric Encryption
      • Asymmetric Encryption
      • Key Management and Exchange
      • Implementation
    • Cryptographic Hashing
      • Secure Hash Algorithms (SHA)
      • Message Digest Algorithms
      • Ethash
      • Blake2
      • SCrypt
      • RIPEMD-160
    • Digital Signature
      • Digital Signature Algorithms
      • Digital Signature in Blockchain
    • Zero-Knowledge Proofs (ZKPs)
      • Types of Zero-Knowledge Proof and Protocols
      • A Case Study of Polygon Platform
    • Multi-Party Computation (MPC)
    • Cryptanalysis
    • Practical Implementation
  • Decentralized Application (DApp)
    • Design and UX in Web3
  • Smart Contract
    • Development Tools
    • Solidity
    • Testing Smart Contract
    • Developing Smart Contract
    • Interacting & Deploying with Smart Contract
    • Verifying Smart Contracts
    • Upgrading Smart Contracts
    • Securing Smart Contract
    • Smart Contract Composability
    • Testnet and Mainnet
    • Blockchain Platform Using Smart Contract
    • Application of Smart Contract
    • Practical Implementation
  • Blockchain Platforms
    • Ethereum
      • Ethereum Virtual Machine (EVM)
      • ETHER and GAS
      • Ethereum transaction
      • Ethereum Accounts
      • Ethereum Stacking
      • Ethereum Network
      • Ethereum Scaling Solutions
      • Ethereum Use-Cases
      • Getting Started with Ethereum
      • Ethereum Ecosystem and Support
    • Solana
      • Solana Architecture
        • Solana Account Model
        • Solana Wallet
        • Transactions and Instructions
        • Solana Programs
        • Program Derived Address (PDA)
        • Cross Program Invocation (CPI)
        • Tokens on Solana
        • Clusters and Public RPC Endpoints
        • Transaction Confirmation & Expiration
        • Retrying Transactions
        • Versioned Transactions
        • Address Lookup Tables
        • State Compression
        • Actions and Blinks
      • Solana Developments
      • Solana Client
      • Advanced Solana
      • Solana Scaling and Performance Architecture
      • Solana Solutions and cases
      • Practical Implemenation
    • Binance Smart Chain (BSC)
      • Create a BEP20 Token
    • Hyperledger Fabric
    • Cosmos
    • Polkadot
    • Quorum
    • Polygon
    • Algorand
    • Corda
    • Avalanche
    • TRON
    • Summary
  • Decentralized Finance (DeFi)
    • DeFi Components
    • DeFi Protocols
    • DeFi Platforms
    • DeFi Risk Classification
      • Infrastructure-layer Attacks
      • Smart Contract Layer-attacks
      • Application Layer-attacks
      • DeFi Risks
    • DeFi and Blockchain
    • DeFi Impact
  • Decentralized Ecosystem and Digital Innovation
    • Layer 2 Scaling Fundamental
    • Tokenomics
    • Cryptocurrency
    • Quantative Trading
    • NFTs
    • GameFi
    • Metaverse
  • Blockchain as a Service (BaaS)
    • Building Fullstack Blockchain Platform
    • Decentralized Digital Identity
    • Build a Cryptocurrencies Exchange
    • Play-to-Earn Gaming
    • Solana Token Airdrop Manager
    • Smart Contract Development on Solana with Rust
    • Quantitative Trading Platform
    • Insurances protocols
    • Flash Loans
    • Asset Management
    • Tokenized Derivatives
    • Automated Market Makers (AMMs)
    • Staking
    • Lending and Borrowing Platforms
    • Yield Farming
    • Stablecoin System
    • Security Token Offerings (STOs)
    • Initial Coin Offerings (ICOs)
    • On-Chain Voting Systems
    • Decentralized Autonomous Organizations (DAOs)
    • NFT Marketplaces
    • Provenance Verification
    • Supply Chain Tracking
    • Commodities Tokenization
    • Real Estate Tokenization
    • Digital Certificates
    • KYC (Know Your Customer)
  • Blockchain Development Across Languages
    • Blockchain using Go(Golang)
    • Blockchain using Rust
    • Blockchain using Python
    • Blockchain using Cairo
  • Distributed Systems & Infrastructure Technology
    • Classification of Distributed Systems
    • Networked systems versus Distributed systems
    • Parallel systems vs Distributed systems
    • Distributed versus Decentralized systems
    • Processes of Distributed Systems
    • Architecture of Distributed systems
    • Infrastructure Technologies
  • Distributed System Patterns
    • Distributed Agreements Algorithms
      • HoneyBadgerBFT
    • Data Replications
    • Data Partition
    • Consistency
    • Distributed Time
    • Cluster Management
    • Communication between Nodes
    • Fault Tolerance and Resilience
      • How to design better fault tolerance systems
      • Resilience Patterns
    • Coordination systems
      • Clock synchronization
    • Security
      • Trust in distributed systems
      • Design of Principal Security
      • Security threats, policies, and mechanisms
      • Authentication and Authorizations
      • Cryptography
      • Monitoring in Security
  • Distributed System Design
    • Page 1
    • Distributed Shared Memory
    • Distributed Data Management
    • Distributed Knowledge Management
    • Distributed Ledger
  • FAQs
  • Support and Community
Powered by GitBook
On this page
  1. Distributed System Patterns
  2. Fault Tolerance and Resilience

Resilience Patterns

Resilience patterns are design strategies and techniques used in distributed systems to ensure they can handle failures gracefully and recover quickly, maintaining high availability and performance despite issues. These patterns aim to build fault-tolerant, self-healing, and adaptable systems. Here are some common resilience patterns:

1. Retry Pattern

• Problem: When a transient failure (e.g., network timeout) occurs during a remote service call, the system fails the request.

• Solution: Automatically retry the operation after a short delay. Retrying can help recover from temporary issues without requiring user intervention. Exponential backoff (increasing the delay after each failure) is often used.

2. Circuit Breaker Pattern

• Problem: Repeated calls to a failing service can cause cascading failures, overload the system, and degrade performance.

• Solution: A circuit breaker detects repeated failures and “trips,” stopping further calls to the failing service for a set time. After the timeout, the system attempts to restore connections. This prevents further load on the failing service and helps the system recover.

3. Bulkhead Pattern

• Problem: Failures in one part of the system can spread and affect other parts, leading to system-wide failures.

• Solution: Partition or isolate components into separate “bulkheads” so that a failure in one partition doesn’t affect others. This is similar to bulkheads in a ship, which prevent flooding in one compartment from sinking the whole vessel.

4. Timeout Pattern

• Problem: A system can hang indefinitely while waiting for a slow or unresponsive service, causing degraded performance or complete failure.

• Solution: Set a maximum timeout for calls to external services. If the service doesn’t respond within the set time, the call is aborted, preventing resources from being unnecessarily tied up.

5. Failover Pattern

• Problem: When a critical system or node fails, services become unavailable, leading to downtime.

• Solution: Automatically switch to a redundant or standby node or service when the primary one fails. Failover ensures that the system can continue operating even if some components fail.

6. Fallback Pattern

• Problem: If a service fails, users experience downtime or degraded service.

• Solution: Provide an alternative method or data source when the primary service fails. For example, serving cached data or offering a simplified version of the service when the main service is down.

7. Load Shedding Pattern

• Problem: When system demand exceeds capacity, the system can become overwhelmed and fail completely.

• Solution: Shed excess load by rejecting requests that the system can’t handle, allowing the system to function for a smaller set of users. This ensures the core service remains available for high-priority requests.

8. Health Check Pattern

• Problem: The system doesn’t detect failed services or nodes until they affect operations.

• Solution: Regularly check the health of services and components, ensuring they’re running properly. If a service is unhealthy, it can be removed from the pool or restarted automatically.

9. Throttling Pattern

• Problem: Sudden spikes in traffic can overload services, causing failures or degraded performance.

• Solution: Limit the number of requests or operations that a service can handle over a certain time period to prevent overload. This can help protect system stability during high-traffic situations.

10. Graceful Degradation Pattern

• Problem: When parts of a system fail, the entire system becomes unusable.

• Solution: Allow the system to continue operating in a reduced capacity when certain services or components fail. For example, a website might disable advanced features but still provide basic functionality during outages.

11. Self-Healing Pattern

• Problem: Manual intervention is required to recover from failures, leading to longer downtimes.

• Solution: Build systems that automatically detect failures and recover from them without human intervention. This can include automatic restarts, failover mechanisms, or moving workloads to healthy nodes.

12. State Replication Pattern

• Problem: If a node or component fails, data or session states are lost.

• Solution: Replicate state or data across multiple nodes to ensure that if one node fails, another can take over with minimal data loss, ensuring continuity.

13. Event Sourcing Pattern

• Problem: When services fail or states are inconsistent, restoring the correct state can be difficult.

• Solution: Store the state as a sequence of events. This allows systems to reconstruct the state at any point by replaying events, which helps maintain consistency and recover after failures.

14. Shadow Traffic Pattern

• Problem: Rolling out updates or new features to live traffic can introduce unanticipated issues.

• Solution: Direct a copy of production traffic (shadow traffic) to a new or experimental service in parallel without affecting the live system. This helps test and verify the system’s resilience to real-world conditions before full deployment.

These resilience patterns ensure that distributed systems can handle failures, recover quickly, and continue functioning, contributing to overall system stability and robustness. Each pattern has specific use cases depending on the system’s architecture and the types of failures it needs to withstand.

PreviousHow to design better fault tolerance systemsNextCoordination systems

Last updated 7 months ago