Think about this: you post a photo on your favorite social media app, and almost instantly, your friends all over the world can view it. How does this happen so seamlessly? Behind the scenes, there’s a powerful technology at work called distributed systems. These systems are the backbone of modern technology, driving cloud storage, online banking, and e-commerce platforms. Let’s explore the basics of distributed systems to understand their role in our everyday digital experiences 1. What is a Distributed System? A distributed system is a group of computers working together to achieve a shared goal. These computers are connected through a network and communicate with each other to perform tasks. Even though the system consists of multiple machines, it appears as a single unit to the user. Key Features: No Single Controller: Each computer (or node) operates independently, without one controlling everything. Can Grow Easily: Adding more computers to handle extra work is simple. Keeps Working Even if Some Fail: If one or more machines stop working, the system can still function. Handles Many Tasks at Once: Multiple computers can work on different parts of a task simultaneously. 2. What is the Client-Server Model ? The Client-Server Model is a fundamental architecture used in distributed systems, where one machine (the client) requests services or resources, and another machine (the server) provides them. The client and server communicate over a network, and the server typically handles more powerful tasks or stores data, while the client interacts with the user and sends requests to the server. Example: Web Browsing When you use a web browser to visit a website, you're interacting with a client-server system: Client (Your Browser): You type a URL (e.g., www.example.com) into the browser. The browser sends a request to the server that hosts the website. Server (Website Host): The server receives the request, processes it, and retrieves the requested webpage (e.g., HTML, images). The server then sends the webpage back to your browser. Response: Your browser receives the webpage and displays it for you to view. 3. What is Scalability? Scalability refers to the ability of a system to handle a growing amount of work or to be able to accommodate growth without compromising performance. In the context of distributed systems, scalability means the system's capacity to scale up (handle more work) or scale out (add more resources) as demand increases. There are two main types of scalability: Vertical Scalability (Scaling Up): Adding more resources (CPU, memory, storage) to a single machine or server. Example At start your node had 2 Core CPU and 512GB of RAM then later you upgrade your system to 4 core CPU and 1TB of RAM. Horizontal Scalability (Scaling Out): Adding more machines or nodes to distribute the load. Example earlier in the cluster you had 2 node of 2 Core CPU later to meet the high traffic you added one more node with 2 Core CPU. 4. What is the CAP Theorem? The CAP Theorem (also known as Brewer's Theorem) is a principle that applies to distributed systems, stating that it is impossible for a distributed system to simultaneously achieve all three of the following properties: Consistency (C): Every read operation returns the most recent write (or an error). All nodes in the system have the same data at the same time. Availability (A): Every request (read or write) will receive a response, either with data or an error, even if some nodes are unavailable. Partition Tolerance (P): The system can continue to operate even if there is a network partition (i.e., some nodes can't communicate with others). According to the CAP Theorem, a distributed system can only guarantee two of these three properties at any given time, not all three. This means that if a system prioritizes one property, it may have to sacrifice the other two under certain conditions. The Three Trade-offs 1. Consistency and Availability (CA) What it means: The system guarantees that: Consistency: All servers have the same data at the same time. Availability: Every request gets a response, even if some servers fail. No Partition Tolerance: The system cannot function properly during a network partition. Example with Servers A, B, and C: Imagine the system processes a write request to update a record. Before the partition: All servers (A, B, C) have consistent data and respond to requests. During a partition: If Server A is isolated from B and C, the system cannot tolerate this partition. It may stop functioning or reject requests to ensure that all servers stay consistent. Trade-off: The system sacrifices partition tolerance to maintain consistency and availability in normal operation. Real-life Example: A single-node relational database like MySQL without replication. If the server becomes unreachable, the system stops working. 2. Consistency and Partition Tolerance (CP) What it means: The system guarantees that: Consistency: All servers have the same data, even if there’s a network partition. Partition Tolerance: The system continues to function during a partition, but it sacrifices availability. Example with Servers A, B, and C: Before the partition: A write request updates a record, and all servers have the same data. During a partition: If Server A is isolated, the system blocks requests to ensure consistency. For example, if a client tries to write data to Server A, the system will reject the request because Servers B and C cannot verify or synchronize the data. Trade-off: The system sacrifices availability during the partition to maintain consistency. Real-life Example: HBase or Zookeeper. During a partition, the system might stop processing requests to prevent data inconsistencies. 3. Availability and Partition Tolerance (AP) What it means: The system guarantees that: Availability: Every request gets a response, even if the data might not be the latest. Partition Tolerance: The system continues to function during a partition, but it sacrifices consistency. Example with Servers A, B, and C: Before the partition: A write request updates a record, and all servers have the same data. During a partition: If Server A is isolated, it will continue accepting requests independently. However, the data on Server A might differ from the data on Servers B and C until the partition is resolved. After the partition: The system reconciles the differences between servers, but there may be a brief period of inconsistency. Trade-off: The system sacrifices consistency to ensure availability during the partition. Real-life Example: Cassandra or DynamoDB. These systems prioritize availability and allow operations during partitions, even if some nodes have stale data. 5. What is Latency? Latency refers to the time it takes for a piece of data to travel from the source to the destination in a network. It is often measured in milliseconds (ms). Think of it as a delay: The time between sending a request and receiving a response. Lower latency means faster communication. Example of Latency: Imagine you're making a video call. When you say something, the time it takes for your voice to reach the other person is the latency. If the latency is high (e.g., 500 ms), the other person will hear your voice with a noticeable delay. If the latency is low (e.g., 20 ms), the conversation feels smooth and real-time. 6. What is Bandwidth? Bandwidth refers to the maximum amount of data that can be transmitted over a network in a given period. It is usually measured in bits per second (e.g., Mbps or Gbps). Think of it as capacity: The size of the "pipe" that carries the data. Higher bandwidth means more data can be transmitted at once. Example of Bandwidth: Imagine downloading a 1GB movie. If your bandwidth is 10 Mbps, it will take longer to download the movie compared to a bandwidth of 100 Mbps. A higher bandwidth allows you to download large files faster or stream videos in higher quality (e.g., 4K). 7. What is Data Replication? Data replication is the process of storing multiple copies of the same data across different servers, systems, or locations in a distributed system. This ensures data availability, fault tolerance, and improved read performance, even in case of server failures or network issues. Why is Data Replication Important? Fault Tolerance: If one server fails, data is still accessible from other servers. Improved Performance: Replicas closer to users reduce latency for data access. Scalability: Multiple replicas can handle a higher number of read requests. Disaster Recovery: Ensures data is not lost in case of a server or data center failure. Types of Data Replication Synchronous Replication: Data is updated on all replicas simultaneously. Ensures consistency but increases latency since updates must be confirmed on all replicas before proceeding. Example: Financial systems where every transaction must be recorded accurately. Asynchronous Replication: Data is updated on the primary server first, and replicas are updated later. Improves performance but may result in temporary inconsistencies. Example: Social media platforms where eventual consistency is acceptable. Example of Data Replication Imagine an e-commerce platform like Amazon with users across the world. Without Replication: All user data (e.g., order history) is stored on a single server in the US. A user in India experiences high latency when accessing their data because of the geographical distance. With Replication: The system replicates the user data to servers in India, Europe, and other regions. When a user in India accesses their order history: The request is routed to the closest server (in India). This reduces latency and improves the user experience. If the Indian server fails, the request is routed to the next nearest server (e.g., Europe), ensuring availability. In today’s tech-driven world, understanding distributed systems is a must for developers and engineers. Whether you’re building a cloud-native application, designing scalable architectures, or working on real-time data processing, distributed systems are the foundation of these innovations. In next post we will see other concepts of distributed system that you should be aware of.If you find this post useful then share with your friends. Let me know in the comment if you have query regarding this article.