Fundamentals of Communication in Distrbuted Systems

Basic Concepts

What is a Distributed System?

Imagine the world’s biggest jigsaw puzzle, too large for one person o solve alone. Instead, groups of people around the world each work on a piece of the puzzle, then combine their work to finish the picture faster. That’s the essence of a distributed system.

Distributed Systems

A distributed system is not a single machine, but a network of autonomous computers that work together to achieve a common goal. Each computer, or node, is independent and communicates with others over a network.

  • There is no shared memory.
  • Each node has its own operating system.
  • All communication happens only through messages.

Real-world example of distributed systems:

  • Netflix’s video streaming service (content distributed across data centres).
  • Online multiplayer games (players interact in real-time from different parts of the world).

In a distributed system, we’re no longer talking about one computer doing all the work. Instead, we have multiple systems, ofter geographically separated, working together to solve a problem. But for that to happen smoothly, these systems need to communicate—not just exchange data, but understand each other, coordinate, and synchronise.

Imagine a team of people working in different time zones on the same project. They can’t just shout across the room—they need a structured way to communicate: emails, scheduled calls, shared documents. Similarly, computers in a distributed environment need formalised methods of communication, with rules and systems in place to make sure everything stays consistent, reliable, and deadlock-free.


From Algorithms to Programs to Processes

To understand how distributed systems work, let’s begin at the roots of computing logic.

Algorithm

An algorithm is a step-by-step logical procedure for solving a problem. It is abstrat and language-independent. It’s like a recipe:

  • Input: Ingredients
  • Steps: Instructions
  • Output: A finished dish
Algorithms
Example:

Algorithm to make coffee:

  1. Boil some water
  2. Add coffee
  3. Pour milk
  4. Stir
  5. Serve

Program

A program is the concrete implementation of an algorithm in a specific programming language (Python, Java, C++, …). It’s the actual code you write to instruct the computer.

Process

When a program is executed, it becomes a process.

A process is a program in motion, it lives, uses resources, changes status, and eventually dies.

Think of a program as a recipe, and a process as you actively cooking using that recipe.


Anatomy of a Process: What’s inside?

Every process has its own memory layout and resource structure. Here’s how memory is organised:

Memory Segments of a Process
SegmentPurpose
Text SegmentContains the actual compiled program code (read-only).
Data SegmentStores global and static variables.
HeapDynamic memory allocated at runtime (e.g., malloc() in C).
StackHolds local variables, function calls, and the return address for each call. Grows and shrinks dynamically with fucntion calls.

Analogy: Think of it as a toolbox:

  • The data is your pre-packed items.
  • The heap is your expandable storage box.
  • The stack is your to-do list (LIFO: last-in, first-out).

Process in memory
Process States: The Life Cycle of a Process

A process doesn’t just “run”. It moves through a series of states.

StateMeaning
NewbornThe process is just being created.
ReadyWaiting to be assigned to the CPU.
RunningCurrently executing on the CPU.
Waiting (Blocked)Waiting for some I/O or event to complete.
TerminatedFinished execution or killed.

Analogy: Imagine a student during exam week:

  • Ready: Waiting outside the classroom.
  • Running: Writing the exam.
  • Waiting: Waiting for results.
  • Terminated: Graduated (or failed).
Process State Diagram

Time Slicing and Context Switching

Modern CPUs are not magical multitaskers. Even though multiple programs appear to run at the same time, the CPU is actually rapidly switching between them.

This technique is called time slicing, managed by a scheduler.

How it works:
  1. Each process is assigned a small time slice (e.g., 10ms).
  2. When the time slice expires, a timer interrupt triggers.
  3. The OS pauses the current process, saves its state, and switches to another one.
  4. This saved state includes program counter, registers, etc. All stored in the Process Control Block (PCB).

Process Control Block (PCB): A Process’s Identity Card

Each process is managed by the OS throuhh a PCB, which holds:

  • Process ID (PID)
  • Current state
  • Program Counter (next instruction to execute)
  • CPU register contents
  • Scheduling priority
  • Memory limits
  • I/O status

Without the PCB, the OS wouldn’t be able to resume a paused process. It would be like trying to finish a sentence without remembering what you were saying.


The Role of the Timer Interrupt

A timer interrupt is a hardware-generated signal that tells the OS:

“Time’s up! Stop what you’re doing and let another process use the CPU”.

This allows:

  • Fairness (no single process hogs the CPU).
  • Responsiveness in real-time systems.
  • Multitasking.

Processes vs Threads (Bonus Clarity)

Sometimes, multiple execution units exist within the same process. These are called threads.

ProcessesThreads
Have their own memory spaceShare the same memory space
Expensive to create/switchLightweight and faster
Used for parallel apps (multi-programmes)Used for multitasking within a single program

Example: Placing audio + loading a page = threads within a browser tab.


Why These Concepts Matter in Distributed Systems?

Understanging these internals sets the stage for:

  • Coordinating multiple processes across machines.
  • Managing concurrency safely (semaphores, locks, etc.)
  • Implementing communication protocols (RPC, message passing)
  • Designing a fault-tolerant and scalable systems

Basic Characteristics of Distributed Systems

  1. Concurrency of Components
    • All processes can execute simultaneously.
    • Each process has no global clock to synchronise with others.
  2. Lack of Shared Memory
    • Data must be passed via messages, not shared variables.
  3. Fault Tolerance and Reliability
    • Systems much gracefully handle failures of some components.
  4. Transparency
    • The user should feel like they’re using a single coherant system, even though it’s made up of many parts.
  5. Scalability
    • The system should continue to work efficiently as it grows in size or number of users.

Key Challenges in Distributed Systems

  • Latency: Communication is not instantaneous.
  • Partial Failures: One machine might crash while others remain functional.
  • Security: Information is passed over networks, often publicly accessible.
  • Concurrency: Processes may compete for shared resources or need to coordinate.

This brings us to the next major section: how these independent processes coordinate, communicate, and avoid chaos, by using tools like semaphores, and deadlock prevention.


Concurrency, Semaphores, and Deadlock

Now that we understand what a distributed system is—a group of independent machines collaborating through message passing, we can introduce concurrency, one of its most defining (and tricky) characteristics.

What is Concurrency?

Imagine you walk into a restaurant and see a single waiter juggling orders from six different tables. Somehow, everyone eventually gets served. But if you really watch, you’ll notice: the waiter is only ever doing one task at a time; taking an order, delivering food, grabbing cutlery; but switching between tasks so fast that to the customers, it looks like everything is happening simultaneously.

This is concurrency.

Definition

Concurrency is the illusion of parallelism, where multiple tasks appear to be executing at the same time, even if a single processor is executing them one after another in a small time slices.

In distributed systems, concurrency is not just an illusion; it’s often real. Multiple machines actually run code simultaneously, but each machine may simulate concurrency within itself.

Simulating Concurrency with Time Slicing

How the CPU creates the illusion of parallelism:

  1. The CPU executes Process A for a short time slice (say 10ms).
  2. A timer interrupt occurs.
  3. The OS saves Process A’s state in its PCB.
  4. The CPU switches to Process B, then C, and so on.

This is called context switching.

To the human eye, it looks like A, B, and C are running at the same time, but the CPU is actually jumping between them.

Concurrency vs Parallelism – A simplified example (Source: dynamogeeks.com)

Round-Robin Scheduling Example

Let’s say we have 3 processes:

  • P1 needs 5ms
  • P2 needs 9ms
  • P3 needs 3ms

With a time quantum of 4ms, the execution looks like this:

TimeCPU executesRemaining Time
0 – 4P1P1: 1, P2: 9, P3: 3
4 – 8P2P1: 1, P2: 5, P3: 3
8 – 12P3P1: 1, P2: 5, P3: 0
12 – 13P1P1: 0, P2: 5
13 – 17P2P2: 1
17 – 18P3P2: 0

Total time: 18ms. This scheduling ensures fairness, and no process waits indefinitely.

Concurrency is powerful, but it can be dangerous. When multiple processes acess shared resouces — files, variables, printers, etc., without coordination, things can go terribly wrong.

Race Conditions and the Need for Coordination

A race condition happens when the result of a computation depends on the timing or order of execution of concurrent processes. It’s a classic bug in systems where multiple processes read and write to shared variables.

Example: Two ATM machines simultaneously trying to withdraw from the same bank account. If the system doesn’t coordinate their access, the user might be able to withdraw more money than exists.

To avoid such issues, we need a synchronisation mechanism.

Semaphores: The Original Coordination Tool

What is a Semaphore?

A semaphore is a control variable (an abstract data type) used to restrict access to shared resources.

There are two key types:

  • Binary Semaphore (Mutex)
    • Only two values: 0 (locked) or 1 (unlocked).
    • Allows one process access at a time (like a lock).
    • Analogy: a toilet key hanging on a hook. If the key’s gone, someone’s inside. If it’s there, go ahead.
  • Counting Semaphore
    • Holds a number, including how many instances of a resource are available.
    • Allows up to N processes to access the resource (e.g., limited printer slots).
    • When the count reaches 0, other processes must wait.
    • Analogy: think of a cinema with 5 seats. Each person grabs a ticket (document), and if there are none left, they must wait outside until someone leaves (increment).

Semaphores support two key operations: wait() and signal()

These are atomic operations. That means once one begin, no other can interrupt it.

wait(S)
wait(S):
    while S <= 0:
        wait
    S = S - 1
  • Checks if the resource is available (S > 0).
  • If not, the process waits.
  • If yes, it takes the resource (decrement S).
signal(S)
signal(S):
    S = S + 1
  • Releases the resource.
  • Notifies one of the waiting processes (if any).

The Problem of Deadlock

What is a Deadlock?

A deadlock is a state in which two or more processes are waiting for each other’s resources, and none can proceed.

Real-World Example:

Two drivers face each other on a one-lane bridge. Neither can pass, and neither will back up. The system is stuck.

Four Conditions for Deadlock to Occur

These are known as Coffman’s Conditions:

  1. Mutual Exclusion: At least one resource is held in a non-shareable mode.
  2. Hold and Wait: A process is holding at least one resource and waiting for another.
  3. No Preemption: A resource can’t be forcibly taken from a process.
  4. Circular Wait: A circular chain of two or more processes exists, where each is waiting for a resource held by the next.

If all four conditions hold, deadlock is inevitable.

Visualising Deadlocks using the Resource Allocation Graph (RAG)

A Resource Allocation Graph (RAG) helps us visualise deadlocks.

Components:

  • Circles = Processes (e.g., P1, P2).
  • Squares = Resources (e.g., R1, R2)
  • Arrow from process to resource = Requesting
  • Arrow from resource to process = Assigned

Deadlock = Cycle in hte graph

Example Deadlock:

P1 → R1 → P2 → R2 → P1 (cycle)

P1 holds R1 and wants R2.

P2 holds R2 and wants R1.

Neither can proceed. That’s a deadlock.


Deadlock Strategies

1. Prevention

Ensure that at least one of the four conditions never holds.

  • Eliminate Hold and Wait: Require processes to request all needed resources upfront.
  • Allow Preemption: If a process can’t get what it needs, force it to release what it has.
  • Avoid Circular Wait: Impose an ordering on resource acquisition.

Trade-off: Often leads to under-utilisation of resouces.


2. Avoidance

Allow requests but use algorithms (like Banker’s Algorithm) to evaluate whether granting a request would lead to deadlock.

Drawback: Requires advance knowledge of future requests.


3. Detection and Recovery

  • Let deadlock occur.
  • Periodically check for cycles in the wait graph.
  • If a deadlock is detected, abort one or more processes to break the cycle.

Downside: Might disrupt normal operations.


4. Ignore It (Ostrich Algorithm)

Used by systems like Unix where deadlocks are rare and the overhead of prevention is higher than the risk.

This is a calculated risk, not negligence.

Leave a Reply

Your email address will not be published. Required fields are marked *