
Basic Concepts
What is a Distributed System?
Imagine the world’s biggest jigsaw puzzle, too large for one person o solve alone. Instead, groups of people around the world each work on a piece of the puzzle, then combine their work to finish the picture faster. That’s the essence of a distributed system.

A distributed system is not a single machine, but a network of autonomous computers that work together to achieve a common goal. Each computer, or node, is independent and communicates with others over a network.
- There is no shared memory.
- Each node has its own operating system.
- All communication happens only through messages.
Real-world example of distributed systems:
- Netflix’s video streaming service (content distributed across data centres).
- Online multiplayer games (players interact in real-time from different parts of the world).
In a distributed system, we’re no longer talking about one computer doing all the work. Instead, we have multiple systems, ofter geographically separated, working together to solve a problem. But for that to happen smoothly, these systems need to communicate—not just exchange data, but understand each other, coordinate, and synchronise.
Imagine a team of people working in different time zones on the same project. They can’t just shout across the room—they need a structured way to communicate: emails, scheduled calls, shared documents. Similarly, computers in a distributed environment need formalised methods of communication, with rules and systems in place to make sure everything stays consistent, reliable, and deadlock-free.
From Algorithms to Programs to Processes
To understand how distributed systems work, let’s begin at the roots of computing logic.
Algorithm
An algorithm is a step-by-step logical procedure for solving a problem. It is abstrat and language-independent. It’s like a recipe:
- Input: Ingredients
- Steps: Instructions
- Output: A finished dish

Example:
Algorithm to make coffee:
- Boil some water
- Add coffee
- Pour milk
- Stir
- Serve
Program
A program is the concrete implementation of an algorithm in a specific programming language (Python, Java, C++, …). It’s the actual code you write to instruct the computer.
Process
When a program is executed, it becomes a process.
A process is a program in motion, it lives, uses resources, changes status, and eventually dies.
Think of a program as a recipe, and a process as you actively cooking using that recipe.
Anatomy of a Process: What’s inside?
Every process has its own memory layout and resource structure. Here’s how memory is organised:
Memory Segments of a Process
Segment | Purpose |
---|---|
Text Segment | Contains the actual compiled program code (read-only). |
Data Segment | Stores global and static variables. |
Heap | Dynamic memory allocated at runtime (e.g., malloc() in C). |
Stack | Holds local variables, function calls, and the return address for each call. Grows and shrinks dynamically with fucntion calls. |
Analogy: Think of it as a toolbox:
- The data is your pre-packed items.
- The heap is your expandable storage box.
- The stack is your to-do list (LIFO: last-in, first-out).

Process States: The Life Cycle of a Process
A process doesn’t just “run”. It moves through a series of states.
State | Meaning |
---|---|
Newborn | The process is just being created. |
Ready | Waiting to be assigned to the CPU. |
Running | Currently executing on the CPU. |
Waiting (Blocked) | Waiting for some I/O or event to complete. |
Terminated | Finished execution or killed. |
Analogy: Imagine a student during exam week:
- Ready: Waiting outside the classroom.
- Running: Writing the exam.
- Waiting: Waiting for results.
- Terminated: Graduated (or failed).

Time Slicing and Context Switching
Modern CPUs are not magical multitaskers. Even though multiple programs appear to run at the same time, the CPU is actually rapidly switching between them.
This technique is called time slicing, managed by a scheduler.
How it works:
- Each process is assigned a small time slice (e.g., 10ms).
- When the time slice expires, a timer interrupt triggers.
- The OS pauses the current process, saves its state, and switches to another one.
- This saved state includes program counter, registers, etc. All stored in the Process Control Block (PCB).
Process Control Block (PCB): A Process’s Identity Card
Each process is managed by the OS throuhh a PCB, which holds:
- Process ID (PID)
- Current state
- Program Counter (next instruction to execute)
- CPU register contents
- Scheduling priority
- Memory limits
- I/O status
Without the PCB, the OS wouldn’t be able to resume a paused process. It would be like trying to finish a sentence without remembering what you were saying.
The Role of the Timer Interrupt
A timer interrupt is a hardware-generated signal that tells the OS:
“Time’s up! Stop what you’re doing and let another process use the CPU”.
This allows:
- Fairness (no single process hogs the CPU).
- Responsiveness in real-time systems.
- Multitasking.
Processes vs Threads (Bonus Clarity)
Sometimes, multiple execution units exist within the same process. These are called threads.
Processes | Threads |
---|---|
Have their own memory space | Share the same memory space |
Expensive to create/switch | Lightweight and faster |
Used for parallel apps (multi-programmes) | Used for multitasking within a single program |
Example: Placing audio + loading a page = threads within a browser tab.
Why These Concepts Matter in Distributed Systems?
Understanging these internals sets the stage for:
- Coordinating multiple processes across machines.
- Managing concurrency safely (semaphores, locks, etc.)
- Implementing communication protocols (RPC, message passing)
- Designing a fault-tolerant and scalable systems
Basic Characteristics of Distributed Systems
- Concurrency of Components
- All processes can execute simultaneously.
- Each process has no global clock to synchronise with others.
- Lack of Shared Memory
- Data must be passed via messages, not shared variables.
- Fault Tolerance and Reliability
- Systems much gracefully handle failures of some components.
- Transparency
- The user should feel like they’re using a single coherant system, even though it’s made up of many parts.
- Scalability
- The system should continue to work efficiently as it grows in size or number of users.
Key Challenges in Distributed Systems
- Latency: Communication is not instantaneous.
- Partial Failures: One machine might crash while others remain functional.
- Security: Information is passed over networks, often publicly accessible.
- Concurrency: Processes may compete for shared resources or need to coordinate.
This brings us to the next major section: how these independent processes coordinate, communicate, and avoid chaos, by using tools like semaphores, and deadlock prevention.
Concurrency, Semaphores, and Deadlock
Now that we understand what a distributed system is—a group of independent machines collaborating through message passing, we can introduce concurrency, one of its most defining (and tricky) characteristics.
What is Concurrency?
Imagine you walk into a restaurant and see a single waiter juggling orders from six different tables. Somehow, everyone eventually gets served. But if you really watch, you’ll notice: the waiter is only ever doing one task at a time; taking an order, delivering food, grabbing cutlery; but switching between tasks so fast that to the customers, it looks like everything is happening simultaneously.
This is concurrency.
Definition
Concurrency is the illusion of parallelism, where multiple tasks appear to be executing at the same time, even if a single processor is executing them one after another in a small time slices.
In distributed systems, concurrency is not just an illusion; it’s often real. Multiple machines actually run code simultaneously, but each machine may simulate concurrency within itself.
Simulating Concurrency with Time Slicing
How the CPU creates the illusion of parallelism:
- The CPU executes Process A for a short time slice (say 10ms).
- A timer interrupt occurs.
- The OS saves Process A’s state in its PCB.
- The CPU switches to Process B, then C, and so on.
This is called context switching.
To the human eye, it looks like A, B, and C are running at the same time, but the CPU is actually jumping between them.

Round-Robin Scheduling Example
Let’s say we have 3 processes:
- P1 needs 5ms
- P2 needs 9ms
- P3 needs 3ms
With a time quantum of 4ms, the execution looks like this:
Time | CPU executes | Remaining Time |
---|---|---|
0 – 4 | P1 | P1: 1, P2: 9, P3: 3 |
4 – 8 | P2 | P1: 1, P2: 5, P3: 3 |
8 – 12 | P3 | P1: 1, P2: 5, P3: 0 |
12 – 13 | P1 | P1: 0, P2: 5 |
13 – 17 | P2 | P2: 1 |
17 – 18 | P3 | P2: 0 |
Total time: 18ms. This scheduling ensures fairness, and no process waits indefinitely.
Concurrency is powerful, but it can be dangerous. When multiple processes acess shared resouces — files, variables, printers, etc., without coordination, things can go terribly wrong.
Race Conditions and the Need for Coordination
A race condition happens when the result of a computation depends on the timing or order of execution of concurrent processes. It’s a classic bug in systems where multiple processes read and write to shared variables.
Example: Two ATM machines simultaneously trying to withdraw from the same bank account. If the system doesn’t coordinate their access, the user might be able to withdraw more money than exists.
To avoid such issues, we need a synchronisation mechanism.
Semaphores: The Original Coordination Tool
What is a Semaphore?
A semaphore is a control variable (an abstract data type) used to restrict access to shared resources.
There are two key types:
- Binary Semaphore (Mutex)
- Only two values: 0 (locked) or 1 (unlocked).
- Allows one process access at a time (like a lock).
- Analogy: a toilet key hanging on a hook. If the key’s gone, someone’s inside. If it’s there, go ahead.
- Counting Semaphore
- Holds a number, including how many instances of a resource are available.
- Allows up to N processes to access the resource (e.g., limited printer slots).
- When the count reaches 0, other processes must wait.
- Analogy: think of a cinema with 5 seats. Each person grabs a ticket (document), and if there are none left, they must wait outside until someone leaves (increment).
Semaphores support two key operations: wait()
and signal()
These are atomic operations. That means once one begin, no other can interrupt it.
wait(S)
wait(S):
while S <= 0:
wait
S = S - 1
- Checks if the resource is available (S > 0).
- If not, the process waits.
- If yes, it takes the resource (decrement S).
signal(S)
signal(S):
S = S + 1
- Releases the resource.
- Notifies one of the waiting processes (if any).
The Problem of Deadlock
What is a Deadlock?
A deadlock is a state in which two or more processes are waiting for each other’s resources, and none can proceed.
Real-World Example:
Two drivers face each other on a one-lane bridge. Neither can pass, and neither will back up. The system is stuck.
Four Conditions for Deadlock to Occur
These are known as Coffman’s Conditions:
- Mutual Exclusion: At least one resource is held in a non-shareable mode.
- Hold and Wait: A process is holding at least one resource and waiting for another.
- No Preemption: A resource can’t be forcibly taken from a process.
- Circular Wait: A circular chain of two or more processes exists, where each is waiting for a resource held by the next.
If all four conditions hold, deadlock is inevitable.
Visualising Deadlocks using the Resource Allocation Graph (RAG)
A Resource Allocation Graph (RAG) helps us visualise deadlocks.
Components:
- Circles = Processes (e.g., P1, P2).
- Squares = Resources (e.g., R1, R2)
- Arrow from process to resource = Requesting
- Arrow from resource to process = Assigned
Deadlock = Cycle in hte graph
Example Deadlock:
P1 → R1 → P2 → R2 → P1 (cycle)
P1 holds R1 and wants R2.
P2 holds R2 and wants R1.
Neither can proceed. That’s a deadlock.

Deadlock Strategies
1. Prevention
Ensure that at least one of the four conditions never holds.
- Eliminate Hold and Wait: Require processes to request all needed resources upfront.
- Allow Preemption: If a process can’t get what it needs, force it to release what it has.
- Avoid Circular Wait: Impose an ordering on resource acquisition.
Trade-off: Often leads to under-utilisation of resouces.
2. Avoidance
Allow requests but use algorithms (like Banker’s Algorithm) to evaluate whether granting a request would lead to deadlock.
Drawback: Requires advance knowledge of future requests.
3. Detection and Recovery
- Let deadlock occur.
- Periodically check for cycles in the wait graph.
- If a deadlock is detected, abort one or more processes to break the cycle.
Downside: Might disrupt normal operations.
4. Ignore It (Ostrich Algorithm)
Used by systems like Unix where deadlocks are rare and the overhead of prevention is higher than the risk.
This is a calculated risk, not negligence.