operating system design for new computer architecture

OPERATING SYSTEM DESIGN FOR
NEW COMPUTER ARCHITECTURES (1)
Dr. Elaheh Gavagsaz
2023

INTRODUCTION
➢ Since its inception, the computer industry has been driven by an endless quest for more and more
computing power.
➢ Astronomers are trying to make sense of the universe, biologists are trying to understand the implications
of the human genome, and aeronautical engineers are interested in building safer and more efficient
aircraft, and all want more CPU cycles.
➢ However much computing power there is, it is never enough.

INTRODUCTION
➢ In the past, the solution was always to make the clock run faster.
➢ We have hit some fundamental limits on clock speed because, in a computer with a 10-GHz clock, the
signals cannot travel more than 2 cm in total (According to Einstein’s special theory of relativity).
➢ Making computers this small may be possible, but then we hit another fundamental problem: heat
production.
➢ The faster the computer runs, the more heat it generates, and the smaller the computer, the harder it is to
get rid of this heat.

INTRODUCTION
➢ One approach to greater speed is through massively parallel computers.
➢ These machines consist of many CPUs, each of which runs at ‘‘normal’’ speed, but collectively having
much more computing power than a single CPU.
➢ Putting 1 million unrelated computers in a room is easy to do provided that you have enough money and
a sufficiently large room. Spreading 1 million unrelated computers around the world is even easier
because it solves the second problem. The trouble comes in when you want them to communicate with
one another to work together on a single problem.
➢ All communication between electronic (or optical) components ultimately reaches sending messages
between them.

MULTIPLE PROCESSOR SYSTEMS
(a) A shared-memory multiprocessor. (b) A message-passing multicomputer. (c) A wide area distributed system.

➢ shared-memory multiprocessors
• Every CPU has equal access to the entire physical memory and can read and write individual words using
LOAD and STORE instructions.
• Accessing a memory word usually takes 1–10 nsec.
• While it sounds simple, actually implementing it is not so simple and usually involves considerable message
passing.
• The message passing in the shared-memory multiprocessors is invisible to the programmers.
➢ Message-passing multicomputer
• The CPU-memory pairs are connected by a high-speed interconnect.
• Each memory is local to a single CPU and can be accessed only by that CPU. The CPUs communicate by
sending multiword messages over the interconnect.
• With a good interconnect, a short message can be sent in 10–50 µsec
• Multicomputers are much easier to build than multiprocessors, but they are harder to program.

➢ Distributed system
• This model connects complete computer systems over a wide area network, such as the Internet, to
form a distributed system.
• Each of these systems has its own memory and the systems communicate by message passing.
• Complete computers are used and message times are often 10–100 msec.
➢ Note
• The three types of systems differ in their delays by something like three orders of magnitude.

MULTIPROCESSORS
➢ A multiprocessor is a computer system in which two or more CPUs share full access to a common RAM.
➢ A program running on any of the CPUs sees a normal (usually paged) virtual address space.
➢ The only unusual property this system has is that the CPU can write some value into a memory word and
then read the word back and get a different value (because another CPU has changed it).
➢ This property forms the basis of interprocessor communication: one CPU writes some data into memory
and another one reads the data out.
➢ For the most part, multiprocessor operating systems are normal operating systems.
➢ However, they have unique features, such as process synchronization, resource management, and
scheduling.

MULTIPROCESSORS
➢ Multiprocessors hardware
• UMA (Uniform Memory Access) Multiprocessors
❖ UMA Multiprocessors with Bus-Based Architectures
❖ UMA Multiprocessors Using Crossbar Switches
❖ UMA Multiprocessors Using Multistage Switching Networks
• NUMA (Nonuniform Memory Access) Multiprocessors
❖ NC-NUMA (Non-Cache-coherent NUMA)
❖ CC-NUMA (Cache-Coherent NUMA)

MULTIPROCESSORS
➢ UMA Multiprocessors with Bus-Based Architectures
• The simplest multiprocessors are based on a single bus.
• Two or more CPUs and one or more memory modules all use the same bus for communication.
• When a CPU wants to read a memory word, it first checks to see if the bus is busy. If the bus is idle, the
CPU puts the address of the word it wants on the bus, asserts a few control signals, and waits until the
memory puts the desired word on the bus.
• If the bus is busy when a CPU wants to read or write memory, the CPU just waits until the bus becomes
idle.
• Problem: The system will be limited by the bandwidth of the bus, and most of the CPUs will be idle most of
the time.

MULTIPROCESSORS
➢ UMA Multiprocessors with Bus-Based Architectures …
Three bus-based multiprocessors. (a) Without caching. (b) With caching. (c) With caching and private memories.

MULTIPROCESSORS
➢ UMA Multiprocessors with Bus-Based Architectures …
• The solution to this problem is to add a cache to each CPU. Since many reads can now be performed
from the local cache, there will be much less bus traffic, and the system can support more CPUs.
When a word is referenced, its entire block, called a cache line, is fetched into the cache of the CPU.
• Another possibility is that each CPU has not only a cache but also a local, private memory which it
accesses over a private bus. The compiler should place all the program text, read-only data, and local
variables in the private memories. The shared memory is then only used for writable shared variables.
In most cases, this will greatly reduce bus traffic.

MULTIPROCESSORS
➢ UMA Multiprocessors Using Crossbar Switches
• Even with the best caching, the use of a single bus limits the size of a UMA multiprocessor to about 16 or
32 CPUs.
• Crossbar switches have been used for decades in telephone switching exchanges to connect a group of
incoming lines to a set of outgoing lines in an arbitrary way.
• At each intersection of a horizontal (incoming) and vertical (outgoing) line is a crosspoint. A crosspoint is a
small electronic switch that can be electrically opened or closed, depending on whether the horizontal and
vertical lines are to be connected or not.
• The nicest properties of the crossbar switch: a nonblocking network, no CPU is ever refused the connection
it needs because some crosspoint or line is already occupied (assuming the memory module itself is
available).

MULTIPROCESSORS
➢ UMA Multiprocessors Using Crossbar Switches...
• No prior planning is needed.
• It is always possible to connect the remaining
CPU to the remaining memory.
• Contention for memory is still possible, of
course, if two CPUs want to access the same
module at the same time.
• Problem: The number of crosspoints is n2.

MULTIPROCESSORS
➢ UMA Multiprocessors Using Multistage Switching Networks
• A completely different multiprocessor design is based on the 2 × 2 switch.
• This switch has two inputs and two outputs. Messages arriving on any of the input lines can be
switched to any of the output lines.
• Messages contain four parts: The Module field tells which memory to use. The Address specifies an
address within a module. The Opcode gives the operation, such as READ or WRITE. The optional
Value field may contain an operand, such as a 32-bit word to be written on a WRITE.
• The switch inspects the Module field and uses it to determine if the message should be sent on X or Y.

MULTIPROCESSORS
➢ UMA Multiprocessors Using Multistage Switching Networks …
• Our 2 × 2 switches can be arranged in many ways to build larger multistage switching networks.
• Omega switching network, eight CPUs to eight memories using 12 switches (n CPUs and n memories (n/2) log n
switches).
• Unlike the crossbar switch, the omega network is a blocking network. Not every set of requests can be processed
simultaneously.
• CPU 011 wants to read a word from
memory module 110 (a)
• CPU 001 wants to write a word to
Memory module 001 (b)
• CPU 000 simultaneously wanted to
access memory module 000

MULTIPROCESSORS
➢ NUMA Multiprocessors
• Single-bus UMA multiprocessors are generally limited to a few CPUs, and crossbar or switched multiprocessors
need a lot of (expensive) hardware. To get to more than 100 CPUs, something new is needed.
• In NUMA, the memory access time depends on the memory location relative to the processor. Under NUMA, a
processor can access its own local memory faster than non-local memory (memory shared between processors)
• NUMA machines have three key characteristics:
1. There is a single address space visible to all CPUs.
2. Access to remote memory is via LOAD and STORE instructions.
3. Access to remote memory is slower than access to local memory.

MULTIPROCESSORS
➢ NUMA Multiprocessors …
• There are two types of NUMA:
❖ NC-NUMA (Non Cache-coherent NUMA)
❖ CC-NUMA (Cache-Coherent NUMA)
❑ A popular approach for building large CC-NUMA multiprocessors is the directory-based
multiprocessor.
❑ The idea is to maintain a database telling where each cache line is and what its status is.
❑ Since this database is queried on every instruction that accesses memory, it must be kept in very fast
hardware.
❑ An obvious limitation of this design is that a line can be cached at only one node.

MULTIPROCESSOR OPERATING SYSTEM TYPES
➢ Each CPU has its own operating system
• The simplest possible way to organize a multiprocessor operating system is to statically divide memory
into as many partitions as there are CPUs and give each CPU its own private memory and its own private
copy of the operating system.
• The n CPUs operate as n independent computers.
• All the CPUs share the operating system code and make private copies of only the operating system data
structures.
• This scheme is better than having n separate computers since it allows all the machines to share a set of
disks and other I/O devices, and it also allows the memory to be shared.

➢ Each CPU has its own operating system …
• One CPU can be given an extra-large portion of the memory so it can handle large programs efficiently.
• Processes can efficiently communicate with one another by allowing a producer to write data directly
into memory and allowing a consumer to fetch it from the place the producer wrote it.
• Each CPU has its own operating system is primitive.

• Four aspects of this design that may not be obvious:
1. When a process makes a system call, the system call is caught and handled on its own CPU using the data
structures in that operating system’s tables.
2. Each operating system has its own tables, it also has its own set of processes that it schedules by itself.
There is no sharing of processes. CPU 1 may be idle while CPU 2 is loaded with work.
3. There is no sharing of physical pages. There is no way for CPU 2 to borrow some pages from CPU 1 since
the memory allocation is fixed.
4. Each operating system maintains a buffer cache of recently used disk blocks. A certain disk block can be
present and dirty in multiple buffer caches at the same time, leading to inconsistent results. To avoid this
problem eliminate the buffer caches. Doing so is not hard, but it hurts performance considerably.

• This model is rarely used in production systems, although it was used in the early days of
multiprocessors when the goal was to port existing operating systems to some new multiprocessor as fast
as possible.
• If the state of each processor is completely local, there is very little sharing that can lead to consistency
or locking problems.

➢ Master-Slave Multiprocessors
• One copy of the operating system and its tables are present on CPU 1 and not on any of the others.
• All system calls are redirected to CPU 1 for processing.
• If there is still CPU time available, CPU 1 can additionally execute user processes.
• This model is called master-slave since CPU 1 is the master and all the others are slaves.

➢ Master-Slave Multiprocessors …
• The master-slave model solves most of the problems of the first model.
• There is a list or a set of prioritized lists that tracks of ready processes.
• When a CPU goes idle, it asks the operating system on CPU 1 for a process to run and is assigned
one. Thus it can never happen that one CPU is idle while another is overloaded.
• Pages can be allocated among all the processes dynamically and there is only one buffer cache, so
inconsistencies never occur.

➢ Master-Slave Multiprocessors …
• The problem with this model is that with many CPUs, the master will become a bottleneck.
• It must handle all system calls from all CPUs. If, say, 10% of all time is spent handling system calls,
then 10 CPUs will pretty much saturate the master, and with 20 CPUs it will be completely
overloaded.
• This model is simple and workable for small multiprocessors, but for large ones, it fails.

➢ Symmetric Multiprocessors
• SMP (Symmetric MultiProcessor) eliminates the asymmetry.
• There is one copy of the operating system in memory that any CPU can run it.
• When a system call is made, the CPU on which the system call was made traps the kernel and processes the
system call.
• The TRAP instruction switches from user mode to kernel mode.

➢ Symmetric Multiprocessors …
• This model balances processes and memory dynamically because there is only one set of operating system
tables.
• It also eliminates the master CPU bottleneck.
• Problem: Imagine two CPUs simultaneously picking the same process to run or requesting the same free
memory page.
• Solution: Associate a mutex (lock) with the operating system, making the whole system one big critical
region. When a CPU wants to run operating system code, it must first acquire the mutex. If the mutex is
locked, it just waits. This approach is something called a big kernel lock.
• This model works but is almost as bad as the master-slave model. Suppose that 10% of all run time is spent
inside the operating system. With 20 CPUs, there will be long queues of CPUs waiting to get in.

• Improvement is easy.
• Many parts of the operating system are independent of one another. For example, there is no problem
with one CPU running the scheduler while another CPU is handling a file-system call.
• Operating system can be divided into multiple independent critical regions that do not interact with
one another.
• Each critical region is protected by its own mutex, so only one CPU at a time can execute it.
• Much more parallelism can be achieved.
• It may well happen that some tables, such as the process table, are used by multiple critical regions.
Each table that may be used by multiple critical regions needs its own mutex.
• Each critical region can be executed by only one CPU at a time and each critical table can be accessed
by only one CPU at a time.

• Most modern multiprocessors use this arrangement.
• The hard part about writing the operating system is not that the actual code is so different from a
regular operating system, it is splitting it into critical regions that can be executed concurrently by
different CPUs without interfering with one another
• Every table used by two or more critical regions must be separately protected by a mutex and all code
that uses the table must use the mutex correctly.
• Great care must be taken to avoid deadlocks.

➢ Multiprocessor Synchronization
• The CPUs in a multiprocessor frequently need to synchronize.
• If a process on a uniprocessor machine (just one CPU) makes a system call that requires accessing
some critical kernel table, the kernel code can just disable interrupts before touching the table and the
process can do its work without any other process sneaking in.
• On a multiprocessor, disabling interrupts affects only the CPU doing the disable. Other CPUs continue
to run and can still touch the critical table.
• A proper mutex protocol must be used and respected by all CPUs to guarantee that mutual exclusion
works.

➢ Multiprocessor Synchronization …
• The heart of any practical mutex protocol is a special instruction that allows a memory word to be
inspected and set in one indivisible operation.
• TSL (Test and Set Lock), reads a memory word and stores it in a register. Simultaneously, it writes a 1
(or some other nonzero value) into the memory word. It takes two bus cycles to perform the memory
read and memory write.
• On a uniprocessor, as long as the instruction cannot be broken off halfway, TSL always works as
expected.

• In a multiprocessor, for example, both CPUs got a 0 back from the TSL instruction, so both of them
now have access to the critical region and the mutual exclusion fails.
• To prevent this problem, the TSL instruction must first lock the bus, preventing other CPUs from
accessing it, then do both memory accesses, and then unlock the bus.

• If TSL is correctly implemented and used, mutual exclusion is guaranteed.
• This mutual exclusion method uses a spin lock because the requesting CPU just sits in a tight loop
testing the lock as fast as it can.
• Not only does it completely waste the time of the requesting CPU (or CPUs), but it may also put a
massive load on the bus or memory, seriously slowing down all other CPUs trying to do their normal
work.
• A way to reduce bus traffic is to use a delay loop that can be inserted between polls. Initially, the delay
is one instruction. If the lock is still busy, the delay is doubled to two instructions, then four
instructions, and so on up to some maximum.

➢ Multiprocessor Scheduling
• Back in the old days, when all processes were single-threaded, processes were scheduled, and there was
nothing else schedulable.
• Modern operating systems support multithreaded processes, which makes scheduling more complicated.
• It matters whether the threads are kernel threads or user threads.
• If threading is done by a user-space library and the kernel knows nothing about the threads, then per-process
scheduling is done as usual.
• If the kernel does not even know threads exist, it can hardly schedule them.
• With kernel threads, the kernel is aware of all the threads and can choose among the threads of a process.

➢ Multiprocessor Scheduling in user space
• On a uniprocessor, scheduling is one-dimensional. The only question that must be answered is:
‘‘Which thread should be run next?’’
• On a multiprocessor, scheduling has two dimensions. The scheduler has to decide which thread to run
and which CPU to run it on. This extra dimension greatly complicates scheduling on multiprocessors.
• Another complicating factor in some systems, all of the threads are unrelated, belonging to different
processes and having nothing to do with one another. For example, a server system in which
independent users run independent processes. The threads of different processes can be scheduled
without regard to the other ones. In others, they come in groups, all belonging to the same application
and working together.

➢ Multiprocessor Scheduling, Time Sharing
• The simplest scheduling algorithm for dealing with unrelated threads is to have a single data structure for ready
threads, possibly just a list, but more likely a set of lists for threads at different priorities.
• The first CPU to finish its current work locks the scheduling queues and selects the highest-priority thread.
• As long as the threads are completely unrelated, this way is a reasonable choice and very simple to implement
efficiently.

• Having a single scheduling data structure provides automatic load balancing because it can never
happen that one CPU is idle while others are overloaded.
• Two disadvantages of this approach:
❖ Potential contention for the scheduling data structure as the number of CPUs grows.
❖ Usual overhead in doing a context switch when a thread blocks for I/O or a thread’s quantum
expires.
✓ A thread quantum is the amount of time a thread is allowed to execute before OS interrupts the
thread and lets a different thread of the same priority level execute.

➢ Smart scheduling:
• Suppose that the thread holds a spin lock when its quantum expires.
• Other CPUs waiting on the spin lock just waste their time spinning until that thread is scheduled
again and releases the lock.
• To get around this anomaly, some systems use smart scheduling, in which a thread acquiring a spin
lock sets a flag to show that it currently has a spin lock. When it releases the lock, it clears the flag.
The scheduler then does not stop a thread holding a spin lock but instead gives it a little more time to
complete its critical region and release the lock.

➢ Affinity scheduling:
• Another issue is when thread A has run for a long time on CPU k, the CPU k’s cache will be full of A’s
blocks. If A gets to run again soon, it may perform better if it is run on CPU k, because its cache may still
contain some of A’s blocks. Having cache blocks preloaded will increase the cache hit rate and thus the
thread’s speed.
• Affinity scheduling, the basic idea is to try running a thread on the same CPU it ran on last time. One way
to create this affinity is to use a two-level scheduling algorithm. When a thread is created, it is assigned to
a CPU. This assignment of threads to CPUs is the top level of the algorithm. The actual scheduling of the
threads is the bottom level of the algorithm. Each CPU is given priorities and is tried to keep a thread on
the same CPU for its entire lifetime, cache affinity is maximized. However, if a CPU has no threads to run,
it takes one from another CPU rather than go idle.

➢ Affinity scheduling:
• Two-level scheduling has three benefits.
❖ It distributes the load over the available CPUs almost equally.
❖ The advantage of cache affinity is used whenever possible.
❖ By giving each CPU its list, contention for the ready lists is minimized because attempts to use
another CPU’s list are relatively rare.

➢ Multiprocessor Scheduling, Space Sharing
• Another general approach to multiprocessor scheduling can be when threads are related to one
another.
• It also often occurs that a single process has multiple threads that work together.
• If the threads of a process communicate a lot, it is useful to have them running at the same time.
• Scheduling multiple threads at the same time across multiple CPUs is called space sharing.

• The simplest space-sharing algorithm assumes that an entire group of related threads is created at once.
• At the time it is created, the scheduler checks to see if there are as many free CPUs as there are threads.
• If there are, each thread is given its dedicated CPU and they all start.
• If there are not enough CPUs, none of the threads are started until enough CPUs are available.
• Each thread holds onto its CPU until it terminates, at which time the CPU is placed back into the pool of
available CPUs.
• If a thread blocks on I/O, it continues to hold the CPU, which is simply idle until the thread wakes up.
• When the next batch of threads appears, the same algorithm is applied.

• At any given moment, the set of CPUs is statically partitioned into several partitions, each one
running the threads of one process.
• As time passes, the number and size of the partitions will change as new threads are created and old
ones terminate.
• For instance, a set of 32 CPUs split into four partitions, with two CPUs available.

• In this simple partitioning model, a process just asks for several CPUs and either gets them all or has
to wait until they are available.
• An alternative approach is to allow threads to actively manage the degree of parallelism.
• One method for managing parallelism is to have a central server that keeps track of which threads are
running and want to run and what their minimum and maximum CPU requirements are.
• Periodically, each application queries the central server to ask how many CPUs it can use.
• The central server adjusts the number of threads up or down to match what is available.

• For example, a Web server can have 5, 10, 20, or any other number of threads running in parallel.
• If it currently has 10 threads and there is suddenly more demand for CPUs and it is told to drop to five
when the next five threads finish their current work, they are told to exit instead of being given new
work.
• This scheme allows the partition sizes to vary dynamically to match the current workload better than
the fixed system.

operating system design for new computer architecture

More Related Content

Similar to operating system design for new computer architecture

Recently uploaded

operating system design for new computer architecture