189 Matching Annotations
  1. Mar 2025
    1. Server-initiated approach. The server records, for each client, the files (or parts of files) that it caches. When the server detects a potential inconsistency, it must react. A potential for inconsistency occurs when two different clients in conflicting modes cache a file. If UNIX semantics (Section 15.7) is implemented, we can resolve the potential inconsistency by having the server play an active role. The server must be notified whenever a file is opened, and the intended mode (read or write) must be indicated for every open. The server can then act when it detects that a file has been opened simultaneously in conflicting modes by disabling caching for that particular file. Actually, disabling caching results in switching to a remote-service mode of operation.

      In the server-initiated approach, the server monitors the files cached by clients and sends notifications when data may have changed. The consistency model chosen impacts both performance and reliability; for instance, systems like HDFS simplify consistency by restricting write operations, while systems like GFS allow concurrent writes but with more complex consistency management.

    2. Client-initiated approach. The client initiates a validity check in which it contacts the server and checks whether the local data are consistent with the master copy. The frequency of the validity checking is the crux of this approach and determines the resulting consistency semantics. It can range from a check before every access to a check only on first access to a file (on file open, basically). Every access coupled with a validity check is delayed, compared with an access served immediately by the cache. Alternatively, checks can be initiated at fixed time intervals. Depending on its frequency, the validity check can load both the network and the server.

      Clients often cache data locally for faster access, but this cached data may become outdated, leading to potential inconsistencies. To address this, there are two primary approaches to verifying the validity of cached data. In the client-initiated approach, the client checks with the server to determine if the cached data is still valid before using it. This method may involve periodic checks, with more frequent checks increasing the load on the network and server.

    3. Cache-Update Policy

      Cache-update policies are essential in maintaining data consistency between cached copies and the server's master copy in distributed file systems (DFS). The write-through policy ensures reliability by immediately writing changes to the server when data is modified in the cache. This approach minimizes data loss in case of system crashes but results in slower write operations. In contrast, the delayed-write policy, also known as write-back caching, improves write performance by postponing updates to the server. This method allows the cache to handle modifications quickly, but it introduces the risk of losing data if the system crashes before the changes are written to the server. Variations of the delayed-write policy include writing data when a cache block is evicted or periodically flushing modified data to the server. Each policy strikes a balance between performance and reliability, and the appropriate choice depends on the system's tolerance for data loss and the importance of fast write operations.

    4. Cache Location Where should the cached data be stored—on disk or in main memory? Disk caches have one clear advantage over main-memory caches: they are reliable. Modifications to cached data are lost in a crash if the cache is kept in volatile memory. Moreover, if the cached data are kept on disk, they are still there during recovery, and there is no need to fetch them again. Main-memory caches have several advantages of their own, however: Main-memory caches permit workstations to be diskless. Data can be accessed more quickly from a cache in main memory than from one on a disk. Technology is moving toward larger and less expensive memory. The resulting performance speedup is predicted to outweigh the advantages of disk caches. The server caches (used to speed up disk I/O) will be in main memory regardless of where user caches are located; if we use main-memory caches on the user machine, too, we can build a single caching mechanism for use by both servers and users. Many remote-access implementations can be thought of as hybrids of caching and remote service. In NFS, for instance, the implementation is based on remote service but is augmented with client- and server-side memory caching for performance. Thus, to evaluate the two methods, we must evaluate the degree to which either method is emphasized. The NFS protocol and most implementations do not provide disk caching (but OpenAFS does).

      The location where cached data is stored—either in main memory or on disk—has significant implications for system performance and reliability. Storing cached data in main memory offers faster access times compared to disk storage, making it an attractive option for systems that require high performance. However, main memory is volatile, meaning that cached data may be lost in the event of a crash, which introduces reliability concerns. In contrast, disk caches are non-volatile and ensure that cached data is preserved even during system failures, but they are slower due to the inherent latency of disk I/O. Hybrid solutions, such as using main memory for high-speed access and disk for persistent storage, combine the advantages of both. The choice between memory and disk caching often depends on the system's resource constraints and performance requirements, with main memory being more favorable for high-performance, low-latency systems and disk being preferred for reliability and persistence.

    5. Caching Scheme

      Caching is a performance-enhancing mechanism widely used in distributed file systems (DFS). In a basic caching scheme, the system retains recently accessed data in local memory to avoid repeated remote data retrieval. When a client requests data not in the cache, the system retrieves it from the server and stores a copy locally. The cache serves as a temporary storage, ensuring that subsequent accesses to the same data can be handled more quickly and with reduced network traffic. The system uses a replacement policy, such as least-recently-used (LRU), to manage cache size and evict older or less frequently accessed data. While caching speeds up read operations, it also introduces challenges in maintaining consistency between the cached copy and the server's master copy. This is a core issue in DFS, as modifications to cached data need to be reflected on the server to ensure data integrity, a problem that is typically addressed through cache-consistency protocols.

    6. 19.7.2 Naming Schemes There are three main approaches to naming schemes in a DFS. In the simplest approach, a file is identified by some combination of its host name and local name, which guarantees a unique system-wide name. In Ibis, for instance, a file is identified uniquely by the name host:local-name, where local-name is a UNIX-like path. The Internet URL system also uses this approach. This naming scheme is neither location transparent nor location independent. The DFS is structured as a collection of isolated component units, each of which is an entire conventional file system. Component units remain isolated, although means are provided to refer to remote files. We do not consider this scheme any further here. The second approach was popularized by NFS. NFS provides a means to attach remote directories to local directories, thus giving the appearance of a coherent directory tree. Early NFS versions allowed only previously mounted remote directories to be accessed transparently. The advent of the automount feature allowed mounts to be done on demand based on a table of mount points and file-structure names. Components are integrated to support transparent sharing, but this integration is limited and is not uniform, because each machine may attach different remote directories to its tree. The resulting structure is versatile. We can achieve total integration of the component file systems by using a third approach. Here, a single global name structure spans all the files in the system. OpenAFS provides a single global namespace for the files and directories it exports, allowing a similar user experience across different client machines. Ideally, the composed file-system structure is the same as the structure of a conventional file system. In practice, however, the many special files (for example, UNIX device files and machine-specific binary directories) make this goal difficult to attain. To evaluate naming structures, we look at their administrative complexity. The most complex and most difficult-to-maintain structure is the NFS structure. Because any remote directory can be attached anywhere on the local directory tree, the resulting hierarchy can be highly unstructured. If a server becomes unavailable, some arbitrary set of directories on different machines becomes unavailable. In addition, a separate accreditation mechanism controls which machine is allowed to attach which directory to its tree. Thus, a user might be able to access a remote directory tree on one client but be denied access on another client.

      Implementing transparent naming in a DFS requires a mapping between file names and their storage locations. This mapping must be managed at the component-unit level rather than for individual files, as files are often aggregated into directories or larger units. Replication and caching are common techniques used to enhance the availability and performance of the naming system. In a location-independent DFS, lower-level identifiers are used to map files to their storage locations, ensuring that file names remain consistent even if files are moved. These identifiers can be replicated and cached, allowing for faster access and ensuring that migration does not disrupt the system’s operations. Structured names, which break down the identifier into distinct parts, are often used to create unique and persistent file identifiers. While this approach helps maintain consistency and efficiency, it also introduces complexity by requiring additional mapping layers and consistent updates to keep track of file migrations. These techniques ensure that DFSs can scale effectively while maintaining high performance and reliability.

    7. Naming Structures

      DFS naming schemes are essential for defining how files are identified and located in a distributed system. Location transparency ensures that users can access files without knowing their physical storage locations, while location independence allows files to move between locations without requiring changes to their names. Several naming approaches can be used in a DFS. The simplest scheme identifies files using a combination of the host name and local file name, but this approach lacks location transparency and independence. More advanced schemes, like those used in NFS, integrate remote directories into the local directory structure, offering transparent access to files. However, even more sophisticated schemes, like OpenAFS’s global namespace, aim to provide complete integration and location independence, ensuring a consistent user experience across distributed resources. While location transparency is easier to implement, location independence offers greater flexibility for handling file migrations, making it a more scalable and resilient solution for large-scale distributed systems.

    8. As the amount of data, I/O workload, and processing expands, so does the need for a DFS to be fault-tolerant and scalable. Large bottlenecks cannot be tolerated, and system component failures must be expected. Cluster-based architecture was developed in part to meet these needs. Figure 19.13 illustrates a sample cluster-based DFS model. This is the basic model presented by the Google file system (GFS) and the Hadoop distributed file system (HDFS). One or more clients are connected via a network to a master metadata server and several data servers that house “chunks” (or portions) of files. The metadata server keeps a mapping of which data servers hold chunks of which files, as well as a traditional hierarchical mapping of directories and files. Each file chunk is stored on a data server and is replicated a certain number of times (for example, three times) to protect against component failure and for faster access to the data (servers containing the replicated chunks have fast access to those chunks).

      The cluster-based DFS model is designed for large-scale, data-intensive applications that require high fault tolerance, scalability, and efficient data access. This model is exemplified by systems like the Google File System (GFS) and Hadoop Distributed File System (HDFS). In this architecture, clients interact with a master metadata server that manages the file system’s structure and location of data, which is distributed across multiple data servers. Each data server stores chunks of files, and these chunks are replicated across servers to ensure data redundancy and faster access. The cluster-based model excels in environments with massive amounts of data because it allows for parallel reading and writing of file chunks across multiple servers, improving throughput and system efficiency. The master server maintains metadata, such as file locations and directory structure, but clients can directly interact with the data servers for file access, reducing the load on the metadata server. This model is widely used in big data applications, where scalability and fault tolerance are paramount.

    9. The Client–Server DFS Model

      In the client-server DFS model, file services are provided by servers that store both files and metadata, while clients access these files over a network. Clients send requests to the server to access or modify files, and the server is responsible for managing file permissions, authentication, and maintaining the consistency of the files. A critical aspect of this model is that the server holds the master copy of the file, and any changes made by clients are communicated back to the server to ensure consistency. This model can be seen in traditional DFS implementations like NFS, where clients access remote directories and files as if they were local. One of the key challenges in the client-server DFS model is the potential for a single point of failure; if the server fails, all file access is disrupted. Furthermore, the server can become a bottleneck, as all requests for both data and metadata must pass through it. To mitigate these issues, techniques like load balancing, clustering, and redundancy are often employed to enhance reliability and scalability.

    10. Still another issue is scalability

      Scalability is a critical factor for distributed systems, enabling them to handle increased workloads efficiently as they grow. A scalable system can adapt to the demands of a larger number of users, data, or processing tasks without significant performance degradation. Scalability is achieved through careful system design that allows for the addition of resources, such as computational power, storage, or network capacity, without causing bottlenecks or overloads. For example, in cloud computing environments, new servers or storage units can be added to meet growing demands. Scalability is related to fault tolerance, as systems that can handle increased loads are often more resilient to failures. Techniques such as load balancing, data replication, and efficient network routing help ensure that resources are used optimally. However, scalability also presents challenges, such as the potential for network congestion or resource contention, which must be addressed through intelligent system design and resource management to maintain performance as the system expands.

    11. Making the multiple processors and storage devices in a distributed system transparent to the users has been a key challenge to many designers. Ideally, a distributed system should look to its users like a conventional, centralized system. The user interface of a transparent distributed system should not distinguish between local and remote resources. That is, users should be able to access remote resources as though these resources were local, and the distributed system should be responsible for locating the resources and for arranging for the appropriate interaction. Another aspect of transpare

      Transparency in distributed systems is the ability to abstract the complexities of distributed resources from users, making remote resources appear as if they are local. This allows users to interact with the system without needing to know where the data is stored or where processes are running. Transparency includes location transparency, meaning the location of resources does not affect how they are accessed, and mobility transparency, allowing users to access their environment from different locations without disruptions. Achieving transparency simplifies the user experience and improves system usability. Distributed systems aim to hide the underlying complexity of managing distributed resources, providing a consistent interface for accessing data and services. The importance of transparency is evident in systems that support user mobility, such as cloud-based services, where users can access their files and applications from any device or location. Transparency in distributed systems enhances user convenience and reduces the burden of managing remote resources, making the system more intuitive and accessible.

    12. be fault tolerant in

      Fault tolerance and robustness are critical considerations in the design of distributed systems. A robust system can continue operating despite hardware failures, such as the failure of a link, host, or site. To achieve fault tolerance, distributed systems must have mechanisms to detect failures, reconfigure the system to bypass the faulty components, and recover once the issue is resolved. The goal is to ensure that failures do not result in system downtime, maintaining service availability and reliability. For instance, redundancy in both hardware and software ensures that backup resources are available if the primary ones fail. Additionally, systems should be designed to degrade gracefully, meaning performance may be reduced during failure, but the system remains operational. This is especially important in distributed environments, where multiple nodes are responsible for handling tasks. Fault tolerance mechanisms such as RAID (Redundant Array of Independent Disks) ensure data availability, while network redundancy helps avoid communication breakdowns, keeping the system resilient.

    13. Suppose a user on site A wants to access data (such as a file) that reside at site B. The system can transfer the data by one of two basic methods. One approach to data migration is to transfer the entire file to site A. From that point on, all access to the file is local. When the user no longer needs access to the file, a copy of the file (if it has been modified) is sent back to site B. Even if only a modest change has been made to a large file, all the data must be transferred. This mechanism can be thought of as an automated FTP system. This approach was used in the Andrew file system, but it was found to be too inefficient. The other approach is to transfer to site A only those portions of the file that are actually necessary for the immediate task. If another portion is required later, another transfer will take place. When the user no longer wants to access the file, any part of it that has been modified must be sent back to site B. (Note the similarity to demand paging.) Most modern distributed systems use this approach. Whichever method is used, data migration includes more than the mere transfer of data from one site to another. The system must also perform various data translations if the two sites involved are not directly compatible (for instance, if they use different character-code representations or represent integers with a different number or order of bits). 19.4.2.2 Computation Migration In some circumstances, we may want to transfer the computation, rather than the data, across the system; this process is called computation migration. For example, consider a job that needs to access various large files that reside at different sites, to obtain a summary of those files. It would be more efficient to access the files at the sites where they reside and return the desired results to the site that initiated the computation. Generally, if the time to transfer the data is longer than the time to execute the remote command, the remote command should be used. Such a computation can be carried out in different ways. Suppose that process P wants to access a file at site A. Access to the file is carried out at site A and could be initiated by an RPC. An RPC uses network protocols to execute a routine on a remote system (Section 3.8.2). Process P invokes a predefined procedure at site A. The procedure executes appropriately and then returns the results to P. Alternatively, process P can send a message to site A. The operating system at site A then creates a new process Q whose function is to carry out the designated task. When process Q completes its execution, it sends the needed result back to P via the message system. In this scheme, process P may execute concurrently with process Q. In fact, it may have several processes running concurrently on several sites. Either method could be used to access several files (or chunks of files) residing at various sites. One RPC might result in the invocation of another RPC or even in the transfer of messages to another site. Similarly, process Q could, during the course of its execution, send a message to another site, which in turn would create another process. This process might either send a message back to Q or repeat the cycle. 19.4.2.3 Process Migration A logical extension of computation migration is process migration. When a process is submitted for execution, it is not always executed at the site at which it is initiated. The entire process, or parts of it, may be executed at different sites. This scheme may be used for several reasons: Load balancing. The processes (or subprocesses) may be distributed across the sites to even the workload. Computation speedup. If a single process can be divided into a number of subprocesses that can run concurrently on different sites or nodes, then the total process turnaround time can be reduced. Hardware preference. The process may have characteristics that make it more suitable for execution on some specialized processor (such as matrix inversion on a GPU) than on a microprocessor. Software preference. The process may require software that is available at only a particular site, and either the software cannot be moved, or it is less expensive to move the process. Data access. Just as in computation migration, if the data being used in the computation are numerous, it may be more efficient to have a process run remotely (say, on a server that hosts a large database) than to transfer all the data and run the process locally. We use two complementary techniques to move processes in a computer network. In the first, the system can attempt to hide the fact that the process has migrated from the client. The client then need not code her program explicitly to accomplish the migration. This method is usually employed for achieving load balancing and computation speedup among homogeneous systems, as they do not need user input to help them execute programs remotely. The other approach is to allow (or require) the user to specify explicitly how the process should migrate. This method is usually employed when the process must be moved to satisfy a hardware or software preference. You have probably realized that the World Wide Web has many aspects of a distributed computing environment. Certainly it provides data migration (between a web server and a web client). It also provides computation migration. For instance, a web client could trigger a database operation on a web server. Finally, with Java, Javascript, and similar languages, it provides a form of process migration: Java applets and Javascript scripts are sent from the server to the client, where they are executed. A network operating system provides most of these features, but a distributed operating system makes them seamless and easily accessible. The result is a powerful and easy-to-use facility—one of the reasons for the huge growth of the World Wide Web.

      In distributed systems, data migration, computation migration, and process migration are crucial strategies for optimizing performance and resource utilization. Data migration involves transferring files from one site to another, with two main approaches: moving the entire file or transferring only the necessary portions of it. The second method, often used in modern distributed systems, is more efficient as it minimizes the data that needs to be moved at any given time. Computation migration, on the other hand, involves transferring computational tasks rather than data. This is particularly useful when accessing remote files or resources—executing tasks at the location of the data can reduce latency and avoid unnecessary data transfers. Process migration takes the idea further, enabling entire processes or parts of them to move across systems based on factors like load balancing, computational speedup, hardware preferences, or data access needs. These strategies enhance the flexibility and efficiency of distributed systems, ensuring that workloads are handled optimally and without bottlenecks.

    14. Distributed Operating Systems

      Distributed operating systems extend the capabilities of network operating systems by providing a more seamless experience. In these systems, remote resources are accessed in the same way as local resources, allowing users to interact with data and processes as if they were on the same machine. A distributed operating system manages the migration of data, computation, and processes across different sites, which can involve transferring files, executing computations on remote machines, or moving processes to optimize resource utilization. This flexibility makes distributed operating systems suitable for environments that require high scalability and fault tolerance. For example, when handling large datasets, a distributed system might migrate computational tasks to the machine where the data resides, reducing the need to transfer large files. Distributed operating systems abstract the complexity of remote resource management, providing a unified system for users and enhancing system efficiency by dynamically allocating resources based on demand.

    15. A network operating system provides an environment in which users can access remote resources (implementing resource sharing) by either logging in to the appropriate remote machine or transferring data from the remote machine to their own machines. Currently, all general-purpose operating systems, and even embedded operating systems such as Android and IOS, are network operating systems. 19.4.1.1 Remote Login An important function of a network operating system is to allow users to log in remotely. The Internet provides the ssh facility for this purpose. To illustrate, suppose that a user at Westminster College wishes to compute on kristen.cs.yale.edu, a computer located at Yale University. To do so, the user must have a valid account on that machine. To log in remotely, the user issues the command ssh kristen.cs.yale.edu This command results in the formation of an encrypted socket connection between the local machine at Westminster College and the kristen.cs.yale.edu computer. After this connection has been established, the networking software creates a transparent, bidirectional link so that all characters entered by the user are sent to a process on kristen.cs.yale.edu and all the output from that process is sent back to the user. The process on the remote machine asks the user for a login name and a password. Once the correct information has been received, the process acts as a proxy for the user, who can compute on the remote machine just as any local user can. 19.4.1.2 Remote File Transfer Another major function of a network operating system is to provide a mechanism for remote file transfer from one machine to another. In such an environment, each computer maintains its own local file system. If a user at one site (say, Kurt at albion.edu) wants to access a file owned by Becca located on another computer (say, at colby.edu), then the file must be copied explicitly from the computer at Colby in Maine to the computer at Albion in Michigan. The communication is one-directional and individual, such that other users at those sites wishing to transfer a file, say Sean at colby.edu to Karen at albion.edu, must likewise issue a set of commands. The Internet provides a mechanism for such a transfer with the file transfer protocol (FTP) and the more private secure file transfer protocol (SFTP). Suppose that user Carla at wesleyan.edu wants to copy a file that is owned by Owen at kzoo.edu. The user must first invoke the sftp program by executing sftp owen@kzoo.edu The program then asks the user for a login name and a password. Once the correct information has been received, the user can use a series of commands to upload files, download files, and navigate the remote file system structure. Some of these commands are: get—Transfer a file from the remote machine to the local machine. put—Transfer a file from the local machine to the remote machine. ls or dir—List files in the current directory on the remote machine. cd—Change the current directory on the remote machine. There are also various commands to change transfer modes (for binary or ASCII files) and to determine connection status. 19.4.1.3 Cloud Storage Basic cloud-based storage applications allow users to transfer files much as with FTP. Users can upload files to a cloud server, download files to the local computer, and share files with other cloud-service users via a web link or other sharing mechanism through a graphical interface. Common examples include Dropbox and Google Drive. An important point about SSH, FTP, and cloud-based storage applications is that they require the user to change paradigms. FTP, for example, requires the user to know a command set entirely different from the normal operating-system commands. With SSH, the user must know appropriate commands on the remote system. For instance, a user on a Windows machine who connects remotely to a UNIX machine must switch to UNIX commands for the duration of the SSH session. (In networking, a session is a complete round of communication, frequently beginning with a login to authenticate and ending with a logoff to terminate the communication.) With cloud-based storage applications, users may have to log into the cloud service (usually through a web browser) or native application and then use a series of graphical commands to upload, download, or share files. Obviously, users would find it more convenient not to be required to use a different set of commands. Distributed operating systems are designed to address this problem.

      network operating system (NOS) provides basic functionalities like remote login and file transfer, allowing users to access resources on other machines within a network. It forms the foundation of many systems, enabling communication and resource sharing over a network. The system often requires users to manually interact with remote machines, using commands such as SSH for remote login or FTP for file transfers. While these systems offer basic functionality, they may lack the transparency and integration provided by distributed operating systems. Network operating systems, such as UNIX and Windows Server, often require explicit user commands to interact with remote systems, making them more complex for users. As technology advances, the lines between network and distributed systems blur, with modern network operating systems incorporating some features of distributed systems, such as cloud storage integration or remote access through APIs, which streamlines user interactions and enhances accessibility across distributed resources.

    16. TCP also helps regulate the flow of packets through mechanisms called flow control and congestion control. Flow control involves preventing the sender from overrunning the capacity of the receiver. For example, the receiver may have a slower connection or may have slower hardware components (like a slower network card or processor). Flow-control state can be returned in the ACK packets of the receiver to alert the sender to slow down or speed up. Congestion control attempts to approximate the state of the networks (and generally the routers) between the sender and the receiver. If a router becomes overwhelmed with packets, it will tend to drop them. Dropping packets results in ACK timeouts, which results in more packets saturating the network. To prevent this condition, the sender monitors the connection for dropped packets by noticing how many packets are not acknowledged. If there are too many dropped packets, the sender will slow down the rate at which it sends them. This helps ensure that the TCP connection is being fair to other connections happening at the same time.

      Communication between processes in distributed systems involves various layers and protocols, with UDP and TCP serving as transport protocols. The transport layer’s primary function is to ensure data transfer between processes on different systems. UDP, being connectionless, sends data without establishing a connection, making it suitable for real-time applications like video streaming. However, the absence of guarantees means that applications must manage packet loss or order. In contrast, TCP provides reliable data transfer by establishing a connection, ensuring packets arrive in order, and implementing error checking. TCP's reliability comes with additional overhead due to its connection setup and acknowledgment system. The transport layer ensures that data is correctly routed from sender to receiver, providing essential mechanisms like flow control and congestion management. These protocols play a vital role in maintaining the efficiency and reliability of communication in distributed systems, impacting both performance and application design.

    17. The transport protocols TCP and UDP identify the receiving (and sending) processes through the use of a port number

      The transport protocols, UDP and TCP, are fundamental for transferring data between hosts. UDP is connectionless and unreliable, meaning it does not guarantee delivery or order, making it faster for applications where speed is prioritized over reliability. The lack of guarantees requires applications to handle lost or out-of-order packets. In contrast, TCP provides reliability through connection-oriented communication. It ensures data is delivered in the correct order, handles retransmission of lost packets, and uses flow and congestion control to manage network traffic. These mechanisms make TCP slower but more reliable for applications like web browsing or file transfers where accuracy and order are critical. TCP’s three-way handshake and acknowledgment system are key features that enhance its reliability, ensuring that data is transmitted without errors. In distributed systems, the choice of protocol depends on the application’s need for speed versus reliability, making both UDP and TCP essential for different use cases.

    18. TCP/IP Example

      In a TCP/IP network, every host has a unique name and IP address. The address is divided into a network number and a host number, with the network number assigned by administrators, while the site can assign the host ID. Communication between hosts involves checking routing tables to locate routers, which use the network part of the host ID to transfer packets. A crucial process is the Address Resolution Protocol (ARP), which resolves the IP address to the Ethernet MAC address within the local network. If a destination is on a different network, the packet is forwarded via routers until it reaches the correct network. The use of ARP and routing ensures that packets reach their correct destinations. This process emphasizes the importance of efficient address mapping, routing, and communication protocols in distributed systems, and how packets move through networks to ensure data delivery across multiple devices and sites.

    19. Security should be a concern in the design and implementation of any modern communication protocol. Both strong authentication and encryption are needed for secure communication. Strong authentication ensures that the sender and receiver of a communication are who or what they are supposed to be. Encryption protects the contents of the communication from eavesdropping. Weak authentication and clear-text communication are still very common, however, for a variety of reasons. When most of the common protocols were designed, security was frequently less important than performance, simplicity, and efficiency. This legacy is still showing itself today, as adding security to existing infrastructure is proving to be difficult and complex.

      Security is an essential consideration in the design and implementation of communication protocols, particularly in distributed systems. As networks expand and more sensitive data is transmitted, ensuring the confidentiality, integrity, and authenticity of communication becomes increasingly important. Two key aspects of security are authentication and encryption. Authentication verifies the identity of the sender and receiver of a communication, ensuring that only authorized parties can exchange information. Encryption, on the other hand, protects the contents of the communication by converting the data into a format that can only be deciphered by the intended recipient. Many communication protocols, such as SSL/TLS for secure web browsing or VPNs for private communication, use both authentication and encryption to provide secure communication. However, implementing strong security in communication protocols can add complexity and overhead, potentially affecting performance. Despite this, the need for security continues to grow as cyber threats increase. Many legacy protocols were designed before security was a significant concern, which has led to challenges in retrofitting them with modern security features. As a result, the design of new protocols and the enhancement of existing ones often prioritize security alongside performance, ensuring that communication in distributed systems remains secure and reliable.

    20. the OSI protocol stack—a set of cooperating protocols—showing the physical flow of data. As mentioned, logically each layer of a protocol stack communicates with the equivalent layer on other systems. But physically, a message starts at or above the application layer and is passed through each lower level in turn. Each layer may modify the message and include message-header data for the equivalent layer on the receiving side. Ultimately, the message reaches the data-network layer and is transferred as one or more packets Figure 19.7). The data-link layer of the target system receives these data, and the message is moved up through the protocol stack. It is analyzed, modified, and stripped of headers as it progresses. It finally reaches the application layer for use by the receiving process. Figure 19.6 The OSI protocol stack. Figure 19.7 An OSI network message. The OSI model formalizes some of the earlier work done in network protocols but was developed in the late 1970s and is currently not in widespread use. Perhaps the most widely adopted protocol stack is the TCP/IP model (sometimes called the Internet model), which has been adopted by virtually all Internet sites. The TCP/IP protocol stack has fewer layers than the OSI model. Theoretically, because it combines several functions in each layer, it is more difficult to implement but more efficient than OSI networking. The relationship between the OSI and TCP/IP models

      The OSI and TCP/IP models are both critical in understanding how communication occurs in distributed systems. The OSI model, developed by the International Standards Organization (ISO), defines seven layers of network communication, ranging from the physical transmission of data (Layer 1) to the application of services like file transfer and email (Layer 7). While the OSI model is theoretical and not widely used in practice, it provides a useful framework for understanding the functions of different layers in network communication. The TCP/IP model, on the other hand, is more practical and is the basis for the Internet. It combines several of the OSI model's layers into fewer layers, making it more efficient for implementation. The application layer in TCP/IP handles protocols like HTTP for web browsing, FTP for file transfers, and DNS for domain name resolution. The transport layer includes protocols like TCP (which ensures reliable data transfer) and UDP (which offers faster, connectionless communication). The network layer is responsible for routing data across the Internet using IP addresses. The TCP/IP model is flexible and allows for communication across diverse network infrastructures, including LANs, WANs, and the Internet. Security is an important consideration in both models, and modern protocols often incorporate encryption and authentication to ensure secure communication.

    21. communication network, we must deal with the inherent complexity of coordinating asynchronous operations communicating in a potentially slow and error-prone environment. In addition, the systems on the network must agree on a protocol or a set of protocols for determining host names, locating hosts on the network, establishing connections, and so on. We can simplify the design problem (and related implementation) by partitioning the problem into multiple layers. Each layer on one system communicates with the equivalent layer on other systems. Typically, each layer has its own protocols, and communication takes place between peer layers using a specific protocol. The protocols may be implemented in hardware or software. For

      Communication protocols play a vital role in distributed systems, ensuring that processes on different hosts can communicate effectively and efficiently. These protocols are responsible for managing the exchange of messages between systems, including tasks such as addressing, error detection, and flow control. One common approach to organizing these protocols is the use of a layered model, where each layer of the network communication stack is responsible for specific tasks. The Open Systems Interconnection (OSI) model, which defines seven layers of communication, is one of the most well-known models. These layers include the physical layer, responsible for transmitting raw data; the data-link layer, which handles framing and error detection; the network layer, which manages routing and addressing; and the transport layer, which ensures reliable delivery of data. The OSI model provides a logical framework for understanding the different functions involved in network communication. However, in practice, the TCP/IP model, which has fewer layers, is more widely used, particularly for Internet communications. TCP/IP protocols, such as HTTP, FTP, and DNS, enable services like web browsing, file transfer, and domain name resolution. Communication protocols, whether OSI or TCP/IP, must also address security concerns, including authentication and encryption, to ensure safe and private data exchange across the network.

    22. Naming and Name Resolution

      In distributed systems, communication between processes on different hosts requires a naming and resolution mechanism to identify and locate remote processes. The naming system involves associating each process with a unique identifier, typically in the form of a pair consisting of a host name and a process identifier. To communicate with a process on a remote system, a process on one host must resolve the host name to an IP address, which is a unique numeric identifier for that host on the network. This name resolution is achieved through systems like the Domain Name System (DNS), which maintains a hierarchical structure of domain names and their corresponding IP addresses. When a process needs to communicate with a remote system, it queries the DNS to obtain the IP address associated with the host name. The resolution process may involve several steps, as the system works its way through the domain hierarchy, starting from the top-level domain (e.g., .com, .edu) and moving down to the specific host. Once the IP address is resolved, the communication can proceed. DNS helps maintain the scalability and flexibility of large distributed systems, allowing new hosts to be added without requiring manual updates to all systems on the network.

    23. local-area networks (LAN) and wide-area networks (WAN).

      Local-area networks (LANs) and wide-area networks (WANs) are two types of networks that connect distributed systems. LANs are typically used to interconnect devices within a small geographical area, such as within a building or a campus. LANs provide high-speed communication with low latency and are commonly used in offices, homes, or small organizations. Ethernet and Wi-Fi are the primary technologies used to implement LANs, enabling devices like computers, printers, and routers to communicate over short distances. In contrast, WANs span larger geographic areas, such as cities or even countries, and are used to connect multiple LANs. WANs are typically slower than LANs due to the greater distances and the complexity of maintaining communication across a large area. The Internet is the largest example of a WAN, providing global connectivity between millions of devices. WANs use technologies such as leased lines, satellite links, and fiber optics to connect remote sites. While LANs provide fast, reliable communication within a limited area, WANs enable global connectivity, allowing organizations to share resources and communicate across vast distances, supporting the global nature of modern distributed systems.

    24. Reliability If one site fails in a distributed system, the remaining sites can continue operating, giving the system better reliability. If the system is composed of multiple large autonomous installations (that is, general-purpose computers), the failure of one of them should not affect the rest. If, however, the system is composed of diversified machines, each of which is responsible for some crucial system function (such as the web server or the file system), then a single failure may halt the operation of the whole system. In general, with enough redundancy (in both hardware and data), the system can continue operation even if some of its nodes have failed. The failure of a node or site must be detected by the system, and appropriate action may be needed to recover from the failure. The system must no longer use the services of that site. In addition, if the function of the failed site can be taken over by another site, the system must ensure that the transfer of function occurs correctly. Finally, when the failed site recovers or is repaired, mechanisms must be available to integrate it back into the system smoothly.

      Reliability is one of the primary benefits of distributed systems, as they are designed to tolerate faults and continue operating even when some nodes or sites fail. In a distributed system, if one node fails, the rest of the system can remain functional by relying on other nodes to take over the tasks previously handled by the failed node. This redundancy ensures that the failure of a single site does not result in a system-wide shutdown. To maintain high reliability, distributed systems employ mechanisms such as hardware and data redundancy, where multiple copies of critical data and services are maintained across different nodes. Additionally, failure detection protocols are implemented to quickly identify failed nodes and trigger the appropriate recovery processes. If possible, the work of a failed node is redistributed to other functioning nodes, and the system continues operating without significant disruption. The recovery process also includes reintegrating the failed node back into the system once it has been repaired. These features make distributed systems particularly suitable for applications requiring high availability, such as cloud storage services, financial systems, and online platforms, where continuous operation is essential for business success.

    25. Computation Speedup

      Computation speedup is a critical advantage of distributed systems. By dividing a large computation into smaller subcomputations, distributed systems can process these tasks concurrently across multiple nodes. This parallel processing significantly reduces the time required to complete complex computations, which is particularly useful for applications involving big data, such as trend analysis, machine learning, or scientific simulations. For example, when analyzing large datasets, the workload can be split among multiple nodes, allowing each one to process a portion of the data simultaneously. Load balancing plays a key role in ensuring that tasks are evenly distributed across the nodes to prevent any single node from being overwhelmed with too many tasks. In a distributed system, if one node becomes overloaded, the system can reassign tasks to other, less busy nodes. This dynamic redistribution of tasks is essential for maintaining system performance and efficiency. Both computation speedup and load balancing are critical in modern distributed systems, particularly in cloud computing environments, where large-scale data processing is common. These mechanisms help optimize resource utilization, minimize delays, and improve overall system performance.

    26. For example, a user at site A may query a database located at site B. Meanwhile, a user at site B may access a file that resides at site A. In general, resource sharing in a distributed system provides mechanisms for sharing files at remote sites, processing information in a distributed database, printing files at remote sites, using remote specialized hardware devices such as a supercomputer or a graphics processing unit (GPU), and performing other operations.

      Resource sharing in a distributed system allows for the efficient use of resources that are not confined to a single machine. Nodes in different locations can access and share resources such as databases, files, printers, and specialized hardware (e.g., GPUs or supercomputers). For instance, a user at Site A can query a database at Site B, or a user at Site B can print a document from Site A's printer. This resource-sharing capability enhances the versatility of distributed systems, enabling organizations to maximize the utility of their hardware and software resources spread across multiple sites. In practice, distributed systems provide mechanisms for seamless file sharing, access to distributed databases, remote job execution, and even the use of specialized equipment that would otherwise be expensive to procure or maintain locally. The ability to utilize resources remotely is particularly important in environments where computing power or data storage needs exceed the capacity of individual machines. For example, scientific research institutions may leverage distributed systems to access vast computational resources or datasets stored in different locations, facilitating collaboration across multiple geographical regions.

    27. A distributed system is a collection of loosely coupled nodes interconnected by a communication network. From the point of view of a specific node in a distributed system, the rest of the nodes and their respective resources are remote, whereas its own resources are local. The nodes in a distributed system may vary in size and function. They may include small microprocessors, personal computers, and large general-purpose computer systems. These processors are referred to by a number of names, such as processors, sites, machines, and hosts, depending on the context in which they are mentioned. We mainly use site to indicate the location of a machine and node to refer to a specific system at a site. Nodes can exist in a client–server configuration, a peer-to-peer configuration, or a hybrid of these. In the common client-server configuration, one node at one site, the server, has a resource that another node, the client (or user), would like to use. A general structure of a client–server distributed system is shown in Figure 19.1. In a peer-to-peer configuration, there are no servers or clients. Instead, the nodes share equal responsibilities and can act as both clients and servers. Figure 19.1 A client-server distributed system. When several sites are connected to one another by a communication network, users at the various sites have the opportunity to exchange information. At a low level, messages are passed between systems, much as messages are passed between processes in the single-computer message system discussed in Section 3.4. Given message passing, all the higher-level functionality found in standalone systems can be expanded to encompass the distributed system. Such functions include file storage, execution of applications, and remote procedure calls (RPCs). There are three major reasons for building distributed systems: resource sharing, computational speedup, and reliability. In this section, we briefly discuss each of them.

      Distributed systems provide significant advantages in resource sharing, computational speedup, and reliability. Resource sharing allows nodes within a distributed system to access and utilize resources located at different sites. This functionality is useful for scenarios like accessing remote databases or using specialized hardware devices, such as supercomputers or GPUs, without needing to have them locally. Additionally, computational speedup is achieved by partitioning tasks into smaller subcomputations that can run concurrently across multiple nodes. This capability accelerates processing, especially in large-scale data analytics and scientific computing tasks. The concept of load balancing further enhances computational efficiency by redistributing tasks to lightly loaded nodes, ensuring that no single node is overwhelmed with requests. Reliability is another key benefit, as the failure of one node does not bring down the entire system. In the event of a failure, other nodes can take over the tasks, and the system can continue to operate. This redundancy, along with mechanisms for failure detection and recovery, helps maintain the robustness of the distributed system. Overall, these advantages make distributed systems highly attractive for both enterprise and scientific applications, where performance, fault tolerance, and resource flexibility are essential.

    28. 18.7.2 The Java Virtual Machine

      The Java Virtual Machine (JVM) is a software-based execution environment that allows Java programs to run on any system regardless of hardware architecture. It consists of a class loader, a verifier, and a Java interpreter, ensuring that compiled Java bytecode executes securely and efficiently. JVM enforces security by preventing illegal memory access and managing resources via garbage collection. This abstraction enables platform independence, making Java highly portable. By dynamically loading and verifying classes, JVM ensures safe execution, preventing stack overflows and pointer misuse. JVM’s architecture significantly enhances software reliability, memory management, and cross-platform compatibility, making Java a preferred language.

    29. 18.7.1 VMware VMware Workstation is a popular commercial application that abstracts Intel x86 and compatible hardware into isolated virtual machines. VMware Workstation is a prime example of a Type 2 hypervisor. It runs as an application on a host operating system such as Windows or Linux and allows this host system to run several different guest operating systems concurrently as independent virtual machines. The architecture of such a system is shown in Figure 18.9. In this scenario, Linux is running as the host operating system, and FreeBSD, WindowsNT, and WidowsXP are running as guest operating systems. At the heart of VMware is the virtualization layer, which abstracts the physical hardware into isolated virtual machines running as guest operating systems. Each virtual machine has its own virtual CPU, memory, disk drives, network interfaces, and so forth.

      VMware Workstation is a widely used Type 2 hypervisor that enables multiple virtual machines to run on a host operating system like Linux or Windows. It abstracts hardware resources, allowing guest operating systems to function independently with dedicated virtual components such as CPU, memory, and disk. This virtualization enhances resource utilization, system efficiency, and disaster recovery. A key feature is the ability to copy guest operating system files, enabling easy backups and migrations. The architecture consists of a virtualization layer between the host OS and hardware, ensuring seamless integration of multiple environments while maintaining isolation and performance across virtual machines.

    30. Live Migration One feature not found in general-purpose operating systems but found in type 0 and type 1 hypervisors is the live migration of a running guest from one system to another. We mentioned this capability earlier. Here, we explore the details of how live migration works and why VMMs can implement it relatively easily while general-purpose operating systems, in spite of some research attempts, cannot. First, let's consider how live migration works. A running guest on one system is copied to another system running the same VMM. The copy occurs with so little interruption of service that users logged in to the guest, as well as network connections to the guest, continue without noticeable impact. This rather astonishing ability is very powerful in resource management and hardware administration. After all, compare it with the steps necessary without virtualization: we must warn users, shut down the processes, possibly move the binaries, and restart the processes on the new system. Only then can users access the services again. With live migration, we can decrease the load on an overloaded system or make hardware or system changes with no discernable disruption for users. Live migration is made possible by the well-defined interface between each guest and the VMM and the limited state the VMM maintains for the guest. The VMM migrates a guest via the following steps: 1. The source VMM establishes a connection with the target VMM and confirms that it is allowed to send a guest. 2. The target creates a new guest by creating a new VCPU, new nested page table, and other state storage. 3. The source sends all read-only memory pages to the target. 4. The source sends all read–write pages to the target, marking them as clean. 5. The source repeats step 4, because during that step some pages were probably modified by the guest and are now dirty. These pages need to be sent again and marked again as clean. 6. When the cycle of steps 4 and 5 becomes very short, the source VMM freezes the guest, sends the VCPU's final state, other state details, and the final dirty pages, and tells the target to start running the guest. Once the target acknowledges that the guest is running, the source terminates the guest.

      Live migration allows moving a running guest from one physical system to another with minimal downtime. This feature enhances resource management and hardware maintenance, avoiding service disruptions. The migration process involves copying guest memory, transferring CPU state, and synchronizing updates to ensure a seamless transition. VMMs minimize interruptions by iteratively transferring memory pages until only a small set remains, at which point the guest is frozen, final state is transferred, and execution resumes on the target machine. Live migration is a powerful feature unique to virtualization, enabling dynamic workload balancing and system upgrades without impacting user experience.

    31. Storage Management

      Storage management in virtualized systems differs from traditional operating systems. Instead of using physical partitions, VMMs store guest OS disks as image files, simplifying management, migration, and duplication. Type 0 hypervisors may use disk partitioning, while Type 1 and Type 2 hypervisors store disk images in a file system. VMMs enable physical-to-virtual (P-to-V) and virtual-to-physical (V-to-P) conversions, facilitating guest system replication and troubleshooting. Additionally, virtualized storage solutions allow dynamic expansion of storage resources, optimizing disk usage. This section discusses how VMMs manage storage to support efficient virtualization while maintaining flexibility for guest OS operations and migrations.

    32. 8.6.3 I/O

      I/O virtualization presents significant challenges due to the diversity of hardware devices and performance considerations. VMMs offer different approaches, such as dedicating devices to specific guests or abstracting hardware access through virtualized device drivers. Direct device access enhances performance but limits sharing, while shared access requires careful VMM management to ensure security and efficiency. Networking in virtualized environments involves complex IP address management and virtual switching to connect guests efficiently. Firewalls and network address translation (NAT) mechanisms provide security and isolation. This section highlights how VMMs handle I/O operations to balance performance, flexibility, and security in virtualized environments.

    33. 18.6.2 Memory Management Efficient memory use in general-purpose operating systems is a major key to performance. In virtualized environments, there are more users of memory (the guests and their applications, as well as the VMM), leading to more pressure on memory use. Further adding to this pressure is the fact that VMMs typically overcommit memory, so that the total memory allocated to guests exceeds the amount that physically exists in the system. The extra need for efficient memory use is not lost on the implementers of VMMs, who take extensive measures to ensure the optimal use of memory. For example, VMware ESX uses several methods of memory management. Before memory optimization can occur, the VMM must establish how much real memory each guest should use. To do that, the VMM first evaluates each guest's maximum memory size. General-purpose operating systems do not expect the amount of memory in the system to change, so VMMs must maintain the illusion that the guest has that amount of memory. Next, the VMM computes a target real-memory allocation for each guest based on the configured memory for that guest and other factors, such as overcommitment and system load. It then uses the three low-level mechanisms listed below to reclaim memory from the guests 1. Recall that a guest believes it controls memory allocation via its page-table management, whereas in reality the VMM maintains a nested page table that translates the guest page table to the real page table. The VMM can use this extra level of indirection to optimize the guest's use of memory without the guest's knowledge or help. One approach is to provide double paging. Here, the VMM has its own page-replacement algorithms and loads pages into a backing store that the guest believes is physical memory. Of course, the VMM knows less about the guest's memory access patterns than the guest does, so its paging is less efficient, creating performance problems. VMMs do use this method when other methods are not available or are not providing enough free memory. However, it is not the preferred approach. 2. A common solution is for the VMM to install in each guest a pseudo–device driver or kernel module that the VMM controls. (A pseudo–device driver uses device-driver interfaces, appearing to the kernel to be a device driver, but does not actually control a device. Rather, it is an easy way to add kernel-mode code without directly modifying the kernel.) This balloon memory manager communicates with the VMM and is told to allocate or deallocate memory. If told to allocate, it allocates memory and tells the operating system to pin the allocated pages into physical memory. Recall that pinning locks a page into physical memory so that it cannot be moved or paged out. To the guest, these pinned pages appear to decrease the amount of physical memory it has available, creating memory pressure. The guest then may free up other physical memory to be sure it has enough free memory. Meanwhile, the VMM, knowing that the pages pinned by the balloon process will never be used, removes those physical pages from the guest and allocates them to another guest. At the same time, the guest is using its own memory-management and paging algorithms to manage the available memory, which is the most efficient option. If memory pressure within the entire system decreases, the VMM will tell the balloon process within the guest to unpin and free some or all of the memory, allowing the guest more pages for its use. 3. Another common method for reducing memory pressure is for the VMM to determine if the same page has been loaded more than once. If this is the case, the VMM reduces the number of copies of the page to one and maps the other users of the page to that one copy. VMware, for example, randomly samples guest memory and creates a hash for each page sampled. That hash value is a “thumbprint” of the page. The hash of every page examined is compared with other hashes stored in a hash table. If there is a match, the pages are compared byte by byte to see if they really are identical. If they are, one page is freed, and its logical address is mapped to the other's physical address. This technique might seem at first to be ineffective, but consider that guests run operating systems. If multiple guests run the same operating system, then only one copy of the active operating-system pages need be in memory. Similarly, multiple guests could be running the same set of applications, again a likely source of memory sharing. The overall effect of these mechanisms is to enable guests to behave and perform as if they had the full amount of memory requested, although in reality they have less.

      Memory management in virtualized environments is complex due to multiple guests competing for limited physical memory. VMMs often overcommit memory, allocating more than what is physically available. To optimize memory usage, VMMs use techniques like nested page tables, balloon memory management, and memory deduplication. Ballooning forces guests to release memory when needed, while deduplication identifies and consolidates identical memory pages across guests. These techniques help ensure efficient memory utilization while maintaining the illusion that each guest has dedicated memory. Despite these optimizations, performance trade-offs exist, particularly in paging efficiency, where the VMM’s strategies may be less optimal than guest OS techniques.

    34. CPU Scheduling

      CPU scheduling in virtualized environments mimics multiprocessor behavior, even on single-CPU systems. The VMM assigns virtual CPUs (vCPUs) to each guest OS and manages physical CPU resources among them. In cases of overcommitment, where guest systems request more vCPUs than available physical CPUs, VMMs use scheduling algorithms to distribute resources fairly. However, virtualization can negatively impact guest OS scheduling assumptions, causing delayed time slices and poor performance for real-time applications. To mitigate these effects, VMMs provide tools to adjust timekeeping and optimize performance. This section highlights the challenges of CPU scheduling and how VMMs balance fairness and efficiency.

    35. Virtualization is probably the most common method for running applications designed for one operating system on a different operating system, but on the same CPU. This method works relatively efficiently because the applications were compiled for the instruction set that the target system uses. But what if an application or operating system needs to run on a different CPU? Here, it is necessary to translate all of the source CPU's instructions so that they are turned into the equivalent instructions of the target CPU. Such an environment is no longer virtualized but rather is fully emulated. Emulation is useful when the host system has one system architecture and the guest system was compiled for a different architecture. For example, suppose a company has replaced its outdated computer system with a new system but would like to continue to run certain important programs that were compiled for the old system. The programs could be run in an emulator that translates each of the outdated system's instructions into the native instruction set of the new system. Emulation can increase the life of programs and allow us to explore old architectures without having an actual old machine. As may be expected, the major challenge of emulation is performance. Instruction-set emulation may run an order of magnitude slower than native instructions, because it may take ten instructions on the new system to read, parse, and simulate an instruction from the old system. Thus, unless the new machine is ten times faster than the old, the program running on the new machine will run more slowly than it did on its native hardware. Another challenge for emulator writers is that it is difficult to create a correct emulator because, in essence, this task involves writing an entire CPU in software. In spite of these challenges, emulation is very popular, particularly in gaming circles. Many popular video games were written for platforms that are no longer in production. Users who want to run those games frequently can find an emulator of such a platform and then run the game unmodified within the emulator. Modern systems are so much faster than old game consoles that even the Apple iPhone has game emulators and games available to run within them.

      Emulation enables software designed for one hardware architecture to run on a different system by translating CPU instructions. Unlike virtualization, which optimizes performance by running code natively, emulation is slower because each instruction must be translated. Emulation is valuable for legacy software preservation, allowing old programs to run on modern systems. It is also widely used in gaming, enabling old console games to run on new hardware. However, performance limitations make it impractical for high-performance computing tasks. Writing a correct emulator is challenging because it requires replicating an entire CPU in software.

    36. 18.5.6 Programming-Environment Virtualization Another kind of virtualization, based on a different execution model, is the virtualization of programming environments. Here, a programming language is designed to run within a custom-built virtualized environment. For example, Oracle's Java has many features that depend on its running in the Java virtual machine (JVM), including specific methods for security and memory management. If we define virtualization as including only duplication of hardware, this is not really virtualization at all. But we need not limit ourselves to that definition. Instead, we can define a virtual environment, based on APIs, that provides a set of features we want to have available for a particular language and programs written in that language. Java programs run within the JVM environment, and the JVM is compiled to be a native program on systems on which it runs. This arrangement means that Java programs are written once and then can run on any system (including all of the major operating systems) on which a JVM is available. The same can be said of interpreted languages, which run inside programs that read each instruction and interpret it into native operations.

      This section explores virtualization at the programming level, where software environments act as execution platforms rather than hardware duplications. The Java Virtual Machine (JVM) is a prime example, providing memory management, security, and portability across different operating systems. Unlike traditional virtualization, this approach abstracts software rather than hardware, allowing applications to run independently of the underlying system. Interpreted languages like Python use similar techniques. This type of virtualization enables platform independence and enhances security, making it ideal for cross-platform development, but it does not offer full system isolation like hypervisor-based virtualization.

    37. Paravirtualization

      Paravirtualization modifies guest operating systems to improve performance by reducing the overhead of full virtualization. Instead of simulating hardware, the VMM provides an optimized interface that guests interact with. Xen pioneered this approach, introducing efficient device I/O using shared memory buffers and hypercalls for memory management. Paravirtualized guests require modifications but achieve better efficiency. Over time, Xen adapted to hardware-assisted virtualization, reducing reliance on paravirtualization. However, some hypervisors, especially Type 0, still use paravirtualization techniques to optimize system performance and reduce resource overhead.

    38. Type 2 hypervisors are less interesting to us as operating-system explorers, because there is very little operating-system involvement in these application-level virtual machine managers. This type of VMM is simply another process run and managed by the host, and even the host does not know that virtualization is happening within the VMM. Type 2 hypervisors have limits not associated with some of the other types. For example, a user needs administrative privileges to access many of the hardware assistance features of modern CPUs. If the VMM is being run by a standard user without additional privileges, the VMM cannot take advantage of these features. Due to this limitation, as well as the extra overhead of running a general-purpose operating system as well as guest operating systems, type 2 hypervisors tend to have poorer overall performance than type 0 or type 1. As is often the case, the limitations of type 2 hypervisors also provide some benefits. They run on a variety of general-purpose operating systems, and running them requires no changes to the host operating system. A student can use a type 2 hypervisor, for example, to test a non-native operating system without replacing the native operating system. In fact, on an Apple laptop, a student could have versions of Windows, Linux, Unix, and less common operating systems all available for learning and experimentation.

      Type 2 hypervisors run as applications on a host operating system, making them less efficient but more accessible. They allow users to run different operating systems without modifying the host OS, making them useful for testing and educational purposes. However, they have limited hardware access and higher overhead due to the need to run within a general-purpose OS. Performance is lower compared to Type 0 and Type 1 hypervisors. The section highlights how Type 2 hypervisors are convenient for individual users but are rarely used in large-scale virtualization environments.

    39. 18.5.3 Type 1 Hypervisor Type 1 hypervisors are commonly found in company data centers and are, in a sense, becoming “the data-center operating system.” They are special-purpose operating systems that run natively on the hardware, but rather than providing system calls and other interfaces for running programs, they create, run, and manage guest operating systems. In addition to running on standard hardware, they can run on type 0 hypervisors, but not on other type 1 hypervisors. Whatever the platform, guests generally do not know they are running on anything but the native hardware. Type 1 hypervisors run in kernel mode, taking advantage of hardware protection. Where the host CPU allows, they use multiple modes to give guest operating systems their own control and improved performance. They implement device drivers for the hardware they run on, since no other component could do so. Because they are operating systems, they must also provide CPU scheduling, memory management, I/O management, protection, and even security. Frequently, they provide APIs, but those APIs support applications in guests or external applications that supply features like backups, monitoring, and security. Many type 1 hypervisors are closed-source commercial offerings, such as VMware ESX, while some are open source or hybrids of open and closed source, such as Citrix XenServer and its open Xen counterpart. By using type 1 hypervisors, data-center managers can control and manage the operating systems and applications in new and sophisticated ways. An important benefit is the ability to consolidate more operating systems and applications onto fewer systems. For example, rather than having ten systems running at 10 percent utilization each, a data center might have one server manage the entire load. If utilization increases, guests and their applications can be moved to less-loaded systems live, without interruption of service. Using snapshots and cloning, the system can save the states of guests and duplicate those states—a much easier task than restoring from backups or installing manually or via scripts and tools. The price of this increased manageability is the cost of the VMM (if it is a commercial product), the need to learn new management tools and methods, and the increased complexity. Another type of type 1 hypervisor includes various general-purpose operating systems with VMM functionality. Here, an operating system such as RedHat Enterprise Linux, Windows, or Oracle Solaris performs its normal duties as well as providing a VMM allowing other operating systems to run as guests. Because of their extra duties, these hypervisors typically provide fewer virtualization features than other type 1 hypervisors. In many ways, they treat a guest operating system as just another process, but they provide special handling when the guest tries to execute special instructions.

      Type 1 hypervisors, also known as "bare-metal" hypervisors, function as specialized operating systems that manage guest VMs directly on hardware. They run in kernel mode, handling CPU scheduling, memory management, and I/O operations while offering APIs for external tools. They improve data-center efficiency by consolidating workloads, enabling live migration, and simplifying backup and replication. However, their complexity and cost, especially for commercial options like VMware ESX, pose challenges. Some general-purpose operating systems, such as Red Hat Enterprise Linux, incorporate Type 1 hypervisors, treating VMs like processes but providing special execution handling.

    40. Type 0 Hypervisor

      Type 0 hypervisors are hardware-based and load guest operating systems at boot time, enabling efficient execution. They partition resources such as CPUs, memory, and I/O devices to guests, simplifying implementation. However, I/O sharing is a challenge, requiring management by the hypervisor or a control partition. Some type 0 hypervisors support dynamic resource reallocation, allowing guests to adjust CPU and memory use. This section also explains how type 0 hypervisors enable nested virtualization, where guests act as VMMs themselves. These hypervisors are distinct because they provide close-to-hardware execution with minimal overhead.

    41. Let's begin with the virtual machine life cycle. Whatever the hypervisor type, at the time a virtual machine is created, its creator gives the VMM certain parameters. These parameters usually include the number of CPUs, amount of memory, networking details, and storage details that the VMM will take into account when creating the guest. For example, a user might want to create a new guest with two virtual CPUs, 4 GB of memory, 10 GB of disk space, one network interface that gets its IP address via DHCP, and access to the DVD drive. The VMM then creates the virtual machine with those parameters. In the case of a type 0 hypervisor, the resources are usually dedicated. In this situation, if there are not two virtual CPUs available and unallocated, the creation request in our example will fail. For other hypervisor types, the resources are dedicated or virtualized, depending on the type. Certainly, an IP address cannot be shared, but the virtual CPUs are usually multiplexed on the physical CPUs as discussed in Section 18.6.1. Similarly, memory management usually involves allocating more memory to guests than actually exists in physical memory. This is more complicated and is described in Section 18.6.2. Finally, when the virtual machine is no longer needed, it can be deleted. When this happens, the VMM first frees up any used disk space and then removes the configuration associated with the virtual machine, essentially forgetting the virtual machine. These steps are quite simple compared with building, configuring, running, and removing physical machines. Creating a virtual machine from an existing one can be as easy as clicking the “clone” button and providing a new name and IP address. This ease of creation can lead to virtual machine sprawl, which occurs when there are so many virtual machines on a system that their use, history, and state become confusing and difficult to track.

      This section explains the process of creating, running, and deleting virtual machines (VMs). The Virtual Machine Monitor (VMM) assigns resources such as CPUs, memory, and storage during VM creation. Type 0 hypervisors require dedicated resources, while other hypervisors use multiplexing. The section highlights virtual machine sprawl, a challenge where excessive VMs become difficult to manage. Compared to physical machines, VMs are easier to clone and manage. The discussion sets the stage for understanding hypervisor types and their functionality by showing how VMs operate throughout their life cycle.

    42. 18.4.1 Trap-and-Emulate

      Trap-and-emulate is a virtualization technique that ensures a guest OS can only execute in a simulated user mode, preventing unauthorized kernel-level execution. When a privileged instruction is attempted, the VMM intercepts the action, emulates its execution, and resumes normal operations. This method introduces additional overhead, slowing privileged instruction execution. However, hardware advancements have minimized these inefficiencies, with modern CPUs incorporating dedicated virtualization support. The technique is foundational in many virtualization solutions, balancing security and efficiency. While it offers a straightforward approach, optimizations such as reducing the need for frequent emulation have improved its practical viability in modern computing.

    43. 18.4.2 Binary Translation Some CPUs do not have a clean separation of privileged and nonprivileged instructions. Unfortunately for virtualization implementers, the Intel x86 CPU line is one of them. No thought was given to running virtualization on the x86 when it was designed. (In fact, the first CPU in the family—the Intel 4004, released in 1971—was designed to be the core of a calculator.) The chip has maintained backward compatibility throughout its lifetime, preventing changes that would have made virtualization easier through many generations. Let's consider an example of the problem. The command popf loads the flag register from the contents of the stack. If the CPU is in privileged mode, all of the flags are replaced from the stack. If the CPU is in user mode, then only some flags are replaced, and others are ignored. Because no trap is generated if popf is executed in user mode, the trap-and-emulate procedure is rendered useless. Other x86 instructions cause similar problems. For the purposes of this discussion, we will call this set of instructions special instructions. As recently as 1998, using the trap-and-emulate method to implement virtualization on the x86 was considered impossible because of these special instructions. This previously insurmountable problem was solved with the implementation of the binary translation technique. Binary translation is fairly simple in concept but complex in implementation. The basic steps are as follows: 1. If the guest VCPU is in user mode, the guest can run its instructions natively on a physical CPU. 2. If the guest VCPU is in kernel mode, then the guest believes that it is running in kernel mode. The VMM examines every instruction the guest executes in virtual kernel mode by reading the next few instructions that the guest is going to execute, based on the guest's program counter. Instructions other than special instructions are run natively. Special instructions are translated into a new set of instructions that perform the equivalent task—for example, changing the flags in the VCPU. Binary translation is shown in Figure 18.3. It is implemented by translation code within the VMM. The code reads native binary instructions dynamically from the guest, on demand, and generates native binary code that executes in place of the original code.

      Binary translation is a virtualization method developed to address CPUs without a clear distinction between privileged and nonprivileged instructions, such as x86 processors. Unlike trap-and-emulate, binary translation dynamically translates privileged instructions into a safe, virtualized equivalent, ensuring proper execution without triggering errors. The VMware method optimizes performance by caching translated instructions, reducing repetitive translation overhead. However, maintaining memory management consistency between guest systems and the VMM introduces complexity. Nested Page Tables (NPTs) allow efficient memory virtualization, mapping guest memory operations to physical hardware. Despite its overhead, binary translation played a crucial role in making x86 virtualization feasible and efficient.

    44. 18.4.3 Hardware Assistance

      Modern CPUs incorporate hardware-assisted virtualization features that significantly improve efficiency and security. Intel’s VT-x and AMD’s AMD-V provide specialized modes for host and guest operations, eliminating the need for binary translation. These enhancements streamline context switching, memory management, and interrupt handling, reducing virtualization overhead. Features such as nested page tables (NPTs) optimize memory translation, while Direct Memory Access (DMA) protection ensures secure data transfers between VMs. ARM’s EL2 exception level further enhances hypervisor security. Additionally, thin hypervisors like Apple’s HyperVisor.framework leverage these hardware capabilities, allowing efficient virtual machine creation with minimal system overhead. Hardware support has revolutionized virtualization.

    45. Benefits and Features

      Virtualization offers significant advantages, primarily hardware resource sharing while ensuring isolation among VMs. This enhances security—a virus in one VM won't affect others. However, isolation can limit resource sharing, which is addressed via shared file systems or virtual networks. A key feature is snapshotting, allowing VM states to be saved and restored. Developers benefit from concurrent OS testing, while data centers optimize resources by consolidating multiple VMs on a single machine. Live migration enables uninterrupted workload transfers between physical servers, crucial for maintenance and scaling. Additionally, virtualization underpins cloud computing and enhances remote desktop security.

    46. 18.2 History Virtual machines first appeared commercially on IBM mainframes in 1972. Virtualization was provided by the IBM VM operating system. This system has evolved and is still available. In addition, many of its original concepts are found in other systems, making it worth exploring. IBM VM/370 divided a mainframe into multiple virtual machines, each running its own operating system. A major difficulty with the VM approach involved disk systems. Suppose that the physical machine had three disk drives but wanted to support seven virtual machines. Clearly, it could not allocate a disk drive to each virtual machine. The solution was to provide virtual disks—termed minidisks in IBM's VM operating system. The minidisks were identical to the system's hard disks in all respects except size. The system implemented each minidisk by allocating as many tracks on the physical disks as the minidisk needed. Once the virtual machines were created, users could run any of the operating systems or software packages that were available on the underlying machine. For the IBM VM system, a user normally ran CMS—a single-user interactive operating system. For many years after IBM introduced this technology, virtualization remained in its domain. Most systems could not support virtualization. However, a formal definition of virtualization helped to establish system requirements and a target for functionality. The virtualization requirements called for: Fidelity. A VMM provides an environment for programs that is essentially identical to the original machine. Performance. Programs running within that environment show only minor performance decreases. Safety. The VMM is in complete control of system resources. These requirements still guide virtualization efforts today. By the late 1990s, Intel 80x86 CPUs had become common, fast, and rich in features. Accordingly, developers launched multiple efforts to implement virtualization on that platform. Both Xen and VMware created technologies, still used today, to allow guest operating systems to run on the 80x86. Since that time, virtualization has expanded to include all common CPUs, many commercial and open-source tools, and many operating systems. For example, the open-source VirtualBox project (http://www.virtualbox.org) provides a program that runs on Intel x86 and AMD 64 CPUs and on Windows, Linux, macOS, and Solaris host operating systems. Possible guest operating systems include many versions of Windows, Linux, Solaris, and BSD, including even MS-DOS and IBM OS/2.

      Virtual machines first became commercially available with IBM VM/370 in 1972, enabling mainframe partitioning into multiple VMs. A major challenge was disk allocation, which IBM resolved using minidisks, allowing virtual machines to have dedicated yet flexible storage. Initially, virtualization was exclusive to IBM, but broader adoption emerged when Intel 80x86 CPUs advanced in the 1990s. Technologies like Xen and VMware enabled virtualization on commodity hardware, making it mainstream. Today, projects like VirtualBox support multiple architectures and operating systems. The core principles—fidelity, performance, and safety—continue guiding virtualization developments, reinforcing its role in modern computing.

    47. Take a moment to note that with virtualization, the definition of “operating system” once again blurs. For example, consider VMM software such as VMware ESX. This virtualization software is installed on the hardware, runs when the hardware boots, and provides services to applications. The services include traditional ones, such as scheduling and memory management, along with new types, such as migration of applications between systems. Furthermore, the applications are, in fact, guest operating systems. Is the VMware ESX VMM an operating system that, in turn, runs other operating systems? Certainly it acts like an operating system. For clarity, however, we call the component that provides virtual environments a VMM. The implementation of VMMs varies greatly. Options include the following: Hardware-based solutions that provide support for virtual machine creation and management via firmware. These VMMs, which are commonly found in mainframe and large to midsized servers, are generally known as type 0 hypervisors. IBM LPARs and Oracle LDOMs are examples. INDIRECTION “All problems in computer science can be solved by another level of indirection”—David Wheeler “… except for the problem of too many layers of indirection.”—Kevlin Henney Operating-system-like software built to provide virtualization, including VMware ESX (mentioned above), Joyent SmartOS, and Citrix XenServer. These VMMs are known as type 1 hypervisors. General-purpose operating systems that provide standard functions as well as VMM functions, including Microsoft Windows Server with HyperV and Red Hat Linux with the KVM feature. Because such systems have a feature set similar to type 1 hypervisors, they are also known as type 1. Applications that run on standard operating systems but provide VMM features to guest operating systems. These applications, which include VMware Workstation and Fusion, Parallels Desktop, and Oracle VirtualBox, are type 2 hypervisors. Paravirtualization, a technique in which the guest operating system is modified to work in cooperation with the VMM to optimize performance. Programming-environment virtualization, in which VMMs do not virtualize real hardware but instead create an optimized virtual system. This technique is used by Oracle Java and Microsoft.Net. Emulators that allow applications written for one hardware environment to run on a very different hardware environment, such as a different type of CPU. Application containment, which is not virtualization at all but rather provides virtualization-like features by segregating applications from the operating system. Oracle Solaris Zones, BSD Jails, and IBM AIX WPARs “contain” applications, making them more secure and manageable. The variety of virtualization techniques in use today is a testament to the breadth, depth, and importance of virtualization in modern computing. Virtualization is invaluable for data-center operations, efficient application development, and software testing, among many other uses.

      There are multiple types of VMMs, categorized based on how they integrate with hardware and software:

      Type 0 Hypervisors: Implemented in firmware, commonly used in enterprise servers (e.g., IBM LPARs, Oracle LDOMs). Type 1 Hypervisors: Installed directly on hardware, functioning as an OS substitute (e.g., VMware ESX, Citrix XenServer, Joyent SmartOS). Type 2 Hypervisors: Run as applications on an existing OS (e.g., VMware Workstation, Parallels, VirtualBox). Paravirtualization: Guest OS is modified for optimized interaction with the hypervisor. Programming-Environment Virtualization: Uses a custom virtual system instead of real hardware (e.g., Java Virtual Machine (JVM), Microsoft .NET). Emulators: Allow applications for one architecture to run on a different CPU type. Application Containment: Not true virtualization but isolates applications for security and management (e.g., Solaris Zones, BSD Jails).

    48. 18.1 Overview The fundamental idea behind a virtual machine is to abstract the hardware of a single computer (the CPU, memory, disk drives, network interface cards, and so forth) into several different execution environments, thereby creating the illusion that each separate environment is running on its own private computer. This concept may seem similar to the layered approach of operating system implementation (see Section 2.8.2), and in some ways it is. In the case of virtualization, there is a layer that creates a virtual system on which operating systems or applications can run. Virtual machine implementations involve several components. At the base is the host, the underlying hardware system that runs the virtual machines. The virtual machine manager (VMM) (also known as a hypervisor) creates and runs virtual machines by providing an interface that is identical to the host (except in the case of paravirtualization, discussed later). Each guest process is provided with a virtual copy of the host (Figure 18.1). Usually, the guest process is in fact an operating system. A single physical machine can thus run multiple operating systems concurrently, each in its own virtual machine.

      A VMM acts as an intermediary between the physical hardware (host) and the guest operating systems. Figure 18.1 contrasts a non-virtualized system with a virtualized one. In a non-virtual system, the kernel directly manages resources, whereas in a virtualized system, the VMM introduces an additional layer, allowing multiple OSs to run simultaneously. Each guest OS perceives the virtualized hardware as if it were running on a real machine. This abstraction increases resource efficiency, isolation, and scalability. Virtualization enables server consolidation, testing environments, and cloud computing by efficiently utilizing hardware resources while ensuring security and independence between virtual instances.

    49. 14.7.1 Consistency Checking Whatever the cause of corruption, a file system must first detect the problems and then correct them. For detection, a scan of all the metadata on each file system can confirm or deny the consistency of the system. Unfortunately, this scan can take minutes or hours and should occur every time the system boots. Alternatively, a file system can record its state within the file-system metadata. At the start of any metadata change, a status bit is set to indicate that the metadata is in flux. If all updates to the metadata complete successfully, the file system can clear that bit. If, however, the status bit remains set, a consistency checker is run. The consistency checker—a systems program such as fsck in UNIX—compares the data in the directory structure and other metadata with the state on storage and tries to fix any inconsistencies it finds. The allocation and free-space-management algorithms dictate what types of problems the checker can find and how successful it will be in fixing them. For instance, if linked allocation is used and there is a link from any block to its next block, then the entire file can be reconstructed from the data blocks, and the directory structure can be recreated. In contrast, the loss of a directory entry on an indexed allocation system can be disastrous, because the data blocks have no knowledge of one another. For this reason, some UNIX file systems cache directory entries for reads, but any write that results in space allocation, or other metadata changes, is done synchronously, before the corresponding data blocks are written. Of course, problems can still occur if a synchronous write is interrupted by a crash. Some NVM storage devices contain a battery or supercapacitor to provide enough power, even during a power loss, to write data from device buffers to the storage media so the data are not lost. But even those precautions do not protect against corruption due to a crash.

      File system corruption can arise from crashes, power failures, or hardware issues. To maintain consistency, a system must first detect and then correct these problems. A full metadata scan can verify consistency but is time-consuming, especially on large storage devices. An alternative approach is tracking file-system state changes using metadata flags. If an operation is interrupted, leaving the flag set, a consistency checker such as fsck in UNIX runs to repair inconsistencies. File systems with linked allocation can reconstruct files from data blocks, while indexed allocation systems risk permanent data loss. Advanced storage devices use battery-backed buffers to mitigate corruption risks.

    50. 14.7.2 Log-Structured File Systems

      Borrowing from database transaction recovery methods, log-based file systems (e.g., NTFS, ext3, ext4, ZFS) improve consistency checking. Unlike traditional methods that detect and repair inconsistencies, log-structured file systems prevent them by writing metadata updates sequentially to a log. Operations are recorded as transactions, ensuring completion before being committed. If a system crash occurs, pending transactions in the log are replayed, maintaining file-system integrity. This approach also enhances performance by converting costly random metadata writes into efficient sequential log writes. Circular buffer structures prevent overwriting unsaved data, making log-based file systems both robust and high-performing for metadata-intensive tasks.

    51. 14.7.3 Other Solutions Another alternative to consistency checking is employed by Network Appliance's WAFL file system and the Solaris ZFS file system. These systems never overwrite blocks with new data. Rather, a transaction writes all data and metadata changes to new blocks. When the transaction is complete, the metadata structures that pointed to the old versions of these blocks are updated to point to the new blocks. The file system can then remove the old pointers and the old blocks and make them available for reuse. If the old pointers and blocks are kept, a snapshot is created; the snapshot is a view of the file system at a specific point in time (before any updates after that time were applied). This solution should require no consistency checking if the pointer update is done atomically. WAFL does have a consistency checker, however, so some failure scenarios can still cause metadata corruption. (See Section 14.8 for details of the WAFL file system.) ZFS takes an even more innovative approach to disk consistency. Like WAFL, it never overwrites blocks. However, ZFS goes further and provides checksumming of all metadata and data blocks. This solution (when combined with RAID) assures that data are always correct. ZFS therefore has no consistency checker. (More details on ZFS are found in Section 11.8.6.)

      Alternative approaches to consistency checking include techniques employed by WAFL and ZFS. These file systems avoid overwriting existing blocks, instead writing new data and updating pointers atomically. This method inherently maintains consistency without requiring explicit checks. If the old pointers are preserved, snapshots can be created, offering a time-specific view of the file system. ZFS enhances this approach by implementing end-to-end checksumming, ensuring data integrity even in the presence of hardware failures. Since ZFS inherently guarantees consistency through its architecture, it does not require a separate consistency checker, making it a more resilient and self-healing file system.

    52. 14.7.4 Backup and Restore

      To prevent permanent data loss, backup strategies involve periodic copies of storage data. A full backup captures all files, while incremental backups store only changes since the last backup. A typical cycle includes a full backup on day one, followed by daily incremental backups. This minimizes storage usage while ensuring recent changes are protected. However, restoring an entire system requires sequential application of incremental backups. Periodic full backups mitigate this by reducing restoration complexity. Permanent backups should be stored offsite for disaster recovery. Multiple backup sites further protect critical data from cyberattacks or physical catastrophes like fire or theft.

    53. TRIMing Unused Blocks HDDs and other storage media that allow blocks to be overwritten for updates need only the free list for managing free space. Blocks do not need to be treated specially when freed. A freed block typically keeps its data (but without any file pointers to the block) until the data are overwritten when the block is next allocated. Storage devices that do not allow overwrite, such as NVM flash-based storage devices, suffer badly when these same algorithms are applied. Recall from Section 11.1.2 that such devices must be erased before they can again be written to, and that those erases must be made in large chunks (blocks, composed of pages) and take a relatively long time compared with reads or writes. A new mechanism is needed to allow the file system to inform the storage device that a page is free and can be considered for erasure (once the block containing the page is entirely free). That mechanism varies based on the storage controller. For ATA-attached drives, it is TRIM, while for NVMe-based storage, it is the unallocate command. Whatever the specific controller command, this mechanism keeps storage space available for writing. Without such a capability, the storage device gets full and needs garbage collection and block erasure, leading to decreases in storage I/O write performance (known as “a write cliff”). With the TRIM mechanism and similar capabilities, the garbage collection and erase steps can occur before the device is nearly full, allowing the device to provide more consistent performance.

      Efficiency depends on balancing storage structures, metadata management, and allocation methods. For example, UNIX’s inode system reserves space even when empty but improves file access speed. Similarly, BSD UNIX’s adaptive cluster sizes optimize storage use. Metadata tracking, such as access timestamps, adds processing overhead and should be evaluated for necessity. Pointer size decisions affect file system scalability—while 64-bit pointers enable larger files, they consume more memory. Poor early design choices can cause limitations, as seen in MS-DOS FAT’s early partition size restrictions. Modern operating systems use dynamic allocation to avoid these constraints, enhancing long-term system performance.

    54. 14.5.4 Counting Another approach takes advantage of the fact that, generally, several contiguous blocks may be allocated or freed simultaneously, particularly when space is allocated with the contiguous-allocation algorithm or through clustering. Thus, rather than keeping a list of n free block addresses, we can keep the address of the first free block and the number (n) of free contiguous blocks that follow the first block. Each entry in the free-space list then consists of a device address and a count. Although each entry requires more space than would a simple disk address, the overall list is shorter, as long as the count is generally greater than 1. Note that this method of tracking free space is similar to the extent method of allocating blocks. These entries can be stored in a balanced tree, rather than a linked list, for efficient lookup, insertion, and deletion. 14.5.5 Space Maps Oracle's ZFS file system (found in Solaris and some other operating systems) was designed to encompass huge numbers of files, directories, and even file systems (in ZFS, we can create file-system hierarchies). On these scales, metadata I/O can have a large performance impact. Consider, for example, that if the free-space list is implemented as a bitmap, bitmaps must be modified both when blocks are allocated and when they are freed. Freeing 1 GB of data on a 1-TB disk could cause thousands of blocks of bitmaps to be updated, because those data blocks could be scattered over the entire disk. Clearly, the data structures for such a system could be large and inefficient. In its management of free space, ZFS uses a combination of techniques to control the size of data structures and minimize the I/O needed to manage those structures. First, ZFS creates metaslabs to divide the space on the device into chunks of manageable size. A given volume may contain hundreds of metaslabs. Each metaslab has an associated space map. ZFS uses the counting algorithm to store information about free blocks. Rather than write counting structures to disk, it uses log-structured file-system techniques to record them. The space map is a log of all block activity (allocating and freeing), in time order, in counting format. When ZFS decides to allocate or free space from a metaslab, it loads the associated space map into memory in a balanced-tree structure (for very efficient operation), indexed by offset, and replays the log into that structure. The in-memory space map is then an accurate representation of the allocated and free space in the metaslab. ZFS also condenses the map as much as possible by combining contiguous free blocks into a single entry. Finally, the free-space list is updated on disk as part of the transaction-oriented operations of ZFS. During the collection and sorting phase, block requests can still occur, and ZFS satisfies these requests from the log. In essence, the log plus the balanced tree is the free list.

      Counting leverages the observation that contiguous blocks are often freed together. Instead of tracking each block separately, the system stores the starting block and the number of contiguous free blocks. This reduces the number of entries needed to maintain the free-space list, leading to improved efficiency. The method is particularly useful in file systems using contiguous allocation or clustering. Additionally, the data can be stored in a balanced tree, enabling fast searches, insertions, and deletions. By reducing list size while retaining quick lookup capabilities, counting balances storage efficiency and performance for block allocation.

      Space Maps ZFS employs space maps to efficiently track free space in large storage systems. Instead of modifying bitmaps or linked lists directly, ZFS logs allocation and deallocation operations in a time-ordered manner. These logs are processed into balanced-tree structures, minimizing I/O overhead. This approach optimizes performance by condensing metadata and reducing the frequency of disk updates. ZFS divides storage into metaslabs, each with its own space map, improving scalability. By combining log-structured techniques with counting, ZFS ensures efficient space management, making it well-suited for large-scale file systems that handle vast amounts of data.

    55. 14.5.2 Linked List

      A linked list of free blocks offers another free-space tracking approach. The system maintains a pointer to the first free block, with each block pointing to the next. This method is simple but inefficient for large-scale searches since sequential access is required. The File Allocation Table (FAT) system integrates free-space management into the allocation structure, eliminating the need for a separate list. While linked lists are less memory-intensive than bitmaps, they suffer from slow traversal times. Most systems prioritize fast block allocation, often using the first available block rather than searching for optimal placement.

    56. Frequently, the free-space list is implemented as a bitmap or bit vector. Each block is represented by 1 bit. If the block is free, the bit is 1; if the block is allocated, the bit is 0. For example, consider a disk where blocks 2, 3, 4, 5, 8, 9, 10, 11, 12, 13, 17, 18, 25, 26, and 27 are free and the rest of the blocks are allocated. The free-space bitmap would be 001111001111110001100000011100000 ... The main advantage of this approach is its relative simplicity and its efficiency in finding the first free block or n consecutive free blocks on the disk. Indeed, many computers supply bit-manipulation instructions that can be used effectively for that purpose. One technique for finding the first free block on a system that uses a bit vector to allocate space is to sequentially check each word in the bitmap to see whether that value is not 0, since a 0-valued word contains only 0 bits and represents a set of allocated blocks. The first non-0 word is scanned for the first 1 bit, which is the location of the first free block. The calculation of the block number is (number of bits per word) × (number of 0-value words) + offset of first 1 bit. Again, we see hardware features driving software functionality. Unfortunately, bit vectors are inefficient unless the entire vector is kept in main memory (and is written to the device containing the file system occasionally for recovery needs). Keeping it in main memory is possible for smaller devices but not necessarily for larger ones. A 1.3-GB disk with 512-byte blocks would need a bitmap of over 332 KB to track its free blocks, although clustering the blocks in groups of four reduces this number to around 83 KB per disk. A 1-TB disk with 4-KB blocks would require 32 MB (240 / 212 = 228 bits = 225 bytes = 25 MB) to store its bit map. Given that disk size constantly increases, the problem with bit vectors will continue to esca

      A common free-space tracking method uses bitmaps, where each block is represented by a bit—1 for free, 0 for allocated. This approach simplifies searching for available blocks, as bitwise operations efficiently locate the first free block. However, bitmaps consume significant memory, especially for large disks. For instance, a 1-TB disk with 4-KB blocks needs 32 MB for a bit vector. Keeping the entire vector in memory speeds up access but may be infeasible for very large storage devices. Clustering blocks into groups can reduce memory usage. Despite its efficiency, this method struggles with scalability as storage sizes grow.

    57. 14.6.1 Efficiency

      Efficiency depends on balancing storage structures, metadata management, and allocation methods. For example, UNIX’s inode system reserves space even when empty but improves file access speed. Similarly, BSD UNIX’s adaptive cluster sizes optimize storage use. Metadata tracking, such as access timestamps, adds processing overhead and should be evaluated for necessity. Pointer size decisions affect file system scalability—while 64-bit pointers enable larger files, they consume more memory. Poor early design choices can cause limitations, as seen in MS-DOS FAT’s early partition size restrictions. Modern operating systems use dynamic allocation to avoid these constraints, enhancing long-term system performance.

    58. 14.6.2 Performance Even after the basic file-system algorithms have been selected, we can still improve performance in several ways. As was discussed in Chapter 12, storage device controllers include local memory to form an on-board cache that is large enough to store entire tracks or blocks at a time. On an HDD, once a seek is performed, the track is read into the disk cache starting at the sector under the disk head (reducing latency time). The disk controller then transfers any sector requests to the operating system. Once blocks make it from the disk controller into main memory, the operating system may cache the blocks there. Some systems maintain a separate section of main memory for a buffer cache, where blocks are kept under the assumption that they will be used again shortly. Other systems cache file data using a page cache. The page cache uses virtual memory techniques to cache file data as pages rather than as file-system-oriented blocks. Caching file data using virtual addresses is far more efficient than caching through physical disk blocks, as accesses interface with virtual memory rather than the file system. Several systems—including Solaris, Linux, and Windows—use page caching to cache both process pages and file data. This is known as unified virtual memory. Some versions of UNIX and Linux provide a unified buffer cache. To illustrate the benefits of the unified buffer cache, consider the two alternatives for opening and accessing a file. One approach is to use memory mapping (Section 13.5); the second is to use the standard system calls read() and write(). Without a unified buffer cache, we have a situation similar to Figure 14.10. Here, the read() and write() system calls go through the buffer cache. The memory-mapping call, however, requires using two caches—the page cache and the buffer cache. A memory mapping proceeds by reading in disk blocks from the file system and storing them in the buffer cache. Because the virtual memory system does not interfa

      Performance optimization involves caching strategies, memory management, and efficient write operations. Modern storage controllers use on-board caches, while operating systems employ buffer caches or page caches to reduce I/O latency. Unified caches prevent redundant data storage, minimizing memory waste and CPU overhead. Solaris and Linux leverage unified virtual memory for efficiency. Least Recently Used (LRU) algorithms manage cache replacement, but their implementation varies. Asynchronous writes improve speed by allowing immediate process continuation, whereas synchronous writes ensure data integrity. Balancing caching, write strategies, and metadata updates is crucial for maximizing storage performance while maintaining reliability.

    59. 14.4.4 Performance The allocation methods that we have discussed vary in their storage efficiency and data-block access times. Both are important criteria in selecting the proper method or methods for an operating system to implement. Before selecting an allocation method, we need to determine how the systems will be used. A system with mostly sequential access should not use the same method as a system with mostly random access. For any type of access, contiguous allocation requires only one access to get a block. Since we can easily keep the initial address of the file in memory, we can calculate immediately the address of the ith block (or the next block) and read it directly. For linked allocation, we can also keep the address of the next block in memory and read it directly. This method is fine for sequential access; for direct access, however, an access to the ith block might require i block reads. This problem indicates why linked allocation should not be used for an application requiring direct access. As a result, some systems support direct-access files by using contiguous allocation and sequential-access files by using linked allocation. For these systems, the type of access to be made must be declared when the file is created. A file created for sequential access will be linked and cannot be used for direct access. A file created for direct access will be contiguous and can support both direct access and sequential access, but its maximum length must be declared when it is created. In this case, the operating system must have appropriate data structures and algorithms to support both allocation methods. Files can be converted from one type to another by the creation of a new file of the desired type, into which the contents of the old file are copied. The old file may then be deleted and the new file renamed. Indexed allocation is more complex. If the index block is already in memory, then the access can be made directly. However, keeping the index block in memory requires considerable space. If this memory space is not available, then we may have to read first the index block and then the desired data block. For a two-level index, two index-block reads might be necessary. For an extremely large file, accessing a block near the end of the file would require reading in all the index blocks before the needed data block finally could be read. Thus, the performance of indexed allocation depends on the index structure, on the size of the file, and on the position of the block desired. Some systems combine contiguous allocation with indexed allocation by using contiguous allocation for small files (up to three or four blocks) and automatically switching to an indexed allocation if the file grows large. Since most files are small, and contiguous allocation is efficient for small files, average performance can be quite good. Many other optimizations are in use. Given the disparity between CPU speed and disk speed, it is not unreasonable to add thousands of extra instructions to the operating system to save just a few disk-head movements. Furthermore, this disparity is increasing over time, to the point where hundreds of thousands of instructions could reasonably be used to optimize head movements. For NVM devices, there are no disk head seeks, so different algorithms and optimizations are needed. Using an old algorithm that spends many CPU cycles trying to avoid a nonexistent head movement would be very inefficient. Existing file systems are being modified and new ones being created to attain maximum performance from NVM storage devices. These developments aim to reduce the instruction count and overall path between the storage device and application access to the data.

      Different allocation methods vary in efficiency and access speed. Contiguous allocation is fast and ideal for direct access, as block locations are predictable. Linked allocation works well for sequential access but is inefficient for direct access due to multiple block reads. Indexed allocation enables direct access but requires additional memory for index blocks. Some systems use a hybrid approach, switching from contiguous to indexed allocation as files grow. Storage optimizations are crucial, especially for modern non-volatile memory (NVM), where eliminating outdated disk-seek avoidance techniques enhances performance. File system designs continue evolving to maximize speed and minimize overhead.

    60. 14.4.3 Indexed Allocation Linked allocation solves the external-fragmentation and size-declaration problems of contiguous allocation. However, in the absence of a FAT, linked allocation cannot support efficient direct access, since the pointers to the blocks are scattered with the blocks themselves all over the disk and must be retrieved in order. Indexed allocation solves this problem by bringing all the pointers together into one location: the index block. Each file has its own index block, which is an array of storage-block addresses. The ith entry in the index block points to the ith block of the file. The directory contains the address of the index block (Figure 14.7). To find and read the ith block, we use the pointer in the ith index-block entry. This scheme is similar to the paging scheme described in Section 9.3. Figure 14.7 Indexed allocation of disk space. When the file is created, all pointers in the index block are set to null. When the ith block is first written, a block is obtained from the free-space manager, and its address is put in the ith index-block entry. Indexed allocation supports direct access, without suffering from external fragmentation, because any free block on the storage device can satisfy a request for more space. Indexed allocation does suffer from wasted space, however. The pointer overhead of the index block is generally greater than the pointer overhead of linked allocation. Consider a common case in which we have a file of only one or two blocks. With linked allocation, we lose the space of only one pointer per block. With indexed allocation, an entire index block must be allocated, even if only one or two pointers will be non-null. This point raises the question of how large the index block should be. Every file must have an index block, so we want the index block to be as small as possible. If the index block is too small, however, it will not be able to hold enough pointers for a large file, and a mechanism will have to be available to deal with this issue. Mechanisms for this purpose include the following: Linked scheme. An index block is normally one storage block. Thus, it can be read and written directly by itself. To allow for large files, we can link together several index blocks. For example, an index block might contain a small header giving the name of the file and a set of the first 100 disk-block addresses. The next address (the last word in the index block) is null (for a small file) or is a pointer to another index block (for a large file). Multilevel index. A variant of linked representation uses a first-level index block to point to a set of second-level index blocks, which in turn point to the file blocks. To access a block, the operating system uses the first-level index to find a second-level index block and then uses that block to find the desired data block. This approach could be continued to a third or fourth level, depending on the desired maximum file size. With 4,096-byte blocks, we could store 1,024 four-byte pointers in an index block. Two levels of indexes allow 1,048,576 data blocks and a file size of up to 4 GB. Combined scheme. Another alternative, used in UNIX-based file systems, is to keep the first, say, 15 pointers of the index block in the file's inode. The first 12 of these pointers point to direct blocks; that is, they contain addresses of blocks that contain data of the file. Thus, the data for small files (of no more than 12 blocks) do not need a separate index block. If the block size is 4 KB, then up to 48 KB of data can be accessed directly. The next three pointers point to indirect blocks. The first points to a single indirect block, which is an index block containing not data but the addresses of blocks that do contain data. The second points to a double indirect block, which contains the address of a block that contains the addresses of blocks that contain pointers to the actual data blocks. The last pointer contains the address of a triple indirect block. (A UNIX inode is shown in Figure 14.8.) Figure 14.8 The UNIX inode. Under this method, the number of blocks that can be allocated to a file exceeds the amount of space addressable by the 4-byte file pointers used by many operating systems. A 32-bit file pointer reaches only 232 bytes, or 4 GB. Many UNIX and Linux implementations now support 64-bit file pointers, which allows files and file systems to be several exbibytes in size. The ZFS file system supports 128-bit file pointers. Indexed-allocation schemes suffer from some of the same performance problems as does linked allocation. Specifically, the index blocks can be cached in memory, but the data blocks may be spread all over a volume.

      Indexed allocation improves direct access by consolidating all block pointers into a single index block. Each file has an index block, storing pointers to its data blocks. This approach prevents external fragmentation and allows quick access. However, small files waste space since an entire index block must be allocated even if only a few pointers are used. Various indexing strategies exist: a linked scheme, multilevel index, and combined scheme, such as UNIX inodes. While indexed allocation enables efficient file management, performance can be affected if index blocks are not cached, leading to scattered data blocks across the disk.

    61. 14.4.2 Linked Allocation

      Linked allocation eliminates the problems of contiguous allocation by allowing file blocks to be scattered across the disk. Each file is a linked list of blocks, where each block contains a pointer to the next. The directory maintains a pointer to the first and last block, ensuring that files can grow dynamically. Since blocks do not need to be contiguous, external fragmentation is eliminated. However, linked allocation is best suited for sequential access rather than direct access. Finding a specific block requires traversing the list from the beginning, which can be inefficient for large files with frequent random access requirements.

      The pointer overhead in linked allocation slightly reduces usable storage. If each block includes a pointer, a small percentage of disk space is used for metadata instead of file content. To mitigate this, blocks can be grouped into clusters, reducing the number of pointers needed. While clusters improve disk throughput by minimizing seek operations, they also introduce internal fragmentation. Unused space within a cluster remains wasted, leading to inefficiencies. Despite this drawback, linked allocation remains a viable option for file systems that prioritize storage flexibility over random-access speed. Many older file systems, such as FAT, incorporate variations of linked allocation.

      Linked allocation simplifies file creation and deletion. Since new blocks can be assigned dynamically, there is no need to estimate file size at creation. This flexibility makes it suitable for applications where files grow unpredictably. However, its performance drawbacks limit its use in modern systems requiring fast random access. To improve efficiency, some file systems use indexed allocation, which maintains a separate index for block locations instead of linking blocks sequentially. Hybrid approaches, such as the File Allocation Table (FAT), use a separate table to track links between blocks, reducing the need for frequent disk reads and improving access speed.

      While linked allocation provides significant advantages over contiguous allocation, it is not without its challenges. The need to follow pointers increases read latency, making it unsuitable for applications requiring frequent random access. Additionally, lost or corrupted pointers can result in data loss or inaccessible files. To address these concerns, file systems often implement error-checking mechanisms, such as checksums or redundant pointers. Despite these safeguards, many modern operating systems favor indexed allocation or extent-based allocation to balance efficiency, flexibility, and reliability. As storage technology evolves, hybrid strategies continue to improve file allocation performance while minimizing fragmentation and access delays.

    62. 14.4.1 Contiguous Allocation

      Contiguous allocation assigns a file to a consecutive sequence of disk blocks. This method simplifies file access because the file system can easily locate and read data without extra lookups. Since adjacent blocks require minimal head movement, seek times are reduced, improving performance. However, it suffers from external fragmentation, which occurs when free space is broken into small, non-contiguous chunks. Managing free space efficiently is challenging, especially in long-running systems. Strategies like compaction can mitigate fragmentation, but they require additional processing. Despite its efficiency in read and write operations, contiguous allocation is rarely used in modern file systems due to its rigidity.

      A directory entry for a file in contiguous allocation includes the starting block address and the file’s length. This approach supports both sequential and direct access, as block locations can be easily computed. However, determining the correct amount of space for a file in advance is difficult. If too little space is allocated, extending the file may be impossible without relocating it. Conversely, over-allocating leads to wasted space. To address these issues, some systems use extent-based allocation, where additional contiguous chunks are allocated as needed. This balances efficiency with flexibility but does not fully eliminate fragmentation challenges.

      External fragmentation is a significant drawback of contiguous allocation. As files are created and deleted, free space fragments into smaller chunks, making it difficult to allocate large files. Compaction, which consolidates free space into a single contiguous block, is a solution but is time-consuming. Some operating systems allow online defragmentation, enabling storage reorganization without downtime, though it may impact performance. The need for defragmentation has led modern file systems to adopt alternative allocation strategies. Extent-based systems, such as Veritas File System (VxFS), reduce fragmentation while maintaining some benefits of contiguous allocation. These modifications help improve system efficiency and usability.

      Another challenge with contiguous allocation is estimating a file’s size at creation. If a file grows beyond its allocated space, relocating it to a larger free space may be necessary. This relocation process, though hidden from users, can degrade performance. Overestimating file size results in internal fragmentation, where allocated space remains unused. Modified allocation schemes, such as dynamic extents, attempt to address these inefficiencies. Some file systems allow users to specify initial and incremental allocation sizes to minimize wasted space. Despite these improvements, contiguous allocation remains unsuitable for many modern systems due to its inflexibility and fragmentation issues.

    63. Directory Implementation

      Directory implementation affects file retrieval efficiency. A linear list stores file names and data block pointers, offering simple but slow search operations. Searching for a file requires scanning the entire list, though caching recently accessed entries can enhance performance. A sorted list allows faster binary searches but complicates insertion and deletion. Using tree structures, such as balanced trees, improves efficiency by organizing data hierarchically. Maintaining an optimized directory structure is essential to ensuring rapid file access and system responsiveness, particularly in large-scale storage environments where file lookup speed is a crucial performance factor.

    64. Alternatively, we can use a chained-overflow hash table. Each hash entry can be a linked list instead of an individual value, and we can resolve collisions by adding the new entry to the linked list. Lookups may be somewhat slowed, because searching for a name might require stepping through a linked list of colliding table entries. Still, this method is likely to be much faster than a linear search through the entire directory.

      Linked allocation structures files using scattered pointers across storage devices, posing reliability concerns. A damaged pointer could corrupt file integrity, leading to data loss or misallocated storage. Solutions like doubly linked lists and in-block file name storage improve resilience but increase overhead. The file allocation table (FAT) method, employed by MS-DOS, maintains an indexed list of file blocks, streamlining access. FAT-based storage enhances random access performance but requires frequent disk head movements, impacting efficiency. Caching FAT structures mitigates these issues, balancing retrieval speed and system overhead for optimal file allocation management.

    65. 4.3.2 Hash Table

      A hash table-based directory structure improves search efficiency by mapping file names to pointers using hash functions. This approach significantly reduces lookup time compared to linear searches. However, hash tables have fixed sizes and require resizing when entries exceed the allocated space. Collisions occur when two file names hash to the same location, which can be resolved using linked lists. Chained-overflow hashing reduces collision-related delays, maintaining efficient lookups. Despite these advantages, hash-based structures introduce complexity in resizing and collision resolution, requiring careful implementation to maintain an optimal balance between efficiency and manageability

    66. Several on-storage and in-memory structures are used to implement a file system. These structures vary depending on the operating system and the file system, but some general principles apply. On storage, the file system may contain information about how to boot an operating system stored there, the total number of blocks, the number and location of free blocks, the directory structure, and individual files. Many of these structures are detailed throughout the remainder of this chapter. Here, we describe them briefly: A boot control block (per volume) can contain information needed by the system to boot an operating system from that volume. If the disk does not contain an operating system, this block can be empty. It is typically the first block of a volume. In UFS, it is called the boot block. In NTFS, it is the partition boot sector. A volume control block (per volume) contains volume details, such as the number of blocks in the volume, the size of the blocks, a free-block count and free-block pointers, and a free-FCB count and FCB pointers. In UFS, this is called a superblock. In NTFS, it is stored in the master file table. A directory structure (per file system) is used to organize the files. In UFS, this includes file names and associated inode numbers. In NTFS, it is stored in the master file table. A per-file FCB contains many details about the file. It has a unique identifier number to allow association with a directory entry. In NTFS, this information is actually stored within the master file table, which uses a relational database structure, with a row per file. The in-memory information is used for both file-system management and performance improvement via caching. The data are loaded at mount time, updated during file-system operations, and discarded at dismount. Several types of structures may be included. An in-memory mount table contains information about each mounted volume. An in-memory directory-structure cache holds the directory information of recently accessed directories. (For directories at which volumes are mounted, it can contain a pointer to the volume table.) The system-wide open-file table contains a copy of the FCB of each open file, as well as other information. The per-process open-file table contains pointers to the appropriate entries in the system-wide open-file table, as well as other information, for all files the process has open.

      The file-organization module structures files into logical blocks, assigning each block a number and managing free-space allocation. It interacts with the logical file system, which handles metadata such as file ownership, permissions, and directory structures. The logical file system maintains file-control blocks (FCBs), storing essential information about files. It also enforces security measures and ensures protection policies are upheld. Layered file-system structures minimize code duplication, allowing multiple file systems to share basic components. However, layering can introduce overhead, affecting performance. Designing an efficient layered architecture requires balancing modularity, reusability, and operational efficiency in file-system implementations.

    67. The I/O control level consists of device drivers and interrupt handlers to transfer information between the main memory and the disk system. A device driver can be thought of as a translator. Its input consists of high-level commands, such as “retrieve block 123.” Its output consists of low-level, hardware-specific instructions that are used by the hardware controller, which interfaces the I/O device to the rest of the system. The device driver usually writes specific bit patterns to special locations in the I/O controller's memory to tell the controller which device location to act on and what actions to take. The details of device drivers and the I/O infrastructure are covered in Chapter 12. The basic file system (called the “block I/O subsystem” in Linux) needs only to issue generic commands to the appropriate device driver to read and write blocks on the storage device. It issues commands to the drive based on logical block addresses. It is also concerned with I/O request scheduling. This layer also manages the memory buffers and caches that hold various filesystem, directory, and data blocks. A block in the buffer is allocated before the transfer of a mass storage block can occur. When the buffer is full, the buffer manager must find more buffer memory or free up buffer space to allow a requested I/O to complete. Caches are used to hold frequently used file-system metadata to improve performance, so managing their contents is critical for optimum system performance. The file-organization module knows about files and their logical blocks. Each file's logical blocks are numbered from 0 (or 1) through N. The file organization module also includes the free-space manager, which tracks unallocated blocks and provides these blocks to the file-organization module when requested. Finally, the logical file system manages metadata information. Metadata includes all of the file-system structure except the actual data (or contents of the files). The logical file system manages the directory structure to provide the file-organization module with the information the latter needs, given a symbolic file name. It maintains file structure via file-control blocks. A file-control block (FCB) (an inode in UNIX file systems) contains information about the file, including ownership, permissions, and location of the file contents. The logical file system is also responsible for protection, as discussed in Chapters 13 and 17. When a layered structure is used for file-system implementation, duplication of code is minimized. The I/O control and sometimes the basic file-system code can be used by multiple file systems. Each file system can then have its own logical file-system and file-organization modules. Unfortunately, layering can introduce more operating-system overhead, which may result in decreased performance. The use of layering, including the decision about how many layers to use and what each layer should do, is a major challenge in designing new systems.

      The I/O control level is responsible for handling the transfer of data between memory and storage devices. It includes device drivers, which translate high-level commands into low-level instructions for hardware controllers. These drivers communicate with the I/O device by writing specific bit patterns to controller memory. The block I/O subsystem issues generic read and write commands to the appropriate driver while managing request scheduling, buffers, and caches. Buffers store temporary data blocks during transfers, and caches hold frequently accessed metadata to improve performance. Efficient buffer and cache management is crucial to maintaining system responsiveness and optimizing I/O operations.

    68. 14.1 File-System Structure Disks provide most of the secondary storage on which file systems are maintained. Two characteristics make them convenient for this purpose: 1. A disk can be rewritten in place; it is possible to read a block from the disk, modify the block, and write it back into the same block. 2. A disk can access directly any block of information it contains. Thus, it is simple to access any file either sequentially or randomly, and switching from one file to another requires the drive moving the read–write heads and waiting for the media to rotate. Nonvolatile memory (NVM) devices are increasingly used for file storage and thus as a location for file systems. They differ from hard disks in that they cannot be rewritten in place and they have different performance characteristics. We discuss disk and NVM-device structure in detail in Chapter 11. To improve I/O efficiency, I/O transfers between memory and mass storage are performed in units of blocks. Each block on a hard disk drive has one or more sectors. Depending on the disk drive, sector size is usually 512 bytes or 4,096 bytes. NVM devices usually have blocks of 4,096 bytes, and the transfer methods used are similar to those used by disk drives. File systems provide efficient and convenient access to the storage device by allowing data to be stored, located, and retrieved easily. A file system poses two quite different design problems. The first problem is defining how the file system should look to the user. This task involves defining a file and its attributes, the operations allowed on a file, and the directory structure for organizing files. The second problem is creating algorithms and data structures to map the logical file system onto the physical secondary-storage devices. The file system itself is generally composed of many different levels. The structure shown in Figure 14.1 is an example of a layered design. Each level in the design uses the features of lower levels to create new features for use by higher levels.

      The file-system structure is fundamental to how data is stored and retrieved on storage devices. Disks, as the primary medium for secondary storage, allow efficient data modification and direct access to any block, making them ideal for sequential and random file access. However, emerging nonvolatile memory (NVM) devices differ from traditional hard disks as they cannot be rewritten in place and exhibit distinct performance traits. File systems enhance storage efficiency by managing data organization through block-based I/O transfers. Their design involves defining user-visible structures and implementing algorithms to map logical data to physical storage, often using a layered architectural approach.

    69. 13.1 File Concept Computers can store information on various storage media, such as NVM devices, HDDs, magnetic tapes, and optical disks. So that the computer system will be convenient to use, the operating system provides a uniform logical view of stored information. The operating system abstracts from the physical properties of its storage devices to define a logical storage unit, the file. Files are mapped by the operating system onto physical devices. These storage devices are usually nonvolatile, so the contents are persistent between system reboots.

      File systems serve multiple functions, including organizing, storing, retrieving, and securing files within an operating system. The core functionalities of a file system include creating and managing directories, allowing access to data through different methods (such as sequential or direct access), and implementing security measures to prevent unauthorized file access. Additionally, file systems provide mechanisms for file sharing among users and processes, ensuring smooth data exchange in multi-user systems. Understanding these functions is crucial for anyone working with operating systems, as it influences performance, security, and usability. Some modern file systems even support advanced features such as journaling, encryption, and redundancy to enhance reliability and protect against data corruption.

    70. ecause files are the method users and applications use to store and retrieve data, and because they are so general purpose, their use has stretched beyond its original confines. For example, UNIX, Linux, and some other operating systems provide a proc file system that uses file-system interfaces to provide access to system information (such as process details). The information in a file is defined by its creator. Many different types of information may be stored in a file—source or executable programs, numeric or text data, photos, music, video, and so on. A file has a certain defined structure, which depends on its type. A text file is a sequence of characters organized into lines (and possibly pages). A source file is a sequence of functions, each of which is further organized as declarations followed by executable statements. An executable file is a series of code sections that the loader can bring into memory and execute.

      This section describes various storage devices such as HDDs, SSDs, magnetic tapes, and optical disks. These devices differ in speed, capacity, and durability, which impact the efficiency of file systems. For example, SSDs provide faster data access than HDDs due to their lack of moving parts, making them ideal for performance-driven applications. Magnetic tapes, on the other hand, are used for archival storage because of their low cost and high capacity, despite their slow sequential access. Understanding the relationship between storage devices and file systems is essential when choosing an appropriate file management solution. The operating system abstracts the differences between these devices, offering a uniform interface for storing and retrieving files seamlessly across different hardware platforms.

    71. A file is named, for the convenience of its human users, and is referred to by its name. A name is usually a string of characters, such as example.c. Some systems differentiate between uppercase and lowercase characters in names, whereas other systems do not. When a file is named, it becomes independent of the process, the user, and even the system that created it. For instance, one user might create the file example.c, and another user might edit that file by specifying its name. The file's owner might write the file to a USB drive, send it as an e-mail attachment, or copy it across a network, and it could still be called example.c on the destination system. Unless there is a sharing and synchonization method, that second copy is now independent of the first and can be changed separately. A file's attributes vary from one operating system to another but typically consist of these: Name. The symbolic file name is the only information kept in human-readable form. Identifier. This unique tag, usually a number, identifies the file within the file system; it is the non-human-readable name for the file. Type. This information is needed for systems that support different types of files. Location. This information is a pointer to a device and to the location of the file on that device. Size. The current size of the file (in bytes, words, or blocks) and possibly the maximum allowed size are included in this attribute. Protection. Access-control information determines who can do reading, writing, executing, and so on. Timestamps and user identification. This information may be kept for creation, last modification, and last use. These data can be useful for protection, security, and usage monitoring. Some newer file systems also support extended file attributes, including character encoding of the file and security features such as a file checksum. Figure 13.1 illustrates a file info window on macOS that displays a file's attributes.

      File attributes vary across operating systems, but they generally include name, identifier, type, location, size, protection, timestamps, and access permissions. These attributes help manage and secure files effectively. However, the concept of extended attributes, such as file checksums and character encoding, raises questions. How do extended attributes improve security, and how are they implemented differently across file systems? Understanding these attributes is crucial for file management, particularly in environments that require strict access control and data integrity measures. I would like to explore how modern file systems, such as NTFS and ext4, utilize extended attributes to enhance security and organization.

    72. A file is an abstract data type. To define a file properly, we need to consider the operations that can be performed on files. The operating system can provide system calls to create, write, read, reposition, delete, and truncate files. Let's examine what the operating system must do to perform each of these seven basic file operations. It should then be easy to see how other similar operations, such as renaming a file, can be implemented. Creating a file. Two steps are necessary to create a file. First, space in the file system must be found for the file. We discuss how to allocate space for the file in Chapter 14. Second, an entry for the new file must be made in a directory. Opening a file. Rather than have all file operations specify a file name, causing the operating system to evaluate the name, check access permissions, and so on, all operations except create and delete require a file open() first. If successful, the open call returns a file handle that is used as an argument in the other calls. Writing a file. To write a file, we make a system call specifying both the open file handle and the information to be written to the file. The system must keep a write pointer to the location in the file where the next write is to take place if it is sequential. The write pointer must be updated whenever a write occurs. Reading a file. To read from a file, we use a system call that specifies the file handle and where (in memory) the next block of the file should be put. Again, the directory is searched for the associated entry, and the system needs to keep a read pointer to the location in the file where the next read is to take place, if sequential. Once the read has taken place, the read pointer is updated. Because a process is usually either reading from or writing to a file, the current operation location can be kept as a per-process current-file-position pointer. Both the read and write operations use this same pointer, saving space and reducing system complexity. Repositioning within a file. The current-file-position pointer of the open file is repositioned to a given value. Repositioning within a file need not involve any actual I/O. This file operation is also known as a file seek. Deleting a file. To delete a file, we search the directory for the named file. Having found the associated directory entry, we release all file space, so that it can be reused by other files, and erase or mark as free the directory entry. Note that some systems allow hard links—multiple names (directory entries) for the same file. In this case the actual file contents is not deleted until the last link is deleted. Truncating a file. The user may want to erase the contents of a file but keep its attributes. Rather than forcing the user to delete the file and then recreate it, this function allows all attributes to remain unchanged—except for file length. The file can then be reset to length zero, and its file space can be released.

      A file is an abstract data type that supports various operations, including creation, reading, writing, repositioning, deletion, and truncation. These operations enable users and applications to interact with files in a structured manner. For instance, when a file is opened, a file handle is assigned to avoid repeated directory searches. The file system also maintains read and write pointers to track file operations efficiently. Additionally, file repositioning allows users to navigate through files without performing actual I/O operations, improving performance. Understanding these file operations is essential for programmers, as efficient file handling can optimize system performance and resource utilization. These operations form the foundation for complex file management tasks such as file compression, synchronization, and remote access.

    73. The implementation of the open() and close() operations is more complicated in an environment where several processes may open the file simultaneously. This may occur in a system where several different applications open the same file at the same time. Typically, the operating system uses two levels of internal tables: a per-process table and a system-wide table. The per-process table tracks all files that a process has open. Stored in this table is information regarding the process's use of the file. For instance, the current file pointer for each file is found here. Access rights to the file and accounting information can also be included. Each entry in the per-process table in turn points to a system-wide open-file table. The system-wide table contains process-independent information, such as the location of the file on disk, access dates, and file size. Once a file has been opened by one process, the system-wide table includes an entry for the file. When another process executes an open() call, a new entry is simply added to the process's open-file table pointing to the appropriate entry in the system-wide table. Typically, the open-file table also has an open count associated with each file to indicate how many processes have the file open. Each close() decreases this open count, and when the open count reaches zero, the file is no longer in use, and the file's entry is removed from the open-file table.

      The file pointer is an essential concept in file handling, determining the position for the next read or write operation. This is particularly useful in programming when working with large files, as it eliminates the need to reload the file from the beginning for each operation. For example, in languages like C and Java, file manipulation functions rely heavily on file pointers to track read and write positions. This concept is crucial in scenarios such as database management systems, where multiple processes access a single file concurrently. Proper handling of file pointers ensures data consistency and prevents corruption. Understanding file pointers helps in optimizing performance when dealing with sequential or direct file access.

    74. FILE LOCKING IN JAVA

      The discussion on file locking introduces the concept of shared and exclusive locks, ensuring data integrity when multiple processes access the same file. However, the distinction between mandatory and advisory locking raises questions. How does the operating system enforce mandatory locks, and what happens if a process ignores an advisory lock? Additionally, in distributed file systems, how do locks function when multiple machines access a file over a network? Exploring these aspects is crucial, especially for developers building multi-threaded applications that require proper synchronization mechanisms. Understanding file locking is also essential for database management, where concurrency control prevents conflicts during data transactions.

    75. File locks provide functionality similar to reader–writer locks, covered in Section 7.1.2. A shared lock is akin to a reader lock in that several processes can acquire the lock concurrently. An exclusive lock behaves like a writer lock; only one process at a time can acquire such a lock. It is important to note that not all operating systems provide both types of locks: some systems provide only exclusive file locking. Furthermore, operating systems may provide either mandatory or advisory file-locking mechanisms. With mandatory locking, once a process acquires an exclusive lock, the operating system will prevent any other process from accessing the locked file. For example, assume a process acquires an exclusive lock on the file system.log. If we attempt to open system.log from another process—for example, a text editor—the operating system will prevent access until the exclusive lock is released. Alternatively, if the lock is advisory, then the operating system will not prevent the text editor from acquiring access to system.log. Rather, the text editor must be written so that it manually acquires the lock before accessing the file. In other words, if the locking scheme is mandatory, the operating system ensures locking integrity. For advisory locking, it is up to software developers to ensure that locks are appropriately acquired and released. As a general rule, Windows operating systems adopt mandatory locking, and UNIX systems employ advisory locks.

      The example of file locking in Java demonstrates how developers can manage concurrent file access. Java’s FileChannel class allows acquiring locks on specific portions of a file, supporting both shared and exclusive locks. This is particularly useful in applications that involve collaborative editing or log file management. However, improper use of file locks can lead to deadlocks, where two processes indefinitely wait for each other to release a lock. This example reinforces the importance of synchronization in file handling. Additionally, understanding how different operating systems implement file locking (such as Windows using mandatory locks and UNIX using advisory locks) can help developers write cross-platform applications that handle concurrency efficiently.

    76. When we design a file system—indeed, an entire operating system—we always consider whether the operating system should recognize and support file types. If an operating system recognizes the type of a file, it can then operate on the file in reasonable ways. For example, a common mistake occurs when a user tries to output the binary-object form of a program. This attempt normally produces garbage; however, the attempt can succeed if the operating system has been told that the file is a binary-object program. A common technique for implementing file types is to include the type as part of the file name. The name is split into two parts—a name and an extension, usually separated by a period (Figure 13.3). In this way, the user and the operating system can tell from the name alone what the type of a file is. Most operating systems allow users to specify a file name as a sequence of characters followed by a period and terminated by an extension made up of additional characters. Examples include resume.docx, server.c, and ReaderThread.cpp.

      Access control mechanisms in file systems prevent unauthorized users from reading, modifying, or deleting files. These mechanisms include file permissions, user authentication, and encryption. In multi-user environments, proper access control ensures data confidentiality and prevents accidental modifications by unauthorized users. For example, UNIX-based systems use a permission model involving read, write, and execute permissions assigned to users, groups, and others.

    77. 13.2.1 Sequential Access The simplest access method is sequential access. Information in the file is processed in order, one record after the other. This mode of access is by far the most common; for example, editors and compilers usually access files in this fashion. Reads and writes make up the bulk of the operations on a file. A read operation—read_next()—reads the next portion of the file and automatically advances a file pointer, which tracks the I/O location. Similarly, the write operation—write_next()—appends to the end of the file and advances to the end of the newly written material (the new end of file). Such a file can be reset to the beginning, and on some systems, a program may be able to skip forward or backward n records for some integer n—perhaps only for n = 1. Sequential access, which is depicted in Figure 13.4, is based on a tape model of a file and works as well on sequential-access devices as it does on random-access ones.

      Sequential access is the simplest and most common method of retrieving file data, where records are accessed in order, one after another. This approach is ideal for text files, logs, and programs that process data sequentially, such as compilers and text editors. Read and write operations move a file pointer forward, automatically advancing to the next record. Some systems allow limited skipping, enabling navigation through the file in predefined increments. Sequential access follows a tape-based model, where operations like rewinding reset the pointer to the beginning. While efficient for linear data processing, sequential access becomes inefficient when frequent random access is required. It remains widely used in applications where data naturally follows a fixed sequence, such as streaming media or logs.

    78. 3.1.5 Internal File Structure Internally, locating an offset within a file can be complicated for the operating system. Disk systems typically have a well-defined block size determined by the size of a sector. All disk I/O is performed in units of one block (physical record), and all blocks are the same size. It is unlikely that the physical record size will exactly match the length of the desired logical record. Logical records may even vary in length. Packing a number of logical records into physical blocks is a common solution to this problem. For example, the UNIX operating system defines all files to be simply streams of bytes. Each byte is individually addressable by its offset from the beginning (or end) of the file. In this case, the logical record size is 1 byte. The file system automatically packs and unpacks bytes into physical disk blocks—say, 512 bytes per block—as necessary. The logical record size, physical block size, and packing technique determine how many logical records are in each physical block. The packing can be done either by the user's application program or by the operating system. In either case, the file may be considered a sequence of blocks. All the basic I/O functions operate in terms of blocks. The conversion from logical records to physical blocks is a relatively simple software problem. Because disk space is always allocated in blocks, some portion of the last block of each file is generally wasted. If each block were 512 bytes, for example, then a file of 1,949 bytes would be allocated four blocks (2,048 bytes); the last 99 bytes would be wasted. The waste incurred to keep everything in units of blocks (instead of bytes) is internal fragmentation. All file systems suffer from internal fragmentation; the larger the block size, the greater the internal fragmentation.

      Managing file data efficiently is crucial for operating systems, particularly in how logical records are mapped to physical storage blocks. Disk storage operates in fixed-size blocks, while logical records may vary in size. To optimize space and performance, files are packed into blocks, though this can cause internal fragmentation, where unused space in the last block is wasted. For example, UNIX treats files as streams of bytes, automatically packing them into physical blocks (e.g., 512 bytes). This system simplifies file handling but necessitates additional software logic for structured data. Internal fragmentation increases with larger block sizes, leading to inefficient space utilization. Despite these challenges, block-based allocation remains essential for efficient disk I/O operations and file system management.

    79. 13.1.4 File Structure File types also can be used to indicate the internal structure of the file. Source and object files have structures that match the expectations of the programs that read them. Further, certain files must conform to a required structure that is understood by the operating system. For example, the operating system requires that an executable file have a specific structure so that it can determine where in memory to load the file and what the location of the first instruction is. Some operating systems extend this idea into a set of system-supported file structures, with sets of special operations for manipulating files with those structures. This point brings us to one of the disadvantages of having the operating system support multiple file structures: it makes the operating system large and cumbersome. If the operating system defines five different file structures, it needs to contain the code to support these file structures. In addition, it may be necessary to define every file as one of the file types supported by the operating system. When new applications require information structured in ways not supported by the operating system, severe problems may result. For example, assume that a system supports two types of files: text files (composed of ASCII characters separated by a carriage return and line feed) and executable binary files. Now, if we (as users) want to define an encrypted file to protect the contents from being read by unauthorized people, we may find neither file type to be appropriate. The encrypted file is not ASCII text lines but rather is (apparently) random bits. Although it may appear to be a binary file, it is not executable. As a result, we may have to circumvent or misuse the operating system's file-type mechanism or abandon our encryption scheme. Some operating systems impose (and support) a minimal number of file structures. This approach has been adopted in UNIX, Windows, and others. UNIX considers each file to be a sequence of 8-bit bytes; no interpretation of these bits is made by the operating system. This scheme provides maximum flexibility but little support. Each application program must include its own code to interpret an input file as to the appropriate structure. However, all operating systems must support at least one structure—that of an executable file—so that the system is able to load and run programs.

      File structures help the operating system and applications organize and manage data effectively. Some files, like source and object files, follow a defined structure expected by the programs that use them. Operating systems often impose structures on certain files, such as executable files, to facilitate their loading and execution. However, supporting multiple file structures increases system complexity and can restrict flexibility. Some systems, like UNIX, adopt a minimal approach, treating all files as byte sequences without enforcing structure. This approach enhances flexibility but places the burden on applications to interpret data. A rigid file structure system can create issues when new types of files, such as encrypted files, do not conform to predefined formats, requiring workarounds or alternative storage methods.

    80. 3.2.2 Direct Access

      Direct access, also known as relative access, allows random retrieval and modification of file records without following a strict sequence. Each file consists of fixed-length logical records, and programs can read or write any record directly using block numbers. This method is well-suited for databases and large datasets requiring rapid lookup, such as airline reservation systems. Direct access relies on relative block numbers, preventing users from accessing unauthorized parts of the system. Logical record length (L) determines how data is located, with retrievals calculated as L * N, where N is the record number. Some operating systems support both direct and sequential access, while others require predefined file access types. Direct access significantly improves efficiency when handling large, structured data.

    81. 13.2.3 Other Access Methods Other access methods can be built on top of a direct-access method. These methods generally involve the construction of an index for the file. The index, like an index in the back of a book, contains pointers to the various blocks. To find a record in the file, we first search the index and then use the pointer to access the file directly and to find the desired record. For example, a retail-price file might list the universal product codes (UPCs) for items, with the associated prices. Each record consists of a 10-digit UPC and a 6-digit price, for a 16-byte record. If our disk has 1,024 bytes per block, we can store 64 records per block. A file of 120,000 records would occupy about 2,000 blocks (2 million bytes). By keeping the file sorted by UPC, we can define an index consisting of the first UPC in each block. This index would have 2,000 entries of 10 digits each, or 20,000 bytes, and thus could be kept in memory. To find the price of a particular item, we can make a binary search of the index. From this search, we learn exactly which block contains the desired record and access that block. This structure allows us to search a large file doing little I/O. With large files, the index file itself may become too large to be kept in memory. One solution is to create an index for the index file. The primary index file contains pointers to secondary index files, which point to the actual data items. For example, IBM's indexed sequential-access method (ISAM) uses a small master index that points to disk blocks of a secondary index. The secondary index blocks point to the actual file blocks. The file is kept sorted on a defined key. To find a particular item, we first make a binary search of the master index, which provides the block number of the secondary index. This block is read in, and again a binary search is used to find the block containing the desired record. Finally, this block is searched sequentially. In this way, any record can be located from its key by at most two direct-access reads. Figure 13.6 shows a similar situation as implemented by OpenVMS index and relative files.

      Advanced access methods build upon direct access by incorporating indexing for faster data retrieval. An index acts like a book’s table of contents, pointing to specific file locations. For example, a retail-price file sorted by product codes (UPCs) can have an index of first UPCs per block, reducing search time via binary search. If the index grows too large, a hierarchical approach, such as IBM's ISAM, introduces multiple index levels, ensuring efficient lookups. OpenVMS follows a similar strategy, maintaining indexed files for rapid retrieval. These methods enhance performance for large databases, enabling quick access with minimal disk reads. While efficient, indexed access requires additional storage and maintenance, making it a trade-off between speed and system resource usage.

    82. 13.3.2 Two-Level Directory As we have seen, a single-level directo

      A two-level directory structure improves upon single-level systems by providing each user with a separate directory. The master file directory (MFD) stores user entries, while each user has a unique user file directory (UFD) containing their files. This approach prevents name conflicts, as users can have identical file names within their own directories. However, it isolates users, making collaboration challenging unless explicit permissions allow shared access. To access another user’s file, a full path, including the username, is required. This structure resembles a tree, where the MFD is the root, UFDs are branches, and files are leaves. While an improvement over single-level directories, this model remains limited in flexibility and scalability, leading to more complex hierarchical directory structures in modern systems.

    83. 13.3.1 Single-Level Directory The simplest directory structure is the single-level directory. All files are contained in the same directory, which is easy to support and understand (Figure 13.7). Figure 13.7 Single-level directory. A single-level directory has significant limitations, however, when the number of files increases or when the system has more than one user. Since all files are in the same directory, they must have unique names. If two users call their data file test.txt, then the unique-name rule is violated. For example, in one programming class, 23 students called the program for their second assignment prog2.c; another 11 called it assign2.c. Fortunately, most file systems support file names of up to 255 characters, so it is relatively easy to select unique file names. Even a single user on a single-level directory may find it difficult to remember the names of all the files as the number of files increases. It is not uncommon for a user to have hundreds of files on one computer system and an equal number of additional files on another system. Keeping track of so many files is a daunting task.

      The simplest directory structure is a single-level directory, where all files exist in one common space. This model is easy to implement but quickly becomes impractical as the number of files increases. Since all files share the same namespace, file names must be unique, which can lead to conflicts, especially in multi-user environments. For instance, if multiple students name their assignments the same, distinguishing between them becomes difficult. A single-level directory also makes file management cumbersome, as users struggle to organize large numbers of files. Despite these limitations, this structure may still be suitable for small systems with limited users and files. However, modern operating systems typically adopt more advanced directory structures to improve usability and organization.

    84. General Graph Directory

      Allowing cycles in a directory structure results in a general graph, increasing complexity. Cycles create issues in directory traversal, potentially causing infinite loops and inefficient searches. Moreover, detecting cycles is computationally expensive. File deletion is also problematic, as reference counts may never reach zero in cyclic structures, leading to orphaned files. Garbage collection can reclaim unused space, but it is time-consuming and rarely implemented in file systems. To mitigate these issues, one approach is to bypass links during traversals, avoiding cycles altogether. While general graphs offer flexibility, they complicate file management, requiring additional safeguards to prevent performance bottlenecks and ensure data integrity. Most systems prefer acyclic structures for ease of maintenance.

    85. 13.3.4 Acyclic-Graph Directories Consider two programmers who are working on a joint project. The files associated with that project can be stored in a subdirectory, separating them from other projects and files of the two programmers. But since both programmers are equally responsible for the project, both want the subdirectory to be in their own directories. In this situation, the common subdirectory should be shared. A shared directory or file exists in the file system in two (or more) places at once. A tree structure prohibits the sharing of files or directories. An acyclic graph—that is, a graph with no cycles—allows directories to share subdirectories and files (Figure 13.10). The same file or subdirectory may be in two different directories. The acyclic graph is a natural generalization of the tree-structured directory scheme.

      An acyclic-graph directory structure supports file and directory sharing among users. Unlike tree-structured directories, an acyclic graph allows multiple references to the same file or directory without duplicating content. Shared files enable real-time collaboration, ensuring changes are visible to all users. Implementation methods include symbolic links (pointers to files) or duplicate directory entries, each with pros and cons. Challenges include tracking multiple absolute path names, preventing redundant file traversals, and managing deletion. One approach involves reference counting, where a file is deleted only when no references remain. This structure increases flexibility but adds complexity in maintaining consistency, ensuring correctness in directory traversals, and handling symbolic links effectively.

    86. 13.3.3 Tree-Structured Directories

      A tree-structured directory system allows users to create subdirectories and organize files in a hierarchy. The root directory serves as the top-level directory, with all files having unique paths. Each directory entry contains a bit to distinguish between files and subdirectories. Processes typically have a current directory, and users can navigate using absolute or relative path names. Deleting directories can be handled in two ways: requiring them to be empty before deletion or recursively deleting all contents. While the former is safer, the latter is more convenient but riskier. This structure enables users to manage their files efficiently and share files by specifying paths, but it lacks built-in mechanisms for direct file or directory sharing.

    87. Types of Access

      This section explains how controlled access prevents unauthorized file operations. Instead of extreme measures such as prohibiting access entirely or allowing unrestricted access, systems use protection mechanisms to regulate different types of access. It categorizes operations like reading, writing, executing, appending, deleting, and listing files. The section also notes that higher-level operations, such as copying or renaming, depend on basic permissions like reading or writing. Controlled access varies depending on the system’s needs—for instance, a corporate server demands stricter protection than a personal laptop. The key takeaway is that access control should balance usability and security, ensuring necessary access without exposing files to potential misuse. Different protection mechanisms are explored in subsequent sections.

    88. When information is stored in a computer system, we want to keep it safe from physical damage (the issue of reliability) and improper access (the issue of protection). Reliability is generally provided by duplicate copies of files. Many computers have systems programs that automatically (or through computer-operator intervention) copy disk files to tape at regular intervals (once per day or week or month) to maintain a copy should a file system be accidentally destroyed. File systems can be damaged by hardware problems (such as errors in reading or writing), power surges or failures, head crashes, dirt, temperature extremes, and vandalism. Files may be deleted accidentally. Bugs in the file-system software can also cause file contents to be lost. Reliability was covered in more detail in Chapter 11. Protection can be provided in many ways. For a laptop system running a modern operating system, we might provide protection by requiring a user name and password authentication to access it, encrypting the secondary storage so even someone opening the laptop and removing the drive would have a difficult time accessing its data, and firewalling network access so that when it is in use it is difficult to break in via its network connection. In multiuser system, even valid access of the system needs more advanced mechanisms to allow only valid access of the data.

      This section introduces the need for protection in computer systems to prevent physical damage and unauthorized access. Reliability ensures that data remains intact despite hardware failures or accidental deletions, while protection prevents unauthorized users from accessing or modifying files. Strategies for reliability include backups, while protection involves security mechanisms such as authentication, encryption, and firewalls. The distinction between reliability and protection is crucial because data integrity alone is insufficient without security measures. In multiuser environments, protection is even more critical, as multiple users require controlled access. This section establishes the foundation for understanding file access control, setting the stage for deeper discussions on controlled access, access-control lists, and security challenges in complex computing environments.

    89. Access Control The most common approach to the protection problem is to make access dependent on the identity of the user. Different users may need different types of access to a file or directory. The most general scheme to implement identity-dependent access is to associate with each file and directory an access-control list (ACL) specifying user names and the types of access allowed for each user. When a user requests access to a particular file, the operating system checks the access list associated with that file. If that user is listed for the requested access, the access is allowed. Otherwise, a protection violation occurs, and the user job is denied access to the file. This approach has the advantage of enabling complex access methodologies. The main problem with access lists is their length. If we want to allow everyone to read a file, we must list all users with read access. This technique has two undesirable consequences: Constructing such a list may be a tedious and unrewarding task, especially if we do not know in advance the list of users in the system. The directory entry, previously of fixed size, now must be of variable size, resulting in more complicated space management. These problems can be resolved by use of a condensed version of the access list. To condense the length of the access-control list, many systems recognize three classifications of users in connection with each file: Owner. The user who created the file is the owner. Group. A set of users who are sharing the file and need similar access is a group, or work group. Other. All other users in the system. The most common recent approach is to combine access-control lists with the more general (and easier to implement) owner, group, and universe access-control scheme just described. For example, Solaris uses the three categories of access by default but allows access-control lists to be added to specific files and directories when more fine-grained access control is desired. To illustrate, consider a person, Sara, who is writing a new book. She has hired three graduate students (Jim, Dawn, and Jill) to help with the project. The text of the book is kept in a file named book.tex. The protection associated with this file is as follows: Sara should be able to invoke all operations on the file. Jim, Dawn, and Jill should be able only to read and write the file; they should not be allowed to delete the file. All other users should be able to read, but not write, the file. (Sara is interested in letting as many people as possible read the text so that she can obtain feedback.)

      Unlike Unix, Windows manages ACLs through a graphical user interface, making it more user-friendly. The section highlights an example from Windows 7 (NTFS file system) where a user named "guest" is explicitly denied access to a file. Windows permissions allow detailed configurations, such as granting or restricting access based on individual users or groups. This flexibility is useful in corporate environments where different departments require different levels of access. However, managing ACLs in Windows can be complex due to overlapping permissions and precedence rules. The section also discusses how operating systems prioritize permissions when conflicts arise, with Solaris and other Unix-like systems granting precedence to ACLs over standard permissions. This approach ensures that more specific rules override general access settings.

    90. 13.4.3 Other Protection Approaches Another approach to the protection problem is to associate a password with each file. Just as access to the computer system is often controlled by a password, access to each file can be controlled in the same way. If the passwords are chosen randomly and changed often, this scheme may be effective in limiting access to a file. The use of passwords has a few disadvantages, however. First, the number of passwords that a user needs to remember may become large, making the scheme impractical. Second, if only one password is used for all the files, then once it is discovered, all files are accessible; protection is on an all-or-none basis. Some systems allow a user to associate a password with a subdirectory, rather than with an individual file, to address this problem. More commonly encryption of a partition or individual files provides strong protection, but password management is key. In a multilevel directory structure, we need to protect not only individual files but also collections of files in subdirectories; that is, we need to provide a mechanism for directory protection. The directory operations that must be protected are somewhat different from the file operations. We want to control the creation and deletion of files in a directory. In addition, we probably want to control whether a user can determine the existence of a file in a directory. Sometimes, knowledge of the existence and name of a file is significant in itself. Thus, listing the contents of a directory must be a protected operation. Similarly, if a path name refers to a file in a directory, the user must be allowed access to both the directory and the file. In systems where files may have numerous path names (such as acyclic and general graphs), a given user may have different access rights to a particular file, depending on the path name used.

      This section discusses alternative protection mechanisms, including password-protected files and encryption. While password-based protection can be effective, it has drawbacks—users may need to remember multiple passwords, and if a single password is compromised, all files become vulnerable. Some systems address this by protecting directories instead of individual files. Encryption provides a stronger solution by securing files at a deeper level, requiring decryption keys for access. The section also highlights the importance of protecting directories, ensuring unauthorized users cannot view, create, or delete files. Directory protection is essential in hierarchical file systems where access restrictions must be enforced at multiple levels. Additionally, some systems allow different access rights depending on the file path used, further enhancing security measures.

    91. 13.5.1 Basic Mechanism

      Memory-mapped files work by associating disk blocks with pages in memory. The system initially loads data into memory using demand paging, causing a page fault when first accessed. Instead of using traditional file I/O functions like read() and write(), subsequent access is handled as direct memory operations, improving efficiency. However, file modifications are not immediately written to disk; instead, updates occur when the file is closed or when memory pressure forces changes to be written. Some operating systems memory-map files automatically to optimize performance, such as Solaris, which maps all file I/O operations to the kernel address space. The ability for multiple processes to map the same file enables shared memory, which is critical for efficient interprocess communication (IPC).

    92. Shared Memory in the Windows API

      Shared Memory in the Windows API Windows provides built-in support for memory-mapped files to enable shared memory between processes. The process begins with the creation of a file mapping using CreateFile() and CreateFileMapping(), followed by mapping a view of the file into a process’s address space with MapViewOfFile(). A second process can then access the same memory-mapped file, allowing seamless data sharing. This mechanism is particularly useful for producer-consumer scenarios, where one process writes to shared memory while another reads from it. The example code in this section demonstrates how a producer writes a message to a shared-memory object, which a consumer then reads. This implementation avoids the overhead of traditional IPC methods like pipes or message queues, leveraging efficient memory management instead.

  2. Feb 2025
    1. Power Management

      Power management is crucial for both data centers and mobile computing. In large-scale environments, optimizing power use reduces operational costs and environmental impact. Techniques such as dynamic workload distribution, hardware sleep states, and natural cooling sources help minimize energy consumption. Mobile devices, in particular, require advanced power-saving mechanisms to maximize battery life. Android, for example, implements power collapse, component-level power management, and wakelocks to regulate energy use efficiently. Power collapse places devices into deep sleep while allowing quick reactivation. Component-level management tracks usage dependencies, selectively powering down inactive components. Wakelocks prevent premature sleep when critical tasks are active. These strategies collectively enhance device longevity and efficiency, making modern computing more sustainable and user-friendly.

    2. Kernel Data Structures

      Kernel data structures maintain essential state information about I/O activities, ensuring efficient system operation. UNIX’s open-file table, for example, tracks file descriptors, file system records, and active inodes, streamlining file management. Similar structures exist for network connections and character devices. Object-oriented approaches further enhance modularity, as seen in UNIX’s dispatch tables and Windows’ message-passing system. In Windows, I/O requests are encapsulated as messages, enabling flexible interactions between the kernel, I/O manager, and device drivers. Although this approach introduces additional processing overhead, it simplifies I/O management and enhances system flexibility. These structured methods ensure that operating systems can efficiently track, manage, and process diverse I/O operations, leading to more stable and efficient computing environments.

    3. I/O Protection

      I/O protection mechanisms prevent unauthorized or malicious processes from disrupting system operations. Since I/O instructions are privileged, users cannot execute them directly; instead, they must invoke system calls, allowing the kernel to validate and process requests securely. Additionally, memory-mapped I/O regions require access controls to prevent unauthorized modifications. However, certain applications, like high-performance graphics software, require direct hardware access, necessitating controlled exceptions. In such cases, the kernel implements locking mechanisms to allocate resources safely. These protective measures ensure system stability by preventing unintended interference while allowing necessary hardware interactions. Effective I/O protection safeguards both user data and system integrity, reducing vulnerabilities and maintaining a secure computing environment.

    4. Error Handling

      Error handling in I/O subsystems ensures system stability by detecting and responding to hardware or software failures. Errors can be transient, such as network congestion, or permanent, like a failed disk controller. Operating systems mitigate transient errors through retries, while permanent failures may require manual intervention. UNIX systems use the errno variable to indicate error types, whereas more sophisticated hardware protocols, like SCSI, provide multi-level error reporting for precise diagnostics. Despite these mechanisms, not all errors are recoverable, necessitating backup strategies to minimize data loss. Error logs maintained by devices can aid in troubleshooting, though many operating systems underutilize this information. Robust error handling mechanisms ensure continued system reliability, reducing downtime and maintaining data integrity in complex computing environments.

    5. Spooling and Device Reservation A spool is a buffer that holds output for a device, such as a printer, that cannot accept interleaved data streams. Although a printer can serve only one job at a time, several applications may wish to print their output concurrently, without having their output mixed together. The operating system solves this problem by intercepting all output to the printer. Each application's output is spooled to a separate secondary storage file. When an application finishes printing, the spooling system queues the corresponding spool file for output to the printer. The spooling system copies the queued spool files to the printer one at a time. In some operating systems, spooling is managed by a system daemon process. In others, it is handled by an in-kernel thread. In either case, the operating system provides a control interface that enables users and system administrators to display the queue, remove unwanted jobs before those jobs print, suspend printing while the printer is serviced, and so on. Some devices, such as tape drives and printers, cannot usefully multiplex the I/O requests of multiple concurrent applications. Spooling is one way operating systems can coordinate concurrent output. Another way to deal with concurrent device access is to provide explicit facilities for coordination. Some operating systems (including VMS) provide support for exclusive device access by enabling a process to allocate an idle device and to deallocate that device when it is no longer needed. Other operating systems enforce a limit of one open file handle to such a device. Many operating systems provide functions that enable processes to coordinate exclusive access among themselves. For instance, Windows provides system calls to wait until a device object becomes available. It also has a parameter to the OpenFile() system call that declares the types of access to be permitted to other concurrent threads. On these systems, it is up to the applications to avoid deadlock.

      Spooling is a technique used to manage output devices that cannot handle interleaved data streams, such as printers. Instead of sending data directly to the device, the system stores it in a spool—a designated buffer in secondary storage—ensuring orderly processing. A spooler daemon manages the queue, allowing users to monitor and manipulate print jobs. Additionally, some devices require exclusive access to prevent conflicts. Device reservation mechanisms prevent multiple processes from simultaneously accessing non-shareable devices like tape drives. Operating systems enforce these constraints through explicit allocation requests or file-handle restrictions, ensuring coordinated access. In environments like Windows, system calls allow processes to wait for device availability, preventing deadlocks. Effective spooling and reservation strategies optimize device utilization and maintain orderly, conflict-free access to critical system resources.

    6. Caching A cache is a region of fast memory that holds copies of data. Access to the cached copy is more efficient than access to the original. For instance, the instructions of the currently running process are stored on disk, cached in physical memory, and copied again in the CPU's secondary and primary caches. The difference between a buffer and a cache is that a buffer may hold the only existing copy of a data item, whereas a cache, by definition, holds a copy on faster storage of an item that resides elsewhere. Caching and buffering are distinct functions, but sometimes a region of memory can be used for both purposes. For instance, to preserve copy semantics and to enable efficient scheduling of disk I/O, the operating system uses buffers in main memory to hold disk data. These buffers are also used as a cache, to improve the I/O efficiency for files that are shared by applications or that are being written and reread rapidly. When the kernel receives a file I/O request, the kernel first accesses the buffer cache to see whether that region of the file is already available in main memory. If it is, a physical disk I/O can be avoided or deferred. Also, disk writes are accumulated in the buffer cache for several seconds, so that large transfers are gathered to allow efficient write schedules. This strategy of delaying writes to improve I/O efficiency is discussed, in the context of remote file access, in Section 19.6.4.

      Caching involves storing copies of frequently accessed data in fast memory to reduce access time and improve performance. Unlike buffering, which may hold the only copy of data, caching retains duplicates of data stored elsewhere. For example, operating systems use disk caching to store frequently accessed file data in main memory, minimizing disk I/O. The buffer cache, a specialized form of caching, helps optimize read and write operations by aggregating small transfers into larger, more efficient disk writes. This strategy not only enhances performance but also prolongs hardware lifespan by reducing mechanical wear on disk drives. Caching strategies such as least-recently-used (LRU) or write-back caching further optimize resource utilization. By leveraging caching, operating systems significantly improve application response times and overall system efficiency.

    7. Buffering A buffer, of course, is a memory area that stores data being transferred between two devices or between a device and an application. Buffering is done for three reasons. One reason is to cope with a speed mismatch between the producer and consumer of a data stream. Suppose, for example, that a file is being received via Internet for storage on an SSD. The network speed may be a thousand times slower than the drive. So a buffer is created in main memory to accumulate the bytes received from the network. When an entire buffer of data has arrived, the buffer can be written to the drive in a single operation. Since the drive write is not instantaneous and the network interface still needs a place to store additional incoming data, two buffers are used. After the network fills the first buffer, the drive write is requested. The network then starts to fill the second buffer while the first buffer is written to storage. By the time the network has filled the second buffer, the drive write from the first one should have completed, so the network can switch back to the first buffer while the drive writes the second one. This double buffering decouples the producer of data from the consumer, thus relaxing timing requirements between them. The need for this decoupling is illustrated in Figure 12.11, which lists the enormous differences in device speeds for typical computer hardware and interfaces.

      Buffering is the use of memory areas to store data temporarily during transfers between devices or between a device and an application. It helps bridge the speed gap between slower and faster components, ensuring smooth data flow. A typical example is network file transfers, where data is buffered before being written to an SSD to accommodate the speed disparity. Double buffering, in particular, allows one buffer to be filled while another is processed, enhancing efficiency. Buffering also plays a crucial role in handling data-transfer size mismatches, such as in networking, where large messages are fragmented and reassembled. Additionally, it supports copy semantics, ensuring that data integrity is maintained even if an application modifies a buffer after issuing a write command. This functionality, despite its overhead, is critical for maintaining reliable data processing.

    8. Scheduling To schedule a set of I/O requests means to determine a good order in which to execute them. The order in which applications issue system calls rarely is the best choice. Scheduling can improve overall system performance, can share device access fairly among processes, and can reduce the average waiting time for I/O to complete. Here is a simple example to illustrate. Suppose that a disk arm is near the beginning of a disk and that three applications issue blocking read calls to that disk. Application 1 requests a block near the end of the disk, application 2 requests one near the beginning, and application 3 requests one in the middle of the disk. The operating system can reduce the distance that the disk arm travels by serving the applications in the order 2, 3, 1. Rearranging the order of service in this way is the essence of I/O scheduling. Operating-system developers implement scheduling by maintaining a wait queue of requests for each device. When an application issues a blocking I/O system call, the request is placed on the queue for that device. The I/O scheduler rearranges the order of the queue to improve the overall system efficiency and the average response time experienced by applications. The operating system may also try to be fair, so that no one application receives especially poor service, or it may give priority service for delay-sensitive requests. For instance, requests from the virtual memory subsystem may take priority over application requests. Several scheduling algorithms for disk I/O were detailed in Section 11.2. When a kernel supports asynchronous I/O, it must be able to keep track of many I/O requests at the same time. For this purpose, the operating system might attach the wait queue to a device-status table. The kernel manages this table, which contains an entry for each I/O device, as shown in Figure 12.10. Each table entry indicates the device's type, address, and state (not functioning, idle, or busy). If the device is busy with a request, the type of request and other parameters will be stored in the table entry for that device.

      I/O scheduling determines the optimal order in which I/O requests are executed to improve system performance and fairness among processes. Without proper scheduling, applications may issue system calls inefficiently, leading to increased wait times and decreased responsiveness. By maintaining a wait queue for each device, the operating system can rearrange requests to minimize disk arm movement and enhance overall efficiency. Additionally, priority-based scheduling allows time-sensitive requests, such as those from virtual memory, to be processed first. In asynchronous I/O environments, the system must track multiple concurrent I/O operations, often utilizing a device-status table to manage these requests. Effective I/O scheduling ensures that computing resources are utilized optimally, balancing performance needs with fairness to prevent any single application from monopolizing device access.

    9. Some operating systems provide another major variation of I/O via their application interfaces. Vectored I/O allows one system call to perform multiple I/O operations involving multiple locations. For example, the UNIX readv system call accepts a vector of multiple buffers and either reads from a source to that vector or writes from that vector to a destination. The same transfer could be caused by several individual invocations of system calls, but this scatter–gather method is useful for a variety of reasons. Multiple separate buffers can have their contents transferred via one system call, avoiding context-switching and system-call overhead. Without vectored I/O, the data might first need to be transferred to a larger buffer in the right order and then transmitted, which is inefficient. In addition, some versions of scatter–gather provide atomicity, assuring that all the I/O is done without interruption (and avoiding corruption of data if other threads are also performing I/O involving those buffers). When possible, programmers make use of scatter–gather I/O features to increase throughput and decrease system overhead.

      Vectored I/O optimizes data transfers by allowing multiple I/O operations in a single system call, reducing context-switching and overhead. The UNIX readv system call exemplifies this approach, enabling efficient data movement between multiple buffers and a single I/O source or destination. Scatter–gather I/O, a technique supported by vectored I/O, improves performance by eliminating the need for additional buffering. Some implementations ensure atomicity, preventing data corruption when multiple threads perform simultaneous I/O operations. The discussion highlights the practical benefits of vectored I/O, including improved throughput and reduced system overhead, making it an advantageous choice for high-performance applications that require efficient and reliable data transfers.

    10. Nonblocking and Asynchronous I/O

      This section explores blocking, nonblocking, and asynchronous I/O methodologies. Blocking I/O suspends thread execution until the request completes, simplifying application development but potentially causing inefficiencies. Nonblocking I/O returns immediately, allowing applications to continue executing while checking for data availability. Asynchronous I/O fully decouples request initiation from completion, notifying the application when data transfer finishes. The section compares synchronous and asynchronous methods using Figure 12.9, illustrating hardware interactions, interrupt handling, and process execution flows. Examples include UI applications handling user input while processing output and video applications reading and rendering frames simultaneously. The section emphasizes the significance of choosing appropriate I/O strategies based on application needs to balance responsiveness and computational efficiency.

    11. 12.3.3 Clocks and Timers Most computers have hardware clocks and timers that provide three basic functions: Give the current time. Give the elapsed time. Set a timer to trigger operation X at time T. These functions are used heavily by the operating system, as well as by time-sensitive applications. Unfortunately, the system calls that implement these functions are not standardized across operating systems. The hardware to measure elapsed time and to trigger operations is called a programmable interval timer. It can be set to wait a certain amount of time and then generate an interrupt, and it can be set to do this once or to repeat the process to generate periodic interrupts. The scheduler uses this mechanism to generate an interrupt that will preempt a process at the end of its time slice. The disk I/O subsystem uses it to invoke the periodic flushing of dirty cache buffers to disk, and the network subsystem uses it to cancel operations that are proceeding too slowly because of network congestion or failures. The operating system may also provide an interface for user processes to use timers. The operating system can support more timer requests than the number of timer hardware channels by simulating virtual clocks. To do so, the kernel (or the timer device driver) maintains a list of interrupts wanted by its own routines and by user requests, sorted in earliest-time-first order. It sets the timer for the earliest time. When the timer interrupts, the kernel signals the requester and reloads the timer with the next earliest time. Computers have clock hardware that is used for a variety of purposes. Modern PCs include a high-performance event timer (HPET), which runs at rates in the 10-megahertz range. It has several comparators that can be set to trigger once or repeatedly when the value they hold matches that of the HPET. The trigger generates an interrupt, and the operating system's clock management routines determine what the timer was for and what action to take. The precision of triggers is limited by the resolution of the timer, together with the overhead of maintaining virtual clocks. Furthermore, if the timer ticks are used to maintain the system time-of-day clock, the system clock can drift. Drift can be corrected via protocols designed for that purpose, such as NTP, the network time protocol, which uses sophisticated latency calculations to keep a computer's clock accurate almost to atomic-clock levels. In most computers, the hardware clock is constructed from a high-frequency counter. In some computers, the value of this counter can be read from a device register, in which case the counter can be considered a high-resolution clock. Although this clock does not generate interrupts, it offers accurate measurements of time intervals.

      Clocks and timers play crucial roles in measuring time, scheduling operations, and triggering events in modern operating systems. They provide essential functionalities such as fetching the current time, tracking elapsed time, and setting timers for future operations. Programmable interval timers generate interrupts at specified intervals, facilitating process scheduling, cache flushing, and network congestion management. Hardware advancements, such as the high-performance event timer (HPET), enhance precision in modern PCs. However, system clock drift necessitates correction protocols like NTP to maintain accuracy. Virtual clocks allow an operating system to support multiple timer requests beyond hardware limitations, ensuring efficient time management. The discussion emphasizes how clocks and timers integrate with system processes, optimizing performance and synchronization.

    12. Block and Character Devices

      This section categorizes I/O devices as block or character devices, explaining their operational differences. Block devices, such as disk drives, handle data in fixed-size blocks and support commands like read(), write(), and seek(). These devices are typically accessed through a file-system interface or via raw I/O, where the application bypasses file system services for direct access. Direct I/O, a hybrid approach, allows applications to disable buffering and locking. In contrast, character devices, like keyboards and mice, process data as streams of characters. The get() and put() functions facilitate data transfer, while higher-level libraries provide line-editing and buffering services. Memory-mapped I/O is introduced as an efficient way to access block devices, reducing the overhead of traditional I/O operations.

    13. Because the performance and addressing characteristics of network I/O differ significantly from those of disk I/O, most operating systems provide a network I/O interface that is different from the read()–write()–seek() interface used for disks. One interface available in many operating systems, including UNIX and Windows, is the network socket interface. Think of a wall socket for electricity: any electrical appliance can be plugged in. By analogy, the system calls in the socket interface enable an application to create a socket, to connect a local socket to a remote address (which plugs this application into a socket created by another application), to listen for any remote application to plug into the local socket, and to send and receive packets over the connection. To support the implementation of network servers, the socket interface also provides a function called select() that manages a set of sockets. A call to select() returns information about which sockets have a packet waiting to be received and which sockets have room to accept a packet to be sent. The use of select() eliminates the polling and busy waiting that would otherwise be necessary for network I/O. These functions encapsulate the essential behaviors of networks, greatly facilitating the creation of distributed applications that can use any underlying network hardware and protocol stack. Many other approaches to interprocess communication and network communication have been implemented. For instance, Windows provides one interface to the network interface card and a second interface to the network protocols. In UNIX, which has a long history as a proving ground for network technology, we find half-duplex pipes, full-duplex FIFOs, full-duplex STREAMS, message queues, and sockets. Information on UNIX networking is given in Section C.9.

      Due to distinct performance and addressing characteristics, network I/O employs a different interface than disk I/O. Many operating systems, including UNIX and Windows, implement the socket interface for network communication. The socket system calls enable applications to create, connect, listen, send, and receive data over network connections. The select() function is highlighted for managing multiple sockets efficiently, allowing applications to detect readable or writable sockets without continuous polling. Network communication in UNIX is further explored, showcasing diverse interprocess communication methods such as pipes, FIFOs, message queues, and STREAMS. The section underscores the importance of encapsulating network behaviors through standardized interfaces, simplifying the development of distributed applications that function across different network hardware and protocol stacks.

    14. Application I/O Interface

      This section introduces the fundamental structuring techniques and interfaces used by operating systems to standardize I/O device interactions. It explains how applications can access files on different types of disks without requiring device-specific knowledge. The discussion highlights the importance of abstraction, encapsulation, and software layering in designing an I/O system. These concepts enable the operating system to provide a uniform interface for application interaction with I/O devices. The role of device drivers is emphasized, as they encapsulate hardware-specific functionalities while exposing standardized interfaces to the kernel. Figure 12.7 is referenced to illustrate the layered structure of the kernel’s I/O subsystem, depicting how different drivers and controllers interact to manage I/O operations efficiently.

    15. 12.2.4 Direct Memory Access For a device that does large transfers, such as a disk drive, it seems wasteful to use an expensive general-purpose processor to watch status bits and to feed data into a controller register one byte at a time—a process termed programmed I/O (PIO). Computers avoid burdening the main CPU with PIO by offloading some of this work to a special-purpose processor called a direct-memory-access (DMA) controller. To initiate a DMA transfer, the host writes a DMA command block into memory. This block contains a pointer to the source of a transfer, a pointer to the destination of the transfer, and a count of the number of bytes to be transferred. A command block can be more complex, including a list of sources and destinations addresses that are not contiguous. This scatter–gather method allows multiple transfers to be executed via a single DMA command. The CPU writes the address of this command block to the DMA controller, then goes on with other work. The DMA controller proceeds to operate the memory bus directly, placing addresses on the bus to perform transfers without the help of the main CPU. A simple DMA controller is a standard component in all modern computers, from smartphones to mainframes. Note that it is most straightforward for the target address to be in kernel address space. If it were in user space, the user could, for example, modify the contents of that space during the transfer, losing some set of data. To get the DMA-transferred data to the user space for thread access, however, a second copy operation, this time from kernel memory to user memory, is needed. This double buffering is inefficient. Over time, operating systems have moved to using memory-mapping (see Section 12.2.1) to perform I/O transfers directly between devices and user address space. Handshaking between the DMA controller and the device controller is performed via a pair of wires called DMA-request and DMA-acknowledge. The device controller places a signal on the DMA-request wire when a word of data is available for transfer. This signal causes the DMA controller to seize the memory bus, place the desired address on the memory-address wire, and place a signal on the DMA-acknowledge wire. When the device controller receives the DMA-acknowledge signal, it transfers the word of data to memory and removes the DMA-request signal. When the entire transfer is finished, the DMA controller interrupts the CPU. This process is depicted in Figure 12.6. When the DMA controller seizes the memory bus, the CPU is momentarily prevented from accessing main memory, although it can still access data items in its caches. Although this cycle stealing can slow down the CPU computation, offloading the data-transfer work to a DMA controller generally improves the total system performance. Some computer architectures use physical memory addresses for DMA, but others perform direct virtual memory access (DVMA), using virtual addresses that undergo translation to physical addresses. DVMA can perform a transfer between two memory-mapped devices without the intervention of the CPU or the use of main memory.

      Direct Memory Access (DMA) enhances system efficiency by offloading data transfer tasks from the CPU to a dedicated DMA controller. This is particularly beneficial for large transfers, such as those from disk drives, where programmed I/O (PIO) would be inefficient. The CPU initiates a DMA transfer by writing a command block to memory, specifying source, destination, and transfer size. The DMA controller then handles the data transfer independently, allowing the CPU to perform other tasks. Handshaking between the DMA controller and device controllers ensures proper data movement. While DMA momentarily seizes the memory bus, slowing CPU memory access, overall performance improves. Some systems use direct virtual memory access (DVMA) to enable direct transfers between memory-mapped devices, further optimizing data movement.

    16. 12.2.3 Interrupts

      Interrupts are crucial in modern operating systems, enabling the CPU to respond to asynchronous events. The fundamental mechanism involves a CPU monitoring an interrupt-request line, which signals when a device needs service. Upon detection, the CPU saves its state, jumps to an interrupt-handler routine, determines the cause of the interrupt, processes the request, restores the state, and resumes execution. This cycle prevents the CPU from constantly polling devices, improving efficiency. Interrupt-driven I/O is vital as systems handle thousands of interrupts per second. Modern interrupt management includes deferring interrupts during critical tasks, efficiently dispatching handlers, prioritizing multiple interrupts, and handling software-generated exceptions. These features are implemented through CPU architecture and interrupt-controller hardware, ensuring responsive and efficient system operation.

    17. The complete protocol for interaction between the host and a controller can be intricate, but the basic handshaking notion is simple. We explain handshaking with an example. Assume that 2 bits are used to coordinate the producer–consumer relationship between the controller and the host. The controller indicates its state through the busy bit in the status register. (Recall that to set a bit means to write a 1 into the bit and to clear a bit means to write a 0 into it.) The controller sets the busy bit when it is busy working and clears the busy bit when it is ready to accept the next command. The host signals its wishes via the command-ready bit in the command register. The host sets the command-ready bit when a command is available for the controller to execute. For this example, the host writes output through a port, coordinating with the controller by handshaking as follows. 1. The host repeatedly reads the busy bit until that bit becomes clear. 2. The host sets the write bit in the command register and writes a byte into the data-out register. 3. The host sets the command-ready bit. 4. When the controller notices that the command-ready bit is set, it sets the busy bit. 5. The controller reads the command register and sees the write command. It reads the data-out register to get the byte and does the I/O to the device. 6. The controller clears the command-ready bit, clears the error bit in the status register to indicate that the device I/O succeeded, and clears the busy bit to indicate that it is finished. This loop is repeated for each byte. In step 1, the host is busy-waiting or polling: it is in a loop, reading the status register over and over until the busy bit becomes clear. If the controller and device are fast, this method is a reasonable one. But if the wait may be long, the host should probably switch to another task. How, then, does the host know when the controller has become idle? For some devices, the host must service the device quickly, or data will be lost. For instance, when data are streaming in on a serial port or from a keyboard, the small buffer on the controller will overflow and data will be lost if the host waits too long before returning to read the bytes. In many computer architectures, three CPU-instruction cycles are sufficient to poll a device: read a device register, logical-and to extract a status bit, and branch if not zero. Clearly, the basic polling operation is efficient. But polling becomes inefficient when it is attempted repeatedly yet rarely finds a device ready for service, while other useful CPU processing remains undone. In such instances, it may be more efficient to arrange for the hardware controller to notify the CPU when the device becomes ready for service, rather than to require the CPU to poll repeatedly for an I/O completion. The hardware mechanism that enables a device to notify the CPU is called an interrupt.

      Polling is a mechanism for the CPU to check a device’s status before issuing a command. It explains the handshaking process between the host and the controller using status and command registers. The concept of busy-waiting is introduced, where the CPU continuously checks a device’s status, which can be inefficient for slow devices. The section discusses the trade-offs of polling: while it is simple and effective for fast devices, it can be inefficient when applied to slow devices, leading to wasted CPU cycles. Instead, interrupt-driven I/O is introduced as an alternative, allowing devices to notify the CPU when they are ready for service. This section provides a critical understanding of CPU-device communication methods and their efficiency implications.

    18. Memory-Mapped I/O

      This section explains how processors communicate with I/O devices through registers that store data and control signals. It contrasts two methods: special I/O instructions and memory-mapped I/O. In memory-mapped I/O, device registers are mapped to the processor’s address space, allowing standard data-transfer instructions to be used for communication. The section highlights the evolution from using dedicated I/O instructions to predominantly employing memory-mapped I/O due to its efficiency and speed advantages. The discussion includes examples such as graphics controllers, which store screen data in memory-mapped regions. Understanding memory-mapped I/O is fundamental to designing systems that efficiently manage hardware interactions.

    19. 12.2 I/O Hardware Computers operate a great many kinds of devices. Most fit into the general categories of storage devices (disks, tapes), transmission devices (network connections, Bluetooth), and human-interface devices (screen, keyboard, mouse, audio in and out). Other devices are more specialized, such as those involved in the steering of a jet. In these aircraft, a human gives input to the flight computer via a joystick and foot pedals, and the computer sends output commands that cause motors to move rudders and flaps and fuels to the engines. Despite the incredible variety of I/O devices, though, we need only a few concepts to understand how the devices are attached and how the software can control the hardware. A device communicates with a computer system by sending signals over a cable or even through the air. The device communicates with the machine via a connection point, or port—for example, a serial port. (The term PHY, shorthand for the OSI model physical layer, is also used in reference to ports but is more common in data-center nomenclature.) If devices share a common set of wires, the connection is called a bus. A bus, like the PCI bus used in most computers today, is a set of wires and a rigidly defined protocol that specifies a set of messages that can be sent on the wires. In terms of the electronics, the messages are conveyed by patterns of electrical voltages applied to the wires with defined timings. When device A has a cable that plugs into device B, and device B has a cable that plugs into device C, and device C plugs into a port on the computer, this arrangement is called a daisy chain. A daisy chain usually operates as a bus. Buses are used widely in computer architecture and vary in their signaling methods, speed, throughput, and connection methods. A typical PC bus structure appears in Figure 12.1. In the figure, a PCIe bus (the common PC system bus) connects the processor–memory subsystem to fast devices, and an expansion bus connects relatively slow devices, such as the keyboard and serial and USB ports. In the lower-left portion of the figure, four disks are connected together on a serial-attached SCSI (SAS) bus plugged into an SAS controller. PCIe is a flexible bus that sends data over one or more “lanes.” A lane is composed of two signaling pairs, one pair for receiving data and the other for transmitting. Each lane is therefore composed of four wires, and each lane is used as a full-duplex byte stream, transporting data packets in an eight-bit byte format simultaneously in both directions. Physically, PCIe links may contain 1, 2, 4, 8, 12, 16, or 32 lanes, as signified by an “x” prefix. A PCIe card or connector that uses 8 lanes is designated x8, for example. In addition, PCIe has gone through multiple “generations,” with more coming in the future. Thus, for example, a card might be “PCIe gen3 x8”, which means it works with generation 3 of PCIe and uses 8 lanes. Such a device has maximum throughput of 8 gigabytes per second.

      This section introduces the various categories of I/O devices, including storage, transmission, and human-interface devices. It describes how devices communicate with a computer system through ports, buses, and controllers. Key concepts include the use of serial ports, daisy chains, and PCIe buses, which determine how devices are connected and managed. Buses are crucial in defining communication protocols and data transfer rates. The section also introduces device controllers, which handle low-level hardware operations, and highlights the role of host bus adapters (HBAs) in managing complex communication protocols. Understanding these hardware elements is essential for efficient device management and integration in an operating system.

    20. 12.1 Overview The control of devices connected to the computer is a major concern of operating-system designers. Because I/O devices vary so widely in their function and speed (consider a mouse, a hard disk, a flash drive, and a tape robot), varied methods are needed to control them. These methods form the I/O subsystem of the kernel, which separates the rest of the kernel from the complexities of managing I/O devices. I/O-device technology exhibits two conflicting trends. On the one hand, we see increasing standardization of software and hardware interfaces. This trend helps us to incorporate improved device generations into existing computers and operating systems. On the other hand, we see an increasingly broad variety of I/O devices. Some new devices are so unlike previous devices that it is a challenge to incorporate them into our computers and operating systems. This challenge is met by a combination of hardware and software techniques. The basic I/O hardware elements, such as ports, buses, and device controllers, accommodate a wide variety of I/O devices. To encapsulate the details and oddities of different devices, the kernel of an operating system is structured to use device-driver modules. The device drivers present a uniform device-access interface to the I/O subsystem, much as system calls provide a standard interface between the application and the operating system.

      This section describes the I/O subsystem, which acts as an intermediary between the core components of an OS—such as the process manager, memory manager, and file system—and external I/O devices. It acknowledges the dual trends in I/O technology: increasing standardization of interfaces and the rising diversity of I/O devices. The I/O subsystem simplifies device management through device drivers, which present a uniform access interface despite the underlying hardware differences. Standardization allows smooth integration of new device generations, while device-driver modules help encapsulate the complexities of hardware communication. The discussion sets the foundation for understanding how an OS interacts with I/O devices, balancing standardization and flexibility.

    21. 8.8.2 Resource Preemption

      Resource preemption resolves deadlocks by forcibly reclaiming resources from processes and redistributing them. It involves three key considerations: selecting a victim process, rolling back progress, and preventing starvation. Victim selection prioritizes minimizing cost, considering resource usage and execution time. Rollback ensures the process can resume execution, often requiring tracking safe states. The simplest rollback method is aborting and restarting the process, though selective rollback is more efficient. Starvation must be prevented by limiting how often a process is preempted. Implementing fair policies ensures no process is perpetually delayed, maintaining system efficiency while effectively managing resource allocation conflicts.

    22. 8.8.1 Process and Thread Termination To eliminate deadlocks by aborting a process or thread, we use one of two methods. In both methods, the system reclaims all resources allocated to the terminated processes. Abort all deadlocked processes. This method clearly will break the deadlock cycle, but at great expense. The deadlocked processes may have computed for a long time, and the results of these partial computations must be discarded and probably will have to be recomputed later. Abort one process at a time until the deadlock cycle is eliminated. This method incurs considerable overhead, since after each process is aborted, a deadlock-detection algorithm must be invoked to determine whether any processes are still deadlocked. Aborting a process may not be easy. If the process was in the midst of updating a file, terminating it may leave that file in an incorrect state. Similarly, if the process was in the midst of updating shared data while holding a mutex lock, the system must restore the status of the lock as being available, although no guarantees can be made regarding the integrity of the shared data. If the partial termination method is used, then we must determine which deadlocked process (or processes) should be terminated. This determination is a policy decision, similar to CPU-scheduling decisions. The question is basically an economic one; we should abort those processes whose termination will incur the minimum cost. Unfortunately, the term minimum cost is not a precise one. Many factors may affect which process is chosen, including: 1. What the priority of the process is 2. How long the process has computed and how much longer the process will compute before completing its designated task 3. How many and what types of resources the process has used (for example, whether the resources are simple to preempt) 4. How many more resources the process needs in order to complete 5. How many processes will need to be terminated

      Terminating processes eliminates deadlocks but can be costly. One approach is aborting all deadlocked processes, ensuring immediate resolution but wasting computed progress. Another method is terminating one process at a time until the deadlock is resolved, requiring repeated deadlock detection, which incurs overhead. Partial termination requires selecting the least costly process to terminate, considering factors like priority, resource usage, and completion time. However, terminating a process mid-operation can corrupt files or shared data, necessitating careful rollback mechanisms. Effective termination policies minimize disruption while ensuring system stability, making it crucial to balance efficiency and fairness when choosing a process.

    23. 8.7.2 Several Instances of a Resource Type The wait-for graph scheme is not applicable to a resource-allocation system with multiple instances of each resource type. We turn now to a deadlock-detection algorithm that is applicable to such a system. The algorithm employs several time-varying data structures that are similar to those used in the banker's algorithm (Section 8.6.3): Available. A vector of length m indicates the number of available resources of each type. Allocation. An n × m matrix defines the number of resources of each type currently allocated to each thread. Request. An n × m matrix indicates the current request of each thread. If Request[i][j] equals k, then thread Ti is requesting k more instances of resource type Rj.

      For multiple-instance resource systems, deadlock detection involves maintaining Available, Allocation, and Request matrices. The algorithm checks if requests can be satisfied sequentially, updating Work and Finish vectors. If any thread remains unfinished, the system is in a deadlocked state. Unlike prevention techniques, this method detects deadlocks after they occur, requiring additional recovery mechanisms. Java provides deadlock detection via thread dumps, which analyze blocked threads and lock ownership. While useful for debugging, this detection technique does not resolve deadlocks automatically. Periodic execution of detection algorithms helps maintain system stability but may impact performance in resource-intensive applications.

    24. 8.7.1 Single Instance of Each Resource Type

      For systems with single-instance resources, deadlock detection uses a wait-for graph, derived from the resource-allocation graph by removing resource nodes. If a cycle exists, a deadlock has occurred. Detecting cycles in the wait-for graph requires O(n²) time complexity, making it suitable for small-scale systems. The BCC toolkit in Linux helps detect deadlocks in pthread_mutex locks by analyzing lock dependencies. The tool constructs a wait-for graph dynamically and identifies cycles, providing real-time deadlock detection. Although effective, this method requires continuous system monitoring and may introduce performance overhead due to frequent deadlock detection operations.

    25. 8.7 Deadlock Detection If a system does not employ either a deadlock-prevention or a deadlock-avoidance algorithm, then a deadlock situation may occur. In this environment, the system may provide: An algorithm that examines the state of the system to determine whether a deadlock has occurred An algorithm to recover from the deadlock Next, we discuss these two requirements as they pertain to systems with only a single instance of each resource type, as well as to systems with several instances of each resource type. At this point, however, we note that a detection-and-recovery scheme requires overhead that includes not only the run-time costs of maintaining the necessary information and executing the detection algorithm but also the potential losses inherent in recovering from a deadlock.

      If a system does not actively prevent or avoid deadlocks, it must detect and recover from them. Deadlock detection involves algorithms that periodically check the system state for potential deadlocks. Recovery mechanisms may include preempting resources or terminating processes. While this approach allows more flexibility than prevention, it incurs higher overhead in both detection and recovery. The cost includes maintaining real-time allocation data and implementing deadlock-resolution techniques. The method is useful in systems where deadlocks are rare but cannot be entirely prevented. It is especially relevant in dynamic environments with unpredictable resource requests and multi-threaded execution.

    26. 8.6.3.3 An Illustrative Example To illustrate the use of the banker's algorithm, consider a system with five threads T0 through T4 and three resource types A, B, and C. Resource type A has ten instances, resource type B has five instances, and resource type C has seven instances.

      An example with five threads (T0–T4) and three resource types (A, B, C) demonstrates the Banker's Algorithm. The initial system state is analyzed using the safety algorithm, ensuring a safe execution sequence. A request by T1 for (1,0,2) resources is evaluated, found safe, and granted. However, a later request by T4 for (3,3,0) cannot be granted due to insufficient resources. Another request by T0 for (0,2,0) is denied since it leads to an unsafe state. This example highlights the algorithm’s effectiveness in maintaining safe states while preventing resource starvation and deadlocks in multi-threaded environments.

    27. 8.6.3.2 Resource-Request Algorithm

      This algorithm determines if a thread's resource request can be granted safely. It follows three checks: (1) Ensuring the request doesn’t exceed the thread’s maximum claim, (2) Verifying that enough resources are available, and (3) Temporarily allocating resources and running the safety algorithm. If the new state is safe, the request is granted; otherwise, it is denied, and the system reverts to its previous state. This approach prevents unsafe allocations and potential deadlocks but adds computational overhead. It ensures resource availability for all threads without causing indefinite blocking, balancing efficiency with strict safety considerations in resource management.

    28. 8.6.3.1 Safety Algorithm

      The safety algorithm determines if a system is in a safe state by ensuring at least one sequence exists where every thread can finish execution. It initializes a Work vector (copy of Available) and a Finish vector (tracking completed threads). The algorithm iteratively finds a thread that can complete with the available resources, updates Work, and marks the thread as finished. If all threads finish, the system is safe. If some remain unfinished, it’s unsafe. This method, though effective, has a complexity of O(m × n²), making it expensive for large-scale systems with many threads and resource types.

    29. 8.6.3 Banker's Algorithm

      The Banker's Algorithm manages multiple instances of resource types to prevent deadlocks. A thread must declare its maximum resource needs beforehand, ensuring that the system can verify safety before granting resources. It maintains four data structures: Available (remaining resources), Max (maximum need per thread), Allocation (currently assigned resources), and Need (remaining required resources). The algorithm simulates allocation and checks for safety using a stepwise approach. If a request leads to an unsafe state, it is denied. This method ensures that resources are allocated efficiently, but it is computationally expensive due to the required safety-checking process.

    30. 8.6.2 Resource-Allocation-Graph Algorithm If we have a resource-allocation system with only one instance of each resource type, we can use a variant of the resource-allocation graph defined in Section 8.3.2 for deadlock avoidance. In addition to the request and assignment edges already described, we introduce a new type of edge, called a claim edge. A claim edge Ti → Rj indicates that thread Ti may request resource Rj at some time in the future. This edge resembles a request edge in direction but is represented in the graph by a dashed line. When thread Ti requests resource Rj, the claim edge Ti → Rj is converted to a request edge. Similarly, when a resource Rj is released by Ti, the assignment edge Rj → Ti is reconverted to a claim edge Ti → Rj.

      The resource-allocation-graph algorithm helps avoid deadlocks in systems with a single instance per resource type. It introduces a claim edge, which shows potential future requests. When a thread requests a resource, the claim edge turns into a request edge, and if granted, it becomes an assignment edge. Before granting a request, the system checks for cycles in the graph using a cycle-detection algorithm. If adding the assignment edge creates a cycle, the request is denied to prevent an unsafe state. This algorithm ensures safe resource allocation, avoiding deadlocks while maintaining system stability, though it is limited to single-instance resources.

    31. 8.6.1 Safe State A state is safe if the system can allocate resources to each thread (up to its maximum) in some order and still avoid a deadlock. More formally, a system is in a safe state only if there exists a safe sequence. A sequence of threads <T1, T2, …, Tn> is a safe sequence for the current allocation state if, for each Ti, the resource requests that Ti can still make can be satisfied by the currently available resources plus the resources held by all Tj, with j < i. In this situation, if the resources that Ti needs are not immediately available, then Ti can wait until all Tj have finished. When they have finished, Ti can obtain all of its needed resources, complete its designated task, return its allocated resources, and terminate. When Ti terminates, Ti+1 can obtain its needed resources, and so on. If no such sequence exists, then the system state is said to be unsafe.

      A system is in a safe state if a sequence of processes exists such that each process can obtain its maximum requested resources, execute, release resources, and allow the next process in the sequence to proceed. If no such sequence exists, the system is in an unsafe state, which may lead to deadlock.

    32. The three options presented thus far for deadlock prevention are generally impractical in most situations. However, the fourth and final condition for deadlocks — the circular-wait condition — presents an opportunity for a practical solution by invalidating one of the necessary conditions. One way to ensure that this condition never holds is to impose a total ordering of all resource types and to require that each thread requests resources in an increasing order of enumeration. To illustrate, we let R = {R1, R2, …, Rm} be the set of resource types. We assign to each resource type a unique integer number, which allows us to compare two resources and to determine whether one precedes another in our ordering. Formally, we define a one-to-one function F: R → N, where N is the set of natural numbers. We can accomplish this scheme in an application program by developing an ordering among all synchronization objects in the system. For example, the lock ordering in the Pthread program shown in Figure 8.1 could be F(first_mutex) = 1 F(second_mutex) = 5 We can now consider the following protocol to prevent deadlocks: Each thread can request resources only in an increasing order of enumeration.

      To prevent circular wait, a system can impose a total ordering of resource requests. Processes must request resources in a predefined order. If followed correctly, this protocol eliminates cycles in resource allocation and prevents deadlocks.

    33. 8.5.3 No Preemption

      A system can allow preemption of resources if a process requests a resource that is unavailable. The resources held by the waiting process can be preempted and reassigned, but this is typically feasible only for resources whose states can be saved and restored, such as CPU registers or database transactions.

    34. To ensure that the hold-and-wait condition never occurs in the system, we must guarantee that, whenever a thread requests a resource, it does not hold any other resources. One protocol that we can use requires each thread to request and be allocated all its resources before it begins execution. This is, of course, impractical for most applications due to the dynamic nature of requesting resources. An alternative protocol allows a thread to request resources only when it has none. A thread may request some resources and use them. Before it can request any additional resources, it must release all the resources that it is currently allocated. Both these protocols have two main disadvantages. First, resource utilization may be low, since resources may be allocated but unused for a long period. For example, a thread may be allocated a mutex lock for its entire execution, yet only require it for a short duration. Second, starvation is possible. A thread that needs several popular resources may have to wait indefinitely, because at least one of the resources that it needs is always allocated to some other thread.

      One way to prevent hold-and-wait conditions is to require processes to request all required resources before execution begins. Alternatively, a process can request resources only when it holds none. These protocols can lead to poor resource utilization and potential starvation.

    35. 8.5.1 Mutual Exclusion The mutual-exclusion condition must hold. That is, at least one resource must be nonsharable. Sharable resources do not require mutually exclusive access and thus cannot be involved in a deadlock. Read-only files are a good example of a sharable resource. If several threads attempt to open a read-only file at the same time, they can be granted simultaneous access to the file. A thread never needs to wait for a sharable resource. In general, however, we cannot prevent deadlocks by denying the mutual-exclusion condition, because some resources are intrinsically nonsharable. For example, a mutex lock cannot be simultaneously shared by several threads.

      Some resources, like read-only files, can be shared and do not require mutual exclusion. However, certain resources, such as mutex locks, cannot be shared and thus remain prone to deadlocks.

    36. 8.4 Methods for Handling Deadlocks

      There are three general approaches to handling deadlocks:

      1. Ignoring Deadlocks: Many operating systems, including Linux and Windows, do not implement specific deadlock-handling mechanisms. Instead, they leave the responsibility to kernel and application developers to prevent and resolve deadlocks.

      2. Deadlock Prevention or Avoidance: This approach ensures that a system never enters a deadlocked state. Deadlock prevention works by ensuring that at least one of the necessary conditions for deadlocks is eliminated. Deadlock avoidance requires prior knowledge of resource requests to make decisions that prevent circular waits.

      3. Deadlock Detection and Recovery: Some systems, such as databases, allow deadlocks to occur and then implement algorithms to detect and recover from them. If no detection and recovery mechanisms exist, the system may deteriorate until manual intervention is required.

      Each approach has its own advantages and drawbacks. Ignoring deadlocks is cost-effective but risks system failure. Prevention and avoidance require additional constraints on resource requests, which may reduce system efficiency. Detection and recovery offer flexibility but may introduce significant computational overhead.

    37. Resource-Allocation Graph

      Deadlocks can be visually represented using resource-allocation graphs, where processes and resources are depicted as nodes, and edges indicate resource requests and allocations. These graphs help in understanding and detecting deadlocks in complex systems. For instance, if a circular wait is visible in a graph, it confirms that a deadlock exists. This concept is similar to dependency graphs used in database transactions, where cycles indicate transaction conflicts. A thought-provoking question is whether modern systems can automate deadlock detection by dynamically analyzing resource allocation graphs in real time.

    38. 8.3.1 Necessary Conditions A deadlock situation can arise if the following four conditions hold simultaneously in a system: 1. Mutual exclusion. At least one resource must be held in a nonsharable mode; that is, only one thread at a time can use the resource. If another thread requests that resource, the requesting thread must be delayed until the resource has been released. 2. Hold and wait. A thread must be holding at least one resource and waiting to acquire additional resources that are currently being held by other threads. 3. No preemption. Resources cannot be preempted; that is, a resource can be released only voluntarily by the thread holding it, after that thread has completed its task. 4. Circular wait. A set {T0, T1, …, Tn} of waiting threads must exist such that T0 is waiting for a resource held by T1, T1 is waiting for a resource held by T2, …, Tn−1 is waiting for a resource held by Tn, and Tn is waiting for a resource held by T0. We emphasize that all four conditions must hold for a deadlock to occur. The circular-wait condition implies the hold-and-wait condition, so the four conditions are not completely independent. We shall see in Section 8.5, however, that it is useful to consider each condition separately.

      A deadlock occurs when four conditions hold simultaneously: Mutual Exclusion, Hold and Wait, No Preemption, and Circular Wait. Mutual exclusion means that only one process can use a resource at a time, hold and wait refers to a process holding resources while waiting for others, no preemption prevents the forced release of resources, and circular wait forms a cycle where each process is waiting for a resource held by the next. If any of these conditions is broken, deadlocks cannot occur. One key question is why the operating system does not forcefully take back resources (breaking the no preemption condition) as a solution to deadlocks.

    39. Livelock is another form of liveness failure. It is similar to deadlock; both prevent two or more threads from proceeding, but the threads are unable to proceed for different reasons.

      Livelock is a special case of deadlock where threads or processes continuously retry an operation without making progress. Unlike deadlock, where processes remain blocked, livelock involves active but futile state changes. A real-world example is two people moving side to side in a hallway, each trying to avoid the other but still blocking each other. In computing, livelock can be mitigated by using randomized retry delays, a technique also applied in network protocols to resolve data transmission collisions. Understanding livelock is crucial because it can degrade system performance even though processes are technically still active.

    40. 8.2 Deadlock in Multithreaded Applications

      In multithreaded applications, deadlocks arise when multiple threads acquire resources in an incorrect sequence. For instance, if two threads each acquire one mutex and then try to acquire the other mutex held by the opposite thread, neither can proceed, causing a deadlock. The scheduling of thread execution can influence deadlock occurrence; if one thread releases its resources quickly enough, a deadlock may be avoided. A key question to explore is whether modern operating systems implement scheduling mechanisms that help mitigate deadlocks dynamically.

    41. 8.1 System Model

      A system consists of multiple processes or threads competing for limited resources, which they request, use, and release. Examples of such resources include CPU cycles, files, and network interfaces. Deadlocks frequently occur when synchronization mechanisms like mutex locks and semaphores are used improperly. Understanding how resource management works at both the kernel and user levels is crucial to preventing deadlocks. One confusing aspect is distinguishing between deadlocks managed by the kernel and those occurring at the user level. Additionally, concepts like mutual exclusion (mutex) and semaphores play a significant role in deadlock formation and resolution.

    42. 4.7.2 Linux Threads Linux provides the fork() syste

      Linux does not differentiate between processes and threads, instead referring to both as tasks. The clone() system call allows task creation with varying levels of resource sharing, depending on flags passed during invocation. When clone() is used with flags like CLONE_VM and CLONE_FS, parent and child tasks share memory and file system information, effectively functioning as threads. If no flags are set, clone() behaves like fork(), creating a separate process. Linux tasks are managed through the task_struct data structure, which holds pointers to resource-related information. This flexibility extends to container-based virtualization, allowing multiple isolated environments under a single Linux kernel. The efficient implementation of clone() enables robust process and thread management in Linux systems.

    43. 4.7.1 Windows Threads

      Windows threads follow a one-to-one mapping, where each user-level thread corresponds to a kernel thread. Every thread has essential components, including a thread ID, program counter, register set, and user/kernel stacks. Windows threads are represented by key data structures: ETHREAD (executive thread block), KTHREAD (kernel thread block), and TEB (thread environment block). The ETHREAD links the thread to its parent process, while KTHREAD manages scheduling and kernel stack usage. The TEB resides in user space, containing thread-local storage and other user-mode data. Since ETHREAD and KTHREAD exist in kernel space, only the operating system can access them. This structured approach ensures efficient thread management, optimizing performance in multithreaded Windows applications.

    44. 4.6.5 Scheduler Activations

      Scheduler activations enable efficient communication between the kernel and thread libraries in many-to-many and two-level threading models. An intermediate data structure, the lightweight process (LWP), acts as a bridge between user and kernel threads, allowing applications to manage scheduling effectively. The kernel assigns virtual processors (LWPs) to applications, and user threads execute on these virtual processors. When a thread blocks, an upcall mechanism notifies the application, prompting it to schedule another thread. This improves resource utilization by dynamically managing virtual processors. Additionally, upcall handlers ensure that blocked threads resume execution when resources become available. By providing applications with better control over thread execution, scheduler activations enhance the performance and responsiveness of multithreaded programs.

    45. .6.4 Thread-Local Storage

      Thread-Local Storage (TLS) allows each thread to maintain its own copy of specific data, which is useful when threads need to store unique information such as a transaction identifier. Unlike local variables that exist only within a function call, TLS persists across function calls for the duration of a thread's execution. This feature is beneficial when developers lack control over thread creation, such as when using thread pools. TLS behaves similarly to static data but is unique to each thread, ensuring thread safety in multithreaded applications. Various programming languages provide TLS support, such as Java’s ThreadLocal<T> class, POSIX Threads’ pthread_key_t, C#’s [ThreadStatic] attribute, and GCC’s _thread keyword, facilitating efficient thread-specific data management.

    46. 4.6 Threading Issues In this section, we discuss some of the issues to consider in designing multithreaded programs. 4.6.1 The fork() and exec() System Calls In Chapter 3, we described how the fork() system call is used to create a separate, duplicate process. The semantics of the fork() and exec() system calls change in a multithreaded program. If one thread in a program calls fork(), does the new process duplicate all threads, or is the new process single-threaded? Some UNIX systems have chosen to have two versions of fork(), one that duplicates all threads and another that duplicates only the thread that invoked the fork() system call. The exec() system call typically works in the same way as described in Chapter 3. That is, if a thread invokes the exec() system call, the program specified in the parameter to exec() will replace the entire process—including all threads. Which of the two versions of fork() to use depends on the application. If exec() is called immediately after forking, then duplicating all threads is unnecessary, as the program specified in the parameters to exec() will replace the process. In this instance, duplicating only the calling thread is appropriate. If, however, the separate process does not call exec() after forking, the separate process should duplicate all threads. 4.6.2 Signal Handling A signal is used in UNIX systems to notify a process that a particular event has occurred. A signal may be received either synchronously or asynchronously, depending on the source of and the reason for the event being signaled. All signals, whether synchronous or asynchronous, follow the same pattern: 1. A signal is generated by the occurrence of a particular event. 2. The signal is delivered to a process. 3. Once delivered, the signal must be handled. Examples of synchronous signals include illegal memory access and division by 0. If a running program performs either of these actions, a signal is generated. Synchronous signals are delivered to the same process that performed the operation that caused the signal (that is the reason they are considered synchronous). When a signal is generated by an event external to a running process, that process receives the signal asynchronously. Examples of such signals include terminating a process with specific keystrokes (such as <control><C>) and having a timer expire. Typically, an asynchronous signal is sent to another process. A signal may be handled by one of two possible handlers: 1. A default signal handler 2. A user-defined signal handler Every signal has a default signal handler that the kernel runs when handling that signal. This default action can be overridden by a user-defined signal handler that is called to handle the signal. Signals are handled in different ways. Some signals may be ignored, while others (for example, an illegal memory access) are handled by terminating the program. Handling signals in single-threaded programs is straightforward: signals are always delivered to a process. However, delivering signals is more complicated in multithreaded programs, where a process may have several threads. Where, then, should a signal be delivered? In general, the following options exist: 1. Deliver the signal to the thread to which the signal applies. 2. Deliver the signal to every thread in the process. 3. Deliver the signal to certain threads in the process. 4. Assign a specific thread to receive all signals for the process. The method for delivering a signal depends on the type of signal generated. For example, synchronous signals need to be delivered to the thread causing the signal and not to other threads in the process. However, the situation with asynchronous signals is not as clear. Some asynchronous signals—such as a signal that terminates a process (<control><C>, for example)—should be sent to all threads. The standard UNIX function for delivering a signal is kill(pid_t pid, int signal) This function specifies the process (pid) to which a particular signal (signal) is to be delivered. Most multithreaded versions of UNIX allow a thread to specify which signals it will accept and which it will block. Therefore, in some cases, an asynchronous signal may be delivered only to those threads that are not blocking it. However, because signals need to be handled only once, a signal is typically delivered only to the first thread found that is not blocking it. POSIX Pthreads provides the following function, which allows a signal to be delivered to a specified thread (tid): pthread_kill(pthread_t tid, int signal) Although Windows does not explicitly provide support for signals, it allows us to emulate them using asynchronous procedure calls (APCs). The APC facility enables a user thread to specify a function that is to be called when the user thread receives notification of a particular event. As indicated by its name, an APC is roughly equivalent to an asynchronous signal in UNIX. However, whereas UNIX must contend with how to deal with signals in a multithreaded environment, the APC facility is more straightforward, since an APC is delivered to a particular thread rather than a process. 4.6.3 Thread Cancellation Thread cancellation involves terminating a thread before it has completed. For example, if multiple threads are concurrently searching through a database and one thread returns the result, the remaining threads might be canceled. Another situation might occur when a user presses a button on a web browser that stops a web page from loading any further. Often, a web page loads using several threads—each image is loaded in a separate thread. When a user presses the stop button on the browser, all threads loading the page are canceled. A thread that is to be canceled is often referred to as the target thread. Cancellation of a target thread may occur in two different scenarios: 1. Asynchronous cancellation. One thread immediately terminates the target thread. 2. Deferred cancellation. The target thread periodically checks whether it should terminate, allowing it an opportunity to terminate itself in an orderly fashion. The difficulty with cancellation occurs in situations where resources have been allocated to a canceled thread or where a thread is canceled while in the midst of updating data it is sharing with other threads. This becomes especially troublesome with asynchronous cancellation. Often, the operating system will reclaim system resources from a canceled thread but will not reclaim all resources. Therefore, canceling a thread asynchronously may not free a necessary system-wide resource. With deferred cancellation, in contrast, one thread indicates that a target thread is to be canceled, but cancellation occurs only after the target thread has checked a flag to determine whether or not it should be canceled. The thread can perform this check at a point at which it can be canceled safely. In Pthreads, thread cancellati

      Multithreaded programs face several challenges, including handling system calls like fork() and exec(). The behavior of fork() varies—some UNIX implementations duplicate all threads, while others duplicate only the calling thread. If exec() is called immediately after forking, duplicating all threads is unnecessary. Signal handling is another challenge, as signals can be synchronous (e.g., division by zero) or asynchronous (e.g., termination signals). Signals in multithreaded programs can be delivered to a specific thread, all threads, or certain threads based on signal type. POSIX Pthreads provide pthread_kill() to direct signals to a specific thread. Windows uses Asynchronous Procedure Calls (APCs) to handle event-driven notifications, similar to UNIX signals. Developers must ensure proper signal handling to avoid unintended behavior. Thread cancellation is another concern, requiring careful implementation to prevent resource leaks and ensure that terminated threads do not leave operations incomplete.

    47. 4.5.5 Intel Thread Building Blocks Intel threading building blocks (TBB) is a template library that supports designing parallel applications in C++. As this is a library, it requires no special compiler or language support. Developers specify tasks that can run in parallel, and the TBB task scheduler maps these tasks onto underlying threads. Furthermore, the task scheduler provides load balancing and is cache aware, meaning that it will give precedence to tasks that likely have their data stored in cache memory and thus will execute more quickly. TBB provides a rich set of features, including templates for parallel loop structures, atomic operations, and mutual exclusion locking. In addition, it provides concurrent data structures, including a hash map, queue, and vector, which can serve as equivalent thread-safe versions of the C++ standard template library data structures. Let's use parallel for loops as an example. Initially, assume there is a function named apply(float value) that performs an operation on the parameter value. If we had an array v of size n containing float values, we could use the following serial for loop to pass each value in v to the apply() function: for (int i = 0; i < n; i++) {   apply(v[i]); } A developer could manually apply data parallelism (Section 4.2.2) on a multicore system by assigning different regions of the array v to each processing core; however, this ties the technique for achieving parallelism closely to the physical hardware, and the algorithm would have to be modified and recompiled for the number of processing cores on each specific architecture. Alternatively, a developer could use TBB, which provides a parallel_for template that expects two values: parallel_for (range   body) where range refers to the range of elements that will be iterated (known as the iteration space) and body specifies an operation that will be performed on a subrange of elements. We can now rewrite the above serial for loop using the TBB parallel_for template as follows: parallel_for (size_t(0), n, [=](size_t i) {apply(v[i]);}); The first two parameters specify that the iteration space is from 0 to n − 1 (which corresponds to the number of elements in the array v). The second parameter is a C++ lambda function that requires a bit of explanation. The expression [=](size_t i) is the parameter i, which assumes each of the values over the iteration space (in this case from 0 to n − 1). Each value of i is used to identify which array element in v is to be passed as a parameter to the apply(v[i]) function. The TBB library will divide the loop iterations into separate “chunks” and create a number of tasks that operate on those chunks. (The parallel_for function allows developers to manually specify the size of the chunks if they wish to.) TBB will also create a number of threads and assign tasks to available threads. This is quite similar to the fork-join library in Java. The advantage of this approach is that it requires only that developers identify what operations can run in parallel (by specifying a parallel_for loop), and the library manages the details involved in dividing the work into separate tasks that run in parallel. Intel TBB has both commercial and open-source versions that run on Windows, Linux, and macOS. Refer to the bibliography for further details on how to develop parallel applications using TBB.

      Intel TBB is a powerful template library for parallel programming in C++. Unlike OpenMP, it does not require compiler support but instead provides a task-based approach for parallel execution. The TBB task scheduler dynamically maps tasks to threads, optimizing load balancing and cache efficiency. The parallel_for template automates parallelization of loops, ensuring efficient distribution of workload. This method abstracts hardware-specific optimizations, making it adaptable across different multicore systems. TBB also provides concurrent data structures such as thread-safe hash maps and vectors, enhancing performance in multithreaded applications. The flexibility of TBB allows developers to scale applications efficiently without modifying code for different hardware configurations. While TBB offers advanced features for parallelism, it requires a solid understanding of lambda functions and task dependencies to maximize performance. Proper use of TBB ensures improved computational efficiency in complex applications.

    48. 4.5.4 Grand Central Dispatch

      GCD, developed by Apple, simplifies parallel programming by managing thread execution using dispatch queues. Tasks are categorized into serial and concurrent queues. Serial queues process tasks sequentially, ensuring ordered execution, while concurrent queues allow multiple tasks to run simultaneously, improving performance. GCD supports four priority classes for tasks: user-interactive, user-initiated, utility, and background, optimizing execution based on urgency and resource demand. Developers can use blocks in C, C++, and Objective-C or closures in Swift to define parallel tasks. The ability to manage thread pools dynamically makes GCD highly efficient, automatically adjusting thread counts to balance workload. GCD's integration into Apple’s ecosystem ensures seamless performance enhancements across macOS and iOS applications. Understanding the distinction between different queue types and priority levels is essential for effective task scheduling and resource management.

    49. 4.5.3 OpenMP

      OpenMP is an API designed for parallel programming in shared-memory environments. It enables developers to create parallel regions using compiler directives, ensuring that specific code blocks run simultaneously. The example C program demonstrates the use of #pragma omp parallel to create multiple threads based on the system’s available cores. OpenMP is beneficial for optimizing performance by distributing workload among threads, particularly in tasks like looping through arrays. Developers can specify levels of parallelism and manage data sharing between threads. OpenMP is widely supported across major operating systems and compilers, making it a versatile choice for parallel computing. However, developers must be mindful of thread synchronization issues to avoid race conditions. Understanding OpenMP's features, such as manual thread control and data-sharing specifications, is crucial for efficient parallel programming.

    50. 4.5.2.1 Fork Join in Java Java introduced a fork-join library in Version 1.7 of the API that is designed to be used with recursive divide-and-conquer algorithms such as Quicksort and Mergesort. When implementing divide-and-conquer algorithms using this library, separate tasks are forked during the divide step and assigned smaller subsets of the original problem. Algorithms must be designed so that these separate tasks can execute concurrently. At some point, the size of the problem assigned to a task is small enough that it can be solved directly and requires creating no additional tasks.

      Java’s ForkJoinPool facilitates parallel execution using recursive divide-and-conquer techniques. Tasks extend the ForkJoinTask class, either implementing RecursiveTask (returns a result) or RecursiveAction (no return value). The system dynamically manages worker threads, using a work-stealing algorithm where idle threads can take tasks from others. This model optimizes CPU usage, enabling efficient multi-threaded performance for computationally intensive tasks.

    51. 4.5.2 Fork Join

      The fork-join model allows parallel task execution by recursively dividing a problem into smaller tasks (fork) and merging results (join). This model is particularly effective for divide-and-conquer algorithms like Quicksort and Mergesort. It streamlines parallel execution by leveraging a task-splitting approach, making it an excellent candidate for implicit threading. A library manages thread allocation dynamically, balancing workload across available processors.

    52. 4.5.1.1 Java Thread Pools The java.util.concurrent package includes an API for several varieties of thread-pool architectures. Here, we focus on the following three models: 1. Single thread executor—newSingleThreadExecutor()—creates a pool of size 1. 2. Fixed thread executor—newFixedThreadPool(int size)—creates a thread pool with a specified number of threads. 3. Cached thread executor—newCachedThreadPool()—creates an unbounded thread pool, reusing threads in many instances.

      The java.util.concurrent package provides an API for managing thread pools. Java offers three main types: single-thread executor, fixed-thread executor, and cached-thread executor. These allow developers to choose a threading strategy based on their application's workload. The ExecutorService interface standardizes thread pool management, including task execution and shutdown mechanisms, simplifying concurrent programming.

    53. ANDROID THREAD POOLS

      Android’s AIDL framework integrates thread pools to manage remote service requests efficiently. By using a thread pool, Android ensures concurrent execution of multiple requests while limiting resource consumption. This model is especially useful for handling background tasks, such as network communication and database access, without overwhelming system resources.

    54. Thread Pools

      Thread pools mitigate the overhead of thread creation by maintaining a pool of pre-instantiated worker threads. This approach improves performance by reducing thread creation time and controlling the number of active threads. The Android OS also employs thread pools to manage RPCs efficiently. Thread pools are particularly beneficial in environments where tasks arrive at unpredictable rates, ensuring optimal CPU and memory utilization while preventing system overload.

    55. THE JVM AND THE HOST OPERATING SYSTEM

      The Java Virtual Machine (JVM) operates on top of a host OS, abstracting platform-specific threading details. The JVM does not dictate how Java threads map to OS threads, leaving it to individual implementations. Windows, for example, employs a one-to-one threading model, meaning each Java thread corresponds to a kernel thread. This adaptability allows Java applications to run consistently across different OS environments, utilizing native threading libraries such as Windows API or Pthreads on Linux/macOS.

    56. 4.4.1 Pthreads

      Pthreads, or POSIX threads, provide a standardized API for multithreading and are widely used in Linux, UNIX, and macOS. A typical Pthreads program includes:

      Declaring a thread identifier (pthread_t tid;).

      Defining thread attributes (pthread_attr_t attr;).

      Creating a thread using pthread_create().

      Ensuring the parent thread waits for completion with pthread_join().

      Terminating the child thread using pthread_exit().

      Pthreads use synchronous threading, meaning the parent thread waits for the child to finish execution before proceeding.

    57. 4.3.1 Many-to-One Model

      The many-to-one model maps multiple user threads to a single kernel thread, with all thread management occurring in user space. This approach is efficient because it avoids kernel overhead, making thread creation and management faster. However, it has significant drawbacks. One major issue is blocking is if a user thread makes a blocking system call, the entire process halts because all threads share the same kernel thread. Additionally, this model does not allow parallel execution on multicore systems, as only one thread can access the kernel at a time. Examples of this model include green threads, which were used in early Java implementations and Solaris. However, due to modern hardware's focus on multicore processing, this model is rarely used today.

    58. Thread Libraries

      A thread library provides an API for creating and managing threads, allowing developers to handle multithreading efficiently. These libraries can be implemented at either the user level or the kernel level.

      A user-level thread library operates in user space, avoiding system calls for better performance. However, if a thread makes a blocking system call, all threads in the process may be blocked.

      A kernel-level thread library is supported directly by the operating system. While slightly slower due to system calls, it allows better thread management and parallel execution.

      Some of the most widely used thread libraries include POSIX Pthreads (common in UNIX/Linux systems), Windows Threads, and Java Threads, which rely on the underlying OS for execution.

    59. 4.3.3 Many-to-Many Model The many-to-many model (Figure 4.9) multiplexes many user-level threads to a smaller or equal number of kernel threads. The number of kernel threads may be specific to either a particular application or a particular machine (an application may be allocated more kernel threads on a system with eight processing cores than a system with four cores).

      The many-to-many model allows multiple user threads to be mapped to an equal or smaller number of kernel threads. This provides greater flexibility and efficiency, as developers can create many user threads without overwhelming kernel resources. It also enables parallel execution when multiple kernel threads are available. A variation of this model is the two-level model, which extends many-to-many by allowing some user threads to be bound to specific kernel threads, providing finer control over execution. However, this model is difficult to implement efficiently. With modern hardware supporting large numbers of kernel threads, many systems now favor the one-to-one model instead.

    60. The one-to-one model (Figure 4.8) maps each user thread to a kernel thread. It provides more concurrency than the many-to-one model by allowing another thread to run when a thread makes a blocking system call. It also allows multiple threads to run in parallel on multiprocessors. The only drawback to this model is that creating a user thread requires creating the corresponding kernel thread, and a large number of kernel threads may burden the performance of a system. Linux, along with the family of Windows operating systems, implement the one-to-one model.

      In the one-to-one model, each user thread is directly mapped to a kernel thread. This model improves concurrency because if one thread blocks, others can continue execution. Furthermore, it supports true parallelism, allowing threads to run on multiple processors. Despite these advantages, the model comes with high overhead. Creating a user thread also requires creating a corresponding kernel thread, which can lead to resource exhaustion if too many threads are created.

    61. 4.3 Multithreading Models Our discussion so far has treated threads in a generic sense. However, support for threads may be provided either at the user level, for user threads, or by the kernel, for kernel threads.

      Thread support is provided at both the user level and kernel level. User threads exist entirely in user space and rely on thread libraries, while kernel threads are managed directly by the operating system. There are three primary multithreading models: many-to-one, one-to-one, and many-to-many. Each model has trade-offs regarding efficiency, concurrency, and system resource utilization. Modern operating systems primarily use the one-to-one model, ensuring better performance and scalability. However, the many-to-many model is still relevant in concurrency libraries, enabling flexible thread-to-core mapping. Understanding these models is essential for designing efficient multithreaded applications, particularly in high-performance and distributed computing environments.

    62. 4.2.2 Types of Parallelism In general, there are two types of parallelism: data parallelism and task parallelism. Data parallelism focuses on distributing subsets of the same data across multiple computing cores and performing the same operation on each core. Consider, for example, summing the contents of an array of size N. On a single-core system, one thread would simply sum the elements [0] … [N − 1]. On a dual-core system, however, thread A, running on core 0, could sum the elements [0] … [N/2 − 1] while thread B, running on core 1, could sum the elements [N/2] … [N − 1]. The two threads would be running in parallel on separate computing cores. Task parallelism involves distributing not data but tasks (threads) across multiple computing cores. Each thread is performing a unique operation. Different threads may be operating on the same data, or they may be operating on different data. Consider again our example above. In contrast to that situation, an example of task parallelism might involve two threads, each performing a unique statistical operation on the array of elements. The threads again are operating in parallel on separate computing cores, but each is performing a unique operation. Fundamentally, then, data parallelism involves the distribution of data across multiple cores, and task parallelism involves the distribution of tasks across multiple cores, as shown in Figure 4.5. However, data and task parallelism are not mutually exclusive, and an application may in fact use a hybrid of these two strategies.

      Parallelism is divided into data parallelism and task parallelism. Data parallelism involves distributing portions of data across multiple cores, with each core executing the same operation on different data subsets. This is useful for operations like array processing. Task parallelism, on the other hand, distributes different computational tasks across cores, where each thread may perform distinct operations on the same or different data. Many modern applications employ a hybrid approach, using both forms to maximize efficiency. These parallelism models are crucial in fields like machine learning, simulation, and large-scale computing, where workload distribution greatly impacts performance.

    63. AMDAHL'S LAW

      Amdahl’s Law defines the theoretical limit on performance improvement when increasing computing cores. The law considers the serial portion (S) of a program that cannot be parallelized, demonstrating that speedup is constrained by sequential execution. For example, if a program is 75% parallel, adding more cores increases speedup but with diminishing returns. The formula shows that as N (cores) approaches infinity, the maximum speedup is 1/S, meaning a significant serial component limits performance gains. The law underscores the importance of optimizing both parallel and serial portions of an application. It plays a crucial role in system design, influencing decisions on parallelization strategies and hardware resource allocation.

    64. 4.2.1 Programming Challenges The trend toward multicore systems continues to place pressure on system designers and application programmers to make better use of the multiple computing cores. Designers of operating systems must write scheduling algorithms that use multiple processing cores to allow the parallel execution shown in Figure 4.4. For application programmers, the challenge is to modify existing programs as well as design new programs that are multithreaded. In general, five areas present challenges in programming for multicore systems: 1. Identifying tasks. This involves examining applications to find areas that can be divided into separate, concurrent tasks. Ideally, tasks are independent of one another and thus can run in parallel on individual cores. 2. Balance. While identifying tasks that can run in parallel, programmers must also ensure that the tasks perform equal work of equal value. In some instances, a certain task may not contribute as much value to the overall process as other tasks. Using a separate execution core to run that task may not be worth the cost. 3. Data splitting. Just as applications are divided into separate tasks, the data accessed and manipulated by the tasks must be divided to run on separate cores. 4. Data dependency. The data accessed by the tasks must be examined for dependencies between two or more tasks. When one task depends on data from another, programmers must ensure that the execution of the tasks is synchronized to accommodate the data dependency. We examine such strategies in Chapter 6. 5. Testing and debugging. When a program is running in parallel on multiple cores, many different execution paths are possible. Testing and debugging such concurrent programs is inherently more difficult than testing and debugging single-threaded applications. Because of these challenges, many software developers argue that the advent of multicore systems will require an entirely new approach to designing software systems in the future. (Similarly, many computer science educators believe that software development must be taught with increased emphasis on parallel programming.)

      Programming for multicore systems introduces significant challenges. Identifying tasks involves finding independent operations that can be executed concurrently. Balancing workload is critical to ensure efficient use of all cores, preventing bottlenecks from uneven task distribution. Data splitting requires dividing data efficiently so that multiple threads can process different sections in parallel. Data dependency complicates execution, as synchronization mechanisms must prevent race conditions and inconsistencies. Testing and debugging are inherently more difficult in parallel programs due to non-deterministic execution paths. These challenges necessitate new programming paradigms and tools to develop efficient, scalable parallel applications that fully leverage multicore architectures.

    65. Multicore Programming Earlier in the history of computer design, in res

      As computer architecture evolved, multicore processors became a standard solution for increased performance. In a single-core system, concurrency is achieved through time-sharing, interleaving multiple threads over time. However, multicore systems enable true parallelism, allowing different threads to execute simultaneously on separate cores. The key distinction is that concurrency allows multiple tasks to make progress, whereas parallelism executes tasks simultaneously. Before multicore processors, CPUs relied on scheduling algorithms to create an illusion of parallel execution. With modern systems, operating systems and applications must be designed to efficiently allocate tasks across multiple cores. This shift demands advanced scheduling techniques and optimized multithreaded software development to fully utilize the hardware’s parallel processing capabilities.

    66. The benefits of multithreaded programming can be broken down into four major categories: 1. Responsiveness. Multithreading an interactive application may allow a program to continue running even if part of it is blocked or is performing a lengthy operation, thereby increasing responsiveness to the user. This quality is especially useful in designing user interfaces. For instance, consider what happens when a user clicks a button that results in the performance of a time-consuming operation. A single-threaded application would be unresponsive to the user until the operation had been completed. In contrast, if the time-consuming operation is performed in a separate, asynchronous thread, the application remains responsive to the user. 2. Resource sharing. Processes can share resources only through techniques such as shared memory and message passing. Such techniques must be explicitly arranged by the programmer. However, threads share the memory and the resources of the process to which they belong by default. The benefit of sharing code and data is that it allows an application to have several different threads of activity within the same address space. 3. Economy. Allocating memory and resources for process creation is costly. Because threads share the resources of the process to which they belong, it is more economical to create and context-switch threads. Empirically gauging the difference in overhead can be difficult, but in general thread creation consumes less time and memory than process creation. Additionally, context switching is typically faster between threads than between processes. 4. Scalability. The benefits of multithreading can be even greater in a multiprocessor architecture, where threads may be running in parallel on different processing cores. A single-threaded process can run on only one processor, regardless how many are available. We explore this issue further in the following section.

      Multithreaded programming provides key advantages, making applications more efficient and responsive. Responsiveness is crucial for user-friendly interfaces, ensuring a program remains functional even when a thread is blocked. Resource sharing allows threads to utilize the same memory and resources, avoiding complex communication mechanisms like message passing. Economy is another major advantage, as creating threads is significantly less resource-intensive than creating entire processes, reducing memory and CPU overhead. Lastly, scalability enables multithreaded applications to take full advantage of multiprocessor systems, distributing workload across multiple cores for enhanced performance. These benefits collectively make multithreading a fundamental concept in modern computing, particularly in systems requiring parallel execution, real-time processing, and interactive performance enhancements.

    67. 4.1.1 Motivation Most software applications that run on modern computers and mobile devices are multithreaded. An application typically is implemented as a separate process with several threads of control. Below we highlight a few examples of multithreaded applications: An application that creates photo thumbnails from a collection of images may use a separate thread to generate a thumbnail from each separate image. A web browser might have one thread display images or text while another thread retrieves data from the network. A word processor may have a thread for displaying graphics, another thread for responding to keystrokes from the user, and a third thread for performing spelling and grammar checking in the background. Applications can also be designed to leverage processing capabilities on multicore systems. Such applications can perform several CPU-intensive tasks in parallel across the multiple computing cores. In certain situations, a single application may be required to perform several similar tasks. For example, a web server accepts client requests for web pages, images, sound, and so forth. A busy web server may have several (perhaps thousands of) clients concurrently accessing it. If the web server ran as a traditional single-threaded process, it would be able to service only one client at a time, and a client might have to wait a very long time for its request to be serviced. One solution is to have the server run as a single process that accepts requests. When the server receives a request, it creates a separate process to service that request. In fact, this process-creation method was in common use before threads became popular. Process creation is time consuming and resource intensive, however. If the new process will perform the same tasks as the existing process, why incur all that overhead? It is generally more efficient to use one process that contains multiple threads. If the web-server process is multithreaded, the server will create a separate thread that listens for client requests. When a request is made, rather than creating another process, the server creates a new thread to service the request and resumes listening for additional requests. This is illustrated in Figure 4.2.

      This section highlights why multithreading is essential in modern computing. Most applications, including web browsers, word processors, and web servers, utilize multiple threads to handle different tasks simultaneously. The section explains how multithreading improves performance by reducing the overhead of creating new processes, making execution faster and more efficient. For instance, web servers can handle multiple client requests simultaneously through threads rather than spawning separate processes. Additionally, operating system kernels, such as Linux, use threads for device management and interrupt handling. Multithreading is also crucial for computational tasks like sorting, artificial intelligence, and graphics processing, leveraging the power of multicore processors.

    68. Threads & Concurrency

      Threads refer to the smallest unit of execution within a process, while concurrency is the ability to execute multiple tasks or threads in overlapping time periods to improve efficiency.

    69. 4.1 Overview A thread is a basic unit of CPU utilization; it comprises a thread ID, a program counter (PC), a register set, and a stack. It shares with other threads belonging to the same process its code section, data section, and other operating-system resources, such as open files and signals. A traditional process has a single thread of control. If a process has multiple threads of control, it can perform more than one task at a time. Figure 4.1 illustrates the difference between a traditional single-threaded process and a multithreaded process.

      This section introduces threads as the fundamental unit of CPU execution, explaining their components like thread ID, program counter, registers, and stack. Unlike traditional single-threaded processes, multithreaded processes share code, data, and resources while maintaining separate execution threads. The section emphasizes how multithreading enables concurrent task execution within a single process, improving efficiency. The accompanying figure visually contrasts single-threaded and multithreaded processes. This discussion sets the foundation for understanding how operating systems manage threads. A key takeaway is that threads help maximize CPU utilization, making modern computing systems more efficient by handling multiple operations simultaneously within a single process.

    70. 3.8.2.1 Android RPC Although RPCs are typically associated with client-server computing in a distributed system, they can also be used as a form of IPC between processes running on the same system. The Android operating system has a rich set of IPC mechanisms contained in its binder framework, including RPCs that allow one process to request services from another process. Android defines an application component as a basic building block that provides utility to an Android application, and an app may combine multiple application components to provide functionality to an app. One such application component is a service, which has no user interface but instead runs in the background while executing long-running operations or performing work for remote processes. Examples of services include playing music in the background and retrieving data over a network connection on behalf of another process, thereby preventing the other process from blocking as the data are being downloaded. When a client app invokes the bindService() method of a service, that service is “bound” and available to provide client-server communication using either message passing or RPCs. A bound service must extend the Android class Service and must implement the method onBind(), which is invoked when a client calls bindService(). In the case of message passing, the onBind() method returns a Messenger service, which is used for sending messages from the client to the service. The Messenger service is only one-way; if the service must send a reply back to the client, the client must also provide a Messenger service, which is contained in the replyTo field of the Message object sent to the service. The service can then send messages back to the client. To provide RPCs, the onBind() method must return an interface representing the methods in the remote object that clients use to interact with the service. This interface is written in regular Java syntax and uses the Android Interface Definition Language—AIDL—to create stub files, which serve as the client interface to remote services. Here, we briefly outline the process required to provide a generic remote service named remoteMethod() using AIDL and the binder service. The interface for the remote service appears as follows: /* RemoteService.aidl */ interface RemoteService {   boolean remoteMethod(int x, double y); { This file is written as RemoteService.aidl. The Android development kit will use it to generate a .java interface from the .aidl file, as well as a stub that serves as the RPC interface for this service. The server must implement the interface generated by the .aidl file, and the implementation of this interface will be called when the client invokes remoteMethod(). When a client calls bindService(), the onBind() method is invoked on the server, and it returns the stub for the RemoteService object to the client. The client can then invoke the remote method as follows: RemoteService service;    . . . service.remoteMethod(3, 0.14); Internally, the Android binder framework handles parameter marshaling, transferring marshaled parameters between processes, and invoking the necessary implementation of the service, as well as sending any return values back to the client process.

      Android extends the concept of RPCs beyond traditional client-server models by integrating them into its IPC mechanisms using the binder framework. In Android, applications often consist of multiple components, including services that run in the background and perform tasks for other processes. Through the Android Interface Definition Language (AIDL), developers define remote interfaces that allow communication between different application processes. The binder framework manages marshaling and unmarshaling of parameters, ensuring efficient communication. This approach is essential for resource management in mobile systems, as it enables processes to offload work to background services while maintaining system responsiveness. By leveraging RPCs, Android provides a structured way to manage process interactions while abstracting the complexities of low-level data transfer.

    71. 3.8.2 Remote Procedure Calls One of the most common forms of remote service is the RPC paradigm, which was designed as a way to abstract the procedure-call mechanism for use between systems with network connections. It is similar in many respects to the IPC mechanism described in Section 3.4, and it is usually built on top of such a system. Here, however, because we are dealing with an environment in which the processes are executing on separate systems, we must use a message-based communication scheme to provide remote service. In contrast to IPC messages, the messages exchanged in RPC communication are well structured and are thus no longer just packets of data. Each message is addressed to an RPC daemon listening to a port on the remote system, and each contains an identifier specifying the function to execute and the parameters to pass to that function. The function is then executed as requested, and any output is sent back to the requester in a separate message. A port in this context is simply a number included at the start of a message packet. Whereas a system normally has one network address, it can have many ports within that address to differentiate the many network services it supports. If a remote process needs a service, it addresses a message to the proper port. For instance, if a system wished to allow other systems to be able to list its current users, it would have a daemon supporting such an RPC attached to a port—say, port 3027. Any remote system could obtain the needed information (that is, the list of current users) by sending an RPC message to port 3027 on the server. The data would be received in a reply message. The semantics of RPCs allows a client to invoke a procedure on a remote host as it would invoke a procedure locally. The RPC system hides the details that allow communication to take place by providing a stub on the client side. Typically, a separate stub exists for each separate remote procedure. When the client invokes a remote procedure, the RPC system calls the appropriate stub, passing it the parameters provided to the remote procedure. This stub locates the port on the server and marshals the parameters. The stub then transmits a message to the server using message passing. A similar stub on the server side receives this message and invokes the procedure on the server. If necessary, return values are passed back to the client using the same technique. On Windows systems, stub code is compiled from a specification written in the Microsoft Interface Definition Language (MIDL), which is used for defining the interfaces between client and server programs. Parameter marshaling addresses the issue concerning differences in data representation on the client and server machines. Consider the representation of 32-bit integers. Some systems (known as big-endian) store the most significant byte first, while other systems (known as little-endian) store the least significant byte first. Neither order is “better” per se; rather, the choice is arbitrary within a computer architecture. To resolve differences like this, many RPC systems define a machine-independent representation of data. One such representation is known as external data representation (XDR). On the client side, parameter marshaling involves converting the machine-dependent data into XDR before they are sent to the server. On the server side, the XDR data are unmarshaled and converted to the machine-dependent representation for the server. Another important issue involves the semantics of a call. Whereas local procedure calls fail only under extreme circumstances, RPCs can fail, or be duplicated and executed more than once, as a result of common network errors. One way to address this problem is for the operating system to ensure that messages are acted on exactly once, rather than at most once. Most local procedure calls have the “exactly once” functionality, but it is more difficult to implement. First, consider “at most once.” This semantic can be implemented by attaching a timestamp to each message. The server must keep a history of all the timestamps of messages it has already processed or a history large enough to ensure that repeated messages are detected. Incoming messages that have a timestamp already in the history are ignored. The client can then send a message one or more times and be assured that it only executes once. For “exactly once,” we need to remove the risk that the server will never receive the request. To accomplish this, the server must implement the “at most once” protocol described above but must also acknowledge to the client that the RPC call was received and executed. These ACK messages are common throughout networking. The client must resend each RPC call periodically until it receives the ACK for that call. Yet another important issue concerns the communication between a server and a client. With standard procedure calls, some form of binding takes place during link, load, or execution time (Chapter 9) so that a procedure call's name is replaced by the memory address of the procedure call. The RPC scheme requires a similar binding of the client and the server port, but how does a client know the port numbers on the server? Neither system has full information about the other, because they do not share memory. Two approaches are common. First, the binding information may be predetermined, in the form of fixed port addresses. At compile time, an RPC call has a fixed port number associated with it. Once a program is compiled, the server cannot change the port number of the requested service. Second, binding can be done dynamically by a rendezvous mechanism. Typically, an operating system provides a rendezvous (also called a matchmaker) daemon on a fixed RPC port. A client then sends a message containing the name of the RPC to the rendezvous daemon requesting the port address of the RPC it needs to execute. The port number is returned, and the RPC calls can be sent to that port until the process terminates (or the server crashes). This method requires the extra overhead of the initial request but is more flexible than the first approach. Figure 3.29 shows a sample interaction.

      Remote Procedure Calls (RPCs) abstract the communication process between distributed systems by allowing a client to invoke procedures on a remote machine as if they were local functions. Unlike raw socket communication, which requires applications to structure their own messages, RPCs handle function calls transparently. A key component of RPCs is parameter marshaling, which ensures data compatibility between different architectures (e.g., big-endian vs. little-endian). Additionally, RPC implementations must handle network failures and duplicate execution risks, requiring techniques such as timestamping and acknowledgment messages. The use of a matchmaker service for dynamic binding enhances flexibility, allowing clients to locate available RPC services at runtime rather than relying on fixed port assignments.

    72. 3.8.1 Sockets

      Sockets serve as fundamental building blocks for communication in client-server systems. They enable processes running on different machines to exchange data efficiently using network protocols like TCP and UDP. A socket is uniquely identified by an IP address and a port number, allowing a server to listen for incoming client requests. In a TCP-based system, once a connection is established, a persistent communication channel is created between the client and server, ensuring reliable data exchange. The use of well-known ports for standard services (e.g., HTTP on port 80) simplifies service discovery. Java simplifies socket programming by providing built-in classes like ServerSocket for the server and Socket for the client, reducing the complexity of managing low-level network interactions.

    73. 3.7.4.2 Named Pipes Ordinary pipes provide a simple mechanism for allowing a pair of processes to communicate. However, ordinary pipes exist only while the processes are communicating with one another. On both UNIX and Windows systems, once the processes have finished communicating and have terminated, the ordinary pipe ceases to exist. Named pipes provide a much more powerful communication tool. Communication can be bidirectional, and no parent–child relationship is required. Once a named pipe is established, several processes can use it for communication. In fact, in a typical scenario, a named pipe has several writers. Additionally, named pipes continue to exist after communicating processes have finished. Both UNIX and Windows systems support named pipes, although the details of implementation differ greatly. Next, we explore named pipes in each of these systems. Named pipes are referred to as FIFOs in UNIX systems. Once created, they appear as typical files in the file system. A FIFO is created with the mkfifo() system call and manipulated with the ordinary open(), read(), write(), and close() system calls. It will continue to exist until it is explicitly deleted from the file system. Although FIFOs allow bidirectional communication, only half-duplex transmission is permitted. If data must travel in both directions, two FIFOs are typically used. Additionally, the communicating processes must reside on the same machine. If intermachine communication is required, sockets (Section 3.8.1) must be used. Named pipes on Windows systems provide a richer communication mechanism than their UNIX counterparts. Full-duplex communication is allowed, and the communicating processes may reside on either the same or different machines. Additionally, only byte-oriented data may be transmitted across a UNIX FIFO, whereas Windows systems allow either byte- or message-oriented data. Named pipes are created with the CreateNamedPipe() function, and a client can connect to a named pipe using ConnectNamedPipe(). Communication over the named pipe can be accomplished using the ReadFile() and WriteFile() functions.

      Windows systems implement ordinary pipes, called anonymous pipes, with the CreatePipe() function. The difference from UNIX is that Windows requires explicit handling of pipe inheritance for child processes. The parent creates the pipe, and when invoking CreateProcess(), the parent's pipe handles must be passed to the child. This is achieved by manipulating the STARTUPINFO structure to redirect the child’s standard input and output to the pipe’s read and write ends.

      Once the child process is created, the parent writes to the pipe using WriteFile(), and the child reads from the pipe using ReadFile(). After communication, both ends of the pipe are closed. The main advantage of Windows anonymous pipes is the ability to explicitly control the inheritance of pipe handles, ensuring that the child only has access to the required pipe ends. This approach facilitates secure and isolated communication between the parent and child processes.

    74. 3.7.4.1 Ordinary Pipes

      In UNIX systems, the pipe() system call is used to create an ordinary pipe. The pipe is represented by an array of file descriptors (fd[0] for reading and fd[1] for writing). After creating the pipe, a fork() system call creates a child process, and the parent can write to the pipe while the child reads from it. To ensure proper communication, the unused ends of the pipe are closed in both processes. When the parent writes to the pipe and the child reads from it, the child will receive the message sent by the parent. After communication, the parent and child close the appropriate ends of the pipe. This model ensures that communication is restricted to the parent-child relationship and is unidirectional, requiring two pipes for bidirectional communication.

    75. 3.7.3 Windows The Windows operating system is an example of modern design that employs modularity to increase functionality and decrease the time needed to implement new features. Windows provides support for multiple operating environments, or subsystems. Application programs communicate with these subsystems via a message-passing mechanism. Thus, application programs can be considered clients of a subsystem server. The message-passing facility in Windows is called the advanced local procedure call (ALPC) facility. It is used for communication between two processes on the same machine. It is similar to the standard remote procedure call (RPC) mechanism that is widely used, but it is optimized for and specific to Windows. (Remote procedure calls are covered in detail in Section 3.8.2.) Like Mach, Windows uses a port object to establish and maintain a connection between two processes. Windows uses two types of ports: connection ports and communication ports. Server processes publish connection-port objects that are visible to all processes. When a client wants services from a subsystem, it opens a handle to the server's connection-port object and sends a connection request to that port. The server then creates a channel and returns a handle to the client. The channel consists of a pair of private communication ports: one for client–server messages, the other for server–client messages. Additionally, communication channels support a callback mechanism that allows the client and server to accept requests when they would normally be expecting a reply. When an ALPC channel is created, one of three message-passing techniques is chosen: 1. For small messages (up to 256 bytes), the port's message queue is used as intermediate storage, and the messages are copied from one process to the other. 2. Larger messages must be passed through a section object, which is a region of shared memory associated with the channel. 3. When the amount of data is too large to fit into a section object, an API is available that allows server processes to read and write directly into the address space of a client. The client has to decide when it sets up the channel whether it will need to send a large message. If the client determines that it does want to send large messages, it asks for a section object to be created. Similarly, if the server decides that replies will be large, it creates a section object. So that the section object can be used, a small message is sent that contains a pointer and size information about the section object. This method is more complicated than the first method listed above, but it avoids data copying. The structure of advanced local procedure calls in Windows is shown in Figure 3.19. Figure 3.19 Advanced local procedure calls in Windows. It is important to note that the ALPC facility in Windows is not part of the Windows API and hence is not visible to the application programmer. Rather, applications using the Windows API invoke standard remote procedure calls. When the RPC is being invoked on a process on the same system, the RPC is handled indirectly through an ALPC procedure call. Additionally, many kernel services use ALPC to communicate with client processes. 3.7.4 Pipes A pipe acts as a conduit allowing two processes to communicate. Pipes were one of the first IPC mechanisms in early UNIX systems. They typically provide one of the simpler ways for processes to communicate with one another, although they also have some limitations. In implementing a pipe, four issues must be considered: 1. Does the pipe allow bidirectional communication, or is communication unidirectional? 2. If two-way communication is allowed, is it half duplex (data can travel only one way at a time) or full duplex (data can travel in both directions at the same time)? 3. Must a relationship (such as parent–child) exist between the communicating processes? 4. Can the pipes communicate over a network, or must the communicating processes reside on the same machine? In the following sections, we explore two common types of pipes used on both UNIX and Windows systems: ordinary pipes and named pipes.

      Windows ALPC (Advanced Local Procedure Call) provides a message-passing mechanism for communication between processes, optimized for performance within the Windows environment. Windows uses connection ports for clients to request services from server processes. ALPC supports different message-passing methods, depending on the message size, including smaller messages using a message queue and larger ones via section objects, which leverage shared memory. Notably, ALPC is designed to handle local communication on the same machine efficiently, and its internal complexity allows clients to make decisions about message size to avoid unnecessary copying. While ALPC is a powerful IPC mechanism, it is not directly exposed to application developers but is used behind the scenes by the system's standard remote procedure call (RPC) system.

      The section on Pipes focuses on a basic but foundational IPC method in UNIX and Windows systems, where data is transmitted between two processes through a unidirectional or bidirectional conduit. The design of pipes involves considerations like whether they allow bidirectional communication, whether a relationship is required between processes (such as parent–child), and whether communication can occur across a network. Named pipes, an extension of regular pipes, allow for more flexible communication, as they can be used by unrelated processes and even across machines, though they still rely on local IPC principles.

    76. 3.7.1 POSIX Shared Memory Several IPC mechanisms are available for POSIX systems, including shared memory and message passing. Here, we explore the POSIX API for shared memory. POSIX shared memory is organized using memory-mapped files, which associate the region of shared memory with a file. A process must first create a shared-memory object using the shm_open() system call, as follows: fd = shm_open(name, O_CREAT | O_RDWR, 0666); The first parameter specifies the name of the shared-memory object. Processes that wish to access this shared memory must refer to the object by this name. The subsequent parameters specify that the shared-memory object is to be created if it does not yet exist (O_CREAT) and that the object is open for reading and writing (O_RDWR). The last parameter establishes the file-access permissions of the shared-memory object. A successful call to shm_open() returns an integer file descriptor for the shared-memory object. Once the object is established, the ftruncate() function is used to configure the size of the object in bytes. The call ftruncate(fd, 4096); sets the size of the object to 4,096 bytes. Finally, the mmap() function establishes a memory-mapped file containing the shared-memory object. It also returns a pointer to the memory-mapped file that is used for accessing the shared-memory object. The programs shown in Figure 3.16 and Figure 3.17 use the producer–consumer model in implementing shared memory. The producer establishes a shared-memory object and writes to shared memory, and the consumer reads from shared memory. #include <stdio.h> #include <stdlib.h> #include <string.h> #include <fcntl.h> #include <sys/shm.h> #include <sys/stat.h> #include <sys/mman.h> int main() { /* the size (in bytes) of shared memory object */ const int SIZE = 4096; /* name of the shared memory object */ const char *name = "OS"; /* strings written to shared memory */ const char *message_0 = "Hello"; const char *message_1 = "World!"; /* shared memory file descriptor */ int fd; /* pointer to shared memory obect */ char *ptr;    /* create the shared memory object */    fd = shm_open(name,O_CREAT | O_RDWR,0666);    /* configure the size of the shared memory object */    ftruncate(fd, SIZE);    /* memory map the shared memory object */    ptr = (char *)     mmap(0, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);    /* write to the shared memory object */    sprintf(ptr,"%s",message_0);    ptr += strlen(message_0);    sprintf(ptr,"%s",message_1);    ptr += strlen(message_1);    return 0; } Figure 3.16 Producer process illustrating POSIX shared-memory API. #include <stdio.h> #include <stdlib.h> #include <fcntl.h> #include <sys/shm.h> #include <sys/stat.h> #include <sys/mman.h> int main() { /* the size (in bytes) of shared memory object */ const int SIZE = 4096; /* name of the shared memory object */ const char *name = "OS"; /* shared memory file descriptor */ int fd; /* pointer to shared memory obect */ char *ptr;    /* open the shared memory object */    fd = shm_open(name, O_RDONLY, 0666);    /* memory map the shared memory object */    ptr = (char *)     mmap(0, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);    /* read from the shared memory object */    printf("%s",(char *)ptr);    /* remove the shared memory object */    shm_unlink(name);    return 0; } Figure 3.17 Consumer process illustrating POSIX shared-memory API. The producer, shown in Figure 3.16, creates a shared-memory object named OS and writes the infamous string “Hello World!” to shared memory. The program memory-maps a shared-memory object of the specified size and allows writing to the object. The flag MAP_SHARED specifies that changes to the shared-memory object will be visible to all processes sharing the object. Notice that we write to the shared-memory object by calling the sprintf() function and writing the formatted string to the pointer ptr. After each write, we must increment the pointer by the number of bytes written. The consumer process, shown in Figure 3.17, reads and outputs the contents of the shared memory. The consumer also invokes the shm_unlink() function, which removes the shared-memory segment after the consumer has accessed it. We provide further exercises using the POSIX shared-memory API in the programming exercises at the end of this chapter. Additionally, we provide more detailed coverage of memory mapping in Section 13.5. 3.7.2 Mach Message Passing As an example of message passing, we next consider the Mach operating system. Mach was especially designed for distributed systems, but was shown to be suitable for desktop and mobile systems as well, as evidenced by its inclusion in the MacOS and iOS operating systems, as discussed in Chapter 2. The Mach kernel supports the creation and destruction of multiple tasks, which are similar to processes but have multiple threads of control and fewer associated resources. Most communication in Mach—including all inter-task communication—is carried out by messages. Messages are sent to, and received from, mailboxes, which are called ports in Mach. Ports are finite in size and unidirectional; for two-way communication, a message is sent to one port, and a response is sent to a separate reply port. Each port may have multiple senders, but only one receiver. Mach uses ports to represent resources such as tasks, threads, memory, and processors, while message passing provides an object-oriented approach for interacting with these system resources and services. Message passing may occur between any two ports on the same host or on separate hosts on a distributed system. Associated with each port is a collection of port rights that identify the capabilities necessary for a task to interact with the port. For example, for a task to receive a message from a port, it must have the capability MACH_PORT_RIGHT_RECEIVE for that port. The task that creates a port is that port's owner, and the owner is the only task that is allowed to receive messages from that port. A port's owner may also manipulate the capabilities for a port. This is most commonly done in establishing a reply port. For example, assume that task T1 owns port P1, and it sends a message to port P2, which is owned by task T2. If T1 expects to receive a reply from T2, it must grant T2 the right MACH_PORT_RIGHT_SEND for port P1. Ownership of port rights is at the task level, which means that all threads belonging to the same task share the same port rights. Thus, two threads belonging to the same task can easily communicate by exchanging messages through the per-thread port associated with each thread. When a task is created, two special ports—the Task Self port and the Notify port—are also created. The kernel has receive rights to the Task Self port, which allows a task to send messages to the kernel. The kernel can send notification of event occurrences to a task's Notify port (to which, of course, the task has receive rights). The mach_port_allocate() function call creates a new port and allocates space for its queue of messages. It also identifies the rights for the port. Each port right represents a name for that port, and a port can only be accessed via a right. Port names are simple integer values and behave much like UNIX file descriptors. The following example illustrates creating a port using this API: mach_port_t port; // the name of the port right mach_port_allocate(         mach_task_self(), // a task referring to itself         MACH_PORT_RIGHT_RECEIVE, // the right for this port         &port); // the name of the port right Each task also has access to a bootstrap port, which allows a task to register a port it has created with a system-wide bootstrap server. Once a port has been registered with the bootstrap server, other tasks can look up the port in this registry and obtain rights for sending messages to the port. The queue associated with each port is finite in size and is initially empty. As messages are sent to the port, the messages are copied into the queue. All messages are delivered reliably and have the same priority. Mach guarantees that multiple messages from the same sender are queued in first-in, first-out (FIFO) order but does not guarantee an absolute ordering. For instance, messages from two senders may be queued in any order. Mach messages contain the following two fields: A fixed-size message header containing metadata about the message, including the size of the message as well as source and destination ports. Commonly, the sending thread expects a reply, so the port name of the source is passed on to the receiving task, which can use it as a “return address” in sending a reply. A variable-sized body containing data. Messages may be either simple or complex. A simple message contains ordinary, unstructured user data that are not interpreted by the kernel. A complex message may contain pointers to memory locations containing data (known as “out-of-line” data) or may also be used for transferring port rights to another task. Out-of-line data pointers are especially useful when a message must pass large chunks of data. A simple message would require copying and packaging the data in the message; out-of-line data transmission requires only a pointer that refers to the memory location where the data are stored. The function mach_msg() is the standard API for both sending and receiving messages. The value of one of the function's parameters—either MACH_SEND_MSG or MACH_RCV_MSG—indicates if it is a send or receive operation. We now illustrate how it is used when a client task sends a simple message to a server task. Assume there are two ports—client and server—associated with the client and server tasks, respectively. The code in Figure 3.18 shows the client task constructing a header and sending a message to the server, as well as the server task receiving the message sent from the client. #include<mach/mach.h> struct message {   mach_msg_header_t header;   int data; }; mach_port_t client; mach_port_t server;        /* Client Code */ struct message message; // construct the header message.header.msgh_size = sizeof(message); message.header.msgh_remote_port = server; message.header.msgh_local_port = client; // send the message mach_msg(&message.header, // message header   MACH_SEND_MSG, // sending a message   sizeof(message), // size of message sent   0, // maximum size of received message - unnecessary   MACH_PORT_NULL, // name of receive port - unnecessary   MACH_MSG_TIMEOUT_NONE, // no time outs   MACH_PORT_NULL // no notify port );       /* Server Code */ struct message message; // receive the message mach_msg(&message.header, // message header   MACH_RCV_MSG, // sending a message   0,  // size of message sent   sizeof(message), // maximum size of received message   server, // name of receive port   MACH_MSG_TIMEOUT_NONE, // no time outs   MACH_PORT_NULL // no notify port ); Figure 3.18 Example program illustrating message passing in Mach. The mach_msg() function call is invoked by user programs for performing message passing. mach_msg() then invokes the function mach_msg_trap(), which is a system call to the Mach kernel. Within the kernel, mach_msg_trap() next calls the function mach_msg_overwrite_trap(), which then handles the actual passing of the message. The send and receive operations themselves are flexible. For instance, when a message is sent to a port, its queue may be full. If the queue is not full, the message is copied to the queue, and the sending task continues. If the port's queue is full, the sender has several options (specified via parameters to mach_msg(): 1. Wait indefinitely until there is room in the queue. 2. Wait at most n milliseconds. 3. Do not wait at all but rather return immediately. 4. Temporarily cache a message. Here, a message is given to the operating system to keep, even though the queue to which that message is being sent is full. When the message can be put in the queue, a notification message is sent back to the sender. Only one message to a full queue can be pending at any time for a given sending thread. The final option is meant for server tasks. After finishing a request, a server task may need to send a one-time reply to the task that requested the service, but it must also continue with other service requests, even if the reply port for a client is full. The major problem with message systems has generally been poor performance caused by copying of messages from the sender's port to the receiver's port. The Mach message system attempts to avoid copy operations by using virtual-memory-management techniques (Chapter 10). Essentially, Mach maps the address space containing the sender's message into the receiver's address space. Therefore, the message itself is never actually copied, as both the sender and receiver access the same memory. This message-management technique provides a large performance boost but works only for intrasystem messages.

      In the section discussing POSIX Shared Memory, the shared memory mechanism relies on memory-mapped files and uses the shm_open(), ftruncate(), and mmap() functions to facilitate inter-process communication (IPC). Shared memory allows processes to communicate efficiently by writing to a memory region that can be accessed by multiple processes. The producer-consumer model is used in this example, where one process (the producer) writes to shared memory, and another (the consumer) reads the data. This method is particularly fast compared to other IPC methods because it avoids the overhead of copying data between processes. However, it is important to note that proper synchronization is required to ensure that concurrent access to shared memory does not lead to data corruption or inconsistency.

      In Mach's Message Passing, communication between tasks occurs via messages sent to and received from ports. Mach message passing is designed for distributed systems and allows efficient communication, even between systems. Each task is associated with ports that manage communication through unidirectional messages, and port rights control access to these messages. The flexibility of message passing in Mach is evident in its queuing system, where messages are queued in FIFO order and can be managed using different waiting strategies if the queue is full. Additionally, Mach uses virtual memory mapping techniques to avoid the overhead of copying messages, enhancing performance in local communications.

    77. 3.6 IPC in Message-Passing Systems In Section 3.5, we showed how cooperating processes can communicate in a shared-memory environment. The scheme requires that these processes share a region of memory and that the code for accessing and manipulating the shared memory be written explicitly by the application programmer. Another way to achieve the same effect is for the operating system to provide the means for cooperating processes to communicate with each other via a message-passing facility. Message passing provides a mechanism to allow processes to communicate and to synchronize their actions without sharing the same address space. It is particularly useful in a distributed environment, where the communicating processes may reside on different computers connected by a network. For example, an Internet chat program could be designed so that chat participants communicate with one another by exchanging messages. A message-passing facility provides at least two operations: send(message) and receive(message) Messages sent by a process can be either fixed or variable in size. If only fixed-sized messages can be sent, the system-level implementation is straightforward. This restriction, however, makes the task of programming more difficult. Conversely, variable-sized messages require a more complex system-level implementation, but the programming task becomes simpler. This is a common kind of tradeoff seen throughout operating-system design. If processes P and Q want to communicate, they must send messages to and receive messages from each other: a communication link must exist between them. This link can be implemented in a variety of ways. We are concerned here not with the link's physical implementation (such as shared memory, hardware bus, or network, which are covered in Chapter 19) but rather with its logical implementation. Here are several methods for logically implementing a link and the send()/receive() operations: Direct or indirect communication Synchronous or asynchronous communication Automatic or explicit buffering We look at issues related to each of these features next. 3.6.1 Naming Processes that want to communicate must have a way to refer to each other. They can use either direct or indirect communication. Under direct communication, each process that wants to communicate must explicitly name the recipient or sender of the communication. In this scheme, the send() and receive() primitives are defined as: send(P, message)—Send a message to process P. receive(Q, message)—Receive a message from process Q. A communication link in this scheme has the following properties: A link is established automatically between every pair of processes that want to communicate. The processes need to know only each other's identity to communicate. A link is associated with exactly two processes. Between each pair of processes, there exists exactly one link. This scheme exhibits symmetry in addressing; that is, both the sender process and the receiver process must name the other to communicate. A variant of this scheme employs asymmetry in addressing. Here, only the sender names the recipient; the recipient is not required to name the sender. In this scheme, the send() and receive() primitives are defined as follows: send(P, message)—Send a message to process P. receive(id, message)—Receive a message from any process. The variable id is set to the name of the process with which communication has taken place. The disadvantage in both of these schemes (symmetric and asymmetric) is the limited modularity of the resulting process definitions. Changing the identifier of a process may necessitate examining all other process definitions. All references to the old identifier must be found, so that they can be modified to the new identifier. In general, any such hard-coding techniques, where identifiers must be explicitly stated, are less desirable than techniques involving indirection, as described next. With indirect communication, the messages are sent to and received from mailboxes, or ports. A mailbox can be viewed abstractly as an object into which messages can be placed by processes and from which messages can be removed. Each mailbox has a unique identification. For example, POSIX message queues use an integer value to identify a mailbox. A process can communicate with another process via a number of different mailboxes, but two processes can communicate only if they have a shared mailbox. The send() and receive() primitives are defined as follows: send(A, message)—Send a message to mailbox A. receive(A, message)—Receive a message from mailbox A. In this scheme, a communication link has the following properties: A link is established between a pair of processes only if both members of the pair have a shared mailbox. A link may be associated with more than two processes. Between each pair of communicating processes, a number of different links may exist, with each link corresponding to one mailbox. Now suppose that processes P1, P2, and P3 all share mailbox A. Process P1 sends a message to A, while both P2 and P3 execute a receive() from A. Which process will receive the message sent by P1? The answer depends on which of the following methods we choose: Allow a link to be associated with two processes at most. Allow at most one process at a time to execute a receive() operation. Allow the system to select arbitrarily which process will receive the message (that is, either P2 or P3, but not both, will receive the message). The system may define an algorithm for selecting which process will receive the message (for example, round robin, where processes take turns receiving messages). The system may identify the receiver to the sender. A mailbox may be owned either by a process or by the operating system. If the mailbox is owned by a process (that is, the mailbox is part of the address space of the process), then we distinguish between the owner (which can only receive messages through this mailbox) and the user (which can only send messages to the mailbox). Since each mailbox has a unique owner, there can be no confusion about which process should receive a message sent to this mailbox. When a process that owns a mailbox terminates, the mailbox disappears. Any process that subsequently sends a message to this mailbox must be notified that the mailbox no longer exists. In contrast, a mailbox that is owned by the operating system has an existence of its own. It is independent and is not attached to any particular process. The operating system then must provide a mechanism that allows a process to do the following: Create a new mailbox. Send and receive messages through the mailbox. Delete a mailbox. The process that creates a new mailbox is that mailbox's owner by default. Initially, the owner is the only process that can receive messages through this mailbox. However, the ownership and receiving privilege may be passed to other processes through appropriate system calls. Of course, this provision could result in multiple receivers for each mailbox. 3.6.2 Synchronization Communication between processes takes place through calls to send() and receive() primitives. There are different design options for implementing each primitive. Message passing may be either blocking or nonblocking—also known as synchronous and asynchronous. (Throughout this text, you will encounter the concepts of synchronous and asynchronous behavior in relation to various operating-system algorithms.) Blocking send. The sending process is blocked until the message is received by the receiving process or by the mailbox. Nonblocking send. The sending process sends the message and resumes operation. Blocking receive. The receiver blocks until a message is available. Nonblocking receive. The receiver retrieves either a valid message or a null. Different combinations of send() and receive() are possible. When both send() and receive() are blocking, we have a rendezvous between the sender and the receiver. The solution to the producer–consumer problem becomes trivial when we use blocking send() and receive() statements. The producer merely invokes the blocking send() call and waits until the message is delivered to either the receiver or the mailbox. Likewise, when the consumer invokes receive(), it blocks until a message is available. This is illustrated in Figures 3.14 and 3.15. message next_produced; while (true) {      /* produce an item in next_produced */      send(next_produced); } Figure 3.14 The producer process using message passing. message next_consumed; while (true) {      receive(next_consumed);      /* consume the item in next_consumed */ } Figure 3.15 The consumer process using message passing. 3.6.3 Buffering Whether communication is direct or indirect, messages exchanged by communicating processes reside in a temporary queue. Basically, such queues can be implemented in three ways: Zero capacity. The queue has a maximum length of zero; thus, the link cannot have any messages waiting in it. In this case, the sender must block until the recipient receives the message. Bounded capacity. The queue has finite length n; thus, at most n messages can reside in it. If the queue is not full when a new message is sent, the message is placed in the queue (either the message is copied or a pointer to the message is kept), and the sender can continue execution without waiting. The link's capacity is finite, however. If the link is full, the sender must block until space is available in the queue. Unbounded capacity. The queue's length is potentially infinite; thus, any number of messages can wait in it. The sender never blocks. The zero-capacity case is sometimes referred to as a message system with no buffering. The other cases are referred to as systems with automatic buffering.

      Message-passing systems provide an alternative to shared memory for interprocess communication (IPC), allowing processes to communicate and synchronize without sharing address space. This is particularly useful in distributed environments where processes reside on different computers. Message passing relies on send() and receive() operations, which can handle fixed or variable-sized messages, with fixed sizes simplifying system implementation but increasing programming complexity. Communication links between processes can be established through direct or indirect methods. Direct communication requires processes to explicitly name each other, creating a one-to-one link, while indirect communication uses mailboxes (or ports) for message exchange, allowing more flexibility and modularity. Synchronization in message passing can be either blocking (synchronous) or nonblocking (asynchronous). Blocking send() causes the sender to wait until the message is received, while blocking receive() makes the receiver wait until a message is available. Nonblocking send() allows the sender to continue immediately after dispatching the message, and nonblocking receive() allows the receiver to check for messages without waiting. When both operations are blocking, a rendezvous occurs, simplifying synchronization. Buffering determines how messages are temporarily stored before being received. Zero-capacity queues require senders to wait for the recipient to be ready, bounded queues impose a fixed limit on messages and may require senders to wait when full, while unbounded queues allow infinite messages without blocking. The choice of buffering impacts system performance and message flow control.

    78. 3.5 IPC in Shared-Memory Systems Interprocess communication using shared memory requires communicating processes to establish a region of shared memory. Typically, a shared-memory region resides in the address space of the process creating the shared-memory segment. Other processes that wish to communicate using this shared-memory segment must attach it to their address space. Recall that, normally, the operating system tries to prevent one process from accessing another process's memory. Shared memory requires that two or more processes agree to remove this restriction. They can then exchange information by reading and writing data in the shared areas. The form of the data and the location are determined by these processes and are not under the operating system's control. The processes are also responsible for ensuring that they are not writing to the same location simultaneously. To illustrate the concept of cooperating processes, let's consider the producer–consumer problem, which is a common paradigm for cooperating processes. A producer process produces information that is consumed by a consumer process. For example, a compiler may produce assembly code that is consumed by an assembler. The assembler, in turn, may produce object modules that are consumed by the loader. The producer–consumer problem also provides a useful metaphor for the client–server paradigm. We generally think of a server as a producer and a client as a consumer. For example, a web server produces (that is, provides) web content such as HTML files and images, which are consumed (that is, read) by the client web browser requesting the resource. One solution to the producer–consumer problem uses shared memory. To allow producer and consumer processes to run concurrently, we must have available a buffer of items that can be filled by the producer and emptied by the consumer. This buffer will reside in a region of memory that is shared by the producer and consumer processes. A producer can produce one item while the consumer is consuming another item. The producer and consumer must be synchronized, so that the consumer does not try to consume an item that has not yet been produced. Two types of buffers can be used. The unbounded buffer places no practical limit on the size of the buffer. The consumer may have to wait for new items, but the producer can always produce new items. The bounded buffer assumes a fixed buffer size. In this case, the consumer must wait if the buffer is empty, and the producer must wait if the buffer is full. Let's look more closely at how the bounded buffer illustrates interprocess communication using shared memory. The following variables reside in a region of memory shared by the producer and consumer processes: #define BUFFER_SIZE 10 typedef struct { . . . } item; item buffer[BUFFER_SIZE]; int in = 0; int out = 0; The shared buffer is implemented as a circular array with two logical pointers: in and out. The variable in points to the next free position in the buffer; out points to the first full position in the buffer. The buffer is empty when in == out; the buffer is full when ((in + 1) % BUFFER_SIZE) == out. The code for the producer process is shown in Figure 3.12, and the code for the consumer process is shown in Figure 3.13. The producer process has a local variable next_produced in which the new item to be produced is stored. The consumer process has a local variable next_consumed in which the item to be consumed is stored. item next_produced; while (true) {      /* produce an item in next_produced */      while (((in + 1) % BUFFER_SIZE) == out)        ; /* do nothing */      buffer[in] = next_produced;      in = (in + 1) % BUFFER_SIZE; } Figure 3.12 The producer process using shared memory. item next_consumed; while (true) {      while (in == out)        ; /* do nothing */      next_consumed = buffer[out];      out = (out + 1) % BUFFER_SIZE;      /* consume the item in next_consumed */ } Figure 3.13 The consumer process using shared memory. This scheme allows at most BUFFER_SIZE − 1 items in the buffer at the same time. We leave it as an exercise for you to provide a solution in which BUFFER_SIZE items can be in the buffer at the same time. In Section 3.7.1, we illustrate the POSIX API for shared memory. One issue this illustration does not address concerns the situation in which both the producer process and the consumer process attempt to access the shared buffer concurrently. In Chapter 6 and Chapter 7, we discuss how synchronization among cooperating processes can be implemented effectively in a shared-memory environment.

      Interprocess communication (IPC) in shared-memory systems allows processes to communicate by creating a shared-memory region. Normally, operating systems restrict processes from accessing each other’s memory, but shared memory requires processes to agree to lift this restriction. These processes determine data structure and management without operating system intervention. A common example is the producer–consumer problem, where a producer generates data consumed by a consumer. This paradigm extends to client-server models, such as web servers providing content to browsers. Shared memory enables concurrent execution of producers and consumers through a buffer, which can be either unbounded (allowing unlimited production) or bounded (with a fixed size, requiring synchronization). A bounded buffer, implemented as a circular array, uses two pointers, in and out, to manage data flow. The buffer is empty when in == out and full when ((in + 1) % BUFFER_SIZE) == out. The producer adds items to the buffer, while the consumer removes them. However, simultaneous access by both processes can lead to conflicts, requiring synchronization techniques, discussed in later chapters. This model enhances efficiency by minimizing kernel intervention, but careful synchronization is necessary to avoid issues like race conditions and data inconsistency.

    79. 3.4 Interprocess Communication Processes executing concurrently in the operating system may be either independent processes or cooperating processes. A process is independent if it does not share data with any other processes executing in the system. A process is cooperating if it can affect or be affected by the other processes executing in the system. Clearly, any process that shares data with other processes is a cooperating process. There are several reasons for providing an environment that allows process cooperation: Information sharing. Since several applications may be interested in the same piece of information (for instance, copying and pasting), we must provide an environment to allow concurrent access to such information. Computation speedup. If we want a particular task to run faster, we must break it into subtasks, each of which will be executing in parallel with the others. Notice that such a speedup can be achieved only if the computer has multiple processing cores. Modularity. We may want to construct the system in a modular fashion, dividing the system functions into separate processes or threads, as we discussed in Chapter 2.

      Interprocess communication (IPC) enables processes in an operating system to interact, with processes classified as either independent or cooperating. Cooperating processes share data and benefit from IPC for information sharing, computation speedup, and modularity. Information sharing allows multiple applications to access the same data, computation speedup is achieved by running subtasks in parallel on multi-core systems, and modularity enhances system organization by dividing functions into separate processes. There are two main IPC models: shared memory and message passing. In shared memory, processes read and write directly to a designated memory space, making it faster since kernel intervention is minimal after setup. Message passing involves exchanging messages between processes, which is useful for small data transfers and distributed systems but generally slower due to frequent kernel involvement. While shared memory is efficient for local communication, message passing is more suitable for distributed environments. Many operating systems implement both models to optimize process interaction.

    80. 3.2 Process Scheduling

      Section 3.2 discusses the process scheduling in operating systems, focusing on multiprogramming and time-sharing objectives. The process scheduler selects processes for CPU time to ensure efficient resource use. In Linux, processes are represented by the task_struct structure, containing key information like process state, memory, and scheduling details. Processes are organized in scheduling queues, including the ready queue (waiting for CPU time) and wait queues (waiting for events like I/O). The CPU scheduler allocates time to processes, executing frequently and sometimes using swapping to manage memory resources. A context switch occurs when the system saves the current process's state and loads a new one, but it introduces overhead, affecting performance. In mobile systems, multitasking is handled differently, with early iOS versions limiting background tasks while Android allows background services to run continuously. This section emphasizes the critical role of scheduling, context switching, and queue management in optimizing system efficiency and responsiveness.

    81. 3.1.4 Threads The process model discussed so far has implied that a process is a program that performs a single thread of execution. For example, when a process is running a word-processor program, a single thread of instructions is being executed. This single thread of control allows the process to perform only one task at a time. Thus, the user cannot simultaneously type in characters and run the spell checker. Most modern operating systems have extended the process concept to allow a process to have multiple threads of execution and thus to perform more than one task at a time. This feature is especially beneficial on multicore systems, where multiple threads can run in parallel. A multithreaded word processor could, for example, assign one thread to manage user input while another thread runs the spell checker. On systems that support threads, the PCB is expanded to include information for each thread. Other changes throughout the system are also needed to support threads. Chapter 4 explores threads in detail.

      A traditional process has a single thread of execution, meaning it performs one task at a time. However, modern OSs support multithreading, enabling multiple tasks within a process to run concurrently. This is particularly beneficial on multicore processors where different threads can run in parallel.

    82. Process Control Block Each process is represented in the operating system by a process control block (PCB)—also called a task control block. A PCB is shown in Figure 3.3. It contains many pieces of information associated with a specific process, including these: Process state. The state may be new, ready, running, waiting, halted, and so on. Program counter. The counter indicates the address of the next instruction to be executed for this process. CPU registers. The registers vary in number and type, depending on the computer architecture. They include accumulators, index registers, stack pointers, and general-purpose registers, plus any condition-code information. Along with the program counter, this state information must be saved when an interrupt occurs, to allow the process to be continued correctly afterward when it is rescheduled to run. CPU-scheduling information. This information includes a process priority, pointers to scheduling queues, and any other scheduling parameters. (Chapter 5 describes process scheduling.) Memory-management information. This information may include such items as the value of the base and limit registers and the page tables, or the segment tables, depending on the memory system used by the operating system (Chapter 9). Accounting information. This information includes the amount of CPU and real time used, time limits, account numbers, job or process numbers, and so on. I/O status information. This information includes the list of I/O devices allocated to the process, a list of open files, and so on.

      The OS tracks each process using a PCB, which stores critical information like process state, program counter, CPU registers, memory management data, and I/O status. This allows the OS to pause and resume processes efficiently.

    83. Process State

      A process moves between five states—New, Ready, Running, Waiting, and Terminated—based on CPU availability, I/O operations, and scheduling decisions. Only one process can run on a core at any given time, but many can be ready to execute.

    84. 3.1.1 The Process Informally, as mentioned earlier, a process is a program in execution. The status of the current activity of a process is represented by the value of the program counter and the contents of the processor's registers. The memory layout of a process is typically divided into multiple sections, and is shown in Figure 3.1. These sections include: Text section—the executable code Data section—global variables Heap section—memory that is dynamically allocated during program run time Stack section—temporary data storage when invoking functions (such as function parameters, return addresses, and local variables)

      A process consists of different sections in memory: text (executable code), data (global variables), heap (dynamic memory), and stack (function call data). The heap and stack grow dynamically during execution, but the OS must ensure they don’t overlap.

    85. arly computers allowed only one program to be executed at a time. This program had complete control of the system and had access to all the system's resources. In contrast, contemporary computer systems allow multiple programs to be loaded into memory and executed concurrently. This evolution required firmer control and more compartmentalization of the various programs; and these needs resulted in the notion of a process, which is a program in execution. A process is the unit of work in a modern computing system.

      Early computers executed only one program at a time, whereas modern systems allow multiple programs to run concurrently. This required the concept of a process, which is a program in execution. A system consists of many processes, some running user code and others handling system operations.

  3. Jan 2025
    1. 2.10 Operating-System Debugging

      This section explores operating-system debugging, covering failure analysis, performance monitoring, and advanced tracing tools. Debugging involves identifying and fixing errors in software and hardware, with performance tuning aiming to eliminate processing bottlenecks. When a process fails, operating systems log errors and may generate core dumps for analysis, while kernel failures result in crash dumps. Debugging kernel issues is complex due to hardware control and limited debugging tools. Performance monitoring relies on counters and tracing methods. Linux provides tools like ps, top, vmstat, and /proc for tracking resource usage, while Windows uses Task Manager. Tracing tools, such as strace, gdb, and tcpdump, capture event-based data for in-depth analysis. The BCC toolkit, built on eBPF, enables secure and low-impact debugging of live systems by tracing interactions between user and kernel code. BCC tools, such as disksnoop for disk I/O and opensnoop for system calls, provide real-time insights into system performance and security without disrupting critical applications.

    2. 2.9 Building and Booting an Operating System

      This section outlines the process of building and booting an operating system, emphasizing flexibility for various hardware configurations. It begins by discussing Operating-System Generation, explaining that while most computers come with preinstalled OSs, users can build their own through a series of steps: writing or obtaining source code, configuring, compiling, installing, and booting. System configuration can be highly tailored, modular, or dynamic, affecting system size and adaptability. A case study on Linux details building an OS from source, including downloading the kernel, configuring it, compiling modules, and installing it. An alternative method involves running Linux as a virtual machine using software like VirtualBox or VMware. The System Boot process follows OS generation, starting with a bootstrap program (boot loader) that loads the kernel into memory. Traditional BIOS-based booting has largely been replaced by UEFI, offering faster and more efficient startup. Boot loaders like GRUB allow dynamic kernel parameter selection. Linux and Android use different boot mechanisms, with Android maintaining initramfs as its root filesystem. Most OSs include recovery modes for troubleshooting and repairs.

    3. 2.8 Operating-System Structure A system as large and complex as a modern operatin

      Operating system structure is crucial for managing complexity and ensuring functionality. Modern operating systems are typically organized into modular components, each with well-defined interfaces, similar to how programs are divided into functions. The monolithic structure, used in UNIX and Linux, combines all kernel functionalities into a single address space, offering high performance due to minimal overhead but making the system difficult to extend and maintain. In contrast, the layered approach divides the OS into distinct layers, each relying on the services of lower layers, simplifying debugging and verification but often suffering from performance overhead due to inter-layer communication. The microkernel approach, exemplified by Mach (used in macOS and iOS), minimizes the kernel by moving nonessential services to user space, communicating via message passing. This enhances security, reliability, and portability but incurs performance penalties due to message-passing overhead. For example, Windows NT initially used a microkernel but shifted to a more monolithic design to improve performance. A modern compromise is the use of loadable kernel modules (LKMs), as seen in Linux, macOS, and Windows. This approach combines the efficiency of monolithic kernels with the flexibility of modular design. Core services remain in the kernel, while additional functionalities, like device drivers or file systems, are dynamically loaded at runtime. This allows for easier updates and customization without recompiling the entire kernel, maintaining performance while enhancing modularity. Overall, the choice of structure depends on balancing performance, flexibility, and ease of maintenance. Hybrid operating systems combine different structures to balance performance, security, and usability. Linux, primarily monolithic, is modular for dynamic functionality. Windows shares monolithic traits but supports microkernel features like user-mode subsystems. Apple's macOS and iOS share the Darwin kernel, featuring Mach and BSD UNIX. macOS targets desktops with Intel chips, while iOS is optimized for mobile ARM architectures with strict security controls. Both provide Cocoa frameworks for application development. Darwin’s hybrid nature integrates Mach’s memory management and IPC with BSD’s POSIX functions. Despite Darwin’s open-source status, Apple’s proprietary frameworks, such as Cocoa, remain closed to developers. Android is an open-source mobile OS developed by Google, supporting various hardware platforms. It features a layered architecture, including the Linux kernel, Bionic C library, and Android Runtime (ART), which performs ahead-of-time (AOT) compilation for efficiency. Developers use Java with Android’s custom API, accessing hardware via the hardware abstraction layer (HAL).

    4. 2.7 Operating-System Design and Implementation In this section, we discuss problems we face in designing and implementing an operating system. There are, of course, no complete solutions to such problems, but there are approaches that have proved successful. 2.7.1 Design Goals The first problem in designing a system is to define goals and specifications. At the highest level, the design of the system will be affected by the choice of hardware and the type of system: traditional desktop/laptop, mobile, distributed, or real time. Beyond this highest design level, the requirements may be much harder to specify. The requirements can, however, be divided into two basic groups: user goals and system goals. Users want certain obvious properties in a system. The system should be convenient to use, easy to learn and to use, reliable, safe, and fast. Of course, these specifications are not particularly useful in the system design, since there is no general agreement on how to achieve them. A similar set of requirements can be defined by the developers who must design, create, maintain, and operate the system. The system should be easy to design, implement, and maintain; and it should be flexible, reliable, error free, and efficient. Again, these requirements are vague and may be interpreted in various ways. There is, in short, no unique solution to the problem of defining the requirements for an operating system. The wide range of systems in existence shows that different requirements can result in a large variety of solutions for different environments. For example, the requirements for Wind River VxWorks, a real-time operating system for embedded systems, must have been substantially different from those for Windows Server, a large multiaccess operating system designed for enterprise applications. Specifying and designing an operating system is a highly creative task. Although no textbook can tell you how to do it, general principles have been developed in the field of software engineering, and we turn now to a discussion of some of these principles. 2.7.2 Mechanisms and Policies One important principle is the separation of policy from mechanism. Mechanisms determine how to do something; policies determine what will be done. For example, the timer construct (see Section 1.4.3) is a mechanism for ensuring CPU protection, but deciding how long the timer is to be set for a particular user is a policy decision. The separation of policy and mechanism is important for flexibility. Policies are likely to change across places or over time. In the worst case, each change in policy would require a change in the underlying mechanism. A general mechanism flexible enough to work across a range of policies is preferable. A change in policy would then require redefinition of only certain parameters of the system. For instance, consider a mechanism for giving priority to certain types of programs over others. If the mechanism is properly separated from policy, it can be used either to support a policy decision that I/O-intensive programs should have priority over CPU-intensive ones or to support the opposite policy. Microkernel-based operating systems (discussed in Section 2.8.3) take the separation of mechanism and policy to one extreme by implementing a basic set of primitive building blocks. These blocks are almost policy free, allowing more advanced mechanisms and policies to be added via user-created kernel modules or user programs themselves. In contrast, consider Windows, an enormously popular commercial operating system available for over three decades. Microsoft has closely encoded both mechanism and policy into the system to enforce a global look and feel across all devices that run the Windows operating system. All applications have similar interfaces, because the interface itself is built into the kernel and system libraries. Apple has adopted a similar strategy with its macOS and iOS operating systems. We can make a similar comparison between commercial and open-source operating systems. For instance, contrast Windows, discussed above, with Linux, an open-source operating system that runs on a wide range of computing devices and has been available for over 25 years. The “standard” Linux kernel has a specific CPU scheduling algorithm (covered in Section 5.7.1), which is a mechanism that supports a certain policy. However, anyone is free to modify or replace the scheduler to support a different policy. Policy decisions are important for all resource allocation. Whenever it is necessary to decide whether or not to allocate a resource, a policy decision must be made. Whenever the question is how rather than what, it is a mechanism that must be determined. 2.7.3 Implementation Once an operating system is designed, it must be implemented. Because operating systems are collections of many programs, written by many people over a long period of time, it is difficult to make general statements about how they are implemented. Early operating systems were written in assembly language. Now, most are written in higher-level languages such as C or C++, with small amounts of the system written in assembly language. In fact, more than one higher-level language is often used. The lowest levels of the kernel might be written in assembly language and C. Higher-level routines might be written in C and C++, and system libraries might be written in C++ or even higher-level languages. Android provides a nice example: its kernel is written mostly in C with some assembly language. Most Android system libraries are written in C or C++, and its application frameworks—which provide the developer interface to the system—are written mostly in Java. We cover Android's architecture in more detail in Section 2.8.5.2. The advantages of using a higher-level language, or at least a systems-implementation language, for implementing operating systems are the same as those gained when the language is used for application programs: the code can be written faster, is more compact, and is easier to understand and debug. In addition, improvements in compiler technology will improve the generated code for the entire operating system by simple recompilation. Finally, an operating system is far easier to port to other hardware if it is written in a higher-level language. This is particularly important for operating systems that are intended to run on several different hardware systems, such as small embedded devices, Intel x86 systems, and ARM chips running on phones and tablets. The only possible disadvantages of implementing an operating system in a higher-level language are reduced speed and increased storage requirements. This, however, is not a major issue in today's systems. Although an expert assembly-language programmer can produce efficient small routines, for large programs a modern compiler can perform complex analysis and apply sophisticated optimizations that produce excellent code. Modern processors have deep pipelining and multiple functional units that can handle the details of complex dependencies much more easily than can the human mind. As is true in other systems, major performance improvements in operating systems are more likely to be the result of better data structures and algorithms than of excellent assembly-language code. In addition, although operating systems are large, only a small amount of the code is critical to high performance; the interrupt handlers, I/O manager, memory manager, and CPU scheduler are probably the most critical routines. After the system is written and is working correctly, bottlenecks can be identified and can be refactored to operate more efficiently.

      Operating system design and implementation involve defining clear goals and balancing user and system requirements. User goals focus on convenience, reliability, and speed, while system goals emphasize ease of design, flexibility, and efficiency. A key principle is separating mechanisms (how to do something) from policies (what to do), enabling flexibility and adaptability. For example, microkernel systems use minimal, policy-free mechanisms, allowing customization, while systems like Windows integrate both for consistency. Modern operating systems are typically written in higher-level languages like C or C++, with some assembly for critical parts, improving portability, maintainability, and performance. Compiler optimizations and efficient algorithms often outweigh the benefits of assembly language, making higher-level languages preferable for most OS development.

    5. 2.6 Why Applications Are Operating-System Specific Fundamentally, applications compiled on one operating system are not executable on other operating systems. If they were, the world would be a better place, and our choice of what operating system to use would depend on utility and features rather than which applications were available. Based on our earlier discussion, we can now see part of the problem—each operating system provides a unique set of system calls. System calls are part of the set of services provided by operating systems for use by applications. Even if system calls were somehow uniform, other barriers would make it difficult for us to execute application programs on different operating systems. But if you have used multiple operating systems, you may have used some of the same applications on them. How is that possible? An application can be made available to run on multiple operating systems in one of three ways: 1. The application can be written in an interpreted language (such as Python or Ruby) that has an interpreter available for multiple operating systems. The interpreter reads each line of the source program, executes equivalent instructions on the native instruction set, and calls native operating system calls. Performance suffers relative to that for native applications, and the interpreter provides only a subset of each operating system's features, possibly limiting the feature sets of the associated applications. 2. The application can be written in a language that includes a virtual machine containing the running application. The virtual machine is part of the language's full RTE. One example of this method is Java. Java has an RTE that includes a loader, byte-code verifier, and other components that load the Java application into the Java virtual machine. This RTE has been ported, or developed, for many operating systems, from mainframes to smartphones, and in theory any Java app can run within the RTE wherever it is available. Systems of this kind have disadvantages similar to those of interpreters, discussed above. 3. The application developer can use a standard language or API in which the compiler generates binaries in a machine- and operating-system-specific language. The application must be ported to each operating system on which it will run. This porting can be quite time consuming and must be done for each new version of the application, with subsequent testing and debugging. Perhaps the best-known example is the POSIX API and its set of standards for maintaining source-code compatibility between different variants of UNIX-like operating systems. In theory, these three approaches seemingly provide simple solutions for developing applications that can run across different operating systems. However, the general lack of application mobility has several causes, all of which still make developing cross-platform applications a challenging task. At the application level, the libraries provided with the operating system contain APIs to provide features like GUI interfaces, and an application designed to call one set of APIs (say, those available from IOS on the Apple iPhone) will not work on an operating system that does not provide those APIs (such as Android). Other challenges exist at lower levels in the system, including the following. Each operating system has a binary format for applications that dictates the layout of the header, instructions, and variables. Those components need to be at certain locations in specified structures within an executable file so the operating system can open the file and load the application for proper execution. CPUs have varying instruction sets, and only applications containing the appropriate instructions can execute correctly. Operating systems provide system calls that allow applications to request various activities, such as creating files and opening network connections. Those system calls vary among operating systems in many respects, including the specific operands and operand ordering used, how an application invokes the system calls, their numbering and number, their meanings, and their return of results. There are some approaches that have helped address, though not completely solve, these architectural differences. For example, Linux—and almost every UNIX system—has adopted the ELF format for binary executable files. Although ELF provides a common standard across Linux and UNIX systems, the ELF format is not tied to any specific computer architecture, so it does not guarantee that an executable file will run across different hardware platforms. APIs, as mentioned above, specify certain functions at the application level. At the architecture level, an application binary interface (ABI) is used to define how different components of binary code can interface for a given operating system on a given architecture. An ABI specifies low-level details, including address width, methods of passing parameters to system calls, the organization of the run-time stack, the binary format of system libraries, and the size of data types, just to name a few. Typically, an ABI is specified for a given architecture (for example, there is an ABI for the ARMv8 processor). Thus, an ABI is the architecture-level equivalent of an API. If a binary executable file has been compiled and linked according to a particular ABI, it should be able to run on different systems that support that ABI. However, because a particular ABI is defined for a certain operating system running on a given architecture, ABIs do little to provide cross-platform compatibility. In sum, all of these differences mean that unless an interpreter, RTE, or binary executable file is written for and compiled on a specific operating system on a specific CPU type (such as Intel x86 or ARMv8), the application will fail to run. Imagine the amount of work that is required for a program such as the Firefox browser to run on Windows, macOS, various Linux releases, iOS, and Android, sometimes on various CPU architectures.

      Applications are often operating-system specific due to differences in system calls, binary formats, and CPU instruction sets. System calls, which enable applications to interact with the OS, vary across platforms, making cross-platform execution challenging. Three approaches enable multi-OS compatibility: 1) Interpreted languages (e.g., Python) use interpreters to execute code on different OSes, though performance and feature sets may be limited. 2) Virtual machines (e.g., Java) run applications within a portable runtime environment, but with similar limitations. 3) Porting applications to each OS using standard APIs (e.g., POSIX) is time-consuming. Binary formats (e.g., ELF, PE) and ABIs further complicate cross-platform compatibility, as they are tied to specific architectures and OSes. These factors make developing cross-platform applications, like Firefox, a complex task requiring significant adaptation for each OS and CPU architecture.

    6. 2.5 Linkers and Loaders Usually, a program resides on disk as a binary executable file—for example, a.out or prog.exe. To run on a CPU, the program must be brought into memory and placed in the context of a process. In this section, we describe the steps in this procedure, from compiling a program to placing it in memory, where it becomes eligible to run on an available CPU core. The steps are highlighted in Figure 2.11. Figure 2.11 The role of the linker and loader. Source files are compiled into object files that are designed to be loaded into any physical memory location, a format known as an relocatable object file. Next, the linker combines these relocatable object files into a single binary executable file. During the linking phase, other object files or libraries may be included as well, such as the standard C or math library (specified with the flag -lm). A loader is used to load the binary executable file into memory, where it is eligible to run on a CPU core. An activity associated with linking and loading is relocation, which assigns final addresses to the program parts and adjusts code and data in the program to match those addresses so that, for example, the code can call library functions and access its variables as it executes. In Figure 2.11, we see that to run the loader, all that is necessary is to enter the name of the executable file on the command line. When a program name is entered on the command line on UNIX systems—for example, ./main—the shell first creates a new process to run the program using the fork() system call. The shell then invokes the loader with the exec() system call, passing exec() the name of the executable file. The loader then loads the specified program into memory using the address space of the newly created process. (When a GUI interface is used, double-clicking on the icon associated with the executable file invokes the loader using a similar mechanism.) The process described thus far assumes that all libraries are linked into the executable file and loaded into memory. In reality, most systems allow a program to dynamically link libraries as the program is loaded. Windows, for instance, supports dynamically linked libraries (DLLs). The benefit of this approach is that it avoids linking and loading libraries that may end up not being used into an executable file. Instead, the library is conditionally linked and is loaded if it is required during program run time. For example, in Figure 2.11, the math library is not linked into the executable file main. Rather, the linker inserts relocation information that allows it to be dynamically linked and loaded as the program is loaded. We shall see in Chapter 9 that it is possible for multiple processes to share dynamically linked libraries, resulting in a significant savings in memory use. Object files and executable files typically have standard formats that include the compiled machine code and a symbol table containing metadata about functions and variables that are referenced in the program. For UNIX and Linux systems, this standard format is known as ELF (for Executable and Linkable Format). There are separate ELF formats for relocatable and executable files. One piece of information in the ELF file for executable files is the program's entry point, which contains the address of the first instruction to be executed when the program runs. Windows systems use the Portable Executable (PE) format, and macOS uses the Mach-O format. ELF FORMAT Linux provides various commands to identify and evaluate ELF files. For example, the file command determines a file type. If main.o is an object file, and main is an executable file, the command file main.o will report that main.o is an ELF relocatable file, while the command file main will report that main is an ELF executable. ELF files are divided into a number of sections and can be evaluated using the readelf command.

      Linkers and loaders play a crucial role in transforming a program from a disk-based binary executable (e.g., a.out or prog.exe) into a memory-resident process ready for CPU execution. Source files are compiled into relocatable object files, which the linker combines into a single executable, incorporating libraries like the standard C library. The loader then loads this executable into memory, adjusting addresses through relocation to enable proper function calls and variable access. On UNIX systems, the shell uses fork() and exec() system calls to create a process and invoke the loader. Dynamic linking, as seen with Windows DLLs, allows libraries to be linked and loaded only when needed, saving memory. Executable files follow standard formats like ELF (Linux), PE (Windows), or Mach-O (macOS), containing machine code, symbol tables, and entry points. Tools like the file and readelf commands help analyze ELF files, distinguishing between relocatable and executable formats.

    7. 2.4 System Services Another aspect of a modern system is its collection of system services. Recall Figure 1.1, which depicted the logical computer hierarchy. At the lowest level is hardware. Next is the operating system, then the system services, and finally the application programs. System services, also known as system utilities, provide a convenient environment for program development and execution. Some of them are simply user interfaces to system calls. Others are considerably more complex. They can be divided into these categories: File management. These programs create, delete, copy, rename, print, list, and generally access and manipulate files and directories. Status information. Some programs simply ask the system for the date, time, amount of available memory or disk space, number of users, or similar status information. Others are more complex, providing detailed performance, logging, and debugging information. Typically, these programs format and print the output to the terminal or other output devices or files or display it in a window of the GUI. Some systems also support a registry, which is used to store and retrieve configuration information. File modification. Several text editors may be available to create and modify the content of files stored on disk or other storage devices. There may also be special commands to search contents of files or perform transformations of the text. Programming-language support. Compilers, assemblers, debuggers, and interpreters for common programming languages (such as C, C++, Java, and Python) are often provided with the operating system or available as a separate download. Program loading and execution. Once a program is assembled or compiled, it must be loaded into memory to be executed. The system may provide absolute loaders, relocatable loaders, linkage editors, and overlay loaders. Debugging systems for either higher-level languages or machine language are needed as well. Communications. These programs provide the mechanism for creating virtual connections among processes, users, and computer systems. They allow users to send messages to one another's screens, to browse web pages, to send e-mail messages, to log in remotely, or to transfer files from one machine to another. Background services. All general-purpose systems have methods for launching certain system-program processes at boot time. Some of these processes terminate after completing their tasks, while others continue to run until the system is halted. Constantly running system-program processes are known as services, subsystems, or daemons. One example is the network daemon discussed in Section 2.3.3.5. In that example, a system needed a service to listen for network connections in order to connect those requests to the correct processes. Other examples include process schedulers that start processes according to a specified schedule, system error monitoring services, and print servers. Typical systems have dozens of daemons. In addition, operating systems that run important activities in user context rather than in kernel context may use daemons to run these activities. Along with system programs, most operating systems are supplied with programs that are useful in solving common problems or performing common operations. Such application programs include web browsers, word processors and text formatters, spreadsheets, database systems, compilers, plotting and statistical-analysis packages, and games. The view of the operating system seen by most users is defined by the application and system programs, rather than by the actual system calls. Consider a user's PC. When a user's computer is running the macOS operating system, the user might see the GUI, featuring a mouse-and-windows interface. Alternatively, or even in one of the windows, the user might have a command-line UNIX shell. Both use the same set of system calls, but the system calls look different and act in different ways. Further confusing the user view, consider the user dual-booting from macOS into Windows. Now the same user on the same hardware has two entirely different interfaces and two sets of applications using the same physical resources. On the same hardware, then, a user can be exposed to multiple user interfaces sequentially or concurrently.

      System services, or utilities, enhance program development and execution by offering a structured environment. They include file management tools for manipulating files, status information programs for system details, and file modification utilities like text editors. Programming support features compilers and debuggers, while program loading tools manage execution. Communication programs enable virtual connections, and background services, or daemons, run essential processes continuously. These services, alongside application programs, shape the user’s perception of the operating system, often masking the underlying system calls. Different interfaces, like macOS GUI or UNIX shell, can coexist on the same hardware, offering varied user experiences despite shared resources.

    8. 2.3 System Calls

      System calls act as a bridge between user applications and the operating system, enabling access to essential system services. The section elaborates on their role through an example of copying data from one file to another, illustrating how system calls facilitate file operations, error handling, and user interaction. It highlights how even simple tasks require multiple system calls, such as reading input, opening files, writing data, and handling errors. The discussion extends to APIs, which simplify system call usage for developers, ensuring portability and ease of development. The role of the runtime environment (RTE) in managing system calls is also explored. Various system call types; process control, file management, device management, information maintenance, communications, and protection; are categorized with examples from Windows and UNIX. File management involves a set of essential system calls that enable users to create, delete, open, read, write, reposition, and close files. These operations also extend to directories, facilitating structured organization within a file system. File attributes, such as name, type, and protection codes, can be retrieved or modified using get_file_attributes() and set_file_attributes() system calls. Additionally, some operating systems provide built-in move() and copy() calls, while others integrate these functionalities through APIs. The close relationship between files and devices is evident in UNIX-like systems, where both share a unified interface. Device management encompasses requesting and releasing devices, which prevents conflicts such as deadlocks. System calls also support debugging, retrieving system information, and process monitoring. Communication between processes occurs through message-passing or shared memory, each with distinct advantages. Finally, system protection mechanisms, including permission controls, ensure secure resource access, safeguarding data in multiprogrammed and networked environments.

    9. 2.2 User and Operating-System Interface

      The user interface of an operating system determines how users interact with it. There are three primary methods: the command-line interface (CLI), the graphical user interface (GUI), and the touch-screen interface. Each has its unique advantages and use cases. The command interpreter, or shell, allows users to enter commands directly. Operating systems like Linux, UNIX, and Windows provide different shells, such as the Bash shell, which execute user commands. These commands can be built-in or executed via system programs. UNIX, for instance, runs external programs to process commands, making it easier to add new commands without modifying the shell. The graphical user interface (GUI), on the other hand, offers a more intuitive approach by using windows, icons, and menus. It was first developed at Xerox PARC in the 1970s and later popularized by Apple Macintosh and Microsoft Windows. GUIs allow users to interact with applications by clicking icons and navigating menus, significantly simplifying the user experience compared to text-based commands. For mobile devices, the touch-screen interface replaces both CLI and GUI with direct touch-based interactions. Users tap, swipe, or use gestures to navigate through applications. The iPhone and iPad, for example, use the Springboard interface to manage apps and settings, minimizing the need for external input devices like keyboards or mice. Ultimately, the choice between these interfaces depends on user preference and system functionality. Power users and administrators prefer CLI for efficiency and automation, while everyday users favor GUIs and touch interfaces for their ease of use.

    10. 2.1 Operating-System Services

      The operating system (OS) provides essential services that facilitate user interaction, program execution, and efficient resource management. One of the key OS services is the user interface (UI), which enables users to interact with the system. This interface can be a graphical user interface (GUI), a command-line interface (CLI), or a touchscreen interface for mobile devices. Each offers distinct advantages, with GUIs being user-friendly, CLIs offering precision, and touchscreens enabling intuitive interactions. Another crucial service is program execution, where the OS loads programs into memory and ensures they run efficiently. Programs must be able to terminate normally or abnormally, with the OS managing errors that arise during execution. Additionally, I/O operations allow programs to interact with hardware devices like disks, networks, and printers, ensuring proper data flow and user interaction. The file-system manipulation service manages data storage and retrieval, allowing programs to read, write, create, and delete files while enforcing access permissions. Communication services facilitate inter-process communication (IPC) through shared memory or message passing, ensuring seamless data exchange between processes on the same or different machines. To maintain system integrity, error detection continuously monitors hardware and software errors, taking corrective actions when necessary. The OS also handles resource allocation, ensuring efficient distribution of CPU cycles, memory, and peripheral devices among multiple processes. Logging and accounting track resource usage for analysis and billing purposes. Lastly, protection and security services safeguard system resources and user data from unauthorized access, ensuring a stable and secure computing environment.

    11. An operating system provides the environment within which programs are executed. Internally, operating systems vary greatly in their makeup, since they are organized along many different lines. The design of a new operating system is a major task. It is important that the goals of the system be well defined before the design begins. These goals form the basis for choices among various algorithms and strategies.

      This book describes a number of operating system functions, such as resource allocation, system calls, and user interfaces (CLI, GUI, and touch screen). It covers the design of user interfaces, including as command-line interfaces (CLI), graphical user interfaces (GUI), and touch-screen interfaces, as well as the function of system calls in communicating with the operating system. It draws attention to the variations in interfaces and their applications according to system specifications or user choices. To guarantee effective and safe functioning, operating systems also oversee crucial tasks like file manipulation, application execution, error detection, and security. System administration and user experience are impacted by interface selection.File manipulation, input/output handling, and process management are made possible by system calls, which offer an essential interface for programs to communicate with the operating system.

    12. Virtualization is a technology that allows us to abstract the hardware of a single computer (the CPU, memory, disk drives, network interface cards, and so forth) into several different execution environments, thereby creating the illusion that each separate environment is running on its own private computer. These environments can be viewed as different individual operating systems (for example, Windows and UNIX) that may be running at the same time and may interact with each other. A user of a virtual machine can switch among the various operating systems in the same way a user can switch among the various processes running concurrently in a single operating system. Virtualization allows operating systems to run as applications within other operating systems. At first blush, there seems to be little reason for such functionality. But the virtualization industry is vast and growing, which is a testament to its utility and importance. Broadly speaking, virtualization software is one member of a class that also includes emulation. Emulation, which involves simulating computer hardware in software, is typically used when the source CPU type is different from the target CPU type. For example, when Apple switched from the IBM Power CPU to the Intel x86 CPU for its desktop and laptop computers, it included an emulation facility called “Rosetta,” which allowed applications compiled for the IBM CPU to run on the Intel CPU. That same concept can be extended to allow an entire operating system written for one platform to run on another. Emulation comes at a heavy price, however. Every machine-level instruction that runs natively on the source system must be translated to the equivalent function on the target system, frequently resulting in several target instructions. If the source and target CPUs have similar performance levels, the emulated code may run much more slowly than the native code. With virtualization, in contrast, an operating system that is natively compiled for a particular CPU architecture runs within another operating system also native to that CPU. Virtualization first came about on IBM mainframes as a method for multiple users to run tasks concurrently. Running multiple virtual machines allowed (and still allows) many users to run tasks on a system designed for a single user. Later, in response to problems with running multiple Microsoft Windows applications on the Intel x86 CPU, VMware created a new virtualization technology in the form of an application that ran on Windows. That application ran one or more guest copies of Windows or other native x86 operating systems, each running its own applications. (See Figure 1.16.) Windows was the host operating system, and the VMware application was the virtual machine manager (VMM). The VMM runs the guest operating systems, manages their resource use, and protects each guest from the others.

      The Overview of Cloud Computing Using virtualization to its advantage, cloud computing provides on-demand online access to computers, storage, and software. The resources used determine how much users pay. Public, private, hybrid, and SaaS, PaaS, and IaaS cloud types are among them. These settings can integrate several kinds of cloud services.

      Integrated Systems For certain tasks, embedded systems are customized computing devices with constrained interfaces. In order to satisfy stringent timing constraints, they frequently use real-time operating systems. These systems need to process sensor data quickly and respond to it. They are utilized in a variety of industries, including industrial automation, medical devices, and automobile control.

      Operating Systems That Are Free and Open-Source GNU/Linux is one example of an open-source operating system that makes its source code available for free alteration and dissemination. The push for free software encourages user liberties,