Agner`s CPU blog

A design without a TLB

Author: Agner

Date: 2016-03-11 05:50

I wonder if it is possible to design a microprocessor without a Translation Look-aside Buffer (TLB). A TLB is a cache that is used for virtual address translation. The TLB is quite big and complicated in many modern processors. Some processors even have two TLB levels. It costs a lot of silicon space and a performance loss because of TLB misses.

One of the most compelling reasons for having virtual address translation in current systems is, as I understand it, that you can run multiple instances of the same program. The multiple instances share the same code segment in order to save memory, but they cannot share writeable data segments. The code segment contains references to the data segment. If multiple running instances share the same code segment then they will also share the same data addresses. The only way to keep the data of different running instances separate is to have the same virtual addresses but different physical addresses for each process.

We can avoid most of the costs of the TLB and virtual address translation by having a special register that points to the data segment of the process. The program code should access its data segment through this pointer. Multiple instances of the same program will have different values in their data segment pointer. This allows them to share the same code segment while having different data segments.

Another use of the TLB is to manage the situation where we are out of physical memory or the memory has become too fragmented. The virtual address translation allows the memory segments to be moved or swapped to a disk. This problem can hopefully be reduced. RAM is cheap today. A well-designed application uses typically 1 MB of memory while a state-of-the-art PC today has 16 GB of RAM or more. Nobody runs 16,000 different applications, not even on a server. Unfortunately, some applications are wildly bloated. This should of course be discouraged. The need for byte-code interpreters, just-in-time compilation and other RAM-hungry frameworks will hopefully be reduced when the instruction set is standardized so that the compiled code will be compatible with many platforms.

We may still need some virtual address translation if we run out of RAM or for the sake of making virtual machines, but it will be a coarse-grained translation with one or a few large contiguous blocks of memory for each process, rather than the fine-grained translation of current systems with a large number of fixed-size memory blocks.

To see how this can be implemented, we first need to get an overview of the different kinds of data used in a running program and how they can be addressed. Traditionally, we have the following kinds of data:

1. Program code (TEXT). This is executable and read-only. Can be shared between multiple processes.
2. Read-only program data (CONST). This contains constants and tables used by the program. It may be shared by multiple processes unless it contains pointers that need to be relocated at load time.
3. Static read/write program data (DATA+BSS). This is used for global data and for static data inside functions. It needs multiple instances if multiple processes are running the same code.
4. Stack data (STACK). This is used for non-static data inside functions. This is the most common and most efficient way of storing program data. Each process or thread has its own stack, addressed relative to the stack pointer.
5. Local heap. Used for dynamic memory allocation by an application program
6. Global heap. Used for dynamic memory allocation by the operating system and device drivers.
7. Thread data. Allocated when a thread is created and used for thread-local static data. Rarely used.

Now, I will discuss how each of these types of data are addressed in current systems and how they can be managed in a system without a TLB.

1. Program code. In current systems, program code may contain absolute addresses. These addresses are modified (relocated) by the loader if the code is loaded at a different address than expected by the linker. The relocation is often avoided by using virtual address translation. Multiple programs that do not need to call each other can be loaded at the same virtual address.
My proposal is to avoid the need for relocation by using relative addresses as much as possible. All addresses within the same code segment are addressed relative to the instruction pointer.

2. Read-only program data. Current systems use either relative or absolute addresses to address read-only data. These addresses often need to be relocated by the loader.
My proposal is to make the read-only data segment contiguous with the program code segment, and access it with addressing relative to the instruction pointer. This needs relocation at the link stage, but not at the load stage.
This segment may contain pointers. This is typically needed in switch/case jump tables, virtual function tables, function pointers and data pointers. Current systems often use absolute addresses in these cases, needing relocation by the loader. Some compilers use self-relative pointer tables or pointers relative to an arbitrary reference point (used in 64-bit Windows and Mac OS).
My proposal for jump tables, virtual function tables and code pointers is to use 32-bit self-relative addresses or addresses relative to the code base.
Tables of constant pointers to data is a problem because - in order to use relative pointers - we need to know whether the pointer target is in the read-only data or the read/write data segment. Preferably, such a table should be placed in the same segment as its targets, either the read-only data segment or the read/write data segment. This makes it possible to use self-relative pointers. If it contains a mixture of both, then it should be placed in the read/write data segment, and any targets in the read-only data segment should be moved to the read/write data segment.

3. Static read/write program data. Current systems use absolute or relative addresses to access read/write data in the same way as read-only data. If multiple processes are running the same program then there will be one instance of the read/write data segment for each process. The multiple instances will typically share the same virtual address, while having different physical addresses. This requires virtual address translation. We want to get rid of this translation.
My proposal is to have a dedicated register for pointing to the read/write data segment. All data in the read/write data segment are addressed relative to the value in this register. We may implement a special addressing mode for this or, alternatively just let the application copy the data segment register to a general purpose register which is used as pointer. Read-only data may optionally be stored here rather than in a separate segment.
The data segment pointer register needs to be saved and restored when one program calls another program. It does not need to be saved when calling a DLL, which I will explain below.

4. Stack. Each process and each thread has its own stack which is addressed by the stack pointer. No problem here.

5 - 6. Heap. You get a pointer when allocating data on a heap. Heap data are addressed through this pointer. No problem here.

7. Thread-local data. Current systems may have a "thread environment block" which contains various information about the thread and a pointer to the thread-local data segment. In x86, it is addressed through a special segment register. It also contains information about stack size, exception handler, process environment, etc.
My proposal is to preserve this system. We may need a dedicated register to point to the thread environment block or to a thread-local data segment.

Dynamic link libraries (DLLs). My proposal is to use Windows-style dynamic linking rather than Linux-style shared objects, because the latter have the rarely used feature of symbol interposition which makes everything less efficient (see www.macieira.org/blog/2012/01/sorry-state-of-dynamic-libraries-on-linux/ )

I propose that a DLL cannot have a per-process read/write data segment (this might not be thread-safe anyway). If a library needs writeable data, for example for some initializations, then there are three possible solutions:
(1) use static linking, (2) use data supplied by the caller through a pointer, or (3) use global heap data allocated at load time with an absolute address relocated by the loader. This data block will be shared by all processes.

The same applies to device drivers. A device driver may need a writeable data segment for a mutex and for storing information about the device. This data area is shared between all processes. My proposal is that this data block is allocated at load time. It is accessed through an absolute 64-bit address which is relocated by the loader. In case the driver later needs more data, for example if there are many network printers using the same driver, then the device driver can allocate additional space on the global heap and store a pointer to this allocated memory in the data segment it got by the loader. If the driver needs per-process data then it will use data space provided by the caller through a pointer.

These methods will make it possible to replace a TLB with its many small memory segments of fixed size by a memory map with a few large memory segments of variable size. Each process has its own little memory map which is cached in the CPU. The memory map should indicate the type of each memory segment, but not necessarily any address translation. Memory segments of the same type should be joined together and made contiguous as far as possible in order to make the memory mapping as simple as possible. The memory map should not distinguish between executable code and read-only data for DLL's. This will make it possible to join all DLLs together into one big memory segment. Any unused space between the DLLs can be filled with an error code. The same goes with device drivers. Each process will have its own memory map listing the memory areas it is allowed to access. This will normally have only one segment of each type: TEXT, CONST, DATA+BSS, STACK, HEAP+THREADDATA). A segment for DLLs and their constant data can be shared between multiple processes. Each process will be able to see DLLs that belong to other processes, but these are read-only and contain no writeable data so I assume that there is no serious security risk here. A process may use static linking instead if it wants to hide which libraries it is using. The reason why I don't want to hide DLLs from processes that are not using them is that this would split the memory into many small pieces, one for each DLL. This would require many entries in the memory map.

It is my goal to keep the memory map small by keeping similar memory blocks together rather than splitting the memory space into many small pieces of different types.

This may cause problems for programming languages with just-in-time compilation. We will discourage systems and script languages that compile a little piece of code at a time. In fact, the justification for just-in-time compiling disappears when the instruction set is standardized. The code can adapt to different processors with different vector sizes at run time. We only need to (re-)compile code in case a new processor version has new advantageous instructions. The compiler/interpreter should preferably compile the code or script all at once. Piecemeal compilation also causes unpredictable response times which is annoying to the user.

I am undecided about how to implement system calls. It could use absolute addresses or a table of pointers in the read-only data segment or in the thread environment block, or a special system call instruction.

Self-modifying code is discouraged. If an application needs to generate executable code then it should preferably make a DLL and load it before executing it.

Many script languages allow self-modifying scripts. Such scripts should preferably be interpreted rather than compiled. If it turns out that there is a serious need for supporting self-modifying code for applications such as compiling self-modifying scripts, compiling user-supplied macros, or debugging applications, then we may decide to support a memory type that allows both write access and execute access. This write/execute memory will be allocated on a special heap dedicated to this purpose only. Access to use this feature must be restricted in order to avoid abuse by hackers.

Memory model:
The system proposed here gives immediate access to up to 8 GB of code (Jumps and calls use a 32-bit signed offset multiplied by the code word size), 2 GB of read-only data, 2 GB of static read/write data, 2 GB of thread-local data, almost unlimited stack size with 2 GB for each stack frame, and almost unlimited heap space. With such a huge address space, we need not support more than one standard memory model. Everything is accessed through pointers with a 32-bit relative offset.

In the rare case that there is more than 2GB distance between read-only data and the code that reads it, we will use a pointer to access it. This pointer can be stored in the thread environment block.

There is no addressing mode for absolute addresses. In the few cases where we need an absolute address (e.g. data for a device driver), we will load the 64-bit address into a register and use this as pointer. The 64-bit address is inserted into the code by the loader.

Problems:
There may be security problems if we are using a global heap. One process may be able to read and modify data belonging to another process. We should probably avoid using a global heap.

What can we do if the local heap or stack of an application overflows? If the heap overflows, we may make an extra heap that is bigger. This requires an extra entry in the memory map. If the stack overflows, then we need to move it to a different physical address and use virtual address translation. This still requires only one entry in the memory map, but with virtual address translation. The cost is that we have to copy the entire contents to a new physical address. The alternative to copying the entire stack contents is fragmented memory at the cost of having more entries in the memory map.

For these reasons, we cannot completely get rid of virtual address translation, but we can still keep the memory map much smaller than the TLBs of current systems.

Reply To This Message

Previous Message

Next Message