UNIVERSITY OF CALIFORNIA,
IRVINE

Performance Enhancement of Desktop Multimedia with Multithreaded
Extensions to A General Purpose Superscalar Microprocessor

THESIS

submitted in partial satisfaction of the requirements for the degree of
MASTER OF SCIENCE
in Electrical and Computer Engineering

by

Mark Alexander Pontius

Thesis Committee:
Professor Nader Bagherzadeh, Chair
Professor Fadi Kurdahi
Professor Nikil Dutt

© Mark Pontius, 1998

All rights reserved



TABLE OF CONTENTS



Abstract of the Thesis viii
1. Introduction 1
  1.1. The Problem 1
  1.2. Related work 2
  1.3. Overview of this thesis 5
2. Theory of Operation 6
  2.1. A multithreading primer 6
  2.2. Benefits and drawbacks of multithreading 8
    2.2.1. Cache Effects 9
    2.2.2. Instruction Scheduling 9
    2.2.3. Branch Prediction 11
    2.2.4. Hardware 12
    2.2.5. Software 13
3. Architecture 15
  3.1. Multithreaded SDSP architecture 15
    3.1.1. Block Diagram 15
      Cache / Memory Interface 16
      Fetch Unit 17
      Reorder Buffer / Instruction Window (RBIW) 18
      Functional Units 19
  3.2. New and Modified Units 21
    3.2.1. Thread Control 21
    3.2.2. Registers 21
  3.3. Instruction Set 22
    3.3.1. New instructions 22
      m_fork(R), m_fork_n(R,N) 22
      m_join(R), m_kill_all() 23
      m_exclusive_run(on/off) 23
      m_thread_num() 24
      m_read_ipc_reg(R), m_write_ipc_reg(R,D), m_inc_ipc_reg(R), m_dec_ipc_reg(R) 24
    3.3.2. Instructions for simulation only 25
      m_get_shared(), m_set_shared() 25
      m_quick_run(on/off) 25
      m_marker(N) 26
  3.4. Sharing of memory between threads 26
4. Simulator 28
  4.1. Overview 28
  4.2. Structure 28
    4.2.1. Initialization 29
    4.2.2. Scalar Preprocess Execution 29
    4.2.3. Superscalar Modeling 29
    4.2.4. Thread Scheduling 30
    4.2.5. User Interface 31
    4.2.6. Statistical Data Generation 32
  4.3. Thread Memory Management 32
    4.3.1. Shared Memory Model 33
    4.3.2. Private (Non-Shared) Memory Model 34
    4.3.3. Multiprogram Memory Model 36
  4.4. Atomic transactions 37
  4.5. Simulator internals 38
  4.6. Simulator limitations and potential improvements 41
5. Benchmarks 43
  5.1. The Multimedia Desktop Environment 43
  5.2. Programming environment 44
  5.3. Discussion of the individual benchmarks and datasets 45
    5.3.1. Workload and datasets 45
    5.3.2. Profile 48
    5.3.3. Properties 48
6. Resource Analysis Results 55
  6.1. Default parameters 55
  6.2. Fetch limits 59
    6.2.1. Fetch Block Size 59
    6.2.2. Fetch Alignment: Prefetching 60
    6.2.3. Branch Prediction 62
    6.2.4. Instruction Cache 64
  6.3. Data and pipeline limits 67
    6.3.1. Functional Units: ALU, FPU, Load/Store 67
    6.3.2. Data Cache 70
    6.3.3. Instruction Window Depth 72
  6.4. Thread parameters 73
    6.4.1. Completion Slots 74
    6.4.2. Thread Scheduling Algorithm 75
  6.5. Interactions and other ideas 77
7. Conclusion 79
  7.1. Discussion of results 79
  7.2. What does it take to make multithreading viable 80
Bibliography 81
Appendix A: More about the Benchmarks 85
  nlfilt: Non-Linear Filter for Image Enhancement 85
  mpeg2e: MPEG II Video Compression 90
    The second mpeg2e loop: fdct 95
  pov: Persistence of Vision Raytracer 100
Appendix B: Simulator Reference 105
  NAME 105
  SYNOPSIS 105
  DESCRIPTION 105
  COMMAND LINE OPTIONS 106
Appendix C: Showstats Reference 126
  NAME 126
  SYNOPSIS 126
  DESCRIPTION 126
  COMMAND LINE ARGUMENTS 126
  OUTPUT FORMATS 128
  -table outputs: 128
  -profile outputs: 141
Appendix D: Raw Data 144
  CD ROM 14
ABSTRACT OF THE THESIS

Performance Enhancement of Desktop Multimedia with Multithreaded
Extensions to A General Purpose Superscalar Microprocessor

Master of Science in Electrical and Computer Engineering

by

Mark Alexander Pontius

University of California, Irvine, 1998

Professor Nader Bagherzadeh, Chair


A multithreaded microprocessor architecture is proposed that is optimized for common multimedia applications. The architecture is defined to support up to seven lightweight threads entirely in hardware, with instructions for managing, synchronizing, and communicating between these threads. Three different memory models are presented to simplify porting of applications: Shared Memory, Private Memory, and Multiprogram. A simulator is presented to accurately model program fetching, scheduling, and execution with an out of order issue superscalar processor based on the SDSP. Eleven benchmark applications are analyzed for single and multithread behavior including resource utilization, branch prediction accuracy, code growth, cycles per instruction, and others. Processor resource tradeoffs are compared and simulated: fetch block size, fetch alignment, branch prediction, instruction cache, functional units, data cache, Instruction Window depth, completion slots, and scheduling algorithm. The results of which indicate that 3 to 5 threads are sufficient to produce up to 20% higher instruction throughput.

1. INTRODUCTION

1.1. The Problem

Achieving high program execution rates can be accomplished in many ways. The instruction pipeline can be lengthened so that high clock rates can be used. The instructions can be made more complex to accomplish more every cycle. Instructions can be scheduled to execute more than one every cycle. All of these have been tried in commercially viable processors, but still more speed is needed. Over time, each of these will progress down their diverging paths, but eventually they will all reach their performance limits. [HAR94, WAL93b]

The super-pipelined processors have more complex forwarding logic or sacrifice execution of consecutive dependent instructions. The complex instruction set processors will reach the point where a few very powerful instructions will take up much of the processor resources, but are likely to be rarely used. The multiple issue processors reach the point where finding additional independent instructions becomes very expensive and the maximum utilization is only rarely achieved.

For this thesis, I look at a way of breaking out of the current limitations of multiple issue processors by specifying in the software groups of independent instructions called threads. The superscalar processor fetches from one of these threads each cycle and places the instructions into a common pool from which ready instructions can be executed out of order.

I focus on applications which are likely to be found on the typical desktop machine, which weighs heavily on multimedia. These applications are well suited to the multithreaded processor because of their data parallelism and task independence [DAN98, FLY98]. Traditional processor studies have focused on scientific programs as they have been the target applications for the high-end processors. Today, the consumer market is driving the high-end processors, and scientific machines are adapted from commercial silicon.

1.2. Related work

The SDSP architecture was developed by Steven Wallace [WAL93a] as a generic RISC microprocessor to study the multiple issue capability of superscalar architectures and tradeoffs. The processor consists of a central Reorder Buffer/Instruction Window (RBIW) which performs register renaming and out of order issue. The processor handles exceptions precisely by only retiring instructions in order as they shift down to the bottom of the RBIW. A VLSI design of the RBIW including the Scheduling Unit was done at 100MHz in .8 micron (1 micron drawn) CMOS by Nirav Dagli [DAG94] which demonstrates the feasibility of the architecture.

The SDSP has a history of multithreaded extensions. Manu Gulati [GUL94, GUL96] proposed a version with a partially shared register file, a deep Instruction Window, and full Instruction Window commit bypassing. Benchmarks were taken from multiprocessor benchmark suites, with speedups achieved in the range of 20-55%. Mat Loikkanen [LOI96] proposed a version that featured Thread Suspending Instruction Buffers which allow a long latency instruction to sit outside the Instruction Window, rather than require out of order commit. They are called Thread Suspending Instructions because when one of these instructions is fetched, that thread must be suspended or the single TSIB would not be able to prevent the Instruction Window from stalling. It has a partially shared register file. Long latency remote loads were added to the instruction set to provide for a distributed memory parallel processing model. The benchmarks were taken from multiprocessor benchmark suites and hand crafted in assembly to optimize the loops. Speedups as high as 3.3 times were achieved.

The Simultaneous Multithreading (SMT) processor designed by Dean Tullsen [TUL95] used independent programs interleaved each fetch cycle (more than one thread may be fetched every cycle) into a superscalar processor to increase resource utilization. It does not improve single program performance, but provides throughput increases. Jack Lo [LO97] modified the SMT architecture to support multithreaded applications, and compared the performance to that of a single chip multiprocessor to show that static resource partitioning leads to lower utilization and lower performance. Steven Wallace [WAL98] modified the SMT architecture to speculatively execute less-likely instructions in extra fetch slots and called it Threaded Multipath Execution. This increases single process performance by utilizing the mechanisms already in place for multithreading.

In 1991, Prasadh [PRA91] proposed interleaving instructions from multiple VLIW threads. The instructions were scheduled at compile time into independent VLIW groups of 4 instructions, then were dynamically interleaved from among all available threads to eliminate NOP's. Stephen Keckler and William Dally [KEC92] tried something similar, dividing instructions from compiler optimized VLIW threads to multiple Functional Unit clusters with a mechanism called processor coupling.

Matthew Farrens and Andrew Pleszkun [FAR91] used interleaving of instructions from multiple threads to mitigate the dependency problems of an in order issue deeply pipelined processor. This works only if enough threads are available. Single thread performance is poor.

Many examples exist of coarse grained threading, in which the switch takes several cycles to complete. Anant Agarwal [AGA92] proposed switching threads when network or memory latencies are encountered in the April Alewife multiprocessor. Richard Eickemeyer and his coworkers at IBM [EIC96] described using coarse grained switching on cache miss with transaction processing software. Soundararajan and Agarwal [SOU92] have a few hardware contexts, and many additional contexts in memory. The hardware contexts can hide cache miss latencies, while network latencies are hidden by a background switching of contexts to memory called dribbling registers.

Henk Corporaal [COR93] describes an architecture called Transport Triggered much like dataflow, in that programs are described as movements of data, and the operations are triggered when the data arrives. This can be considered very fine-grained multithreading, or simply a different way of expressing standard superscalar out of order execution.

1.3. Overview of this thesis

Chapter 2 describes what can be accomplished by adding multithreading to a microprocessor. Chapter 3 shows the architecture of the processor. Chapter 4 describes the simulator with its capabilities and limitations as they pertain to getting clear, realistic data. Chapter 5 goes into detail on the multimedia benchmarks used, analyzing their properties and determining their suitability to operation in a multithreaded environment. Chapter 6 looks at the numbers generated by the simulator, showing trends in thread performance, resource requirements and thread related extensions to the processor. Chapter 7 brings together the conclusions in the previous two chapters and discusses the feasibility and promises of multithreading.

Appendix A goes into more detail on some of the benchmarks, and what it took to convert scalar applications to multithreaded ones. Appendix B is the Simulator reference, describing all of the parameters that have been varied in this thesis. Appendix C is the Showstats reference, explaining all of the fields in the data tables in both equation form and detailed description. Appendix D contains some of the raw data, as well as instructions on accessing all of the data in electronic form for those curious people with an idea that I missed something.

2. THEORY OF OPERATION

2.1. A multithreading primer

Programs are expressed as a series of operations. Some require the results of a previous instruction, and are said to be dependent. For example: X = A + B + C + D could be programmed as three instructions: Y = A + B, followed by Z = Y + C, followed by X = Z + D. As can be seen in the dataflow diagram Figure 2.1a, the second instruction is dependent on the first. The third is dependent on the second. This takes 3 cycles to complete since none of the instructions are independent.

Figure 2.1. Dataflow diagram for X=A+B+C+D before (a) and after (b) optimization.

The compiler could re-organize the program into Y = A + B, Z = C + D, X = Y + Z. In this case the second instruction could begin before the first one completed, exposing more parallelism to a superscalar microprocessor. The dataflow diagram in Figure 2.1.b shows that the third is still dependent on the first two, so the program takes 2 cycles to complete. This is a speedup of 33%.

A superscalar processor generally has an Instruction Window containing several instructions listed in program order, from which it can pick and choose several instructions to execute every cycle. This has the limitation of only looking at a small piece of the program at a time, and will seriously limit the amount of parallelism that can be exploited. The example equation given here could be effectively handled by today's superscalar processors because of the small size of the problem. If the + operation was something more complex with many instructions, a superscalar processor would be unable to see the parallelism.

The programmer or compiler can simplify the problem by splitting the program into threads as shown in Figure 2.2. Each thread has no dependencies between it and other threads of operation. This allows the threads to all be executed at the same time without the processor having to check for dependencies. In the example above, each equation now appears to the processor as part of independent instruction streams, and all can be executed at the same time, no matter how complex the operation is.

Figure 2.2. The same problem as a multithreaded program.

With just a few threads of execution, enough independent instructions can be found with a modest Instruction Window to saturate the execution ability of a superscalar processor which has a limited number of Functional Units.

2.2. Benefits and drawbacks of multithreading

The following sections discuss some of the benefits and problems to be expected in a processor capable of executing multiple instruction threads at the same time. These will be explored further in the results section later in the thesis to explain the experimental data. Table 2.1 summarizes them by category.

Table 2.1. Benefits and drawbacks of multithreading.
PROs CONs
cache
  • shared cache locality
  • ability to hide misses
  • allows non-blocking ICache
  • reduced locality
scheduling
  • reduced data dependencies
  • improved load/store bypassing
  • out of order commit
  • stall sharing
  • bursty use of Functional Units
branch prediction
  • reduced need for accuracy
  • constructive aliasing in BTB
  • destructive aliasing in BTB
hardware
  • simpler branch prediction
  • increased resource utilization
  • increased resource contention
  • Thread Control Unit
  • bigger register file
  • more bits in RBIW and Scheduler
  • non-blocking Instruction Cache
software
  • reduced OS context switching
  • more complex to write
  • difficult to debug
  • code growth

2.2.1. Cache Effects

The most common perception about multithreaded programs is that they have reduced cache locality. By having multiple different routines running concurrently, there is a potential for instruction and Data Cache conflicts to occur. However, fine grained threads such as is supported on this processor, are often executing the same section of code so that Instruction Cache entries for one thread are the same ones going to be requested by other threads. Data Cache locality is not likely to significantly change when a routine is multithreaded. The same data is going to be accessed by one thread in order, or by multiple threads slightly out of order. Whether the locality of reference for instruction or Data Cache is likely to increase or decrease is dependent on the nature of the code being executed.

The penalty for a cache miss is much less on a multithreaded microprocessor. For data misses, there are more independent instructions that belong to other threads that can be executed while the data is fetched. For Instruction Cache misses, again other threads are available to be fetched, so long as a non-blocking Instruction Cache is used. The non-blocking Instruction Cache is useless to a single threaded processor.

2.2.2. Instruction Scheduling

By interleaving blocks of instructions from different threads in the Instruction Window, fewer of the instructions available to the Scheduler have dependencies. This allows more instructions to be issued early in the Window, when they have more time to complete [TUL95].

A load instruction cannot be issued before a prior store instruction because the dependency check using addresses is too complicated for the Scheduling Unit [WAL93a, DAG94]. This can be relaxed for multiple threads, since each thread can have a separate sequential state. By allowing loads to pass stores from different threads, there are additional opportunities to better utilize the Load/Store Units.

A long latency instruction does not mean a stalled pipeline in a multithreaded processor. The only reason that instructions must commit in program order is the requirement for precise exception handling. By having independent threads in the window, different threads may commit out of order with respect to each other without sacrificing the exception handling ability. Thus, a stall by one thread would not stall the entire Instruction Window [GUL94]. By giving a thread with one of these instructions in the window a low priority, the processor will not have as many dependent instructions waiting and blocking other threads from using the processor's resources.

Some instructions take too long to execute and can never complete before reaching the end of the Instruction Window. These include Floating Point computations and Load instruction Data Cache (DCache) misses. In tight loops or synchronized threads, more than one thread is likely to have one of these instructions in the window at the same time. Since there is no data dependency between the threads, they can all be executing in different Functional Units (presuming there are enough resources) at the same time, turning several separate long stalls into just one. I call this effect Stall Sharing.

The same thing that leads to Stall Sharing, namely multiple threads executing the same code with the same type of resources being needed at the same time, may also result in resource shortages. If there are not enough resources available, some ready instructions may sit around waiting for a Functional Unit.

2.2.3. Branch Prediction

A single thread processor with a mispredicted branch will result in 1 to 4 fetch cycles grabbing useless instructions. With multiple active threads, some or all of these cycles will be spent fetching from other threads, minimizing the number of invalid instructions fetched. With the reduced dependencies mentioned above in instruction scheduling, branches are often resolved earlier in the window, further reducing bad branch fetching.

If two threads are executing loops of the same code, but with different data, the branch predictor can make poor predictions for one or both of those threads. For example, if low values branch one way while high values branch the other, and two threads are working the same routine through a dataset from opposite ends, the branch history of each thread would be different.

Alternatively, the two threads executing loops of the same code could have constructive effects on prediction accuracy. If the loops are doing the same type of operation in each thread, their branch patterns are likely to be similar, allowing the predictor to learn the pattern more quickly and thus make more correct decisions.

Again, whether branch prediction improves or degrades is dependent on the routines being executed. The branch predictor could be designed to keep predictions for each thread separate, but at the cost of increasing its size or reducing its effectiveness for a single thread.

2.2.4. Hardware

With the lower dependence on branch prediction, a much simpler algorithm can be implemented without degrading performance as much as on a single threaded processor. This additional space saving can be used for implementing the additional hardware described below that multithreading requires.

The higher instruction throughput of a multithreaded processor will increase the resource utilization. This can be a good thing, considering that the same resources are now being used more effectively. It can also mean that some resources now become over utilized, and may become new bottlenecks. This can lead to decisions to add more resources like Functional Units or issue ports.

The only new Functional Unit is the Thread Control Unit. It handles forks and joins, as well as controlling exclusive_run locking, identifying thread numbers, and storing ipc_reg registers.

For each thread, a full set of registers is needed. This makes the register file for an n-way threaded processor n times as large. This processor architecture only accesses registers at instruction fetch and retire, and is done one fetch block at a time. Only one thread's register file is accessed for reads and one for writes each cycle. This fact will help reduce the complexity of the large register file.

The Instruction Window needs to store the thread number for each instruction, which also acts as the upper bits of the register number during register remapping. The thread number is checked when scheduling the Load/Store Unit. The thread number also must be checked when instructions are being invalidated due to exceptions or branch mispredictions.

The non-blocking Instruction Cache described above takes slightly more area than a blocking cache. This can be achieved by making the cache dual ported, or less expensively by adding additional buffering or partitioning into independent banks.

2.2.5. Software

Relative to lightweight hardware threads, Operating System controlled threads are expensive to switch. By having multiple lightweight thread contexts in hardware, the required number of Operating System controlled context switches is reduced. Since many of these multimedia applications already have many threads, the additional support in hardware will reduce the demands on the Operating System to schedule and switch them.

Multithreaded programs are more difficult to write and verify. Because of their non-sequential nature, data locking, atomic transactions, and explicit synchronization must often be used. Care needs be taken to handle exceptions, which may be difficult in a parallel routine.

The additional instructions involved in forking threads, then setting up their registers, handling data locks, synchronizing, and finally joining, creates a certain amount of code growth. These are more instructions that do not apply directly to complete the original program, but must now be executed as well. In the benchmarks simulated, this overhead was anywhere from 6 to over 100 instructions per thread fork.

3. ARCHITECTURE

3.1. Multithreaded SDSP architecture

The multithreaded extensions are a minor change to the SDSP architecture [WAL93a, WAL96]. Figure 3.1 shows the system block diagram. The only new item is the Thread Control Unit. Minor internal changes were made to other units to support or enhance the multithreaded operation.

3.1.1. Block Diagram

Figure 3.1. Processor Block Diagram

The Fetch Unit takes information from the Branch Prediction and Thread Control to generate addresses for the instruction stream. Instructions come through the Fetch Unit, which passes them to the Instruction Window while it scans for branches that need predicting, and in some configurations, prefetch buffering is done.

Instructions enter the Instruction Window, which is a FIFO of instructions to execute. The Reorder Buffer contains the source and destination register values or tags, and shifts in lock-step with the Instruction Window. Together, these two units are called the RBIW, after the two abbreviations. When an instruction in the RBIW is ready to execute, it is sent to one of the Functional Units. Results are passed back to the Reorder Buffer to update the register state and pass along to the following instructions.

If instructions are complete when they reach the bottom of the Instruction Window, they are retired. The result of any branches is passed back to the Branch Prediction Unit, and registers are committed to the Register File. If the instructions are not complete when they reach the end of the Instruction Window, the fetch process is stalled until the instructions complete.

Each unit is described in further detail below.

Cache / Memory Interface

The Instruction Cache contains a copy of the most recently used instruction memory contents. Instructions are fetched from here if they are present. If not, the requested fetch fails and the Scheduling Unit will select a different thread for the next cycle's fetch. Meanwhile, the data is requested from the next level of the memory hierarchy, L2 Cache or Main Memory. The Thread Scheduler is notified when the instructions are ready, so it can re-try the failed instruction fetch.

If while a cache fill is in progress, a different thread misses, then its request is queued up behind the current one. The latency between request and completion of a cache fill is modeled at 5 cycles, but consecutive requests complete at a minimum interval of 3 cycles, representing a pipelined fill.

The Data Cache works in much the same way, but communicates with the Load/Store Unit. When a load or store misses the cache, it is set aside while the cache fill progresses. Consecutive misses are queued up as in the Instruction Cache with the same delays, the maximum of 5 cycles latency or 3 cycles interval.

Fetch Unit

Instructions come through the Fetch Unit on their way to the Instruction Window, are decoded, scanned for branches, and in some configurations are buffered.

Branch Prediction

For each block of instructions, a lookup is done in the Branch Target Buffer (BTB) to see if there is a record of this being a known branch. Two bit branch prediction [YOU95] is used to provide reasonable accuracy of the prediction. If a branch is predicted, a new address is used the next cycle in which that thread is fetched, otherwise, the next sequential address is used.

Prefetch Buffer

The prefetch buffer may be used to store up instructions that are ready to enter the Instruction Window. In the simplest model, no prefetch buffering occurs and only instructions that fall within the cache block are passed on. The next model called self-aligned or dual fetch, does not buffer the instructions, but always fetches two consecutive cache lines and aligns them to remove any empty instruction slots at the beginning of the block. The full prefetching model described by Wallace [WAL96], but unused in this thesis, uses a double length cache line, and keeps the unused instructions in its buffer for the next time that thread is fetched. With this, most instruction alignment problems are eliminated, but upon reaching a mispredicted branch, this must also be invalidated. The final prefetch model, ideal, simply assumes that this prefetching works 100% of the time and every cycle can fetch a full fetch block of instructions.

Reorder Buffer / Instruction Window (RBIW)

Together, the Reorder Buffer and the Instruction Window make up the core of the processor. Instructions are kept in order here while executed out of order by the Functional Units. Registers are kept with instructions. If the register contents are known, the value is read during decode and stored in the Reorder Buffer. If the value has not yet been calculated by an earlier instruction still in the window, the tag number for that upcoming result is stored instead. When the earlier instruction completes, the tag is replaced by the result. When all operands of an instruction are ready, the instruction is eligible for scheduling to a Functional Unit the following cycle. The scheduler gives priority to the oldest instructions that are ready. When an instruction finally is assigned to a Functional Unit, it passes is source register values along, and waits for the result.

Each cycle, instructions in the bottom of the RBIW are checked for completion. If all instructions in the bottom block are done, their destination registers are committed to the Register File, and the instructions are retired (discarded). The result of branches is also passed back to the Branch Prediction Unit to update its history.

Functional Units

There are several Functional Units so that more than one instruction may be processed at the same time. Each instruction is assigned a source and destination data path to the RBIW. There are Integer, Floating Point, Load/Store, and Thread Units.

The Integer Units are the most plentiful, and can process any integer arithmetic or logic operation. Some Units may not have multiply or divide capability, as these are not as common of operations. The scheduler is aware of which Units can perform which operations.

The Floating Point Units are more expensive, in terms of real estate and turnaround time. It may take several cycles for a complex floating point operation to return its value. Some types of operations could be pipelined through this unit, so more than one operation may be in progress at any one time. Others such as divide, do not allow pipelined operation, and will block further instructions from entering. Table 1 shows the latency involved with each floating point instruction.

Table 3.1. Latency of Floating Point instructions
Single Precision Floating Point Instructions Double Precision Floating Point Instructions
1abs_s, neg_s2 abs_d, neg_d
3c_eq_s, c_ne_s, c_ge_s, c_lt_s 3c_eq_d, c_ne_d, c_ge_d, c_lt_d
5cvt_d_s, cvt_s_w, cvt_w_s 5cvt_d_w, cvt_s_d, cvt_w_d
5round_w_s, trunc_w_s, ceil_w_s, floor_w_s 5round_w_d, trunk_w_d, ceil_w_d, floor_w_d
10add_s, sub_s15 add_d, sub_d
10mul_s15 mul_d
20div_s25 div_d
40sqrt_s80 sqrt_d

The Load/Store Unit is capable of queuing up store operations, while safely doing loads by checking the queue before reading memory. In the event of an exception in the RBIW, some queued up stores may need to be invalidated, so this unit is notified in such an event. Note that stores cannot exit the Store Queue until after their corresponding instruction has retired from the Instruction Window, as the potential exists for an exception.

3.2. New and Modified Units

3.2.1. Thread Control

The Thread Control Unit handles m_fork() and m_join() instructions, as well as maintaining the state of all IPC_reg registers. It also communicates with the Fetch Unit to do a round-robin fetch of the different threads. From the Instruction Window, it appears as a Functional Unit.

3.2.2. Registers

The standard SDSP register set consists of 31 integer registers plus register 0 which is hardwired to a value of zero. Each thread, however, has an independent set of these 31 registers. The initial value in the registers for a new thread is undefined after a fork. This means the hardware does not need to copy or initialize the register contents every time a new thread is started. The penalty is that register variables cannot be used across thread forks. For the case where a procedure call immediately follows a fork, this should not be a problem, but for forking inside a tight loop, this can be disadvantageous, as the data must be stored in memory or one of the IPC_reg registers.

By copying a subset of the registers on each fork, then this problem could be minimized. To minimize the architecture changes, I chose not to do this, and let the software copy registers it needs on a case-by-case basis.

3.3. Instruction Set

The SDSP architecture defined by Steve Wallace [WAL93a] has a RISC instruction set. All operations are register-to-register, and contain up to two source and one destination registers. In addition to the existing instruction set, several new instructions to support multithreading were added.

3.3.1. New instructions

The instructions listed below are shown as they appear in C sourcecode, much like function calls, but they are compiled into opcodes by gcc_sdsp.

Table 3.2. Abbreviations used in instruction definitions.
RIPC_reg register number
Nan integer number
on/offa boolean (an integer in C)
Dan integer data value

m_fork(R), m_fork_n(R,N)

The fork instructions are the core of the multithreaded instructions. When a m_fork() instruction is encountered, a new thread context is created. The new thread has the same PC and register contents. The variation m_fork_n() takes an additional input N, and creates several threads. Each time a new thread is created, the IPC_REG specified by R is incremented.

m_join(R), m_kill_all()

The m_join() instruction is used to terminate threads. The IPC_REG specified by R is decremented. If the number goes below 0, then this was the last thread to reach the m_join() and is allowed to continue. Otherwise, the thread is terminated and its resources are de-allocated. In private-memory simulator mode, thread 0 is always the one to continue after a join because it simulates the quickest. The m_kill_all() instruction is used primarily for catastrophic error handling, as it immediately terminates all threads except one. It can also be used for some applications in which many threads are looking for a solution, and the first to find one can call off the hunt by killing all other threads. This is not used in any of my benchmarks except in its error handling role.

m_exclusive_run(on/off)

For atomic transactions, the m_exclusive_run() instruction can be used to suspend all other threads during the critical section of code. This is particularly useful for isolating sections of the benchmark that contain non-reentrant library calls (notably malloc and printf), or merely sections that behave poorly in a multithreaded environment. If a join is encountered during exclusive_run, then exclusive_run is turned off.

m_thread_num()

The instruction m_thread_num() allows the thread to find out which number it is. Threads can use this information any way they want. In most benchmarks, this is used to determine which subset of data to process by each thread. It is also useful to identify which thread is the parent and which is the child. The return value from m_fork() is this same number, but because of the limitations of using stack registers near a fork, it is often convenient to have this as a separate instruction.

m_read_ipc_reg(R), m_write_ipc_reg(R,D), m_inc_ipc_reg(R), m_dec_ipc_reg(R)

Inter-Process Communication Registers (ipc_regs) are the best way for threads to pass data to each other. These registers are used to count forks and joins, but may also be used by the threads to pass results, coordinate data, or synchronize operation. In addition to the read and write commands, atomic increment with read and decrement with read are available. This is useful for allocating which thread should work on which subset of data. A job queue can be set up, with an ipc_reg acting as the index. Each thread can do an m_inc_ipc_reg() on that index to get its job assignment, knowing that it has gotten a unique result.

3.3.2. Instructions for simulation only

m_get_shared(), m_set_shared()

These two instructions are used to explicitly synchronize data between threads when using the Private Memory implementation of the simulator. It transfers a block of memory between threads. For single processor machines like I am studying, this instruction is not intended to be implemented in hardware, but is a workaround for the limitations of the simulator. For multiprocessor systems, a DMA subsystem could be built to handle the explicit memory copying.

In some benchmark programs, the threads should be re-written to use Shared Memory, but since they were originally written for single thread operation, threads have many places where they stomp on each other's operation. Without a compiler that understands threads, these would need to be painstakingly re-written, a rather time-consuming operation. Instead, the simulator can be placed into private-memory mode, which completely isolates the threads in their own copy of the memory space. When results need to be passed, these instructions transfer just the data needed.

m_quick_run(on/off)

This instruction is used to speed up the simulation process. Since initialization code is often not threaded, and adds little to the benchmark's value, using m_quick_run() allows the programmer to define sections of the benchmark to run through a subset of the simulator. During quick_run, only a single thread may be executed, and it does not go through the superscalar model. This results in a 100 fold increase in speed in these sections of the benchmarks. Statistics are kept separately while in this mode, and are not included in the analysis done here.

The primary use for this instruction is the ray trace benchmark pov. During the image description file read and parsing, no attempt was made to thread the benchmark. For complex images, this takes a long time in the simulator, but is a small portion of the actual benchmark. By using qucik_run through this portion, the more complex scenes can be simulated within a reasonable time.

If -hybrid is not included on the simulator command line, m_quick_run() instructions are ignored and treated as no-op's.

m_marker(N)

This instruction simply prints a debug string to standard out:

"cycle: MARK [thread] = N"

It can be used to trace order within multithreaded code, or a quick printout of an integer variable without requiring the benchmark to deal with the overhead of a printf call.

3.4. Sharing of memory between threads

The architecture specifies that all threads share the same memory space. Some modifications were made to the simulator to allow Private Memory be available if required to simplify porting of code not originally designed to work in Shared Memory. There is no mechanism in the defined architecture to support this, but a discussion in the following chapter describes the practical limits of bending the rules while still getting a reasonable estimate of multithreaded performance.

4. SIMULATOR
simulator: A device that enables the operator to reproduce or represent under test conditions phenomena likely to occur in actual performance. --Webster's Dictionary.

4.1. Overview

The objective of this simulator is to test the architecture defined in the previous chapter, and to determine the effectiveness of using threads to increase parallelism in a superscalar processor. To simplify the overwhelming task of porting a variety of benchmarks to the multithreaded model, two variations were made to the simulator that do not exactly follow the architecture, but are approximations of it in certain situations. These are described below in the sections on Private Memory and Multiprogram Memory Models.

This chapter will first show the structure of the simulator and its capabilities. It then covers the various memory models. Next comes a description of some of the internal workings of the simulator, and finally a list of limitations and potential improvements.

4.2. Structure

The basic SDSP simulator, from which this simulator was developed, consists of six main sections, Initialization, Scalar Preprocess Execution, Superscalar Modeling, Thread Scheduling, User Interface, and Statistical Data Generation. These are described in the following sections.

4.2.1. Initialization

This set of procedures parses the command line, creates and clears all data structures and statistics to be used and reads the benchmark into memory performing remapping and symbol table resolution. Any symbols not resolved within the benchmark are checked against Operating System calls known by the simulator, and replaced with traps to be handled by the simulator if required. If the inputs are tracefiles, the files are opened, piping them through gunzip if necessary.

4.2.2. Scalar Preprocess Execution

These routines execute the instructions in order. They do all the ALU, Memory, and Operating System calls. This generates a trace of instructions that are passed along to the superscalar routines. In the Private Memory mode, these may be executed by different Unix processes. If a tracefile is to be generated, the executed instructions are written to disk. If a tracefile is to be read, one instruction at a time is read from the disk and passed on to the superscalar model.

Statistics are kept for procedure profiling and instruction class frequency.

4.2.3. Superscalar Modeling

The superscalar section 'fetches' instructions from the scalar preprocess routines for each thread. It puts the pre-executed instructions into the Reorder Buffer/ Instruction Window (RBIW), performing register renaming. Each cycle, it scans the Window for ready instructions, allocates the execution resources, and then marks the instructions as completed. The oldest instructions in the window are retired if they are compete. If not, then fetching and shifting of instructions is prevented, creating stall cycles. This entire procedure is iterated for each simulated clock cycle.

Statistics are kept for items such as fetch alignment, branch prediction, Scheduling Unit stalls, and resource utilization.

4.2.4. Thread Scheduling

Threads are scheduled in each cycle for the following clock. The default algorithm uses a Least Recently Used (LRU) method to select threads that have not been fetched in a while. Alternatively, a simple Round Robin can be used. The third algorithm available, Count, orders the threads based on the number of instructions each thread has in the Instruction Window [LO98]. Priority modifiers can be specified that cause skipping of threads that have an unknown predicted branch, a known mispredicted branch, an Instruction Cache miss, or a floating point instruction in the Window.

Statistics are kept for number of switches attempted, when exclusive_thread was active, or only a single thread was available.

4.2.5. User Interface

An optional graphical interface shown in Figure 4.1 using X allows displaying of the contents of the RBIW, as well as controls for stepping through the program one cycle at a time. Breakpoints can be defined and registers examined to help in debugging. Info mode can be toggled on and off to display a detailed trace of what is going on inside the simulator.

Figure 4.1. ss simulator user interface.

4.2.6. Statistical Data Generation

Scalar, superscalar, and multithreaded statistics are gathered throughout the program. Upon completion of the simulation, statistics can be printed, or saved as a binary file. This can be parsed later with the program showstats, or combined with other outputs to generate a variety of statistical reports.

4.3. Thread Memory Management

Three memory models are implemented in the simulator: Shared Memory, Private Memory, and Multiprocess. Figure 4.2 shows how memory is allocated immediately after a fork in the program. For Shared Memory Model, the thread gets a new stack pointer, but all memory is shared. For Private Memory Model, the thread gets a copy of the current data and stack segments. For Multiprogram Memory Model, the thread is allocated unique addresses for instruction, data, and stack.

Figure 4.2. Memory Block diagrams immediately after a fork in each memory model.
I = program instruction memory, Data = static data and heap,
Stack = program stack (temporary variables, procedure calls, etc.).

4.3.1. Shared Memory Model

The Shared Memory Model is the one which was described in the Architecture chapter, and is the default model. A new thread gets a unique stack pointer, but shares all memory with the other threads. Instructions begin fetching from the very next address for both threads.

Care must be taken when writing benchmarks for this model, that no thread interferes with the other threads use of memory. Global variables may be read freely, but writes should be controlled by locks or in atomic blocks to prevent data corruption. Stack data may not be used across a fork because the child will not see the same stack pointer as the parent. One way of avoiding this is to do a procedure call immediately after the fork, and declare all local variables there. Each thread will place its local variables in a different stack. Another way is to use the interprocess communication registers, which all threads have shared access to.

One more problem is that some of the standard C library routines are not reentrant capable. Routines like malloc and printf have been seen to break when called by two Shared Memory threads at the same time. Either avoid these calls in a threaded section, or make them atomic with the m_exclusive_run() instruction.

4.3.2. Private (Non-Shared) Memory Model

The Private Memory Model was implemented to make the porting of benchmarks easier. Some benchmarks were not written with shared-memory multithreading in mind, and have large amounts of read/write global data structures. To port these to the Shared Memory Model would require rewriting each of these data structures. A good compiler could handle this, but one is not currently available for this architecture. To keep from having to extensively rewrite benchmarks, each thread gets a copy of the data and stack memories. Each thread can continue using the memory as if it were private, and program operation is unaffected.

When data synchronization is required, the new instructions m_set_shared() and m_get_shared() can copy a block of data from one thread to another. This is used in the benchmarks to write their results to thread 0 before terminating. In this model only, thread 0 is always the thread to continue after a join. If it reaches the join before other threads, it goes to sleep until all others have terminated. This simplifies the explicit sharing of data, as well as speeding simulation as discussed below in the section titled simulator internals.

The drawback to this is that realism is sacrificed. Two diverging copies of data memory are aliased onto the same memory addresses. The Data Cache model treats these as identical for determining cache hits or misses, so one can expect higher hit rates than normal.

A hardware implementation of this would be difficult at best. A DMA engine would be required. To prevent having to copy the entire Data and Stack spaces before starting the child process, a map could be maintained, much like a dirty cache indication, and only duplicate memory when a write occurs. This would be expensive to implement, and is not proposed as a proper solution. Instead, treat this model as an approximation to the Shared Memory Model. Only the occasional write to this memory would result in an additional cache miss. The additional benchmarks this makes available to us, makes up for the small loss in accuracy.

4.3.3. Multiprogram Memory Model

The Multiprogram Memory Model allows independent programs to be run as if they were threads. They are all mapped into memory space, but since each has its own address range, no conflicts occur. This model is expected to have higher Instruction Cache miss rates, since there is no code sharing going on. Also these independent programs may have addresses that alias to the same block in the cache further increasing the miss rate.

This is implemented in the simulator as multiple trace reads. Each benchmark program is run separately and creates a long instruction trace file. The multithreaded simulator can then read multiple tracefiles as threads, remapping the address as each instruction is read.

The multiprogram model could be implemented in a real processor simply by having Operating System support that can handle the process threads.

This model may also be an approximation of programs that have multiple threads doing completely different functions like user interface, file I/O, and data processing all in parallel. They would look much the same as this model, with different threads accessing different instruction routines and different data, while still being in one program. Since none of my explicitly threaded benchmarks use this popular variety of threading, the multiprogram model is a good approximation and adds to the diversity of benchmark architectures to study.

4.4. Atomic transactions

atom: an indivisible particle. --Webster's Dictionary

As in all parallel processing environments, some activities need to be handled atomically, such that no other thread's activities could interfere or change a value in the middle of another's critical section of code. In a multithreaded microprocessor, this is quite easy to do. The instruction m_exclusive_run() temporarily locks one thread into exclusive execution. When the transaction is complete, it is called again to unlock the thread. This locking can be arbitrarily nested, such that procedure calls which have locked sections can be called within locked sections of code, and the thread is only unlocked when the number of unlocks equals the number of locks. If a join is encountered, any exclusive_run lock is broken to prevent processor deadlock.

This locking mechanism can allow the unmodified use of non-reentrant library calls within Shared Memory simulations by placing a lock before and an unlock after. This can also be used to track down a section of code that is causing cross-thread interference, by selectively placing locks and unlocks and running the program to see when the problem disappears. It can usually be narrowed down to a single line of offending code.

There is a slight performance penalty associated with an exclusive_run section of code. Any time that an exclusive instruction is anywhere in the Instruction Window, the normally loose rules for loads and stores from different threads being allowed to issue out of order are tightened to prevent something from interfering with the atomic transaction. This tightening of the rules also applies if a fork or join instruction is in the window, as these are also synchronization commands.

4.5. Simulator internals

With the Shared Memory Model, all threads execute in the same Unix process. This runs quite fast and efficiently. All memory conflict resolution is handled by the benchmark program, instead of the simulator, and is thus the more difficult model to program for.

For the second model with Private Memory, the simulator forks off a separate child process for each thread, shown in Figure 4.3. This allows the benchmark to execute without much fear of side effects due to other threads. There is a penalty for this, and that is speed. The act of forking under Unix is a relatively big event, duplicating the context of the process. In addition to this overhead as each thread starts is that now all process communication must be explicit. With threads executing in separate processes from the superscalar model of the processor, every instruction must be passed through this communication channel. Note that thread 0 is always executed by the same process as the superscalar simulator. This eliminates the communication overhead for one of the threads and speeds simulation.

Figure 4.3. Communication between simulator processes in the Private Memory mode.
Bold rectangles indicate separate Unix processes and arrows indicate pipes.

The separate processes communicate through pipes. Each thread gets two pipes exclusively allocated to it. One feeds a stream of instructions from the scalar execution portion to the superscalar process. Another pipe called the command pipe sends information to the threads when needed such as:

For the third memory model, Multiprogram, each program is pre-executed by a separate simulation, which generates a tracefile. The Multiprogram simulator reads each of these trace files in from a file. If the file has a .gz extension, a pipe is created through the gunzip program to decompress the stream on-the-fly. This has about a 10% speed penalty for the simulation, but results in much lower storage requirements. Each instruction read from the stream is remapped into its own address space by adding a constant offset per thread. Any addresses used by the instructions such as loads, stores, and control transfers are also modified. From there, the instructions are passed into the Shared Memory Model of the simulator and processed normally.

4.6. Simulator limitations and potential improvements

No simulator is perfect in its ability to represent real operating conditions. This simulator is very accurate in many ways, but does have its share of limitations. The following list describes all of the ones I have become aware of.

5. BENCHMARKS

5.1. The Multimedia Desktop Environment

Previous studies of multithreading have focused on high performance scientific or server applications. I show in these simulations that the multimedia desktop has much to gain from the same techniques. Significant data parallelism, real-time latency intolerance, and interfacing to low bandwidth I/O devices all fit the multithreaded model's best aspects.

Most of the tasks to be performed on a multimedia desktop machine are easily adaptable to multithreaded operation, and many already are, because of the inherent parallelism in them. Also, since multimedia implies more than one type of media at a time, many of these will be running concurrently. Lack of performance may readily be seen by the user as long latencies or poor quality and detracts from the system's usefulness. Table 5.1 shows how the selection of benchmarks are representative of many of these categories of tasks, giving insight into whether they provide an appropriate mix of instructions for a multithreaded microprocessor.

Table 5.1. Some typical desktop tasks in a multimedia desktop environment.
Items in bold are represented by benchmarks in this thesis.
MediaExample uses Benchmarks
Videoplay movies, video conferencing, movie authoring, 3D games, gesture recognition mpegd, mpeg2e
Still Imagesimage filtering, compression, scene rendering, object recognition nlfilt, cjpeg, pov, xmountains
Audiocompression, filtering, speech synthesis, voice recognition, music synthesis sox, say
TextOCR, handwriting recognition, spellcheck, translation, searching diff
I/Ofile compression, virus scanning, Internet, printer preprocessing, home automation gzip
AIartificial intelligence, remote agents

5.2. Programming environment

The benchmarks used were written in C and available freely as UNIX source code. The programs were compiled using the SDSP version of the gnu C compiler. The -O2 optimization command was usually used. Threads are added by including m_fork_n() procedure calls in the C source code. These are left as unresolved links by the compiler and linker, and are interpreted by the simulator as extensions of the processor instruction set.

Not wanting to completely re-write the benchmarks, I looked for easy ways to add multithreading. The main data processing loops in the program have been divided into multiple threads. In this way, concurrency is increased without too much overhead of initiating threads.

For some applications, the processing is done in nested loops. This leads to the dilemma of where to place the threading. With an inner loop, more forking and joining adds more overhead. With a long outer loop, any unbalance in the workload can result in a significant period of time where a limited number of threads finish the remaining iteration after the others have completed. In general, coarse grained threading was preferred, but in mpeg2e, a relatively fine grained threading was used. See Appendix A for details on how the benchmarks were threaded.

5.3. Discussion of the individual benchmarks and datasets

This chapter describes the benchmarks used. First, the benchmarks are described with their data sets. Then, statistics are shown for each benchmark with a default hardware configuration.

5.3.1. Workload and datasets

NameData in Data outInstruction Count
mpeg2e3 image frames 320x280 3 frame MPEG2 movie35,918,652

Mpeg2e is a video compression program. It reads a set of images, and creates an mpeg2 video stream consisting of I, B, and P type frames. This dataset has one of each type frame and comes from the popular ping-pong sequence, but scaled down to provide smaller images more suitable to simulation, as well as more appropriate for video conferencing applications which demand real-time performance. All other parameters were set to their defaults provided with the sourcecode [MPE94].
NameData in Data outInstruction Count
nlfilt100x100 image 100x100 image19,192,033

Nlfilt is a non-linear image filter. The edge enhance module simulated here is typical of image processing. Only the image processing routine was included, file I/O was not. This was chosen because the typical interactive image editing application loads an image once, then performs a series of filters interacting with the user in real time. Real time performance is the key factor in user interaction. The sourcecode was taken from the netpbm library [NET93] routine called pnmnlfilt [PNM93]. The code was modified by replacing all library references with local routines that use a simple binary file format, trading flexibility for simulation speed.
NameData in Data outInstruction Count
povsimple.pov scene desc. 25x25 pixel image7,962,116

Pov is the Persistence of Vision ray tracer [POV96]. It creates an image by tracking rays of light as they bounce around a scene full of geometrically defined objects. This may be used to view a proposed architectural design, or fantasy world in fine detail. It is not typically run in real time, as there are less precise methods available for quick previews or real-time games.
NameData in Data outInstruction Count
isuiteN/AN/A 8,000,000
gzip Text file compressed file 2,000,000
mpegd 3 frame MPEG1 movie 3 individual frames 2,000,000
diff Two text files list of lines that differ 2,000,000
cjpeg gif image jpeg image 2,000,000

Isuite consists of four integer programs run in parallel. Each program was cut off at 2,000,000 instructions to balance the load. Gzip compresses data files to conserve storage or network bandwidth [GZI93, LZ77]. Mpegd decompresses a video stream in real time for immediate display of movies, game animation, or downloaded entertainment content [PVR93]. Diff is a typical text filter representative of a whole class of applications that consist primarily of file I/O with limited computation [DIF93]. Cjpeg compresses images for storage and transmission. It is also used in very low power processors with low clock rates in digital cameras where speed and efficiency are very important [JPG96, WAL91].
NameData in Data outInstruction Count
fsuiteN/AN/A 8,000,000
xmountains seed number image of fractal terrain 2,000,000
whet iterations N/A 2,000,000
say text phrase audio speech file 2,000,000
sox audio file filtered audio file 2,000,000

Fsuite consists of four floating point programs run in parallel. Each program was cut off at 2,000,000 instructions to balance the load. Xmountains generates a 3D view of terrain based on the infinite resolution of fractals, as would be used in a flight simulator or similar program [XMO95]. Whet is a benchmark application that is very heavy in floating point operations. Its only purpose is to stress the floating point capabilities of a processor [WHE87, CUR76]. Say can convert text into speech, and can be used to provide a more natural interface to the computer [SAY94, HOL64, KLA80]. Sox manipulates an audio waveform, in this case just upsampling it, but with the right algorithms, it could as well provide multichannel surround sound separation, or audio special effects [SOX94, SMI93].

5.3.2. Profile

The benchmarks were all profiled to determine what routines formed the core loops, and which contained the most processing. These were selected for re-coding with threads. The profiles that can be found in Appendix A include information about how much additional overhead is in the 4 thread run of the program versus the single thread run.

5.3.3. Properties

Table 3 shows properties of the benchmarks. The figures represent the single thread version under default conditions, except for the last four items which are for a 4 thread run. Tables 4 and 5 later on will break down isuite and fsuite into their component parts. See Appendix 3 for detailed description and equations for each item.

Table 5.2. Properties of the benchmarks.
label:mpeg2e nlfilt povisuite fsuite
memory model: SharedShared PrivateMultiprogram Multiprogram
cycles:10,685,764 2,795,6964,676,530 1,392,7914,573,652
float work: 0.2690.014 1.5190.000 1.504
instructions: 35,918,65219,192,033 7,962,1168,000,000 8,000,000
quick_instr: 257,5831,059,949 1,436,0480 0
CPI:0.297 0.1460.587 0.1740.572
IPC:3.361 6.8651.703 5.7441.749
avg fetch:6.748 7.6915.675 6.4956.336
avg issue:3.182 6.8081.345 5.2571.454
total delays: 57.377%11.840% 74.165%14.291% 74.916%
su stalls: 36.188%7.901% 62.936%7.113% 68.230%
br delays: 20.970%3.850% 11.036%6.348% 6.291%
i delays:0.219% 0.089%0.193% 0.831%0.396%
i swap:58.342% 0.000%40.819% 24.567%44.128%
d swap:44.818% 41.056%0.000% 92.738%5.201%
i miss:0.088% 0.020%0.129% 0.187%0.286%
d miss:0.035% 0.082%0.035% 2.607%0.079%
fetch deficit: 15.722%3.886% 29.158%18.959% 21.029%
pred rate:35.081% 66.986%73.591% 93.896%66.283%
br penalty: 2.8513.441 2.3492.079 2.324
commit bypass: 0.112%0.013% 1.456%1.177% 0.694%
fetch cycles: 6,818,8062,574,800 1,733,3301,293,718 1,453,051
code growth: 0.421%0.005% 0.045%0.000% 0.000%
%threaded:34.459% 99.076%92.171% 95.180%81.447%
total threads: 14054 734 4
speedup19.433% 2.891%14.843% 8.817%16.019%

The item float work is a weighted sum of floating point operations, with the weight being the number of cycles it takes to complete that type of operation, then normalized to the total number of instructions in the benchmark. Thus, pov and fsuite are very heavy users of floating point, while mpeg2e and nlfilt use less, and isuite uses none.

The average fetch entry shows that our default mechanism of dual fetch does a pretty good job of retrieving 5.6 to 7.6 instructions. This forms the upper limit on throughput.

The cycles per instruction (CPI) or its inverse (IPC) varies considerably across the benchmarks. The floating point apps have much lower rates of instruction throughput than the integer ones, which indicates the floating point operations are a significant source of the stalls in the Scheduling Unit. Looking down at the total delays, and the next three items after it, which break down the cause of delays, one can see the dominance of Scheduling Unit stalls in those floating point applications.

Branch delays are also a considerable portion of the delay, which indicates that the default mechanism of two bit history branch prediction isn't always a good performer. The pred rate entry for mpeg2e shows that its prediction rate is only 35%. The br penalty line shows that for each of those mispredicted branches, an average of 2 to 3.5 fetch cycles are wasted.

The commit bypass item show what percentage of stalls were avoided by being able to commit from slots farther up the Instruction Window, and is less than 1.5% for all benchmarks. For these single threaded applications, the only situation this is ever used is to eat the pipe bubbles on completely invalidated blocks after a bad branch prediction has been detected.

Instruction and Data Cache miss rates are very low, well under 1% in all but the isuite Data Cache. This translates into less than 1% delays for all instruction fetch delays. The significance of the swap items is described later in the next chapter.

Up to now, all the items in the table have been describing the single thread benchmarks. The last 4 lines describe what happens when 4 threads are used. The code grows for all but the suites, but less than half of one percent for the worst case. The suites don't have code growth, because the programs don't change, as it is only interleaving the different programs.

The %threaded entry shows that for nlfilt, pov, and isuite, greater than 90% of the cycles have more than one thread available, while fsuite has 81%, and mpeg2e only has 34%. The suites are only less than 100% because of load imbalance, some components have a higher IPC than others and complete early. For the threaded applications, the percentage reflects how much of the program is essentially scalar, and cannot be easily threaded.

The total threads line shows that most benchmarks were threaded very coarse grained. Mpeg2e has the highest number of threads, but also had the highest amount of code growth, showing the cost involved in fine grained threading.

The last item is the most important. This shows the amount of speedup between the scalar version and the 4 thread version. For nlfilt, it is very low, about 2.9%. This application already has the highest IPC. For the other integer application, isuite, it also was low at 8.8%. The floating point applications did better, 14.8% to 19.4%. The speedup numbers are nowhere near the many times speedup that others have reported when simulating scientific applications designed for parallel machines, but for the small amount of extra cost involved in adding multithreading to a superscalar processor, they are quite good.

Table 5.3 shows the benchmarks that make up the isuite benchmark and their properties.

Table 5.3. Integer Suite breakdown.
label:gzip mpegd diffcjpeg isuite
cycles:625,530 412,310583,134 408,9321,392,791
float work: 0.0000.000 0.0000.000 0.000
instructions: 2,000,0002,000,000 2,000,0002,000,000 8,000,000
quick_instr: 00 00 0
CPI:0.313 0.2060.292 0.2040.174
IPC:3.197 4.8513.430 4.8915.744
avg fetch:4.828 5.1523.907 5.4516.495
avg issue:3.081 4.3962.976 4.6745.257
total delays: 33.928%6.524% 14.868%14.940% 14.291%
su stalls:2.177% 3.576%9.270% 2.071%7.113%
br delays:31.408% 2.344%5.342% 11.306%6.348%
i delays:0.342% 0.603%0.256% 1.563%0.831%
i swap:0.000% 0.604%0.000% 26.625%24.567%
d swap:55.488% 79.334%94.032% 52.303%92.738%
i miss:0.103% 0.128%0.058% 0.347%0.187%
d miss:2.221% 2.489%5.565% 0.479%2.607%
fetch deficit: 39.706%35.684% 51.186%32.102% 18.959%
pred rate:51.422% 98.207%94.389% 76.860%93.896%
br penalty: 2.9142.857 1.9832.236 2.079
commit bypass: 0.095%0.399% 0.044%0.236% 1.177%
fetch cycles: 611,901397,564 529,079400,461 1,293,718
code growth: 0.000%0.000% 0.000%0.000% 0.000%
%threaded:0.000% 0.000%0.000% 0.000%95.180%
total threads: 11 11 4
speedup:N/A N/AN/A N/A8.817%

In the integer suite, gzip has the lowest throughput (IPC), and thus is the cause of only having isuite 95% threaded, due to load imbalance between the different programs within the suite. The low throughput on gzip is primarily due to poor branch prediction. Cjpeg, another compression routine also has poor branch prediction. On the other end of the spectrum, the decompression routine mpegd has over 98% prediction accuracy

The text comparison program, diff, has the highest Data Cache miss rate, more than double any of the other benchmarks at 5.5%.

Table 5.4 shows the benchmarks that make up the fsuite benchmark and their properties.

The floating point application suite has a low %threaded amount at 81%, due to the xmountain's low throughput, making the load imbalanced. The reason for that is it has a very high float work rating, more than double the other applications. It also has the lowest avg fetch.

Despite the imbalance, the combined suite has 16% speedup, which is high because the floating point latencies can be hidden by multithreading.

Table 5.4. Floating point suite breakdown.
label:xmountains whet saysox fsuite
cycles:1,783,308 862,390736,496 840,6164,573,652
float work: 3.2131.210 0.6560.936 1.504
instructions: 2,000,0002,000,000 2,000,0002,000,000 8,000,000
quick_instr: 00 00 0
CPI:0.892 0.4310.368 0.4200.572
IPC:1.122 2.3192.716 2.3791.749
avg fetch:4.144 5.1995.900 5.8696.336
avg issue:0.759 2.0092.413 2.1211.454
total delays: 75.990%56.812% 56.364%62.068% 74.916%
su stalls:68.144% 51.959%47.930% 55.328%68.230%
br delays:7.592% 4.472%7.275% 6.396%6.291%
i delays:0.254% 0.381%1.158% 0.344%0.396%
i swap:6.858% 1.065%40.880% 2.422%44.128%
d swap:3.936% 0.000%4.786% 0.699%5.201%
i miss:0.187% 0.170%0.500% 0.169%0.286%
d miss:0.176% 0.016%0.144% 0.030%0.079%
fetch deficit: 48.301%35.123% 26.624%26.756% 21.029%
pred rate:66.840% 78.021%54.232% 65.724%66.283%
br penalty: 2.2882.343 2.4172.395 2.324
commit bypass: 2.193%0.012% 0.024%0.007% 0.694%
fetch cycles: 568,089414,301 383,482375,520 1,453,051
code growth: 0.000%0.000% 0.000%0.000% 0.000%
%threaded:0.000% 0.000%0.000% 0.000%81.447%
total threads: 11 11 4
speedupN/A N/AN/A N/A16.019%

6. RESOURCE ANALYSIS RESULTS

6.1. Default parameters

In a system that's nearly infinitely configurable, choosing where to start is difficult but critical to finding the best solution and most interesting interactions. The configuration developed by Wallace [WAL93a] was the starting point, but it was necessary to scale some of the resources up to take full advantage of the increased parallelism available to the multithreaded benchmarks. Hundreds of preliminary simulations were run that are not documented here, in order to see where the interesting bends in the graphs occur, while not having any one resource low enough that its limits mask other effects. Using that, a set of parameters was selected as a starting point for the rest of the runs. These are summarized in Table 6.1, along with the limits imposed by the architecture or simulator.

From the default resource configuration, one parameter is varied at a time (or occasionally related parameters are changed as a set) to see how performance is affected. This gives insight to the resource requirements and dependencies in the benchmarks.

Table 6.1. Default Resource Configuration.
PARAMETERAVAILABLE RANGE DEFAULT
fetch block size1-16 instructions 8 instructions
depth of RBIW1-15 blocks 4 blocks
completion slots1-15 blocks 2 blocks
number of threads1-7 4
thread schedulingX,R/L/C,I,B,P,F LIBPF
maximum issue/result ports 1-1616
functional units: ALU 1-168
functional units: FPU 1-164
functional units: Multiply 1-164
functional units: Load/Store 1-162
result bypassingoff / on on
I Cache0-512 KB 64 KB
D Cache0-512 KB 64 KB
Cache Associativity 1-8 way4 way
cache latency/interval 1+ / 1+5 / 3
cache line size1+ fetch blocks 1 fetch block (32 bytes)
prefetchoff / dual / ideal dual
branch prediction algorithm 2 bit / always not taken / perfect 2 bit
branch prediction table size 0-256 entry shared / each 256 entry shared
misprediction bubbles 0+1
thread stack sizeany $8200 bytes (32.5 KB)

Figure 6.1 shows Cycles Per Instruction (CPI) as the number of threads is varied for each benchmark. Note that some of the benchmarks do not have all the threads graphed. For the suites, only 1, 2, and 4 threads were run, since the suites each have 4 programs in them. If an odd number were run, it would not be an accurate result because of the load imbalance. The final column, average, is an average of all the benchmarks. Sometimes Instructions Per Cycle (IPC) feels more intuitive than CPI. For those, the data is inverted in Figure 6.2 to show IPC. The rest of the thesis will stick to CPI for consistency [EMM97].

Figure 6.1. CPI versus threads.

Figure 6.2. IPC versus threads.

Nlfilt has the most inherent parallelism, approaching the 8 instructions per cycle fetch limit, and thus the least to gain from multithreading. Pov and fsuite have the worst performance, and most to gain from the additional parallelism. These two use of floating point extensively.

Figure 6.3 shows this same data represented as speedup, which is the relative execution time of the multithreaded benchmark versus the single threaded.

Figure 6.3. Speedup versus Threads

The first few threads provide significant speedup, but after 4 or 5 threads, the additional overhead begins to outweigh the diminishing gains from more parallelism. These trends have discontinuities in them due to the load imbalance with certain numbers of threads that are not multiples of the loop sizes. This can be seen in mpeg2e and nlfilt, which have better performance on 6 threads than they do on 5. For nlfilt, 3 threads is the best. For pov, 5 threads is best. For the suites and mpeg2e, 4 threads is best.

For all the remaining threaded runs in this thesis, 4 threads were used. This was chosen to be the same for all benchmarks for consistency. Among all benchmarks, 4 threads is close to the greatest speedup.

6.2. Fetch limits

Instruction fetching is the first bottleneck in the processor pipeline. If instructions cannot be fetched in enough quantity, the processor will spend much of its time idle. Several parameters control the flow of instructions into the processor. The following sections look at fetch block size, fetch alignment through self-aligned fetching or prefetching, branch prediction, and Instruction Cache.

6.2.1. Fetch Block Size

In Figure 6.4, the size of the fetch block is varied while the total size of the Instruction Window is kept at 32 entries.

Figure 6.4. Block Size, keeping Instruction Window to 32 entries.

The default configuration of 8/4 is the best for all the multithreaded cases. The larger fetch block 16 is just slightly better in the highest throughput single threaded cases. These must have very little data dependence, or the pipe would stall frequently with only two cycles between fetch and retire.

6.2.2. Fetch Alignment: Prefetching

Figure 6.5 compares fetching models. No prefetch means that a single cache line is fetched each cycle. Dual fetch refers to fetching two cache lines every cycle, then re-aligning the instructions to begin at the start of the block. This is also called self-aligned fetch block. Perfect fetch refers to a theoretical prefetch scheme that always returns a full block of instructions, even when the block is misaligned or correctly predicted branches occur within the block.

Figure 6.5. Prefetching

Dual fetch is approximately 6-7% slower than perfect fetch. Considering the additional cost of prefetching and fact that it would most likely increase the pipeline length, then dual fetch is a good compromise.

The no prefetch configuration of single threaded pov has much worse (+29%) performance than the dual fetch configuration. The multithreaded pov only has +11% difference, indicating that the threading can compensate for some of the poor fetch performance. In all the other applications, no significant difference exists between the benefits gained by improved fetch schemes for single and multithreaded applications.

Nlfilt gets no performance increase with any of the fetch improvements, indicating that its performance is not limited by fetch.

6.2.3. Branch Prediction

The branch prediction mechanism is responsible for reducing the amount of garbage instructions fetched by the processor. In Figure 6.6, the bsimple model always predicts not taken. 64 and 255 are the number of entries in the branch target buffer, a two bit counter mechanism for determining if a branch should be taken or not. 256 each refers to separate branch target buffers for each thread. Perfect predictor is never wrong, and shows the possible performance limits to the branch prediction.

Figure 6.6. Branch Prediction

It is obvious from this graph that branch prediction is a much smaller share of performance with multithreaded programs than single thread programs. Expensive dynamic prediction is important in the single thread models, but of little benefit when multiple threads are available. Also shown is that inter-thread interference can often reduce the effectiveness of the two bit branch predictor, such that the simple not-taken prediction performs better in both mpeg2e and nlfilt, and is about the same in the others.

For the multithreaded case, all models achieve nearly the same results as the perfect prediction because there is time to resolve the branch while other threads are executing. Bsimple actually outperforms any of the other real cases because it never discards the instructions within the fetch block after the branch instruction until the branch has been resolved, while the other models will mistakenly discard these on a false taken branch prediction.

6.2.4. Instruction Cache

In Figure 6.7, the Instruction Cache is varied from 16K to 256K with both a direct mapped (1-way) and an associative (4-way) arrangement. Perfect refers to a cache that never misses, giving a best case reference point.

Figure 6.7. Instruction Cache

For all but pov with a small cache, the associative cache does not gain us much, and in threaded fsuite and nlfilt, a small associative cache can perform worse than direct mapped.

It is possible that these benchmarks are not run long enough to show all of the cache effects. To test for this, Figure 6.8 shows the percentage of instruction misses that replace data in the cache (as opposed to hitting an empty cache entry). An indication that this is a poor model of steady state performance is if the graph is not near 100%. The graph shows that most benchmarks only approach 100% for the smallest cache, and some like nlfilt and isuite never get very high at all. This indicates that much longer simulations are needed to accurately model cache effects.

The only exception to this is pov, which has swap rates over 90% for all of the direct mapped cache models. Back in Figure 6.7, we can see that for pov, the threaded version had slightly less dependence on cache size for the multithreaded version than the single thread one. This is as can be expected, since the multithreaded processor is not blocked from executing other threads when a cache miss occurs. From this benchmark, we can also see that the associative cache performs better than the direct mapped, as less contention is seen.

Figure 6.8. Instruction Cache Usage.

Some very good papers have been written on cache models for multithreaded applications [JOU90, MCF91, BLU92, LEE95, FAR97, LO98, PHI96], and our results here, though seriously limited by the under-utilization of the cache, agree with them that spatial locality is reduced while threads can compensate for the additional misses. For real cache model analysis, a much simpler processor model is usually used to handle many times the size of instruction runs we can use here.

Figure 6.9 shows a comparison of blocking versus nonblocking Instruction Cache. The differences are too small to see.

Figure 6.9. Blocking vs Non-Blocking Instruction Cache.

6.3. Data and pipeline limits

Once the instructions are in the processor, there must be enough Execution Units to process them all with minimal stalls. The Data Cache must be able to provide the required memory accesses in a timely manner. Finally, the Instruction Window must be large enough that instructions have a chance to execute before causing a stall

6.3.1. Functional Units: ALU, FPU, Load/Store

Figure 6.10 compares CPI while varying the number of Integer Execution Units (ALU).

Figure 6.10. Integer ALU

For the floating point applications, 4 ALU's are enough to reach maximum performance, for the integer heavy applications, 8 ALU's provide additional speedup.

In mpeg2e, the additional performance gained by going from 1 to 4 threads has increased the reliance on ALU Units. While in the single thread version, only 2.3% improvement is seen when going to 8 ALU's, the threaded version shows 5.9% improvement when adding those same additional ALU's. This is due to the additional resource utilization of the multithreaded version.

For all benchmarks, more than 8 integer ALU's are entirely unused. This is likely due to the maximum fetch bandwidth of 8 instructions per cycle becoming the bottleneck.

Figure 6.11 varies the Floating Point Units (FPU).

Figure 6.11. Floating Point Units.

Floating point intensive programs like pov and fsuite can make use of up to 8 or even 12 Floating Point Units. The first 4 are quite effective, beyond that, it is a cost/performance tradeoff to determine if additional units are worthwhile.

The existence of threads does not appear to place an extra burden on FPU resources.

Figure 6.12 varies the number of Load/Store Units. Each of which can perform one load or one store each cycle.

Figure 6.12. Load/Store Units.

The second Load/Store Unit is of significant performance importance. In nlfilt, it cuts execution time nearly in half, taking CPI from 0.27 to 0.14 for either single or multithreaded runs. In mpeg2e and isuite, the presence of a second Load/Store Unit improves performance more for multithreaded versions than scalar.

More than 2 Load/Stores has little or no effect, probably due to the precedence rules for executing loads before preceding stores within the same thread.

6.3.2. Data Cache

In Figure 6.13, the Data Cache is varied from 16K to 256K with both a direct mapped (1-way) and an associative (4-way) arrangement. Perfect refers to a cache that never misses, giving a best case reference point.

Figure 6.13. Data Cache

Data Cache has very little impact on performance of these benchmarks. It is quite likely that our data set size is too small to see much effect with the short simulation runs done. To test for this, the following graph 6.14 shows percentage of cache misses that cause a swap to occur (as opposed to filling an empty cache line). If these are not near 100%, then the simulation has not reached the steady state condition. This is the case for all but the smallest cache sizes.

Figure 6.14. Data Cache Usage

Because the cache model is nearly useless with these small simulation runs, I refer the reader back to the comments made about Instruction Cache in Section 6.2.4.

6.3.3. Instruction Window Depth

Figure 6.15 varies the depth of the Reorder Buffer / Instruction Window (RBIW) which determines how long an instruction has to complete before stalling the processor. Note that the default case has two Completion Slots, so one stalled thread will not stall the entire processor, but two will.

Figure 6.15. RBIW depth.

For the integer benchmarks isuite and nlfilt, a depth of more than 4 blocks is nearly useless. For the floating point intensive applications however, 6 and 8 entries will give better performance.

6.4. Thread parameters

There are some parameters that do not do much for a single threaded processor, but can significantly impact performance with multiple threads. Being able to bypass a stalled thread in the Instruction Window, and having a non-blocking Instruction Cache are two such items. The thread scheduling algorithm can also have an effect on the mix of instructions available.

6.4.1. Completion Slots

Figure 6.16 shows the result of being able to retire instructions from farther up in the Instruction Window, if the entries below it are not from the same thread.

Figure 6.16. Instruction Window Completion Slots.

There is a slight improvement in single thread performance, only because pipe bubbles from blocks of instructions that were invalidated after a mispredicted branch can be squashed with this mechanism. In multithreaded operation, we get approximately 15% speedup from just one extra completion slot.

Performance of multithreaded applications can be even worse than their scalar versions if completion bypassing is not allowed, as seen in fsuite. This is because the high occurrences of floating point instructions are frequently stalling the pipeline. The original code had the floating point operations grouped together, they are benefiting from stall sharing as described in Chapter 2. However, when the multiple threads are interleaved, these instructions are spread out, and do not benefit from this stall sharing.

6.4.2. Thread Scheduling Algorithm

Figure 6.17 shows the results from changing the thread scheduling algorithm.

Figure 6.17. Scheduling Algorithm.

The scheduling algorithm is very customizable, and a variety of techniques were tried here. The algorithm name comes from abbreviations for the various options enabled. Normally, a new thread is chosen every fetch cycle, unless X is specified to indicate coarse grained threading. With coarse grained threading, thread switch only occurs when the current thread gets a low priority.

The order in which threads are checked can be one of three: R (Round Robin order), L (Least Recently Used order), and C (Count order). Round Robin order simply goes in thread number order 1, 2, 3, … Least Recently Used order modifies this to account for the occasional fetch stalls in threads. If a thread loses its turn because of an Instruction Cache miss, or other priority mechanism, this allows the thread to get the next available slot when the priority changes, instead of having to wait for all other threads to get another turn. The final ordering mechanism is Count order, which looks at threads first which have the fewest instructions in the RBIW.

After the order specification, a series of priority flags are used: I (Imiss priority), B (Bad Branch priority), P (Prediction priority), F (FPU priority). These can lower the priority of a thread, allowing other threads to fetch instead. Imiss priority lowers a thread's priority if it is waiting for an Instruction Cache miss, since the thread cannot possibly get any instructions it would be a waste of a fetch cycle. Bad Branch priority is similar, there is a single cycle delay between when a bad branch was detected and the Instruction Cache is ready with the correct instructions. This gives a thread a lower priority during this cycle because it would be a waste of a fetch cycle. Prediction priority lowers a thread's priority if there is a predicted branch in the Instruction Window. This can benefit performance if branch prediction is poor. FPU priority lowers a thread's priority of a floating point instruction is in the window. Since FPU instructions are likely to stall, a second fetch to this thread may plug up the Instruction Window (because it only has 2 Completion Slots). By giving preference to other threads, the chance of plugging the remaining completion slot is reduced.

The algorithms that use floating point priority work better than those that do not.

The ordering parameter has a very minor effect. LRU performs very slightly better than pure round robin, and count performs almost as well as LRU in some applications, but worse in others.

The coarse grained switching has a slightly worse performance as can be expected because it does not benefit from the additional data independence of consecutively fetched blocks of code being from different threads.

The lesson here is that having a more complex scheduling algorithm does not matter very much, but on the other hand does not cost very much either. The only parameter that is important to have is FPU priority.

6.5. Interactions and other ideas

Some combinations have interesting properties, and may be looked at in the future. These include scheduling coarse with no prediction, prefetch with deeper Instruction Windows, or high number of Completion Slots with few Floating Point Units.

Some items not looked at, but which could potentially have interest. The maximum issue and write ports may be reduced to less than the virtually 100% connectivity that was simulated here. Multithreading may allow us to reduce these expensive resources more than would be reasonable in a single thread implementation. Register file ports may be reduced using Steve Wallace's system of fetching the registers after the instructions are in the RBIW, instead of during the decode stage [WAL97]. Better branch predictors could be tried, although we have shown that branch prediction is less important if multiple threads are available. Multiple threads could be fetched each cycle and interleaved at a finer grain, as is done in the SMT architecture [TUL95, TUL96]. Predicted not-taken branches could be fetched when no other threads are available in hopes of executing instructions that eventually are needed as has been done by Wallace in the SMT architecture [WAL98].

7. CONCLUSION

7.1. Discussion of results

In Chapter 5, the characteristics of the multimedia benchmarks were explored. The diversity of their workloads is the most striking aspect. Some have very high floating point usage, while others are entirely integer. IPC ranges from 1.7 to 6.8 just for the single thread case. Speedups range from 2.8% to 19.4% once threaded. Some stall primarily due to resource contentions, while others have poor branch prediction performance.

They do share one important feature: Threaded versions perform better than single threaded versions when given the proper resources.

That leads to Chapter 6, studying the benchmark behavior as resources are varied. Some resources didn't affect multithreaded applications any differently than single thread applications: fetch block size, FPU, and Instruction Window depth.

Other resources had some instances where they were marginally more or less important for multithreaded applications than single thread ones: fetch alignment/prefetching, Integer ALU, Load/Store Units, and thread scheduling algorithm.

Some had strong effects on multithreaded code that wasn't seen on the single thread versions: branch prediction and Completion Slots.

The Instruction and Data Cache analysis was largely useless because of the small benchmark runs. It was difficult to get the cache heavily enough utilized to see real trends in the results. For these we referred the reader to several excellent papers that studied cache in detail with much simpler processor models, which seems to be the only way to do it.

7.2. What does it take to make multithreading viable

This thesis has assembled an extension of a superscalar processor to handle multithreaded applications. The benchmarks covering our target application space proved to be so varied as to make generalizations about their characteristics useless except to say they are diverse. The key resources needed in the hardware boiled down to support for less than a half dozen threads and a mechanism to prevent pipeline stalls (Completion Slots). With this, we gained a degree of immunity to bad branch prediction and up to 20% higher instruction throughput.

Is this a high price to pay for a small performance gain, or a small price to pay for a big performance gain? Actual implementation details will give those numbers when it comes time to build a processor. Multithreaded processors will be built. It is an attractive and inevitable step in increasing performance in the multimedia desktop machine.

BIBLIOGRAPHY

AGA92: A. Agarwal, "Performance Tradeoffs in Multithreaded Processors", IEEE Trans on Parallel and Distributed Systems, Sep 1992.

BLU92: R. Blumofe, "Managing Storage for Multithreaded Computations", MS Thesis MIT, Sep 1992.

COR93: H. Corporaal, "Evaluating Transport Triggered Architectures for scalar applications", Microprocessing and Microprogramming, Sep 1993

CUR76: H. J. Curnow, B. A. Wichman, "A Synthetic Benchmark", Computer Journal, vol 19 #1, Feb 1976.

DAG94: N. Dagli, "Design and Implementation of a Scheduling Unit for a Superscalar Processor", Masters Thesis, UC Irvine, Dec 1994.

DAN98: A. Dan, S. I. Feldman, D. N. Serpanos, "Evolution and Challenges in Multimedia", IBM Journal of R&D V42,N2, 1998.

DIF93: Haertel, Hayes, Stallman, Tower, Eggert, Free Software Foundation, "gnu diff sourcecode", 1993.

EIC96: R. J. Eickemeyer, R. E. Johnson, S. R. Kunkel, M. S. Squillante, S. Liu, "Evaluation of Multithreaded Uniprocessors for Commercial Application Environments", SIGARCH Comp Arch News, May 1996.

EMM97: P. G. Emma, "Understanding Some Simple Processor-Performance Limits", IBM Journal of Research & Dev, Feb 1997.

FAR91: M. K. Farrens, A. R. Pleszkun, "Strategies for Achieving Improved Processor Throughput", Unknown ACM publication, Sep 1991.

FAR94: K. I. Farkas, N. P. Jouppi, P. Chow, "How Useful are Non-blocking Loads, Stream Buffers, and Speculative Execution in Muliple Issue Processors?", DEC WRL 94/8, Dec 1994.

FAR97: K. I. Farkas, P. Chow, N. P. Jouppi, Z. Vranesic, "Memory-system Design Considerations for Dynamically-Scheduled Processors", DEC/WRL Tech Report 97.1, Feb 1997.

FLY98: R. J. Flynn, W. H. Tetzlaff, "Multimedia-An Introduction", IBM Journal of R&D V42,N2, 1998.

GUL94: M. Gulati, "Multithreading on a Superscalar Microprocessor", MS Thesis UCI, Dec 1994.

GUL96: M. Gulati, N. Bagherzadeh, "Performance Study of a Multithreaded Superscalar Microprocessor", HPCA, Feb 1996

GZI93: Free Software Foundation, "gzip sourcecode version 1.2.4", Oct 1993. Ftp://prep.ai.mit.edu/pub/gnu

HAR94: H. W. Hardenbergh, "CPU Performance, Where are We Headed?", Dr. Dobb's Journal, Jan 1994.

HOL64: J. N. Holmes, I. G. Mattingly, J. N. Shearme, "Speech Synthesis by Rule", Language Speech 7, 1964.

JOU90: N. P. Jouppi, "Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers", DEC/WRL Tech Note TN-14, Mar 1990.

JPG96: The Independent JPEG Group's JPEG Software, "cjpeg sourcecode release 6a", Feb 96.

KEC92: S. W. Keckler, W. J. Dally, "Processor Coupling: Integrating Compile Time and Runtime Scheduling for Parallelism", International Symposium in Computer Architecture, Queensland, Australia, Jul 1992.

KLA80: Klatt, "Software for a Cascade/Parallel Format Synthesizer", Journal of the Acoustic Society of America, Mar 1980.

LEE95: G. Lee, S. Jamil, "Memory Block Relocation in Cache-Only Memory Multiprocessors", IASTED-ISMM International Conference on Parallel and Distributed Computing and Systems, Oct 1995.

LO97 : J. L. Lo, S. J. Eggers, J. S. Emer, H. M. Levy, R. L. Stamm, D. M. Tullsen, "Converting Thread-Level Parallelism to Instruction-Level Parallelism via Simultaneous Multithreading", ACM Transactions on Computer Systems, Aug 1997.

LO98 : J. L. Lo, L. A. Barroso, S. J. Eggers, K. Gharachorloo, H. M. Levy, S. S. Parekh, "An Analysis of Database Workload Performance on Simultaneous Multithreaded Processors", ISCA98, Jun 1998.

LOI96: M. Loikkanen, N. Bagherzadeh, "A Fine-Grain Multithreading Superscalar Architecture", Parallel Architectures and Compilation Techniques, Oct 1996.

LZ77: Ziv, Lempel, "A Universal Algorithm for Sequential Data Compression", IEEE Transactions on Information Theory vol. 23 no. 3, May 1977.

MCF91: S. McFarling, "Cache Replacement with Dynamic Exclusion", DEC/WRL Tech Note TN-22., Nov 1991.

MPE94: MPEG Software Simulation Group, "MPEG2 encode sourcecode", 1994.

NET93: "NetPBM Library release 7", Dec 1993. Ftp://wuarchive.wustl.edu/graphics/graphics/packages/NetPBM

PHI96: J. Philbin, J. Edler, O. J. Anshus, C. C. Douglas, K. Li, "Thread Scheduling for Cache Locality", ASPLOS, Oct 1996.

PNM93: G. W. Gill, "pnmnlfilt.c sourcecode version 1.0", Jan 1993. (This program is one component of NetPBM [NET93].)

POV96: POV-Ray Team, "Persistence of Vision Ray Tracer sourcecode version 3.0", Jul 1996.

PRA91: R. G. Prasadh, C. Wu, "A Benchmark Evaluation of a Multi-Threaded RISC Processor Architecture", International Conference on Parallel Processing, Aug 1991.

PVR93: Portable Video Research Group (PVRG), "MPEG1 decode sourcecode", 1993.

SAY94: N. Ing-Simmons, "say sourcecode version 2.0", Nov 1994.

SMI93: J. O. Smith III, "Bandlimited Interpolation - Introduction and Algorithm", Publication unknown, Jan 1993.

SOU92: V. Soundararajan, A. Agarwal, "Dribbling Registers: A Mechanism for Reducing Context Switch Latency in Large-Scale Multiprocessors", MIT/LCS Tech Memo TM-474, Nov 1992.

SOX94: L. Norskog, "Sound Tools release 11, patchlevel 12", Aug 1994.

TUL95: D. M. Tullsen, S. J. Eggers, H. M. Levy, "Simultaneous Multithreading: Maximizing On-Chip Parallelism", ISCA95, 1995.

TUL96: D. M. Tullsen, S. J. Eggers, J. S. Emer, H. M. Levy, L. L. Lo, R. L. Stamm, "Exploiting Choice: Instruction Fetch and Issue on Implementable Simultaneous Multithreading Processor", ISCA, May 1996.

WAL91: G. K. Wallace, "The JPEG Still Picture Compression Standard", Communications of the ACM, Apr 1991.

WAL93a: S. Wallace, "Performance Analysis of a Superscalar Architecture", MS Thesis UC Irvine, Sep 1993.

WAL93b: D. W. Wall, "Limits of Instruction-Level Parallelism", DEC/WRL Research Report 93.6, Nov 1993.

WAL96: S. Wallace, N. Bagherzadeh, "Instruction Fetching Mechanisms for Superscalar Microprocessors", Euro-Par '96, Aug 1996.

WAL97: S. Wallace, "Scalable Hardware Mechanisms for Superscalar Microprocessors", PHD Dissertation UC Irvine, 1997.

WAL98: S. Wallace, B. Calder, D. M. Tullsen, "Threaded Multiple Path Execution", 25th Int. Symposium on Computer Architecture, Jun 1998.

WHE97: "whetstone benchmark in C sourcecode", May 1987.

XMO95: S. Booth, "xmountains sourcecode version 2.2", University of Edinburgh, Jun 1995

YOU95: C. Young, N. Gloy, M. D. Smith, "A Comparative Analysis of Schemes for Correlated Branch Prediction", Harvard Tech Report, Jun 1995.

APPENDIX A: MORE ABOUT THE BENCHMARKS AND THREADING

nlfilt: Non-Linear Filter for Image Enhancement

Nlfilt is a non-linear image filter program. The edge enhance module simulated here and shown in Figure A.2 is typical of image processing. The source code was taken from the netpbm library [PNM93, NET93] routine called pnmnlfilt. The code was modified by replacing all pbm library references with local routines that use a simple binary file format to improve simulation speed and eliminate porting the library to SDSP. Only the image processing routine was included in the simulation statistics. The simplified I/O routines were excluded from the simulation statistics by use of the m_quick_run() instruction. The typical interactive image editing application loads an image once, then performs a series of filters interacting with the user in real time. Therefore, we want to focus on the real-time portion of the program.

Nlfilt uses the Shared Memory Model. Threads are forked at the beginning of the image processing loop. Each thread works on every nth row, with n being the number of threads. When no more rows need processing, the threads join and a single thread outputs the file. Figure A.1 shows a code fragment containing the threaded loop.


...
if (num_threads > 1) {
    m_fork_n(0,num_threads-1);
    main_inner();     /* multithreaded loop */
    m_join(0);
} else {
    main_inner_1();   /* single thread optimized loop */
}
...

void main_inner_1(void) { /* optimized version for single thread */ xel *orow, *irow0, *irow1, *irow2, *ip0, *ip1, *ip2, *op; int pr[9],pg[9],pb[9]; /* 3x3 neighbor pixel values */ int r,g,b, row,col; int po,no; /* offsets for left and right colums in 3x3 */ orow = o_image; for (row = 0 ; row < rows; ) { irow0 = irow1 = irow2 = &i_image[row * cols]; if (row != 0) irow0-=cols; if (row != (rows-1)) irow2+=cols; for (col = cols-1,po= col>0?1:0,no=0,ip0=irow0,ip1=irow1, ip2=irow2,op=orow; col >= 0; col--,ip0++,ip1++, ip2++,op++, no |= 1,po = col!= 0 ? po : 0) { pr[0] = PPM_GETR( *ip1 ); /* grab 3x3 pixel values */ pg[0] = PPM_GETG( *ip1 ); pb[0] = PPM_GETB( *ip1 ); pr[1] = PPM_GETR( *(ip1-no) ); pg[1] = PPM_GETG( *(ip1-no) ); pb[1] = PPM_GETB( *(ip1-no) ); pr[5] = PPM_GETR( *(ip1+po) ); pg[5] = PPM_GETG( *(ip1+po) ); pb[5] = PPM_GETB( *(ip1+po) ); pr[3] = PPM_GETR( *(ip2) ); pg[3] = PPM_GETG( *(ip2) ); pb[3] = PPM_GETB( *(ip2) ); pr[2] = PPM_GETR( *(ip2-no) ); pg[2] = PPM_GETG( *(ip2-no) ); pb[2] = PPM_GETB( *(ip2-no) ); pr[4] = PPM_GETR( *(ip2+po) ); pg[4] = PPM_GETG( *(ip2+po) ); pb[4] = PPM_GETB( *(ip2+po) ); pr[6] = PPM_GETR( *(ip0+po) ); pg[6] = PPM_GETG( *(ip0+po) ); pb[6] = PPM_GETB( *(ip0+po) ); pr[8] = PPM_GETR( *(ip0-no) ); pg[8] = PPM_GETG( *(ip0-no) ); pb[8] = PPM_GETB( *(ip0-no) ); pr[7] = PPM_GETR( *(ip0) ); pg[7] = PPM_GETG( *(ip0) ); pb[7] = PPM_GETB( *(ip0) ); r = (*atfunc)(pr); /* call filter 3 times */ g = (*atfunc)(pg); b = (*atfunc)(pb); PPM_ASSIGN( *op, r, g, b ); }; orow += cols; row ++; }; }; void main_inner(void) { xel *orow, *irow0, *irow1, *irow2, *ip0, *ip1, *ip2, *op; int pr[9],pg[9],pb[9]; /* 3x3 neighbor pixel values */ int r,g,b, row,col; int po,no; /* offsets for left and right colums in 3x3 */ orow = o_image + m_thread_num()*cols; for (row = m_thread_num() ; row < rows; ) { irow0 = irow1 = irow2 = &i_image[row * cols]; if (row != 0) irow0-=cols; if (row != (rows-1)) irow2+=cols; for (col = cols-1, po= col>0?1:0,no=0,ip0=irow0,ip1=irow1, ip2=irow2,op=orow; col >= 0; col--,ip0++,ip1++, ip2++,op++, no |= 1,po = col!= 0 ? po : 0) { pr[0] = PPM_GETR( *ip1 ); /* grab 3x3 pixel values */ pg[0] = PPM_GETG( *ip1 ); pb[0] = PPM_GETB( *ip1 ); pr[1] = PPM_GETR( *(ip1-no) ); pg[1] = PPM_GETG( *(ip1-no) ); pb[1] = PPM_GETB( *(ip1-no) ); pr[5] = PPM_GETR( *(ip1+po) ); pg[5] = PPM_GETG( *(ip1+po) ); pb[5] = PPM_GETB( *(ip1+po) ); pr[3] = PPM_GETR( *(ip2) ); pg[3] = PPM_GETG( *(ip2) ); pb[3] = PPM_GETB( *(ip2) ); pr[2] = PPM_GETR( *(ip2-no) ); pg[2] = PPM_GETG( *(ip2-no) ); pb[2] = PPM_GETB( *(ip2-no) ); pr[4] = PPM_GETR( *(ip2+po) ); pg[4] = PPM_GETG( *(ip2+po) ); pb[4] = PPM_GETB( *(ip2+po) ); pr[6] = PPM_GETR( *(ip0+po) ); pg[6] = PPM_GETG( *(ip0+po) ); pb[6] = PPM_GETB( *(ip0+po) ); pr[8] = PPM_GETR( *(ip0-no) ); pg[8] = PPM_GETG( *(ip0-no) ); pb[8] = PPM_GETB( *(ip0-no) ); pr[7] = PPM_GETR( *(ip0) ); pg[7] = PPM_GETG( *(ip0) ); pb[7] = PPM_GETB( *(ip0) ); r = (*atfunc)(pr); /* call filter 3 times */ g = (*atfunc)(pg); b = (*atfunc)(pb); PPM_ASSIGN( *op, r, g, b ); }; if (num_threads > 1) { orow += cols*num_threads; /* pointer arithmetic */ row += num_threads; } else { orow += cols; row ++; }; };


Figure A.1. Excerpt from the threaded version of nlfilt.

The if statement in the first line shows that single thread performance was kept optimized by executing the original code without additional threading overhead. Only if num_threads is > 1 does the new version of the loop get executed. Within the if statement, the threads are forked and a procedure is called. Within the procedure, stack variables may be used, since each thread has a unique stack pointer. The loop itself is initialized with the thread number as starting point, and then each increment statement adds num_threads to skip to the thread's next line.

Figure A.2. Input and output image from the edge enhancement filter nlfilt.

Table A.1 is the profile of nlfilt. Each line shows statistics for one procedure in the program. This data is gathered during the 1 and 4 threads default configuration simulations based on the number of instructions fetched within each procedure address range. The first column lists the percentage of instructions fetched from each routine. The second column adds all child procedures called by that routine. Note that there are some minor errors in this amount due to the simulator being unable to always follow the program hierarchy. In particular, the _main routine is usually slightly over 100%. The next column counts the number of times which the procedure is called. Column four is the raw instruction count for a single thread execution. Column five, thread penalty, is the difference between multithread and single threaded instruction counts. This shows that main_inner_1 in the single thread run is replaced by main_inner in the multithreaded run. The % thread colunm shows the percentage of time that more than one thread was active when instructions in the procedure were fetched. The last column is the procedure name. The leading underscore is an artifact of the symbol table format. Multiple prefix underscores are used by library routines. See Appendix C for a more detailed description of the fields.

Table A.1. Profile of the benchmark nlfilt.
% instruc  % parent  # calls  # instr. + tpenalty % thread. symbol
 64.8025 % 64.8025 %    34200 12436972 +        0 100.000 % _atfilt4
  0.0000 %  0.0000 %        0        0 +  6624518  99.998 % _main_inner
 34.5121 % 99.3146 %        1  6623609 + -6623609   0.000 % _main_inner_1
  0.5549 %  0.6826 %        1   106491 +        0   0.000 % _atfilt_setup
  0.0813 %  0.0813 %       44    15610 +        0   0.000 % _triang_area
  0.0339 %  0.1225 %       11     6501 +        0   0.000 % _hex_area
  0.0072 %  0.0072 %       11     1391 +        0   0.000 % _rectang_area
  0.0037 %  0.0053 %        1      710 +        0   0.000 % _sqrt
  0.0016 %  0.0022 %        1      301 +        0   0.000 % __fwalk
  0.0010 %  0.0010 %        2      190 +        0   0.000 % _scalb
  0.0006 %  0.0011 %        5      113 +        0   0.000 % ___sflush
  0.0005 %105.5217 %        1       89 +       16   8.571 % _main
  0.0005 %  0.0005 %        2       90 +        0   0.000 % _logb
  0.0001 %  0.0001 %        1       27 +        0   0.000 % _finite
  0.0001 %  0.0023 %        1       20 +        0   0.000 % _exit
  0.0000 %  0.0022 %        1        8 +        0   0.000 % __cleanup
  0.0000 %  0.0000 %        1        1 +        0   0.000 % __exit

mpeg2e: MPEG II Video Compression

The benchmark mpeg2e takes a sequence of images, and creates an MPEG2 video stream [MPE94]. The ping pong sequence was used, and is shown in Figure A.3.

Figure A.3. Input and output images from mpeg2e ping pong sequence.

A profile of the program (shown in Table A.2 at the end of this Section) indicates that two routines make up the bulk of the fetched instructions. dist1 and fdct each take approximately 25%. The third, at 10% is putbits, part of the output stream routine, which lacks the explicit data parallelism that makes threading so simple. No other single routine takes more than 4.5% of the execution time. Because of this, we can focus just on the first two routines to get the best return on (programming time) investment.

dist1 is a part of the motion estimation routine, and calculates a quality value for the difference between two integer arrays (16x16 or 16x8). The operation consists of many integer operations in nested loops, good fodder for multithreaded optimization. Each execution however is fairly quick, and the number of forks and joins would be significant, the profile run had nearly 1,700 calls.

The routine fullsearch calls dist1 about 50 times each time it is invoked, and is also an easy location to insert the thread forks and joins. It takes one macroblock of the picture, and determines the best motion vector between the previous frame and the new block. It consists of a couple nested loops with repeated calls to dist1. Figure A.4 shows the loop before threads are added.

The multithreading was added by forking before the loop begins, and then having the n threads work on every nth comparison. Load balance is not much of a problem, because each iteration is a similar complexity to all others. Figure A.5 shows the loop after threads were added.


...
  for (l=1; l<=sxy; l++) /* sxy is either 8 or 16, depending on instance */
  {
    i = i0 - l;
    j = j0 - l;
    for (k=0; k<8*l; k++)
    {
      if (i>=ilow && I<=ihigh && j>=jlow && j<=jhigh)
      {
        d = dist1(org+i+lx*j,blk,lx,0,0,h,dmin);
        if (d<dmin)
        {
          dmin = d;
          imin = i;
          jmin = j;
        }
      }
      if      (k<2*l) i++;
      else if (k<4*l) j++;
      else if (k<6*l) i--;
      else            j--;
    }
  }
...

Figure A.4. Excerpt from fullsearch routine in mpeg2e. Original version without threads.

int order[] = {          /* precalculated i,j pairs for fullsearch */
  -1, -1,  0, -1,  1, -1, 1, 0,  1, 1,  0, 1,  -1, 1,  -1, 0,  
  -2, -2,  -1, -2,  0, -2, 1, -2,  2, -2,  2, -1,  2, 0,  2, 1, ...

#define FULLSEARCH_INNER_QUICK(myi, myj, myk, myl, myd, myzz, mynum) 
  for (;m_dec_ipc_reg(4) >= 0;) {
    m_exclusive_run(1);  /* atomic */
    myi=s_i0-order[m_inc_ipc_reg(5)];
    myj=s_j0-order[m_inc_ipc_reg(5)];
    m_exclusive_run(0);
    if (myi>=ilow && myi<=ihigh && myj>=jlow && myj<=jhigh) {
      myd = dist1(s_org+myi+s_lx*myj,s_blk,s_lx,0,0,
                  s_h,m_read_ipc_reg(1));
    if (myd<m_read_ipc_reg(1)) {      
        m_exclusive_run(1);
        if (myd<m_read_ipc_reg(1)) /*repeat as exclusive*/
        {  
          m_write_ipc_reg(1,myd); /*dmin = myd*/
          m_write_ipc_reg(2,myi); /*imin = myi*/
          m_write_ipc_reg(3,myj); /*jmin = myj*/
        }        
        m_exclusive_run(0);
      }        
    }         
  };

Figure A.5. Excerpt from fullsearch routine in mpeg2e. Modified version with Shared Memory threads. (continues on next page)

...
s_org = org; s_blk = blk;           /* shared read only pointers */
s_lx=lx; s_i0=i0; s_j0=j0; s_h=h;   /* shared read only integers */
m_write_ipc_reg(1,dmin);            /*dmin = d;*/
m_write_ipc_reg(2,imin);            /*imin = i;*/
m_write_ipc_reg(3,jmin);            /*jmin = j;*/
m_write_ipc_reg(4,(sxy+1)*(sxy+1)); /* number of iterations */
m_write_ipc_reg(5,-1);              /* current position in order[] */
m_fork_n(0,num_threads-1);
switch (m_thread_num()) {
case 0:
  FULLSEARCH_INNER_QUICK (i_0, j_0, k_0, l_0, d_0, zz_0, 0); break;
case 1:
  FULLSEARCH_INNER_QUICK (i_1, j_1, k_1, l_1, d_1, zz_1, 1); break;
case 2:
  FULLSEARCH_INNER_QUICK (i_2, j_2, k_2, l_2, d_2, zz_2, 2); break;
case 3:
  FULLSEARCH_INNER_QUICK (i_3, j_3, k_3, l_3, d_3, zz_3, 3); break;
case 4:
  FULLSEARCH_INNER_QUICK (i_4, j_4, k_4, l_4, d_4, zz_4, 4); break;
case 5:
  FULLSEARCH_INNER_QUICK (i_5, j_5, k_5, l_5, d_5, zz_5, 5); break;
case 6:
  FULLSEARCH_INNER_QUICK (i_6, j_6, k_6, l_6, d_6, zz_6, 6); break;
};
m_join(0);
dmin = m_read_ipc_reg(1);
imin = m_read_ipc_reg(2);
jmin = m_read_ipc_reg(3);
...

Figure A.5. (cont.) Excerpt from fullsearch routine in mpeg2. Modified version with Shared Memory threads.

The complex looping mechanism that searches in a spiral from center to edge is a problem for the multithreaded code. It was done because dist1 can abort a comparison when it knows the current match is worse than the best one seen so far, making many comparisons exit well before the entire array has been checked. The most likely match for two frames is right in the middle of the block with no or little motion. By checking for this situation first, the remaining checks are much quicker

The first solution to this strange ordering was to get rid of the nested loops, and to make a single loop that has a more complex increment statement. This proved inefficient as each thread was forced to do too much redundant processing just to select what to work on. Instead, the complex looping has been replaced by a pre-initialized array of coordinates. Each thread can simply take the nth location in this list, and doesn't have the overhead of repeating the compare and increment statements many times for each iteration of the loop.

The threads are written for the Shared Memory configuration. Upon reaching the fork, each child thread gets a unique stack pointer, while sharing data memory.

All read-only variables that are stored on the stack (org, blk, lx, i0, j0, and h) must be copied to data memory versions, which have been defined as static, placing them in the program's shared data area. Three variables, dmin, imin, and jmin, are stored in ipc_reg registers, for easy sharing between threads. Some variables (i, j, k, d, my_num, and zz) are duplicated for each thread, so that each thread can use a different set without interference. To make the code recognize which set of these to use, the loop has been converted to a macro that is instantiated once for each possible thread.

Since all threads are looking for a value of d<dmin, there is a chance that more than one thread will find a match at about the same time. To prevent data corruption, the read/write set must be an atomic transaction. However, a successful comparison only happens around 1% of the time. Dropping into exclusive_run each comparison would waste cycles and unnecessarily limit parallelism. Instead, the comparison of (d<dmin) is first performed unprotected; Then, only if the condition is met, exclusive_run is set, and the comparison is repeated before updating the registers.

The second mpeg2e loop: fdct

The other routine that has been threaded is fdct, forward discrete cosine transform, shown in Figure A.6. The threaded version is shown in Figure A.7.


...
  for (i=0; i<8; i++)
    for (j=0; j<8; j++)
    {
      s = 0.0;
      for (k=0; k<8; k++)
        s += c[j][k] * block[8*i+k];
      tmp[8*i+j] = s;
    }
  for (j=0; j<8; j++)
    for (i=0; i<8; i++)
    {
      s = 0.0;
      for (k=0; k<8; k++)
        s += c[I][k] * tmp[8*k+j];
      block[8*i+j] = (int)floor(s+0.499999);
    }
...

Figure A.6. Excerpt from fdct routine in mpeg2e. Original version without threads

The routine fdct consists of two sets of nested loops, with the inner multiply executed 512 times each. The variable c[8][8], is a double precision floating point array, so each multiply takes 15 cycles to complete in the Floating Point Unit. The more of these we can get into the Instruction Window at the same time, the more parallelism we can achieve (assuming we have enough Functional Units to work them in parallel).


#define SHARED static    /* shared variables are declared static */
#define FDCT_INNER_A(myi, myj, myk, mys)
    for (myi=0; myi<8; myi++)
      for (myj=m_thread_num(); myj<8; myj+=num_threads) {
        mys = 0.0;
        for (myk=0; myk<8; myk++)
          mys += c[myj][myk] * s_block[8*myi+myk];
        tmp[8*myi+myj] = mys;
      };

#define FDCT_INNER_B(myi, myj, myk, mys)
    for (myj=0; myj<8; myj++)
      for (myi=m_thread_num(); myi<8; myi+=num_threads) {
        mys = 0.0;
        for (myk=0; myk<8; myk++)
          mys += c[myi][myk] * tmp[8*myk+myj];
        s_block[8*myi+myj] = (int)floor(mys+0.499999);
      };

void fdct(short *block) {
  SHARED int i0, j0, k0;
  SHARED int i1, j1, k1;
  SHARED int i2, j2, k2;
  SHARED int i3, j3, k3;
  SHARED int i4, j4, k4;
  SHARED int i5, j5, k5;
  SHARED int i6, j6, k6;
  SHARED double s0, s1, s2, s3, s4, s5, s6;
  SHARED double tmp[64];
  SHARED short *s_block;
  s_block = block;
  m_fork_n(0,num_threads-1);
  switch(m_thread_num()) {
    case 0:  FDCT_INNER_A(i0,j0,k0,s0); break;
    case 1:  FDCT_INNER_A(i1,j1,k1,s1); break;
    case 2:  FDCT_INNER_A(i2,j2,k2,s2); break;
    case 3:  FDCT_INNER_A(i3,j3,k3,s3); break;
    case 4:  FDCT_INNER_A(i4,j4,k4,s4); break;
    case 5:  FDCT_INNER_A(i5,j5,k5,s5); break;
    case 6:  FDCT_INNER_A(i6,j6,k6,s6); break;
  };
  m_join(0);             
  m_fork_n(0,num_threads-1);
  switch(m_thread_num()) {
    case 0:  FDCT_INNER_B(i0,j0,k0,s0); break;
    case 1:  FDCT_INNER_B(i1,j1,k1,s1); break;
    case 2:  FDCT_INNER_B(i2,j2,k2,s2); break;
    case 3:  FDCT_INNER_B(i3,j3,k3,s3); break;
    case 4:  FDCT_INNER_B(i4,j4,k4,s4); break;
    case 5:  FDCT_INNER_B(i5,j5,k5,s5); break;
    case 6:  FDCT_INNER_B(i6,j6,k6,s6); break;
  };
  m_join(0);
...

Figure A.7. Excerpt from fdct routine in mpeg2e. Modified with Shared Memory threads.

The middle nested loop has been divided up so each of n threads takes 1/n of the iterations. It could have been done the same with the outer loop instead, the choice was arbitrary. The array tmp[][] is calculated by the first loop, then used by the second, so the threads must be synchronized between them with an m_join() followed by a new m_fork(). The second set of nested loops is threaded in the same way as the first loop.

Load balancing is not a problem in either loop, because each iteration has exactly the same complexity.

As in the fullsearch, shared variables are defined as static, so that they are allocated in data memory instead of on the stack. Each variable has been duplicated for each thread. The loop was turned into a macro that lets each instance reference a different set of variables.

Table A.2 shows the profile of mpeg2e sorted by % instructions. See Appendix C for description of the fields.

Table A.2. Profile of the benchmark mpeg2e.
% instruc  % parent  # calls  # instr. + tpenalty % thread. symbol
 25.1379 % 27.3676 %      216  9029448 +   150336  99.859 % _fdct
 24.8493 % 24.8493 %     1695  8925783 +        0  51.081 % _dist1
 10.1725 % 10.1736 %    13207  3653913 +        0   0.000 % _putbits
  4.4671 %  4.4671 %        9  1604556 +        0   0.000 % _calcSNR1
  2.9100 %  2.9100 %      138  1045259 +        0   0.000 % _quant_non_intra
  2.8469 % 12.5056 %     6278  1022598 +        0   0.000 % _putAC
  2.5124 %  2.5124 %      216   902448 +        0   0.000 % _add_pred
  2.4562 %  2.4562 %     1728   882240 +        0   0.000 % _idctcol
  2.2815 %  2.2815 %      216   819504 +        0   0.000 % _sub_pred
  2.1133 %  2.1133 %     1728   759090 +        0   0.000 % _idctrow
  2.0234 %  2.0234 %      138   726794 +        0   0.000 % _iquant1_non_intra
  1.8621 %  1.8621 %       47   668841 +        0   0.000 % _dist2
  1.7910 %  1.7910 %       78   643302 +        0   0.000 % _quant_intra
  1.5831 %  1.5831 %       87   568628 +        0   0.000 % _pred_comp
  1.4463 %  2.2382 %    13880   519507 +        0  99.463 % _floor
  1.3815 %  1.3815 %      144   496224 +        0   0.000 % _var_sblk
  1.3149 %  1.3149 %       78   472324 +        0   0.000 % _iquant1_intra
  1.3108 %  1.3108 %       36   470844 +        0   0.000 % _variance
  1.2071 %  1.2071 %       12   433572 +        0   0.000 % _bdist2
  1.0154 %  7.8405 %      103   364745 +        0   0.000 % _putnonintrablk
  0.7920 %  0.7920 %     6103   284496 +        0  99.445 % _ceil
  0.7396 %  6.6983 %       78   265672 +        0   0.000 % _putintrablk
  0.5565 %  1.5108 %      175   199876 +        0   0.000 % _vfprintf
  0.5416 % 25.3910 %       36   194556 +     2268  83.907 % _fullsearch
  0.4723 %  0.4723 %       13   169650 +        0   0.000 % _clearblock
  0.3969 %  0.3969 %     1141   142565 +        0   0.000 % _bcopy
  0.3037 %  0.3324 %      396   109098 +        0   0.000 % ___qdivrem
  0.2345 %  4.8040 %      216    84240 +        0   0.000 % _idct
  0.1867 %  0.5084 %      414    67055 +        0   0.000 % ___sfvwrite
  0.0999 % 21.5775 %        3    35899 +        0   0.000 % _putpict
  0.0914 % 29.7405 %        3    32841 +        0   0.000 % _transform
  0.0914 %  7.4078 %        3    32841 +        0   0.000 % _itransform
  0.0836 %  0.1193 %       67    30026 +        0   0.000 % ___dtoa
  0.0518 %  0.3962 %        3    18604 +        0   0.000 % _stats
  0.0501 %  0.1403 %      288    18009 +        0   0.000 % _fread
  0.0453 %  0.0453 %     1276    16271 +        0   0.000 % ___ucmpdi2
  0.0402 % 99.9986 %        1    14426 +        0   0.000 % _putseq
  0.0373 %  0.1438 %       78    13384 +        0   0.000 % _putDC
  0.0357 %  0.0357 %        9    12807 +        0   0.000 % _border_extend
  0.0330 %  0.0357 %      282    11838 +        0   0.000 % _memchr
  0.0320 %  1.6150 %       29    11486 +        0   0.000 % _pred
  0.0311 % 29.8020 %       36    11157 +        0   0.000 % _frame_ME
  0.0277 %  0.2354 %        3     9945 +        0   0.000 % _read_y_u_v
  0.0273 %  2.0507 %      138     9798 +        0   0.000 % _iquant_non_intra
  0.0259 %  0.4582 %      434     9310 +        0   0.000 % _open
  0.0185 %  0.0258 %       36     6660 +        0   0.000 % _rc_calc_mquant
  0.0178 %  1.3328 %       78     6396 +        0   0.000 % _iquant_intra
  0.0177 %  1.3992 %        3     6368 +        0   0.000 % _calc_actj
  0.0156 %  0.0376 %       36     5596 +        0   0.000 % _putmv
  0.0151 %  2.1025 %       36     5440 +        0   0.000 % _predict_mb
  0.0147 %  0.2147 %      103     5293 +        0   0.000 % _putACfirst
  0.0119 %  2.1144 %        3     4281 +        0   0.000 % _predict
  0.0102 %  0.0478 %       18     3672 +        0   0.000 % _putmvs
  0.0102 % 29.8207 %        3     3669 +        0   0.000 % _motion_estimation
  0.0101 %  1.4661 %      151     3624 +        0   0.000 % _fprintf
  0.0088 %  0.1769 %      198     3168 +        0   0.000 % ___umoddi3
  0.0088 %  0.0132 %       18     3150 +        0   0.000 % _log
  0.0073 %  0.0073 %       21     2610 +        0   0.000 % ___sfp
  0.0073 %  0.0074 %       48     2605 +        0   0.000 % _malloc
  0.0068 %  0.0303 %       35     2450 +        0   0.000 % _putmbtype
  0.0067 %  0.0214 %       36     2396 +        0   0.000 % _putmotioncode
  0.0067 %  0.0106 %       39     2394 +        0   0.000 % ___sflush
  0.0053 %  0.0187 %       35     1890 +        0   0.000 % _putaddrinc
  0.0048 %  0.0048 %      143     1722 +        0   0.000 % ___cmpdi2
  0.0046 %  0.0046 %        3     1644 +        0   0.000 % _dct_type_estimation
  0.0044 %  0.0150 %       21     1566 +        0   0.000 % _fopen
  0.0043 %  0.0045 %       21     1560 +        0   0.000 % ___swhatbuf
  0.0040 %  0.0098 %       21     1433 +        0   0.000 % _fclose
  0.0039 %  0.0588 %       24     1416 +        0   0.000 % _sprintf
  0.0039 %  0.1008 %       52     1404 +        0   0.000 % _putDClum
  0.0039 %  0.1681 %      198     1386 +        0   0.000 % ___udivdi3
  0.0038 %  0.0040 %       27     1377 +        0   0.000 % ___swrite
  0.0038 %  0.0136 %       22     1374 +        0   0.000 % ___smakebuf
  0.0034 %  0.0034 %       67     1206 +        0   0.000 % _isnan
  0.0034 %  0.0034 %       67     1206 +        0   0.000 % _isinf
  0.0030 %  0.0030 %       73     1073 +        0   0.000 % _strlen
  0.0030 %  4.7003 %        3     1062 +        0   0.000 % _calcSNR
  0.0027 %  0.0262 %       20      980 +        0   0.000 % _putcbp
  0.0027 %  0.0668 %        1      972 +        0   0.000 % _putuserdata
  0.0024 %  0.0078 %       13      865 +        0   0.000 % ___swbuf
  0.0023 %  0.0023 %       20      810 +        0   0.000 % _logb
  0.0020 %  0.0489 %       26      702 +        0   0.000 % _putDCchrom
  0.0020 %  0.0094 %       10      702 +        0   0.000 % ___srefill
  0.0019 %  0.0019 %       21      684 +        0   0.000 % ___sflags
  0.0017 %  0.0017 %       51      612 +        0   0.000 % _bitcount
  0.0015 %  0.1155 %        3      552 +        0   0.000 % _writeframe
  0.0015 %  0.0087 %       12      540 +        0   0.000 % ___swsetup
  0.0014 %  0.0014 %       83      486 +        0   0.000 % _finite
  0.0013 %  0.0083 %       28      476 +        0   0.000 % _fflush
  0.0013 %  0.0013 %       21      460 +        0   0.000 % _free
  0.0012 %  0.4013 %        3      422 +        0   0.000 % _rc_update_pict
  0.0011 %  0.0788 %        9      405 +        0   0.000 % _fwrite
  0.0011 %  0.1128 %        3      402 +        0   0.000 % _calc_vbv_delay
  0.0010 %  1.4542 %        3      354 +        0   0.000 % _rc_init_pict
  0.0008 %  0.0008 %       18      288 +        0   0.000 % _ldexp
  0.0007 %  0.0121 %       16      247 +        0   0.000 % _alignbits
  0.0007 %  0.0008 %       21      240 +        0   0.000 % ___sclose
  0.0007 %  0.0007 %       10      234 +        0   0.000 % ___sread
  0.0005 %  0.1091 %        1      197 +        0   0.000 % _rc_init_seq
  0.0005 %  0.0009 %        3      186 +        0   0.000 % _rc_start_mb
  0.0005 %  0.0005 %       18      180 +        0   0.000 % ___negdi2
  0.0005 %  0.0136 %       18      162 +        0   0.000 % _log10
  0.0004 %  0.2358 %        3      141 +        0   0.000 % _readframe
  0.0004 %  0.1535 %        3      136 +        0   0.000 % _putpicthdr
  0.0003 %  0.0003 %        1      121 +        0   0.000 % _frametotc
  0.0003 %  0.0329 %        1       92 +        0   0.000 % _rc_init_GOP
  0.0002 %  0.0110 %        1       63 +        0   0.000 % _putgophdr
  0.0002 %  0.0197 %        1       61 +        0   0.000 % _putseqhdr
  0.0002 %  0.0002 %       27       54 +        0   0.000 % _write
  0.0001 %  0.0002 %        3       45 +        0   0.000 % _vbv_end_of_picture
  0.0001 %100.7170 %        1       40 +        0   0.000 % _main
  0.0001 %  0.0001 %       21       40 +        0   0.000 % _close
  0.0001 %  0.0001 %       21       40 +        0   0.000 % _fstat
  0.0001 %  0.0001 %       10       18 +        0   0.000 % _read
  0.0000 %  0.0064 %        1       14 +        0   0.000 % _putseqend
  0.0000 %  0.0000 %       20        8 +        0   0.000 % _sbrk

pov: Persistence of Vision Raytracer

The benchmark pov is a ray tracer [POV96]. It reads a scene description file, parses it, preprocesses it, then renders an image pixel by pixel. The read, parse, and preprocess routines do not lend themself to multithreading, and are only executed one time at the beginning of the program. Therefore these routines have been left scalar.

The threading was done to the ray-tracing portion of the code. Each thread is initially assigned one pixel. When that pixel has been complete, an ipc_reg increment command is used to get the next available pixel. At the end of each scan-line, the threads join, and a single thread writes the resulting line of the image to a file. The next line begins with another fork. Figures A.8 and A.9 show the contents of the main loop before and after threading.


...
/* Loop over all columns. */
for (x = opts.First_Column; x < opts.Last_Column; x++)
{
  Check_User_Abort(FALSE);            /* Check for user abort. */
  trace_pixel(x, y, Current_Line[x]); /* Trace current pixel. */
  plot_pixel(x, y, Current_Line[x]);  /* Display pixel. */
}
output_line(y);          /* Write current row to disk.  */
...

Figure A.8. Excerpt from main loop of pov benchmark before threading.

...
/* Loop over all columns. */
m_write_ipc_reg(1,opts.First_Column-1);
m_fork_n(0,max_threads-1);
for (x=m_inc_ipc_reg(1); x<opts.Last_Column; x=m_inc_ipc_reg(1))
{
  Check_User_Abort(FALSE);            /* Check for user abort. */
  trace_pixel(x, y, Current_Line[x]); /* Trace current pixel. */
  plot_pixel(x, y, Current_Line[x]);  /* Display pixel. */
  m_set_shared(Current_Line[x],Current_Line[x],sizeof(COLOUR));
}
m_join(0);
output_line(y);          /* Write current row to disk.  */
...

Figure A.9. Excerpt from main loop of pov benchmark after threading.

Pov uses the Private Memory Model because of the large number of global read/write variables. This model allows threads to be added with just a few lines of code, and no significant rewrite. The loop increment variable is stored in an ipc_reg register, and shows the effectiveness of the atomic read/increment command by allowing each thread to select the next available pixel for processing, no matter what order the threads complete their previous one. The command m_set_shared() is used to copy the completed pixel into thread zero's memory space. When all threads have reached the end of the row, thread zero writes it to the file before beginning the next line by forking off the threads again.

Figure A.10 shows a sample output from pov using the scene description file simple.pov provided with the simulator source code.

Table A.3 is the profile of pov. See Appendix C for description of the fields.

Figure A.10. Output image from pov ray tracer.

Table A.3. Profile of the benchmark pov.
% instruc  % parent  # calls  # instr. + tpenalty % thread. symbol
 19.2683 % 19.2683 %     2755  1534161 +        0 100.000 % _bcopy
 16.5334 % 24.7711 %     1900  1316404 +        0 100.000 % _sqrt
 12.0237 % 12.0237 %    10171   957339 +        0  99.994 % _memcpy
  4.6843 %  4.6843 %     3950   372970 +        0 100.000 % _scalb
  4.0106 %  6.0535 %     1794   319332 +        0 100.000 % _compute_lighted_texture
  2.8172 %  9.1793 %     3985   224306 +        0  99.990 % _Intersection
  2.3439 %  3.8582 %     1537   186624 +        0 100.000 % _create_ray
  2.2189 %  2.2189 %     3950   176670 +        0 100.000 % _logb
  2.1894 %  2.1907 %     2288   174324 +     1194  99.030 % _malloc
  2.1255 % 11.3995 %     5038   169234 +        0 100.000 % _Determine_Apparent_Colour
  1.8588 %  1.8588 %     1177   148003 +        0  99.936 % _Intersect_Plane
  1.8448 %  2.4343 %     1699   146885 +        0 100.000 % _Intersect_Sphere
  1.7671 %  4.7460 %     1699   140699 +        0 100.000 % _All_Sphere_Intersections
  1.7058 %  4.5442 %     1755   135817 +        0 100.000 % _Diffuse
  1.6974 %  3.9346 %     4608   135146 +        3  99.957 % _Trace
  1.4051 %  2.9477 %       47   111875 +        0   0.000 % _Write_Targa_Pixel
  1.3151 %  1.3151 %     2304   104706 +        0   0.000 % _floor
  1.2242 %  3.8260 %     1196    97474 +        0 100.000 % _do_light
  1.2155 %  1.2155 %     3122    96782 +        0 100.000 % _Ray_In_Bound
  1.1743 %  4.7242 %     2259    93496 +      624 100.000 % _pov_malloc
  1.0700 %  1.7613 %     1332    85198 +        0 100.000 % _block_point_light
  0.9771 %  3.4883 %     1177    77800 +        0  99.977 % _All_Plane_Intersections
  0.8584 %  0.8584 %      768    68347 +        0  99.870 % _Clip_Colour
  0.6911 %  0.6911 %     3050    55026 +        0 100.000 % _finite
  0.6656 %  2.9998 %     2304    52992 +        0  99.966 % _trace_pixel
  0.6309 %  2.0354 %     1196    50232 +        0 100.000 % _do_texture_map
  0.5765 %  0.5770 %     3533    45905 +        6 100.000 % _open_istack
  0.5444 %  1.7309 %     1227    43344 +        0  99.336 % _pov_free
  0.5408 %  0.5408 %      598    43056 +        0 100.000 % _Attenuate_Light
  0.5286 %  0.5286 %     2286    42090 +        0  98.033 % _free
  0.5116 %  0.5116 %      304    40736 +        0 100.000 % _do_diffuse
  0.5016 %  0.5016 %      768    39936 +        0  99.870 % _do_fog
  0.4968 %  0.4968 %     2261    39557 +      267 100.000 % _mem_stats_alloc
  0.4356 %  2.0128 %      598    34684 +        0 100.000 % _Compute_Pigment
  0.4258 %  0.5558 %      305    33900 +        0 100.000 % _do_phong
  0.4206 %  0.4206 %     1288    33488 +        0 100.000 % _Point_In_Clip
  0.4050 %  0.6443 %      150    32250 +        0 100.000 % _log__D
  0.3853 % 15.5184 %      818    30675 +        0 100.000 % _block_light_source
  0.3681 %  1.1761 %      793    29310 +      648  97.637 % _Start_Non_
                                                               Adaptive_Tracing
  0.3665 %  1.6976 %     2304    29184 +        0  99.870 % _Do_Finite_Atmosphere
  0.3383 %  0.3404 %      768    26936 +      168 100.000 % _initialize_ray_
                                                               container_state
  0.3259 %  0.7498 %      450    25950 +        0 100.000 % _pow
  0.3079 %  0.3079 %      598    24518 +        0 100.000 % _create_texture_list
  0.3024 %  0.3024 %      344    24080 +        0 100.000 % _Sphere_Normal
  0.2990 %  0.5016 %      768    23808 +        0  99.870 % _do_atmospheric_scattering
  0.2894 %  0.4919 %      768    23040 +        0  99.870 % _do_rainbow
  0.2502 %  0.2767 %      840    19920 +        0   0.000 % _Write_Targa_Line
  0.2411 %  0.2411 %      768    19200 +        0  99.870 % _plot_pixel
  0.2328 %  0.2328 %      598    18538 +        0 100.000 % _Copy_Ray_Containers
  0.2315 %  0.2315 %      768    18432 +        0 100.000 % _All_Light_
                                                               Source_Intersections
  0.2315 %  0.2315 %      768    18432 +        0 100.000 % _Check_User_Abort
  0.2219 %  0.2219 %      768    17664 +        0  99.870 % _gamma_correct
  0.2026 %  0.2508 %      768    16128 +        0  99.870 % ./spheres.o
  0.2026 %  0.2508 %      768    16128 +        0  99.870 % ./super.o
  0.1809 %  0.4842 %      150    14400 +        0 100.000 % _exp__D
  0.1774 %  0.1774 %     3531    14124 +        0  99.972 % _close_istack
  0.1696 %  0.1696 %     2051    13500 +        0 100.000 % _copysign
  0.1361 %  0.1361 %     2261    10836 +        0  99.336 % _mem_stats_free
  0.1294 %  0.1294 %     1288    10304 +        0 100.000 % _incstack
  0.1068 %  0.1068 %      170     8500 +        0  99.412 % _do_skysphere
  0.1055 %  0.2783 %      105     8400 +        0 100.000 % _filter_shadow_ray
  0.0829 %  0.3254 %      254     6604 +        0 100.000 % _Plane_Normal
  0.0686 %  0.0686 %     1366     5464 +        0 100.000 % _Initialize_Ray_Containers
  0.0619 %  0.3224 %      170     4930 +        0  99.412 % _Do_Infinite_Atmosphere
  0.0543 %  0.0998 %       48     4320 +        0   0.000 % _freopen
  0.0368 %  0.1088 %      288     2933 +        0   0.000 % ___sflush
  0.0283 %  0.0283 %      150     2250 +        0 100.000 % _ldexp
  0.0225 %  0.0244 %       25     1794 +        0   0.000 % ___swhatbuf
  0.0211 %  0.0350 %       83     1679 +        0   0.000 % ___swbuf
  0.0194 %  0.0590 %       50     1541 +        0   0.000 % ___smakebuf
  0.0154 %  0.0320 %      116     1224 +        0   0.000 % ___swrite
  0.0130 %  0.0343 %       49     1035 +        0   0.000 % ___swsetup
  0.0121 %  0.0121 %       30      960 +        0   0.000 % ___sflags
  0.0112 %  0.0112 %       24      888 +        0   0.000 % _Prune_Vista_Tree
  0.0108 %  0.0332 %       47      858 +        0   0.000 % _output_line
  0.0087 %  0.0087 %       24      696 +        0   0.000 % _check_stats
  0.0051 %  0.2018 %      260      408 +        0   0.000 % _fflush
  0.0038 %  0.0054 %        1      301 +        0   0.000 % __fwalk
  0.0036 %  0.0079 %       26      288 +        0   0.000 % ___sclose
  0.0009 %  0.0009 %        1       72 +      216 100.000 % _Inside_Sphere
  0.0008 %  0.0017 %        6       66 +      198 100.000 % _create_istack
  0.0008 %  0.0008 %        1       60 +      180 100.000 % _Inside_Plane
  0.0012 %  0.0012 %       50       96 +        0   0.000 % _close
  0.0008 %  0.0010 %        2       64 +        0   0.000 % _Destroy_Text_Streams
  0.0006 %  0.0053 %        8       48 +        0   0.000 % _close_all
  0.0006 %  0.0006 %       24       48 +        0   0.000 % _dup2
  0.0006 %  0.0006 %       30       48 +        0   0.000 % _open
  0.0006 %  0.0006 %      116       48 +        0   0.000 % _write
  0.0006 %  0.0008 %        5       47 +        0   0.000 % _Destroy_IStacks
  0.0006 %  0.0006 %       25       46 +        0   0.000 % _fstat
  0.0005 %  0.0017 %        4       37 +        0   0.000 % _Destroy_Frame
  0.0004 %  0.0006 %        5       32 +        0   0.000 % _Free_Noise_Tables
  0.0004 %  0.0006 %        2       30 +        0   0.000 % _destroy_libraries
  0.0004 %  0.0004 %        2       29 +        0   0.000 % _FreeFontInfo
  0.0001 %  0.0001 %        1       11 +       33 100.000 % _Inside_Light_Source
  0.0003 %  0.0009 %        4       26 +        0   0.000 % _Close_Targa_File
  0.0003 %  0.0003 %        2       25 +        0   0.000 % _Deinitialize_
                                                               Radiosity_Code
  0.0003 %  0.0003 %        2       24 +        0   0.000 % _Destroy_Skysphere
  0.0003 %  0.0003 %        2       21 +        0   0.000 % _Destroy_Light_Buffers
  0.0003 %118.0327 %        4       20 +        0   0.000 % _main
  0.0003 %  0.0003 %        2       20 +        0   0.000 % _exit
  0.0002 %106.3712 %       29       18 +        0   0.000 % _FrameRender
  0.0002 %  0.0004 %        5       17 +        0   0.000 % _Deinitialize_
                                                               Lighting_Code
  0.0002 %  0.0006 %        3       15 +        0   0.000 % _Terminate_POV
  0.0002 %  0.0004 %        2       14 +        0   0.000 % _destroy_shellouts
  0.0002 %  0.0004 %        4       14 +        0   0.000 % _Terminate_Renderer
  0.0002 %  0.0002 %        2       13 +        0   0.000 % _Free_Iteration_Stack
  0.0002 %  0.0005 %        5       13 +        0   0.000 % _Deinitialize_
                                                               VLBuffer_Code
  0.0002 %  0.0011 %        5       12 +        0   0.000 % _Destroy_Camera
  0.0001 %  0.0001 %        2       11 +        0   0.000 % _Deinitialize_
                                                               Atmosphere_Code
  0.0001 %  0.0001 %        1       11 +        0   0.000 % _destroy_histogram
  0.0001 %  0.0001 %        2       10 +        0   0.000 % _Destroy_Random_Generators
  0.0001 %  0.0001 %       23        4 +       12 100.000 % _sbrk
  0.0001 %  0.0003 %        3        9 +        0   0.000 % _Deinitialize_BBox_Code
  0.0001 %  0.0003 %        3        9 +        0   0.000 % _Destroy_Blob_Queue
  0.0001 %  0.0003 %        3        9 +        0   0.000 % _Deinitialize_Mesh_Code
  0.0001 %  0.0001 %        2        9 +        0   0.000 % _Destroy_Vista_Buffer
  0.0001 %  0.0001 %        2        8 +        0   0.000 % _Destroy_Atmosphere
  0.0001 %  0.0001 %        2        8 +        0   0.000 % _Destroy_Bounding_Slabs
  0.0001 %  0.0006 %        2        8 +        0   0.000 % __cleanup
  0.0001 %  0.0002 %        1        6 +        0   0.000 % _mem_release_all
  0.0001 %  0.0001 %        2        6 +        0   0.000 % _mem_stats_init
  0.0000 %  0.0000 %        1        1 +        0   0.000 % __exit

APPENDIX B: SIMULATOR REFERENCE

Multithreaded Superscalar Simulator

based on SDSP by Steven Wallace

with multithreaded extensions by Mark Pontius

NAME

ss - simulate the SDSP multithreaded microprocessor

SYNOPSIS

ss [-options] executable [executable options]

ss [-options] -scalar -tracegen executable [executable options]

ss [-options] -trace tracefile1 ... tracefileN

DESCRIPTION

Ss simulates the specified executable or tracefiles as if they were run on a multithreaded SDSP microprocessor.

The first version reads an SDSP executable program and simulates it.

The second version reads an SDSP executable program and creates a tracefile.

The third version reads n tracefiles, and treats them as coarse grain threads to be simulated.

The simulation consists of a cycle-by-cycle correct emulation of the real microprocessor, with register file, Instruction Window, Reorder Buffer, Scheduling Unit, Execution Units, Instruction and Data Cache, and Thread Control, all fully supported.

COMMAND LINE OPTIONS

Ss will allow the following command line options. They may be placed in any order, and later options will over ride earlier ones. The internal variable names are listed to give a better idea of exactly what each option does. This is followed by a description of where and how that option may be used. Note that TRUE=1 and FALSE=0.

This is an incomplete list. Items that were unused in this thesis have been omitted. Please refer to the man page for other options.

  • Sets the degree of associativity for the Instruction Cache. Examples: a 1 Kbyte, direct mapped (1 way associative) cache with line size of 32 bytes has 32 entries direct mapped to the lower 10 bits of memory addresses. A 4 way associative cache with the same parameters would have 32 entries grouped in 8 groups of 4, with each group mapped to the lower 8 bits of memory addresses. If cache line replacement is required and there are no empty entries in the associative group, a pseudo random number is generated to determine which entry will be replaced.
  • Sets the degree of associativity for the Data Cache. See -ai above for more explanation.
  • Sets the number of Integer Execution Units.

    Enables a very simple not-taken branch prediction algorithm.

  • Creates a raw (binary) statistics file with the name <benchmark>.bstat that can be post-processed by showstats to display all results and parameters of the run. If not specified, only a few multithreaded statistics are printed to screen at the end of the run.
  • Sets the number of entries in the BTB (Branch Tar get Buffer) used in branch prediction. Each entry is direct mapped onto instruction address space, every linesize bytes, and contains the predicted next address for that fetch block. It also contains two bits of history for each entry, so that two mispredictions are required for it to change the state of the prediction. The special case of size 0 indicates a perfect predictor (never wrong).
  • If set, each thread has an independent BTB. These BTB's retain their history across join/fork, but always apply only to their own thread number.
  • Sets the number of instructions decoded per cycle. Each instruction is 4 bytes long, so the fetch block is adjusted to match.
  • Allows completion from any number of slots (blocks) in the bottom of the RBIW. For a higher block to complete, bypassing a stalled block below, it must not contain any valid instructions from the same thread as a stalled block. This allows a different thread to bypass, or a block that's been invalidated due to mispredicted branch (thus eating up a bubble).
  • This sets the number of cycles between consecutive dcache accesses when pipelined requests are made.
  • Sets the number of cycles latency before a Data Cache miss returns data. Requests may be pipelined one every dcache_interval cycles.
  • A type of prefetching, in which two consecutive cache lines are fetched, so that alignment of the first instruction is not a problem. This is now default, so see -nodualfetch below to turn it off.
  • A type of prefetching that can allow fetching past a control transfer. Doesn't work.
  • Sets the number of Floating Point Execution Units.

  • This option pipes the trace generation output through gzip. This is useful for creating large trace files that may otherwise overflow disk space (or quota). This can slow down the simulation significantly. Does nothing if tracegen isn't set. It does not affect the automatic detection of .gz extensions on read tracefiles.
  • Hybrid mode enables portions of a benchmark to skip the superscalar simulation, and just run through the scalar part. It is started and stopped by m_quick_run(on/off) calls within the benchmark. If hybrid mode is disabled, these instructions are ignored. Hybrid/quick_run can speed up benchmarks by a factor of 10.
  • Interactive mode displays detailed information every cycle on what instructions are being executed, as well as the state of many variables. It may be toggled on and off in the X interface control panel for single stepping through a small section of benchmark.
  • This sets the number of cycles between consecutive icache accesses when pipelined requests are made.
  • Redirect stdin from file. .in is appended to the benchmark name.

  • Sets the number of cycles latency before an Instruction Cache miss returns the requested block. Requests may be pipelined one every icache_interval cycles.
  • Line size. Used in the instruction and Data Cache.

  • Stops the simulation after number of instructions has been reached. Note that fetching will continue until the end of the current fetch block, so a few extra instructions may be fetched. Primarily used in -scalar -tracegen mode, which doesn't have this problem (fetch block is effectively 1).
  • Set the number of Load Execution Units. See -ls below for more information.

  • By setting this parameter <= load_number and store_number, the load and store units can appear as one. If set to >=load_number + store_number, the load and store units appear independent. If set between those two limits, some units will capable of either function, and others can only do one.
  • Set the number of integer Multiply Execution Units.

  • Enables Multiprogram model (Private Memory). In this model, memory is not shared between processes, and data results must be explicitly copied with the m_set_shared() instruction. This model simulates slower, and is not as realistic, but allows easy coarse grained threading for some benchmarks.
  • Disables interactive mode. See -i above.

  • With bypassing off, results from Functional Units must write into RBIW one cycle, before dependent instructions can be issues the next cycle. With bypassing on (default), result writes can be bypassed to the dependent instructions being issued the same cycle.
  • Turns off dualfetch (see above).

    Redirect stdout to file. .out is appended to the benchmark name.

  • When reading trace files as threads, paralleltraces determines how many may run at any one time. The first set begins at cycle one. The next set begins only when all of the first set have completed. Simulation statistics are kept between runs. Each thread is still remapped to a unique address space, even ones that are run at different times.
  • Makes instruction fetching perfect. No fetch block alignment checks are made and a fetch block may contain several branch instructions. This is the ideal prefetch model.
  • Creates an execution profile which contains the % of instructions in each routine and the % of the routine is threaded. It uses the symbols in the source file, so it cannot work with trace inputs. If hybrid mode is enabled, quick_run instructions are not counted. If the -bstat option is used, the profile information is included in the bstat file and can be viewed with the -profile option to showstats. If -bstat is not used, then the profile table is dumped to standard output.
  • After a branch misprediction is detected by the benchmark (branch executes), this is how many idle cycles it takes before execution along the correct path may continue.
  • Simulates instruction prefetching. Cache line size default is double, so on any instruction fetch cycle, more instructions than needed will be read. The extra instructions are kept waiting until the next cycle, where they may be grouped with instructions from the next fetch. This allows many alignment problems to be avoided. It is not perfect, because of branches. See the -perfect option above for an ideal prefetch model.
  • Sets the maximum number of write ports in the Reorder Buffer/Instruction Window (RBIW). If more instructions complete in a cycle than there are write ports. Units must wait until the following cycle to return their data.
  • The size (in bits) of the Instruction Cache. A 16 bit icache is 64K bytes. See the note above in -ai for how to set the associativity of the cache.
  • This handy flag turns off all superscalar modeling. It runs about 10 times faster, and is useful for debugging benchmarks and generating trace files. It does not model a one-way superscalar version of the processor because none of the resource checking is done. If more than one thread is used, it will simply switch threads every 4 instructions (unless controlled by an exclusive_run section of the benchmark) to approximate superscalar operation for inter-thread communication purposes.
  • With this option, detailed scalar statistics are kept such as register usage distribution, run length, instruction class distribution, and register lifetime.
  • arg may be a combination of XRLCIBPF

  • Determines what algorithm and priorities are used for thread switching. It is a bit mask sum of the available options. They are specified by one letter abbreviations (in any order) from the following list. X= Coarse Switch (only switch threads when the current one has a low priority. Must also specify at least one of the priority options I, B, P, F, or the thread will run to completion before switching). R= Round Robin order (threads are checked in numerical order 1.. 2.. 3.. etc.). L= Least Recently Used order (threads with the least recent access are checked first). C= Count order (threads with fewest instructions in the window are checked first). I= Istall priority (threads with icache misses in progress get low priority). B= Bad Branch priority (threads with known bad branch delay slots get low priority). P= Prediction priority (threads with any predicted branches that haven't been completed get low priority). F= Floating Point Unit priority (threads with floating point instructions in the window get low priority).
  • Only one of the order parameters may be specified (R, L, or C), and if none is, then R is assumed.
  • Set the number of Store Execution Units. See -ls above for more information.

  • Trace files are read, bypassing the scalar execution portion of the simulator. If more than one tracefile is supplied, each is assigned to one thread. If the trace filename ends in .gz, it is piped in through gunzip to decompress it as it is read. This slows down the simulation slightly, but places much less demand on disk space. See -tracegen below for creating trace files.
  • The benchmark is run through the scalar execution portion of the simulator. Each instruction is written as a binary data structure to a trace file. This only works with scalar program, and cannot make a multithreaded trace. Each instruction is 48 bytes for disk usage estimation. If the -gz option listed above is used, the trace file is piped through gzip to reduce file size. Each gzip'ed instruction takes approximately 14 bytes. Since benchmarks are usually measured in the millions of instructions, this can yield quite large files.
  • Sets the size of stack used by each thread > 0. An odd number here is useful for reducing cache interference.
  • Sets the number of issue ports in the Instruction Window. If more instructions are ready for execution than there are issue ports. The newest instructions must wait until the next cycle.
  • If -profile is also set, then a trace of which procedures are being executed is printed to standard output. Each time an instruction is fetched, and it is not in the same routine as last cycle, the message "visualtrace: symbol" is printed. The hierarchy of routines is shown by indenting an additional space each time a subroutine is called, and indenting one less space each time the parent routine is returned to. This doesn't work for recursive calls, or if the call stack gets more than 100 deep. If this is the case, the indentation will stop following the hierarchy, but otherwise correct execution will continue. Trap instructions that are encountered are displayed as TRAP name (#num), indented as any procedure call would be. This is very useful for debugging programs. If -profile is not set, only TRAP calls are displayed.
  • Sets the number of blocks in the Reorder Buffer/Instruction Window (RBIW). Each block contains ss.max_decode instructions. Every cycle in which a stall does not occur, the blocks are shifted down one making room for new instructions to be loaded into the top, and retiring instructions in the bottom. More blocks means each instruction has more time in the window in which to complete before causing a stall.
  • This enables the X window interface. It consists of a control window and several optional data windows. The controls allow single stepping through a benchmark, or trace runs that stop after a number of cycles or a particular instruction address is reached. The Instruction Window can be observed, and is color coded to indicate instruction state. The file ~/app-defaults/sdsp is used to configure this interface.
  • The size (in bits) of the Data Cache. See notes above in -s and -ad for more information on the cache.
  • APPENDIX C: SHOWSTATS REFERENCE

    NAME

    showstats -display processor statistics for a run of ss.

    SYNOPSIS

    showstats [-options] files

    DESCRIPTION

    Showstats displays the statistics from a run of the ss superscalar simulation. The raw statistics files are created by giving the -bstat option to ss which generates a file with the .bstat extension. This is the file given on the showstats command line.

    COMMAND LINE ARGUMENTS

    Showstats will allow the following command line options. They may be placed in any order, and later options will override earlier ones. This is a partial list, refer to the man page for other options.

  • Displays the following additional information. 3+: header(magic, stat_type, version, time, length) 2+: traptrace(iov: base, legth) 1+: traptrace(argcount, args,, return, errorno, iovcount)
  • OUTPUT FORMATS

    Showstats can display many different statistics for a given run. These are summarized below by the order they appear within the various output displays. The variable or equation used is shown so you can see exactly what is being measured. These variable names may be searched for in the simulator source code. Directions are given for what command line options in the simulator ss would have been used to set the parameters.

    -table outputs:

  • The maximum number of threads in the threadfile at any one time. It was controlled by the benchmark, or the -paralleltraces option if tracefiles were used.
  • The number of instructions that may be fetched (decoded) each cycle. This was controlled by the -c option.
  • The number of blocks in the RBIW. This was controlled by the -w option.

  • How many blocks (fetch blocks) in the bottom of the RBIW are allowed to complete. Only one block completes in any one cycle, but if the bottom block is stalled, this number shows how far up the window was searched for blocks that could bypass it.1 means no completion bypassing. This was set by the -cslots option.
  • The number of Integer ALU Execution Units available. It was set by the -alu option.
  • The number of Integer Multiplication Execution Units available. It was set by the -mul option.
  • The number of Floating Point Execution Units available. It was set by the -fpu option
  • The number of Load and Store Execution Units. If only one number is shown, then load and store units have been merged into one. If not, three numbers are shown like this: loads/stores (merged). It was set by the -ls, -load, and -store options.
  • The number of bytes in a cache line. It was set by the -l option.

  • Size of the Instruction Cache in bytes, and its associativity. It was set by the -s and -ai options.
  • Size of the Data Cache in bytes, and its associativity. It was set by the -z and -ad options.
  • Size of the branch target buffer in entries. If the flag bsimple is set, then bsimple is displayed.
  • If ss.bsize is 0, perfect pred is displayed. If ss.ind_predict is set, the word each is shown after
  • ss.bsizeto indicate separatebtb's for each thread. It was set by the -btbsize, -bsimple, and
  • -btbthread options.
  • The memory model used during the run, either shared-mem, multiprocess (private mem), or multiprogram (multiple trace files). It was set by the-multiprocess or -trace option.
  • The type of prefetching used (if any). If perfectflag is set, perfectfetch is displayed. If prefetchflag is set, prefetch is displayed. If dualflag is set, dual fetch is displayed. If
  • extendflag is set, extend fetch is displayed. If none are set, no prefetch is displayed. Note that these items are checked in this order, and if more than one is set, only the first is displayed. It was set by the -perfect, -prefetch,-dual and -extend options.
  • The thread scheduling algorithm used. This field displays a summary of the various options that may individually enabled or disabled. See ss for more details on what they mean. The options are as follows: X= Coarse Switch, R= RoundRobin order, L= Least Recently Used order, C= Countorder, I= Istall priority, B= Bad Branch priority, P= Prediction priority, F= Floating Point Unit priority. These were set by the -schedule command line option.
  • The number of superscalar cycles simulated.

  • An indication of floating point workload in a benchmark. This is the total number of cycles floating point instructions would have taken if they were executed one at a time. With more than one fpu, they likely would have executed in parallel, but this is not shown here.
  • The normalized float_work field in the thesis took this number, and divided by the number of instructions.
  • The total number of valid instructions fetched. It does not count junk instructions fetched by a mispredicted branch, nop's, or garbage due to misalignment. It also does not count quick_instruc tions (see below).
  • The number of instructions that were executed by m_quick_run() while hybrid mode was set. These instructions were not counted in any other statistics shown here.
  • Cycles Per Instruction. The average rate of completing instructions.

  • Instructions Per Cycle. The average throughput of the processor. The only difference between this and avg fetch below, is that this counts all cycles, including icache misses and Scheduling Unit stalls.
  • Average number of instructions fetched per fetched block. This takes into account mis-alignment of fetch blocks, branches within blocks, and nop's.
  • Average number of instructions issued per cycle. Only instructions issued are counted, so it doesn't count trap_os instructions which are handled differently by the simulator.
  • Percentage of cycles that the processor is waiting. The sum of the following 3 item breakdown.
  • Percentage of cycles that the RBIW cannot shift due to incomplete instructions in the last block. This is typically caused by long latency instructions, register dependence or Functional Unit bottlenecks.
  • Percentage of cycles that fetch garbage instructions due to bad branch prediction.
  • Percentage of cycles that fetch fails due to an Instruction Cache miss. This includes the initial fetch, as well as any retries because it is the only available thread.
  • Number of times that a block is swapped out of the Instruction Cache. This does not count Instruction Cache misses that don't result in a swap. It is an indication of thrashing or too small a cache if this is a large number.
  • Number of times that a block is swapped out of the Data Cache. See notes above on i swap.
  • Percentage of attempted instruction fetches that result in a cache miss. If no Instruction Cache, then displays as N/A.
  • Percentage of Data Cache accesses that result in a cache miss. If no Data Cache, then displays as N/A.
  • Percentage of wasted fetch slots. Takes into account same items as avg fetch shown above.
  • Percentage of times that branch prediction was correct.

  • Average number of cycles wasted per bad branch. Includes any time instructions are invalidated in the current block that didn't need to be, or any fetches along the bad branch path before the misprediction was detected, and (if no other thread fills the slot) the delay cycles incurred by the -pipebub option while the icache is fetching the next correct block.
  • Percentage of cycles where a stall was avoided by bypassing one or more stalled blocks in the bottom of the Instruction Window.
  • Number of cycles that a thread switch was attempted. This translates roughly to the number of cycles fetch is attempted, with the exception of ??
  • Percentage of attempts to change threads that succeeded (had more than one valid and ready thread).
  • Percentage of attempts to change threads that failed because only one thread was valid.
  • Percentage of attempts to change threads that failed because exclusive_run was set. This occurs in benchmarks that are performing atomic transactions.
  • Cumulative number of threads executed.

    -profile outputs:

    APPENDIX D: RAW DATA

    The following is an example output from showstats. This data was used in the number of threads graph in Chapter 6. The fields are explained in Appendix C.

    Table D.1. Raw data

           mpeg2e label:               mpeg2e         mpeg2e         mpeg2e         mpeg2e         mpeg2e         mpeg2e         mpeg2e
           mpeg2e threads:                  1              2              3              4              5              6              7
           mpeg2e decode:                   8              8              8              8              8              8              8
           mpeg2e depth:                    4              4              4              4              4              4              4
           mpeg2e cslots:                   2              2              2              2              2              2              2
           mpeg2e alu:                      8              8              8              8              8              8              8
           mpeg2e mul:                      4              4              4              4              4              4              4
           mpeg2e fpu:                      4              4              4              4              4              4              4
           mpeg2e load/store:               2              2              2              2              2              2              2
           mpeg2e line size:               32             32             32             32             32             32             32
           mpeg2e i cache size:   64K (4 way)    64K (4 way)    64K (4 way)    64K (4 way)    64K (4 way)    64K (4 way)    64K (4 way)
           mpeg2e d cache size:   64K (4 way)    64K (4 way)    64K (4 way)    64K (4 way)    64K (4 way)    64K (4 way)    64K (4 way)
           mpeg2e btb size:               256            256            256            256            256            256            256
           mpeg2e thread model:    shared-mem     shared-mem     shared-mem     shared-mem     shared-mem     shared-mem     shared-mem
           mpeg2e prefetch:        dual fetch     dual fetch     dual fetch     dual fetch     dual fetch     dual fetch     dual fetch
           mpeg2e schedule:             LIBPF          LIBPF          LIBPF          LIBPF          LIBPF          LIBPF          LIBPF
           mpeg2e ---results---  ------------   ------------   ------------   ------------   ------------   ------------   ------------
           mpeg2e cycles:          10,685,764      9,041,598      9,017,560      8,609,206      8,800,814      8,787,066      8,953,615
           mpeg2e float work:       9,674,452      9,674,452      9,674,452      9,674,452      9,674,452      9,674,452      9,674,452
           mpeg2e instructions:    35,918,652     35,969,052     36,019,452     36,069,852     36,120,252     36,170,652     36,221,052
           mpeg2e quick_instr:        257,583        257,583        257,583        257,583        257,583        257,583        257,583
           mpeg2e CPI:                  0.297          0.251          0.250          0.239          0.244          0.243          0.247
           mpeg2e IPC:                  3.361          3.978          3.994          4.190          4.104          4.116          4.045
           mpeg2e avg fetch:            6.748          6.742          6.735          6.729          6.722          6.716          6.710
           mpeg2e avg issue:            3.182          3.765          3.779          3.963          3.881          3.891          3.823
           mpeg2e total delays:      57.377 %       49.528 %       49.296 %       46.788 %       47.847 %       47.663 %       48.547 %
           mpeg2e  su stalls:        36.188 %       29.774 %       29.774 %       26.954 %       28.297 %       28.006 %       28.819 %
           mpeg2e  br delays:        20.970 %       19.491 %       19.253 %       19.553 %       19.274 %       19.376 %       19.447 %
           mpeg2e  i delays:          0.219 %        0.263 %        0.268 %        0.281 %        0.276 %        0.281 %        0.282 %
           mpeg2e i swap:            58.342 %       59.035 %       59.935 %       60.133 %       60.272 %       60.827 %       61.896 %
           mpeg2e d swap:            44.818 %       44.829 %       45.371 %       45.049 %       44.778 %       46.691 %       46.535 %
           mpeg2e i miss:             0.088 %        0.090 %        0.092 %        0.092 %        0.093 %        0.095 %        0.097 %
           mpeg2e d miss:             0.035 %        0.035 %        0.036 %        0.036 %        0.035 %        0.037 %        0.037 %
           mpeg2e fetch deficit:     15.722 %       15.807 %       15.889 %       15.971 %       16.052 %       16.133 %       16.205 %
           mpeg2e pred rate:         35.081 %       35.177 %       35.272 %       35.366 %       35.462 %       35.571 %       35.640 %
           mpeg2e br penalty:           2.851          2.231          2.187          2.110          2.116          2.114          2.151
           mpeg2e commit bypass:      0.112 %        9.298 %       14.375 %       25.390 %       21.427 %       20.457 %       19.801 %
           mpeg2e fetch cycles:     6,818,806      6,349,588      6,332,707      6,288,719      6,310,465      6,326,128      6,373,320
           mpeg2e thread sw:          0.000 %       32.505 %       32.765 %       34.459 %       34.911 %       35.382 %       32.938 %
           mpeg2e only thread:      100.000 %       63.238 %       63.381 %       63.824 %       63.627 %       63.437 %       66.873 %
           mpeg2e exclusive th:       0.000 %        0.128 %        0.128 %        0.130 %        0.130 %        0.130 %        0.129 %
           mpeg2e total threads:            1            469            937          1,405          1,873          2,341          2,809
    
           nlfilt label:               nlfilt         nlfilt         nlfilt         nlfilt         nlfilt         nlfilt         nlfilt
           nlfilt threads:                  1              2              3              4              5              6              7
           nlfilt decode:                   8              8              8              8              8              8              8
           nlfilt depth:                    4              4              4              4              4              4              4
           nlfilt cslots:                   2              2              2              2              2              2              2
           nlfilt alu:                      8              8              8              8              8              8              8
           nlfilt mul:                      4              4              4              4              4              4              4
           nlfilt fpu:                      4              4              4              4              4              4              4
           nlfilt load/store:               2              2              2              2              2              2              2
           nlfilt line size:               32             32             32             32             32             32             32
           nlfilt i cache size:   64K (4 way)    64K (4 way)    64K (4 way)    64K (4 way)    64K (4 way)    64K (4 way)    64K (4 way)
           nlfilt d cache size:   64K (4 way)    64K (4 way)    64K (4 way)    64K (4 way)    64K (4 way)    64K (4 way)    64K (4 way)
           nlfilt btb size:               256            256            256            256            256            256            256
           nlfilt thread model:    shared-mem     shared-mem     shared-mem     shared-mem     shared-mem     shared-mem     shared-mem
           nlfilt prefetch:        dual fetch     dual fetch     dual fetch     dual fetch     dual fetch     dual fetch     dual fetch
           nlfilt schedule:             LIBPF          LIBPF          LIBPF          LIBPF          LIBPF          LIBPF          LIBPF
           nlfilt ---results---  ------------   ------------   ------------   ------------   ------------   ------------   ------------
           nlfilt cycles:           2,795,696      2,750,643      2,711,331      2,714,859      2,734,072      2,729,128      2,737,744
           nlfilt float work:         273,725        273,725        273,725        273,725        273,725        273,725        273,725
           nlfilt instructions:    19,192,033     19,192,884     19,192,921     19,192,958     19,192,995     19,193,032     19,193,069
           nlfilt quick_instr:      1,059,949      1,059,949      1,059,949      1,059,949      1,059,949      1,059,949      1,059,949
           nlfilt CPI:                  0.146          0.143          0.141          0.141          0.142          0.142          0.143
           nlfilt IPC:                  6.865          6.978          7.079          7.070          7.020          7.033          7.011
           nlfilt avg fetch:            7.691          7.690          7.690          7.690          7.690          7.690          7.690
           nlfilt avg issue:            6.808          6.920          7.020          7.011          6.962          6.975          6.953
           nlfilt total delays:      11.840 %       10.385 %        9.085 %        9.203 %        9.841 %        9.677 %        9.961 %
           nlfilt  su stalls:         7.901 %        8.573 %        7.735 %        7.891 %        8.537 %        8.372 %        8.659 %
           nlfilt  br delays:         3.850 %        1.717 %        1.254 %        1.216 %        1.208 %        1.210 %        1.207 %
           nlfilt  i delays:          0.089 %        0.095 %        0.095 %        0.095 %        0.096 %        0.096 %        0.095 %
           nlfilt i swap:             0.000 %        0.000 %        0.000 %        0.000 %        0.000 %        0.000 %        0.000 %
           nlfilt d swap:            41.056 %       41.156 %       42.092 %       41.636 %       41.767 %       42.349 %       49.018 %
           nlfilt i miss:             0.020 %        0.020 %        0.020 %        0.020 %        0.020 %        0.020 %        0.020 %
           nlfilt d miss:             0.082 %        0.082 %        0.084 %        0.083 %        0.083 %        0.084 %        0.095 %
           nlfilt fetch deficit:      3.886 %        3.894 %        3.894 %        3.894 %        3.894 %        3.894 %        3.895 %
           nlfilt pred rate:         66.986 %       67.016 %       67.016 %       67.015 %       67.014 %       67.014 %       67.013 %
           nlfilt br penalty:           3.441          1.509          1.087          1.056          1.055          1.055          1.056
           nlfilt commit bypass:      0.013 %        3.715 %        8.055 %        9.571 %        8.672 %        8.884 %        8.188 %
           nlfilt fetch cycles:     2,574,800      2,514,825      2,501,598      2,500,631      2,500,654      2,500,655      2,500,682
           nlfilt thread sw:          0.000 %       95.931 %       97.865 %       99.076 %       99.071 %       99.075 %       98.990 %
           nlfilt only thread:      100.000 %        0.917 %        1.899 %        0.922 %        0.922 %        0.925 %        0.922 %
           nlfilt exclusive th:       0.000 %        0.000 %        0.000 %        0.000 %        0.000 %        0.000 %        0.000 %
           nlfilt total threads:            1              2              3              4              5              6              7
    
              pov label:                  pov            pov            pov            pov            pov            pov            pov
              pov threads:                  1              2              3              4              5              6              7
              pov decode:                   8              8              8              8              8              8              8
              pov depth:                    4              4              4              4              4              4              4
              pov cslots:                   2              2              2              2              2              2              2
              pov alu:                      8              8              8              8              8              8              8
              pov mul:                      4              4              4              4              4              4              4
              pov fpu:                      4              4              4              4              4              4              4
              pov load/store:               2              2              2              2              2              2              2
              pov line size:               32             32             32             32             32             32             32
              pov i cache size:   64K (4 way)    64K (4 way)    64K (4 way)    64K (4 way)    64K (4 way)    64K (4 way)    64K (4 way)
              pov d cache size:   64K (4 way)    64K (4 way)    64K (4 way)    64K (4 way)    64K (4 way)    64K (4 way)    64K (4 way)
              pov btb size:               256            256            256            256            256            256            256
              pov thread model:  multiprocess   multiprocess   multiprocess   multiprocess   multiprocess   multiprocess   multiprocess
              pov prefetch:        dual fetch     dual fetch     dual fetch     dual fetch     dual fetch     dual fetch     dual fetch
              pov schedule:             LIBPF          LIBPF          LIBPF          LIBPF          LIBPF          LIBPF          LIBPF
              pov ---results---  ------------   ------------   ------------   ------------   ------------   ------------   ------------
              pov cycles:           4,676,530      4,322,226      4,176,538      3,982,403      3,936,657      3,954,593      3,978,653
              pov float work:      12,090,622     12,090,853     12,091,084     12,091,315     12,091,546     12,091,777     12,092,008
              pov instructions:     7,962,116      7,963,299      7,964,482      7,965,665      7,966,848      7,968,031      7,969,214
              pov quick_instr:      1,436,048      1,441,419      1,441,289      1,430,442      1,441,256      1,433,365      1,440,614
              pov CPI:                  0.587          0.543          0.524          0.500          0.494          0.496          0.499
              pov IPC:                  1.703          1.842          1.907          2.000          2.024          2.015          2.003
              pov avg fetch:            5.675          5.674          5.674          5.673          5.673          5.673          5.673
              pov avg issue:            1.345          1.455          1.506          1.580          1.599          1.592          1.582
              pov total delays:      74.165 %       72.039 %       71.057 %       69.639 %       69.280 %       69.414 %       69.593 %
              pov  su stalls:        62.936 %       64.335 %       64.582 %       64.272 %       63.878 %       64.021 %       64.298 %
              pov  br delays:        11.036 %        7.547 %        6.331 %        5.223 %        5.277 %        5.252 %        5.177 %
              pov  i delays:          0.193 %        0.156 %        0.145 %        0.145 %        0.125 %        0.141 %        0.118 %
              pov i swap:            40.819 %       38.787 %       35.697 %       34.154 %       31.804 %       33.041 %       31.934 %
              pov d swap:             0.000 %       21.927 %       23.377 %       24.359 %       23.625 %       24.359 %       23.994 %
              pov i miss:             0.129 %        0.124 %        0.118 %        0.116 %        0.112 %        0.114 %        0.112 %
              pov d miss:             0.035 %        0.045 %        0.046 %        0.046 %        0.046 %        0.046 %        0.046 %
              pov fetch deficit:     29.158 %       29.161 %       29.164 %       29.166 %       29.168 %       29.171 %       29.173 %
              pov pred rate:         73.591 %       73.571 %       73.573 %       73.550 %       73.563 %       73.555 %       73.555 %
              pov br penalty:           2.349          1.483          1.202          0.945          0.944          0.944          0.936
              pov commit bypass:      1.456 %       10.472 %       14.683 %       19.460 %       21.316 %       20.206 %       18.952 %
              pov fetch cycles:     1,733,330      1,541,527      1,479,239      1,422,847      1,421,991      1,422,818      1,420,448
              pov thread sw:          0.000 %       75.906 %       87.267 %       92.171 %       94.691 %       94.785 %       95.158 %
              pov only thread:      100.000 %        4.039 %        3.975 %        4.251 %        4.137 %        4.074 %        4.102 %
              pov exclusive th:       0.000 %        0.000 %        0.000 %        0.000 %        0.000 %        0.000 %        0.000 %
              pov total threads:            1             25             49             73             97            121            145
    
           isuite label:               isuite         isuite         isuite
           isuite threads:                  1              2              4
           isuite decode:                   8              8              8
           isuite depth:                    4              4              4
           isuite cslots:                   2              2              2
           isuite alu:                      8              8              8
           isuite mul:                      4              4              4
           isuite fpu:                      4              4              4
           isuite load/store:               2              2              2
           isuite line size:               32             32             32
           isuite i cache size:   64K (4 way)    64K (4 way)    64K (4 way)
           isuite d cache size:   64K (4 way)    64K (4 way)    64K (4 way)
           isuite btb size:               256            256            256
           isuite thread model:  multiprogram   multiprogram   multiprogram
           isuite prefetch:        dual fetch     dual fetch     dual fetch
           isuite schedule:             LIBPF          LIBPF          LIBPF
           isuite ---results---  ------------   ------------   ------------
           isuite cycles:           1,392,791      1,332,572      1,269,990
           isuite float work:               0              0              0
           isuite instructions:     8,000,000      8,000,000      8,000,000
           isuite quick_instr:              0              0              0
           isuite CPI:                  0.174          0.167          0.159
           isuite IPC:                  5.744          6.003          6.299
           isuite avg fetch:            6.495          6.496          6.496
           isuite avg issue:            5.257          5.495          5.766
           isuite total delays:      14.291 %       10.372 %        5.957 %
           isuite  su stalls:         7.113 %        5.541 %        2.184 %
           isuite  br delays:         6.348 %        4.378 %        3.452 %
           isuite  i delays:          0.831 %        0.453 %        0.321 %
           isuite i swap:            24.567 %       25.724 %       26.289 %
           isuite d swap:            92.738 %       93.427 %       93.619 %
           isuite i miss:             0.187 %        0.190 %        0.192 %
           isuite d miss:             2.607 %        2.939 %        3.030 %
           isuite fetch deficit:     18.959 %       18.959 %       18.959 %
           isuite pred rate:         93.896 %       93.889 %       93.912 %
           isuite br penalty:           2.079          1.370          1.034
           isuite commit bypass:      1.177 %       33.668 %       77.466 %
           isuite fetch cycles:     1,293,718      1,258,725      1,242,249
           isuite thread sw:          0.000 %       76.312 %       95.180 %
           isuite only thread:      100.000 %       12.081 %        4.731 %
           isuite exclusive th:       0.000 %        0.000 %        0.000 %
           isuite total threads:            4              4              4
    
           fsuite label:               fsuite         fsuite         fsuite
           fsuite threads:                  1              2              4
           fsuite decode:                   8              8              8
           fsuite depth:                    4              4              4
           fsuite cslots:                   2              2              2
           fsuite alu:                      8              8              8
           fsuite mul:                      4              4              4
           fsuite fpu:                      4              4              4
           fsuite load/store:               2              2              2
           fsuite line size:               32             32             32
           fsuite i cache size:   64K (4 way)    64K (4 way)    64K (4 way)
           fsuite d cache size:   64K (4 way)    64K (4 way)    64K (4 way)
           fsuite btb size:               256            256            256
           fsuite thread model:  multiprogram   multiprogram   multiprogram
           fsuite prefetch:        dual fetch     dual fetch     dual fetch
           fsuite schedule:             LIBPF          LIBPF          LIBPF
           fsuite ---results---  ------------   ------------   ------------
           fsuite cycles:           4,573,652      4,263,629      3,840,978
           fsuite float work:      12,030,035     12,030,035     12,030,035
           fsuite instructions:     8,000,000      8,000,000      8,000,000
           fsuite quick_instr:              0              0              0
           fsuite CPI:                  0.572          0.533          0.480
           fsuite IPC:                  1.749          1.876          2.083
           fsuite avg fetch:            6.336          6.336          6.342
           fsuite avg issue:            1.454          1.560          1.731
           fsuite total delays:      74.916 %       73.076 %       70.110 %
           fsuite  su stalls:        68.230 %       68.442 %       66.113 %
           fsuite  br delays:         6.291 %        4.425 %        3.845 %
           fsuite  i delays:          0.396 %        0.209 %        0.152 %
           fsuite i swap:            44.128 %       45.469 %       58.378 %
           fsuite d swap:             5.201 %        5.409 %        7.040 %
           fsuite i miss:             0.286 %        0.293 %        0.384 %
           fsuite d miss:             0.079 %        0.079 %        0.080 %
           fsuite fetch deficit:     21.029 %       21.028 %       21.029 %
           fsuite pred rate:         66.283 %       66.211 %       66.268 %
           fsuite br penalty:           2.324          1.520          1.192
           fsuite commit bypass:      0.694 %       11.416 %       20.867 %
           fsuite fetch cycles:     1,453,051      1,345,488      1,301,558
           fsuite thread sw:          0.000 %       58.334 %       81.447 %
           fsuite only thread:      100.000 %       15.270 %       13.197 %
           fsuite exclusive th:       0.000 %        0.000 %        0.000 %
           fsuite total threads:            4              4              4
    

    CD ROM

    Additional data can be accessed in electronic form on the World Wide Web:

    
    http://www.eng.uci.edu/comp.arch/multithreading/index.html
    

    The data is also available on the CD-ROM that is present with the hard copy of the book at the University of California, Irvine library. The CD-ROM is recorded in single-session 640MB ISO9660 format, and may be read on nearly any CD drive under Unix, DOS, Windows, or many others. The information is present on the disk as both a gzip compressed tar archive for Unix machines, and a zip file for DOS / Windows machines. It contains the following items:

    Any questions about this thesis, or the accompanying disk can be directed to:

    mpontius@ece.uci.edu