Characterizing the Impact of Predicated Execution on Branch Prediction

Loading...

Characterizing the Impact of Predicated Execution on Branch Prediction Scott A. Mahlke

Richard E. Hank Roger A. Bringmann John C. Gyllenhaal David M. Gallagher Wen-mei W. Hwu

Center for Reliable and High-Performance Computing University of Illinois Urbana-Champaign, IL 61801

Abstract Branch instructions are recognized as a major impediment to exploiting instruction level parallelism. Even with sophisticated branch prediction techniques, many frequently executed branches remain dicult to predict. An architecture supporting predicated execution may allow the compiler to remove many of these hard-to-predict branches, reducing the number of branch mispredictions and thereby improving performance. We present an in-depth analysis of the characteristics of those branches which are frequently mispredicted and examine the e ectiveness of an advanced compiler to eliminate these branches. Over the benchmarks studied, an average of 27% of the dynamic branches and 56% of the dynamic branch mispredictions are eliminated with predicated execution support.

1 Introduction Branch instructions are recognized as a major impediment to exploiting instruction level parallelism (ILP). They force the compiler and hardware to make frequent predictions of branch directions in an attempt to nd sucient parallelism. Misprediction of these branches can result in severe performance degradation through the introduction of wasted cycles into the instruction stream. This problem is especially serious for superscalar and VLIW processors, where each wasted cycle potentially costs multiple instructions. Branch prediction strategies reduce this problem by allowing the compiler and hardware to continue processing instructions along the predicted control path, thus eliminating these wasted cycles. There are two basic classes of branch prediction strategies: static branch prediction and dynamic branch prediction. Static branch prediction utilizes information available at compile-time to make predictions. Example static prediction schemes are prediction using branch direction (backward

0

taken, forward not taken) [1], heuristics based on the program structure [2], and pro le information [3] [4]. For compilers employing scheduling techniques such as trace scheduling [5] or superblock scheduling [6], static branch prediction is used to identify likely sequences of basic blocks which can be scheduled as single units. Dynamic branch prediction utilizes runtime behavior to make predictions. Example dynamic branch prediction schemes are the branch target bu er (BTB) with a 2-bit saturating counter [7] and two-level adaptive training [8]. For processors employing hardware scheduling, dynamic branch prediction is used to identify a continuous window of instructions. Regardless of the prediction strategy employed, the software or hardware scheduler is presented with a larger block of instructions, enabling it to expose greater amount of ILP. While correct branch prediction can increase ILP, incorrect prediction often results in large performance penalties. Recent studies have shown that imperfect branch prediction can reduce performance by a factor of two to more than ten [9] [10] [11]. These performance penalties are attributed to several conditions. First, a large number of instructions, termed speculative instructions, are often executed from the predicted direction of each branch. When the branch is mispredicted, all speculative instructions must be discarded since they were improperly executed. Thus, the processor wastes a large number of instruction slots when a branch is mispredicted. Note that the amount of speculation grows as the issue width of the processor increases; therefore the negative impact of branch prediction misses also increases as the issue width of the processor increases. The second source of performance loss due to mispredicted branches is the time necessary to undo the e ects of the improperly initiated speculative instructions. This involves allowing pipelines to drain, and invalidating the appropriate instructions from processor bu ers so they do not update the processor state. Third, after a mispredicted branch is discovered, execution must resume on the correct path. This involves computing the proper target address and initiating instruction fetch along this path. At a minimum, several empty pipeline cycles are required for this procedure. Finally, the presence of a large number of branches in the instruction stream places a limit on the potential ILP. A superscalar processor may have to execute multiple branches per cycle to sustain execution of multiple instructions per cycle. Under the assumption that an instruction stream con-

tains 25% branches, an 8-issue superscalar processor must have the capability to execute at least 2 branches per cycle. Handling multiple branches per cycle requires additional pipeline complexity, as well as designing multi-ported structures such as the BTB. In high issue rate processors, it is much easier to duplicate arithmetic function units than to predict and execute multiple branches per cycle. Therefore, a technique that eliminates branches from the instruction stream can signi cantly reduce the cost for achieving high issue rates for branch intensive programs. Predicated execution support provides an e ective means to eliminate branches from an instruction stream. Predicated execution refers to the conditional execution of an instruction based on the value of a boolean source operand, referred to as the predicate of the instruction [12] [13]. This architectural support allows the compiler to use an if-conversion algorithm to convert conditional branches into predicate de ning instructions, and instructions along alternative paths of each branch into predicated instructions [14] [15] [16]. Predicated instructions are fetched regardless of their predicate value. Instructions whose predicate value is true are executed normally. Conversely, instructions whose predicate is false are nulli ed, and thus are prevented from modifying the processor state. Predicated execution allows the compiler to trade instruction fetch eciency for the capability to expose ILP to the hardware along multiple execution paths. Predicated execution o ers the opportunity to improve branch handling in superscalar processors. Eliminating frequently mispredicted branches may lead to a substantial reduction in branch prediction misses. As a result, the performance penalties associated with the eliminated branches are removed. Eliminating branches also reduces the need to handle multiple branches per cycle for wide issue processors. As a side e ect of reducing the number of branches in the instruction stream, the amount of speculation required to sustain full processor utilization is reduced. Therefore, in the case of a mispredicted branch, fewer speculative instructions must be discarded. Finally, predicated execution provides an ecient interface for the compiler to expose multiple execution paths to the hardware. Without compiler support, the cost of maintaining multiple execution paths in hardware grows rapidly. In this paper, we investigate the impact of predicated execution on branch behavior. The objectives of the paper are two-fold. First, the characteristics of branches for a set of benchmarks are analyzed. Branches are characterized based on several features, including type, location, frequency, and bias. Branches which contribute large numbers of mispredictions are isolated and targeted for elimination with predicated execution support. The second objective is to analyze the e ects that predicated execution has on branches and branch prediction characteristics. The ability of the compiler to eliminate the problematic branches with predicated execution is assessed. The analysis presented in this paper is based on a superscalar microarchitectural model that ef ciently supports predicated execution. This model is an extension of the HP PA-RISC architecture. Hyperblock optimization and scheduling techniques are utilized by the compiler to exploit the predicated execution support [16].

for ( i = 0; i < 100; i++ ) if (A[i]  50 ) j = j + 1; else k = k + 1; (a) mov r1,0 mov r1,0 mov r2,0 mov r2,0 ld r3,addr(A) ld r3,addr(A) L1: ld r4,mem(r3+r2) L1: ld r4,mem(r3+r2) bgt r4,50,L2 pred gt p1U ,p2U ,r4,50 add r5,r5,1 add r5,r5,1 (p2) jump L3 add r6,r6,1 (p1) L2: add r6,r6,1 add r1,r1,1 L3: add r1,r1,1 add r2,r2,4 add r2,r2,4 blt r1,100,L1 blt r1,100,L1 (b) (c)

Figure 1: Example of if-conversion, (a) source code segment, (b) assembly code segment, (c) assembly code segment after if-conversion.

2 Predicated Execution In this section, the underlying predicated execution model as well as the architecture and compiler support for predicated execution are summarized. This is necessary to provide a basic understanding of the underlying framework for predicated execution used in this paper.

2.1 Overview of Predicated Execution Predicated execution refers to the conditional execution of instructions based on the boolean value of a source operand, referred to as the predicate. If the value of the predicate is true (a logic 1), the instruction is allowed to execute normally, otherwise the instruction is nulli ed, preventing it from modifying the processor state. Figure 1 contains a simple example to illustrate the concept of predicated execution. For each iteration of the loop in Figure 1(a), either the value of j or k is conditionally incremented. The basic compiler transformation to exploit predicated execution is known as ifconversion [15]. If-conversion replaces conditional branches in the code with comparison instructions which de ne one or more predicates. Instructions control dependent on the branch are then converted to predicated instructions, utilizing the appropriate predicate value. In this manner, control dependences are converted to data dependences. Figures 1(b) and 1(c) show the assembly code for the loop example before and after if-conversion. Note that the variables j and k have been placed in registers r5 and r6. The conditional branch, bgt, in Figure 1(b) is replaced by a predicate de ne instruction, pred gt, in Figure 1(c). The actual semantics of the pred gt instruction will be discussed in the next subsection. It is sucient for this example to say that the predicate p1 is assigned the value 1 if r4 > 50 and 0 otherwise, and the predicate p2 is assigned the complement of p1. The instructions incrementing the values of r5 and r6 are converted to predicated instructions, associated with predicates p1 and p2, respectively. For each loop iteration, either r5 and r6 will be incremented by the predicated add instructions, contingent on the results of the predicate de ne instruction. Note also that the jump instruction becomes unnecessary after if-conversion.

2.2 Architectural Support

The architectural extensions assumed to support predicated execution are based on the Cydra 5 architecture [13] and the HPL Playdoh Architecture [17]. They consist of four major components: an Nx1-bit predicate register le to store the predicate values, an additional source operand for each instruction to specify a predicate for instruction execution, a modi ed decode/issue stage to nullify instructions whose predicate is false, and a set of predicate de ning instructions. The set of predicate de ning instructions consist of a complete set of integer, unsigned, oat, and double comparison opcodes of the form shown below. pred Pout1 , Pout2 , src1, src2 (Pin ) This instruction assigns values to Pout1 and Pout2 according to a comparison of src1 and src2 speci ed by and the predicate speci ed for each destination predicate. The comparison is: equal (eq), not equal (ne), greater than (gt), etc. The boolean value written to a predicate register is a function of the result of the comparison, the input predicate of the de nition instruction (Pin ), and the eld. The predicate speci es one of eight possible functions: unconditional, conditional, OR, AND, or each of their complements. For a typical predicate de nition instruction, the two destination predicates are a given predicate type and its complement to re ect the \then" and \else" paths of an if statement. The reader is referred to [17] for more details regarding the predicate de nition semantics.

2.3 Compiler Support

The compilation techniques utilized in this paper to exploit predicated execution are based on a abstract structure called a hyperblock [16]. A hyperblock is a collection of basic blocks in which control may only enter at the rst basic block, designated as the entry block. Control ow may leave from one or more of the basic blocks in the hyperblock. All control ow between basic blocks in a hyperblock is eliminated via if-conversion. The goal of hyperblocks is to intelligently group basic blocks from many di erent control ow paths into a single manageable block for compiler optimization and scheduling. Basic blocks are systematically included based on two high level goals. First, performance is maximized when the hyperblock captures a large fraction of the predicted control ow paths.1 Thus, any likely blocks to which control may ow should be added to the hyperblock. Second, resources (fetch bandwidth and function units) are limited; therefore including too many blocks will likely result in an overall performance loss. It may be better to leave an infrequently executed block out of the hyperblock rather than insert more predicated instructions which may saturate the processors resources. Exclusion of a block from the hyperblock will require a branch instruction to be left in the hyperblock; however, if the branch is infrequently taken, it should be a highly-predictable branch. Hyperblock formation focuses on eliminating unbiased branches, while leaving highly biased branches alone since little performance gain is 1 Predicted control paths are identi ed using static branch prediction (pro le information in this implementation).

linect = wordct = charct = token = 0; for (;;) { if (--(fp)->cnt < 0) c = filbuf(fp); else c = *(fp)->ptr++; if (c == EOF) break; charct++; if ((’ ’ < c) && (c < 0177)) if (! token) { wordct++; token++; } continue; if (c == ’\n’) linect++; else if ((c != ’ ’) && (c != ’\t’)) continue; token = 0; }

Figure 2: Source code for the inner loop of wc . achieved by eliminating them. Though, careful attention is paid to the size and dependence height of each block selected for inclusion in the hyperblock. Therefore, a 50-50 branch may be left in the program if one of the target blocks requires signi cantly more resources than the other. An example using hyperblock compilation techniques is presented in the next section.

3 A Case Study In order to provide a deeper understanding of the branch characteristics of non-numeric applications and how these branches are e ected by predicated execution, a detailed analysis of one of the benchmark programs is presented in this section. The benchmark chosen is word count (wc). This benchmark was chosen for two reasons. First, it contains a loop which accounts for a large fraction of its execution time, yet is small enough to be presented in the context of a paper. Second, the loop has a non-trivial control structure which presents a challenge to branch handling strategies. The preprocessed C source code for the loop segment is presented in Figure 2. The purpose of this program is to count the number of characters, words, and lines in an input le. A character bu er is processed in the loop and re- lled as necessary until the end-of- le marker is encountered. The corresponding assembly code and control ow graph for the loop segment are presented in Figure 3. The control ow graph is augmented with the execution frequencies of each control transfer for the measured run of the program. This loop is characterized by small basic blocks and a large percentage of branches. Overall, the loop segment contains 13 basic blocks with a total of 34 instructions. Of the 34 instructions, 14 are branches, 8 conditional, 5 unconditional, and 1 subroutine call.

Elimination of Branches with Hyperblock Formation. Branches are eliminated and replaced with predicated

instructions using hyperblock formation. Hyperblock formation consists of several steps. First, all loop-back branches of innermost loops are coalesced into a single backedge. This procedure is illustrated for the example loop in Figure 4a. A new block, N, is created with a single jump instruction to the loop header, A. All loop-back branches are then adjusted to target the new block. Therefore, in Figure 4a, the branches in blocks H, K, L, and M are retargeted from A to N. The purpose of this step is to convert as many loop-back branches to non-loop (i.e. intra-loop) branches as possible. Because

LA: ld r98, mem(r3 + 0) add r27, r98, -1 st mem(r3 + 0), r27 blt r98, 1, LC LB: ld r30, mem(r3 + 4) add r29, r30, 1 st mem(r3 + 4), r29 ld r4, mem(r30+0) LD: beq r4, -1, EXIT LE: ld r33, mem(r73 + 0) add r32, r33, 1 st mem(r73 + 0), r32 bge 32, r4, LG LF: bge r4, 127, LG LH: bne 0, r2, LA LK: ld r36, mem(r72 + 0) add r35, r36, 1 st mem(r72 + 0), r35 add r2, r2, 1 jump LA LG: beq r4, 10, LI LJ: bne r4, 32, LL LM: mov r2, 0 jump LA LI: ld r39, mem(r71 + 0) add r38, r39, 1 st mem(r71+0), r38 jump LM LL: bne r4, 9, LA jump LM LC: mov Parm0, r3 jsr filbuf mov r4, Ret0 jump LD (a)

A

A

LA: pred_clr p4, p6 ld r98, mem(r3 + 0) add r27, r98, -1 st mem(r3 + 0), r27 blt r98, 1, LC ld r30, mem(r3 + 4) add r29, r30, 1 st mem(r3 + 4), r29 ld r4, mem(r30 + 0) beq r4, -1, EXIT ld r33, mem(r73 + 0) add r32, r33, 1 st mem(r73+0), r32 pred_ge p4(OR), p1(U), 32, r4 pred_ge p4(OR), p2(U), r4, 127 (p1)

14

105K

14

105K

B

B

C

105K

14

C

105K 14

105K

D 1

1

E

105K

77K

E 77K

D’ - N’

D

EXIT

F

28K

EXIT

28K 0

77K

F

4K

16K

H 61K

G

H

0

77K

K

24K

I 4K

16K

I

K

G 4K

16K

24K

61K

J

4K

16K

22K

2K

L

J 2K

2K

22K

M

L 2K

28K

25

M

N 28K

(b)

25

pred_eq p3(U), -, 0, r2 (p2) pred_eq p6(OR), p5(U), r4, 10 (p4) pred_eq p7(U), -, r4, 10 (p4) pred_eq p6(OR), p8(U), r4, 32 (p5) ld r36, mem(r72+0) (p3) add r35, r36, 1 (p3) st mem(r72+0), r35 (p3) add r2, r2, 1 (p3) ld r39, mem(r71 + 0) (p7) add r38, r39, 1 (p7) st mem(r71 + 0), r38 (p7) pred_eq p6(OR), -, r4, 9 (p8) mov r2, 0 (p6) jump LA

105k

(a)

(b)

Figure 3: Inner loop segment of wc , (a) assembly code, (b) control ow graph.

Figure 4: Inner loop segment for wc , (a) control ow graph after loop-back branch coalescing, block selection, and tail duplication (b) assembly code after if-conversion.

if-conversion can only remove non-loop branches, this procedure provides more opportunity for eliminating branches with multiple back branches. A similar procedure can be applied for each set of loop-exit branches to the same target. In this loop example, there is only one loop exit, so there is no opportunity for coalescing exit blocks. The second step of hyperblock formation is choosing the set of blocks to be included in the hyperblock. In this example, all blocks are selected for inclusion with the exception of block C. The priority for C is very low due to its low execution frequency and the hazardous instruction (subroutine call to lbuf ) which it contains. The blocks selected for inclusion for the example loop are outlined by the dashed line in Figure 4a. In order to perform if-conversion on the selected blocks, control ow from non-selected blocks to selected blocks must be eliminated. Such paths of control are referred to as side-entry points into the hyperblock. The third step, tail duplication, eliminates side-entry points by duplicating a portion of the selected blocks and re-adjusting the appropriate control ow arcs. In the example loop, a side entry point exists from C to D. This is eliminated by duplicating blocks D through N (pictured as block D0 -N0 ) and re-adjusting the C-D control

ow arc to C-D0 . The control ow graph for the loop after tail duplication is shown in Figure 4a. The nal step of hyperblock formation is to perform ifconversion on the blocks selected for the hyperblock. In our current implementation, a variant of the RK if-conversion algorithm is utilized [15]. The if-conversion algorithm rst calculates the localized control dependence information among the selected basic blocks. One predicate register is then assigned to all basic blocks with the same set of control depen-

dences. Predicate register de ning instructions are inserted into all basic blocks which are the source of the control dependences associated with a particular predicate. Next, all instructions in each selected block are predicated based on the predicate assigned to their block. Finally, all conditional and unconditional branches from selected blocks to other selected blocks are removed. The resultant assembly code for the loop body of the example is presented in Figure 4b. The loop contains eight unique control dependences; thus eight predicate registers are required. Of the 12 original branches in the blocks selected for inclusion, all but three are removed. The remaining branches in the hyperblock are two infrequent exit branches and the unconditional loop-back branch at the bottom of the hyperblock. Comparison of Branch Characteristics. The branch characteristics for wc before and after hyperblock formation are analyzed in the remainder of this section. Tables 1 { 5 present a detailed breakdown of the branch behavior for wc . Each table consists of two parts: the behavior for the base architecture (Base), and the base architecture with predicated execution support (Pred). The base architecture is an 8-issue superscalar processor with uniform function units. The instruction latencies assumed are those of the HP PA-RISC 7100. More details regarding the architecture model and the simulation model are presented in Section 4.1. An overall breakdown of the dynamic branches is presented in Table 1. Branches are broken down into three categories: type (conditional, unconditional, subroutine call/return, or indirect), class (loop or non-loop), and location (inner-loop, outer-loop, or straight-line). For the class

Type Class Location Base Pred

Loop Inner Outer 184K 1 105K 1

Conditional Non-loop Inner Outer StrLn 339K 3 4 105K 3 4

Unconditional Loop Non-loop Inner Outer Inner Outer StrLn 43726 1 5355 0 0 105K 0 23 0 0

Jsr/Rts

Ind

Total

29 29

0 0

572K 315K

Table 1: Dynamic branch breakdown for wc . Base Pred

0 287K 210K

1-10 24898 14

11-20 27606 5

21-30 105K 0

31-40 3 16

41-50 2 8

51-60 0 0

61-70 3 3

71-80 77590 8

81-90 0 0

91-100 3 3

Table 2: Taken frequency distribution for wc . category, loop branches refer to loop-back branches as well as loop-exit branches. All other branches are grouped in non-loop class. For the location category, inner-loop speci es branches contained within innermost loops. Outer-loop speci c branches contained within a loop which is not an innermost loop. All other branches do not reside explicitly within any loop body within their respective function and are thus placed in the straight-line category. From Table 1, with predicated execution support, the total number of dynamic branches was reduced from 572K to 315K, approximately a 45% reduction. This is mainly attributable to the reduction in branches shown in column [conditional, non-loop, inner]. The branches in this category removed are from blocks E, F, G, and J (Figure 3). The dynamic number of branches is also reduced in column [conditional, loop, inner]. Although if-conversion does not directly remove loop branches, the loop-back and exit coalescing performed during hyperblock formation allows this number to be reduced with predicated execution support. One interesting note from Table 1, column [unconditional, loop, inner], is that the number of branches is increased with predicated execution support. This is a result of loop-back branch coalescing. In the transformed code, there is a single unconditional back branch which is executed on every iteration of the loop. Since unconditional loop-back branches were only executed on a subset of iterations in the original code, an increase in the number is observed as shown in column [unconditional, loop, inner]. To further break down the conditional branches, the taken frequency distribution is presented in Table 2. The data shown are the dynamic counts of all conditional branches whose taken percentages lie within the range at the top of each column. For example, for the base architecture, 8% (27,606) of the branches are taken 11-20% of the time. From Table 2, a drastic change is observed from the base to the predicate architecture. With predicated execution support, the taken frequency of nearly all conditional branches is reduced to 0%. This behavior is a direct result of the hyperblock formation procedure. All conditional branches in the loop are eliminated with the exception of the two branches to blocks C and EXIT (Figure 3). These remaining branches are heavily biased to the fall-through path, taken only 1 and 14 times, respectively. The taken frequency distribution chart is a direct measure of the e ectiveness of the compiler at eliminating unbiased branches with predicated execution support. The desired trend is for the compiler to eliminate all unbiased branches, while leaving highly biased branches in the code. If only highly biased conditional branches remain with predicated execution support, very few mispredictions for conditional

branches will occur. Table 3 presents the branch misprediction breakdown for wc . Data for three dynamic prediction models is reported: BTB with a 2-bit saturating counter (Ctr), BTB with pro lebased direction prediction (Pro), and a branch target cache (Btc). For each scheme, a 1024-entry bu er is utilized. In our model, the branch target cache is similar to a BTB, except that the bu er is addressed by the cache block number instead of the instruction address, and simply stores the address to which control owed the last time this cache block was executed. Thus, there is only one branch prediction for all branches within a single cache block. The cache block size simulated is 64 bytes (16 instructions). From Table 3, the most important data is the huge drop in branch prediction misses from the base to the predicate architecture. The number of misses drops by factor of almost 1000 for all models. The reason for this huge drop the compiler successfully removes the branches causing the majority of the misses. Considering the Ctr and Pro models, the two problematic categories of branches are shown in columns [conditional, loop, inner] and [conditional, non-loop, inner]. As shown in column [conditional, non-loop, inner] the hyperblock formation procedure removes all branches of this category, with the exception of 2. The remaining 2 are highly biased as fall-through, and are only mispredicted a total of 33 times. The branches in column [conditional, loop, inner] have been coalesced into a single unconditional branch which is mispredicted only when it is not in the BTB. An interesting pattern is observed in the Btc model. In the base architecture, it predicts [conditional, loop, inner] branches more e ectively than either Ctr or Pro. All models have approximately the same performance for [conditional, non-loop, inner] branches. The poorer performance of the Btc model is the result of the large number of mispredictions for [unconditional, loop, inner] branches. This behavior occurs because there are two unconditional branches which share the same cache block. Therefore, they are constantly changing the predicted target block and causing the other to miss. However, the unconditional branches causing this problem are eliminated by the compiler with predicated execution support. The performance level of all three branch prediction models is approximately equal with predicated execution support for wc . Table 4 presents the static misprediction coverage of all branches. The branch prediction scheme is a BTB with 2-bit saturating counter and the percentages shown are the sum of the conditional and unconditional branch values. Misprediction coverage is de ned as the percentage of static branches which account for a given percentage of dynamic mispredictions. For example, in the base architecture, 8% of the

Type Class Location Base Ctr Pro Btc Pred Ctr Pro Btc

Loop Inner Outer 19521 0 16173 0 14047 0 3 1 3 1 3 1

Conditional Non-loop Inner Outer StrLn 32604 0 2 32987 0 2 33000 0 2 33 0 2 31 0 2 47 0 2

Unconditional Loop Non-loop Inner Outer Inner Outer StrLn 2 1 5 0 0 2 1 5 0 0 43726 1 4082 0 0 2 0 6 0 0 2 0 6 0 0 26 0 23 0 0

Jsr/Rts

Ind

Total

MPR

9 9 9 9 9 9

0 0 0 0 0 0

52144 49179 94867 56 54 111

0.09 0.09 0.17 0.00 0.00 0.00

Table 3: Branch misprediction breakdown for wc . Base Pred

10 0.06 0.07

20 0.06 0.07

30 0.06 0.07

40 0.06 0.14

50 0.06 0.14

60 0.06 0.20

70 0.06 0.29

80 0.08 0.34

90 0.08 0.45

100 0.56 0.52

Table 4: Branch misprediction coverage for wc using a BTB with 2-bit counter. static branches account for 90% of the dynamic mispredictions. Note that 100% of all mispredictions occur in 56% of the static branches; the remaining 44% of branches are either never executed or never mispredicted. The desired distribution is that a small number of static branches should account for a large portion of the misses, as seen in this table. In this situation, the compiler can focus its e orts on eliminating this small number of branches, reducing mispredictions and improving performance. The distribution of the number of instructions between branches and the number of instructions between mispredicted branches using the BTB with 2-bit counter scheme is presented in Table 5. All branch categories are considered in this data. The data shown are the dynamic counts of the number of branches which are separated by the speci ed number of instructions. For example, in the base architecture, 59% (335K) of branches are separated by 3-4 instructions, and 11% (5561) of mispredicted branches are separated by 1 instruction. In general, for the base architecture, there is a very small number of instructions between branches and mispredicted branches. The average values are 2.02 instructions between branches and 32.15 instructions between mispredicted branches. For a superscalar processor, this low number of instructions between mispredicted branches indicates a signi cant branch misprediction overhead. For the predicated architecture, the distribution is somewhat distorted since there are only 56 mispredicted branches in the entire program. The average values are 9.00 instructions between branches and 56355.55 instructions between mispredicted branches. By eliminating branches, the compiler has successfully increased the distance between branches and mispredicted branches for the predicate architecture.

Summary. This section demonstrates how predicated execution can be applied to the benchmark wc with the use of hyperblocks. The compiler was able to reduce the number of dynamic branch instructions by 45%. More importantly, we were able to remove almost all hard to predict branches. As a result, the number of dynamic mispredictions was reduced dramatically (1000 times), regardless of the branch prediction scheme employed. Thus, the results for wc indicated that predicated execution may lessen the need to employ a sophisticated branch prediction scheme. Finally, we saw that hyperblocks greatly increased the number of instructions between mispredicted branches. In the next section, we examine whether the the branch characteristics exhibited by wc are consistent across a set of benchmarks.

4 Experimental Evaluation 4.1 Methodology

The impact of predicated execution on branch behavior is evaluated using hyperblock compilation techniques in this section. The benchmarks studied consist of 022.li , 023.eqntott , 026.compress , 056.ear from SPEC-92, and the Unix utilities cmp , grep , lex , quick sort , wc . The benchmarks are compiled to produce an intermediate code for the target architecture, either base or predicate. The base architecture is an 8 issue superscalar processor, with no limitation placed on the combination of instructions which may be issued each cycle. The base architecture is further assumed to have 64 integer and 64 oating-point registers. The memory system consists of a 64K direct mapped instruction cache and a 64K direct mapped, blocking data cache; both with 64 byte block size. The data cache is write-through with no write allocate and has a miss penalty of 12 cycles. The dynamic branch prediction strategy is one of three models: a 1K entry BTB with 2 bit counter, a 1K entry BTB with pro le-based direction prediction, or a 1K entry branch target cache. The instruction latencies assumed are those of the HP PA-RISC 7100. The predicate architecture is the same as the base architecture with extensions to support predicated execution as described in Section 2.2. A 64-entry predicate register le is assumed in the predicate architecture. Following register allocation and code scheduling, the intermediate code is in a form which could be executed by the target architecture. To allow simulation of the predicated code on the host HP PA-RISC processor, the code is modi ed to remove all predicated instructions. Instructions to emulate the e ects of predicated instructions are inserted by the compiler, using the bit manipulation and conditional nulli cation capabilities of the PA-RISC instruction set. This emulation code is then probed. Execution of the probed code demonstrates correctness of the target architecture code, and also generates an instruction trace containing memory address information, predicate register contents, and branch directions. Note the code utilized speci cally for emulation and trace generation is not simulated; therefore it is not counted in any of the measured statistics. The compiler support utilized in this evaluation is more sophisticated than what was presented in Section 3 for wc . For the base architecture, superblock formation, superblock ILP optimizations, and superblock scheduling are applied to e ectively generate code for a wide issue processor without predicated execution support [6]. For the predicate architecture, the ILP optimizations and scheduling techniques are

Base Pred

Br MP br Br MP br

0 209K 4060 58 5

1 27586 5561 12 7

2 17 4 105K 4

3-4 335K 8 24 8

5-8 4 9 1 9

9-16 3 5487 210K 6

17-32 2 23980 2 3

33-64 0 5444 0 1

65-128 0 5689 0 0

129-256 0 1742 0 0

257+ 0 160 0 13

Table 5: Distance between branches and mispredicted branches for wc using a BTB with 2-bit counter. 0.60

1.00

0.55

0.90

0.50

0.80

0.45

0.70

0.40

0.60

0.35

0.50

0.30

0.40

0.25

0.30

0.20

0.20

0.15

0.10

0.10

0.00

0.05

-0.10

0.00 -0.05

-0.20 022.li

023.eqntott 026.compress

056.ear

cmp

grep

lex

qsort

wc

% Reduction in Dynamic Branches

-0.30

022.li

023.eqntott 026.compress

056.ear

Ctr

cmp

Pro

grep

lex

qsort

wc

Btc

Figure 5: Reduction in total dynamic branches with predicated execution support.

Figure 6: Reduction in the number branch mispredictions with predicated execution support.

applied to hyperblocks to expose additional ILP for a wide issue processor with predicated execution support. An important item to note is the branch behavior data for wc presented in this section is similar to that presented in Section 3. However, there are some distinct di erences as to the categorization of branches. These di erences are caused by the changes to the control ow graph and loop structure introduced by superblock and ILP transformations.

branch counts for each benchmark.

4.2 Results and Analysis The branch analysis introduced in Section 3 was performed for each of the benchmarks. Here, much of the data is presented in graphical form rather than tables. Appendix A contains a complete tabular listing of the data generated for all benchmarks. The table format and contents is the same as that used in Section 3. Total Dynamic Branches. Predicated execution enables the compiler to remove hard to predict branches from the instruction stream. Thus, the total number of dynamic branches in the program would be expected to decrease as the result of hyperblock techniques. Figure 5 shows the reduction in dynamic branches for each benchmark for the predicate architecture. As expected, predication signi cantly reduces the number of dynamic branches for most benchmarks, averaging a 27% reduction across the benchmarks. The one exception was 026.compress, which experiences a small increase in the number of branches for the predicated execution case (12343K to 12852K total dynamic branches, Table 7). The increase in the number of dynamic branches is a side-e ect of the hyperblock formation procedure. As a hyperblock is formed, branches that cannot be removed from the merged blocks are predicated. These branches are predicted regardless of the value of their predicate, increasing the number of times each branch is fetched, and hence increasing the dynamic count for that branch. The reader is referred to Table 7 in Appendix A for a complete breakdown of the dynamic

Total Mispredicted Branches. By allowing hard to predict branches to be removed from the instruction stream, the total number of mispredicted branches should be decreased with predicated execution support. Figure 6 shows this reduction for each of the benchmarks, using the three branch prediction schemes. A complete breakdown of the branch misprediction counts is provided in Appendix A, Table 8. For the majority of benchmarks, the number of mispredictions decreases signi cantly regardless of the prediction scheme employed. For wc and cmp , virtually all mispredictions are removed, regardless of the branch prediction scheme. Although the numbers are not as signi cant as the wc example presented in Section 3, the number of mispredictions is reduced by a factor of 6 for 023.eqntott and a factor of 4 for 056.ear (Table 8). Even though there is a 4% (Figure 5) increase in the number of dynamic branches for 026.compress , all branch prediction schemes achieve a 30% decrease in the number of branch mispredictions. The opposite trend appears for grep : the number of dynamic branches is reduced by 30% (Figure 5) with predication, but there is little decrease in the number of branch mispredictions and even an increase for the Btc model. This behavior occurs because grep is dominated by heavily biased branches. As a result, branches which are eliminated do not signi cantly reduce the number of mispredictions. The observed increase in mispredictions for the Btc model occurs because the branches in the hyperblock happen to be scheduled into the same cache block. Since the Btc model only allows a single prediction per cache block, prediction accuracy is lost. Dynamic Distance Between Branches. Figure 7 presents the distribution, averaged across all benchmarks, of the number of instructions between branches and mispredicted branches. The dynamic branch prediction scheme utilized to derive this data is the BTB with 2-bit counter. The desired trend is for the distributions to shift to the right with predicated execution support. This indicates the compiler is

0.25

0.70 0.65 0.60

0.20

0.55 0.50 0.45

0.15

0.40 0.35 0.30

0.10

0.25 0.20 0.15

0.05

0.10 0.05 0.00

0.00 0

1

Base_br

2

3-4

5-8

9-16

Base_MP_br

17-32

33-64

Pred_br

65-128

129-256

257+

Figure 7: Distribution of the distance between branches and mispredicted branches using a BTB with 2-bit counter.

022.li 023.eqntott 026.compress 056.ear cmp grep lex qsort wc Average

Base Mispred Branches Branches 3.5 43 2.1 28 7.3 78 4.7 129 2.6 433 0.9 126 1.6 160 8.5 54 2.0 46 3.7 122

0

1-10

11-20

21-30

31-40

Pred_MP_br

Pred Mispred Branches Branches 3.7 62 4.8 195 8.7 137 5.3 492 3.6 43940 1.2 100 6.4 290 14.7 128 6.8 28464 6.1 8201

Table 6: Average distance between branches and mispredicted branches using a BTB with 2-bit counter. successfully removing branches from the instruction stream, thereby increasing the distances between branches and mispredicted branches. The distributions indicate the desired behavior is obtained for the predicate architecture. The most notable change is the distance between mispredicted branches distribution which starts in the 3-4 instruction category for the base architecture. However, for the predicate architecture the distribution has been shifted to start in the 17-32 instruction range. This indicates the predicate architecture can consistently process a moderate number of instructions (17-32) before encountering a misprediction. This is especially important for wider issue processors, which cannot be fully utilized unless sucient distance can be established between mispredicted branches. The individual dynamic distributions for each benchmark are given in Table 11 of the appendix. On an individual benchmark basis, the average distance between branches and mispredicted branches is presented in Table 6. With predicated execution support, the compiler has e ectively eliminated branches to increase distance between the remaining branches. The benchmark improved the largest amount is lex , 1.6 to 6.4 instructions. A more magni ed e ect is observed for the increased distance between mispredicted branches. For cmp and wc , almost all mispredictions are eliminated, thus the average distance between mispredictions is extremely high. The anomaly is grep , in which the average distance between mispredicted branches is reduced with predicated execution support. This behavior occurs because the number of mispredictions is relatively unchanged in the predicate architecture (Table 8). However,

41-50

Base

51-60

61-70

71-80

81-90

91-100

Pred

Figure 8: Average taken frequency distribution of conditional branches across all benchmarks. the total number of instructions executed in grep is reduced in the predicate architecture due to additional compiler optimization opportunities exposed by hyperblock formation. As a result, a net reduction in instructions per misprediction is observed. Taken Branch Frequency. Figure 8 presents the average taken frequency distribution for conditional branches across all benchmarks. As discussed in Section 3, the e ectiveness of the compiler support for predicated execution can be shown by eliminating or reducing the less biased branches and leaving the branches that are more heavily biased towards taken or not taken. As the gure shows, this was indeed the case. In particular, the number of branches that fall into the range of taken from 1% of the time to 90% of the time were reduced. In addition, branches that were never taken or taken 91-100% of the time have been increased.

5 Related Work Predicated or guarded execution has been examined by several researchers. Decision tree scheduling utilizes guarded instructions to achieve large performance improvements on deeply pipelined processors [12]. Guarded instructions allow instructions from multiple execution paths to be placed in load/branch delay slots to e ectively hide long execution latencies. Predicated execution support was used extensively in the Cydra 5 system [13] [18]. Predicated execution is integrated into the optimized execution of modulo scheduled inner loops to control the prologue, epilogue, and iteration initiation. Predicated execution also allows loops with conditional branches to be eciently modulo scheduled. Pnevmatikatos and Sohi examine the e ectiveness of guarded execution on dynamically scheduled superscalar processors [19]. They show that full guarding can signi cantly increase the average basic block size and the average dynamic window size. They also show moderate increases in both may be obtained with restricted guarding. A major di erence with this work is that we focus on characterizing the behavior of the branches responsible for mispredictions and the e ectiveness of predicated execution to deal with these branches using hyperblock compilation techniques [16].

6 Conclusions Branch instructions pose serious diculties to exploiting instruction-level parallelism. Even with sophisticated branch prediction techniques, a large percentage of frequently executed branches remain dicult to predict. In this paper, an in-depth analysis of the characteristics of branches is presented. Branches which contribute large numbers of mispredictions are isolated and targeted for elimination with predicated execution support. Compiler support for predicated execution is based on a structure called a hyperblock. The goal of hyperblock formation is to intelligently group basic blocks from many di erent control ow paths into a single manageable block for compiler optimization and scheduling. Hyperblock formation focuses on eliminating unbiased branches, while leaving highly biased branches alone, since little performance gain is achieved by eliminating them. Results show the compiler can substantially reduce the number of dynamic branches with predicated execution support. Across all benchmarks, an average of 27% reduction in the dynamic branches was observed. More importantly, though, the compiler is able to remove a large fraction of the dicult to predict branches. Therefore, the number of mispredictions is also considerably reduced. The addition of predicate support to an architecture utilizing a BTB with a 2-bit counter reduced the branch prediction misses by 75% in 4 of the 9 benchmarks. Over 20% of the branch prediction misses are eliminated in all but one of the benchmarks. The distance between branches and mispredicted branches is also an important measure. With predicated execution support the average distance between branches is increased from 3.7 to 6.1 instructions, and the average distance between mispredicted branches is increased from 122 to 8201, for the benchmarks studied. While predicated execution support was able to reduce the number of mispredicted branches, there are still a large number of problematic branches remaining. These problematic branches motivate future architecture and compiler studies on predicated execution.

Acknowledgements The authors would like to thank all members of the IMPACT research group and the anonymous referees whose comments and suggestions helped to improve the quality of this paper signi cantly. This research has been supported by the National Science Foundation (NSF) under grant MIP-9308013, Joint Services Engineering Programs (JSEP) under Contract N00014-90-J-1270, Intel Corporation, the AMD 29K Advanced Processor Development Division, Hewlett-Packard, SUN Microsystems, NCR, and NASA under Contract NASA NAG 1-613 in cooperation with ICLASS. John Gyllenhaal was also supported by an NSF Graduate Fellowship.

References

[1] J. E. Smith, \A study of branch prediction strategies," in Proceedings of the 8th International Symposium on Computer Architecture, pp. 135{148, May 1981. [2] T. Ball and J. R. Larus, \Branch prediction for free," in Proceedings of the ACM SIGPLAN 1993 Conference on Pro-

[3] [4]

[5] [6]

[7] [8] [9] [10] [11] [12] [13] [14] [15] [16]

[17] [18] [19]

gramming Language Design and Implementation, pp. 300{ 313, June 1993. W. W. Hwu, T. M. Conte, and P. P. Chang, \Comparing software and hardware schemes for reducing the cost of branches," in Proceedings of the 16th International Symposium on Computer Architecture, pp. 224{233, May 1989. J. A. Fisher and S. M. Freudenberger, \Predicting conditional branch directions from previous runs of a program," in Proceedings of 5th International Conference on Architectual Support for Programming Languages and Operating Systems, pp. 85{95, October 1992. J. A. Fisher, \Trace scheduling: A technique for global microcode compaction," IEEE Transactions on Computers, vol. c-30, pp. 478{490, July 1981. W. W. Hwu, S. A. Mahlke, W. Y. Chen, P. P. Chang, N. J. Warter, R. A. Bringmann, R. G. Ouellette, R. E. Hank, T. Kiyohara, G. E. Haab, J. G. Holm, and D. M. Lavery, \The Superblock: An e ective technique for VLIW and superscalar compilation," Journal of Supercomputing, vol. 7, pp. 229{248, January 1993. J. Lee and A. J. Smith, \Branch prediction strategies and branch target bu er design," IEEE Computer, pp. 6{22, January 1984. T. Y. Yeh and Y. N. Patt, \Two-level adaptive training branch prediction," in Proceedings of the 24th Annual International Symposium on Microarchitecture, pp. 51{61, November 1991. M. D. Smith, M. Johnson, and M. A. Horowitz, \Limits on multiple instruction issue," in Proceedings of the 3rd International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 290{302, April 1989. D. W. Wall, \Limits of instruction-level parallelism," in Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 176{188, April 1991. M. Butler, T. Yeh, Y. Patt, M. Alsup, H. Scales, and M. Shebanow, \Single instruction stream parallelism is greater than two," in Proceedings of the 18th International Symposium on Computer Architecture, pp. 276{286, May 1991. P. Y. Hsu and E. S. Davidson, \Highly concurrent scalar processing," in Proceedings of the 13th International Symposium on Computer Architecture, pp. 386{395, June 1986. B. R. Rau, D. W. L. Yen, W. Yen, and R. A. Towle, \The Cydra 5 departmental supercomputer," IEEE Computer, vol. 22, pp. 12{35, January 1989. J. R. Allen, K. Kennedy, C. Porter eld, and J. Warren, \Conversion of control dependence to data dependence," in Proceedings of the 10th ACM Symposium on Principles of Programming Languages, pp. 177{189, January 1983. J. C. Park and M. S. Schlansker, \On predicated execution," Tech. Rep. HPL-91-58, Hewlett Packard Laboratories, Palo Alto, CA, May 1991. S. A. Mahlke, D. C. Lin, W. Y. Chen, R. E. Hank, and R. A. Bringmann, \E ective compiler support for predicated execution using the hyperblock," in Proceedings of the 25th International Symposium on Microarchitecture, pp. 45{54, December 1992. V. Kathail, M. S. Schlansker, and B. R. Rau, \HPL playdoh architecture speci cation: Version 1.0," Tech. Rep. HPL93-80, Hewlett-Packard Laboratories, Palo Alto, CA 94303, February 1994. J. C. Dehnert, P. Y. Hsu, and J. P. Bratt, \Overlapped loop support in the Cydra 5," in Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 26{38, April 1989. D. N. Pnevmatikatos and G. S. Sohi, \Guarded execution and branch prediction in dynamic ILP processors," in Proceedings of the 21st International Symposium on Computer Architecture, pp. 120{129, April 1994.

A Appendix Type Class Location 022.li Base Pred 023.eqntott Base Pred 026.compress Base Pred 056.ear Base Pred cmp Base Pred grep Base Pred lex Base Pred qsort Base Pred wc Base Pred

Loop Inner Outer 2419K 57667 2149K 191K 28708K 101M 176M 263K 5217K 1758K 4383K 4264K 1065M 117M 878M 6320 210K 13 420K 13 4300 312K 210K 108K 5866K 778K 6195K 902K 5512K 0 2075K 0 222K 58237 210K 23

Conditional Non-loop Inner Outer StrLn 147K 428K 3405K 1141K 341K 1959K 2285K 149M 5017K 956K 1903K 4707K 39 4760K 88 39 2664K 88 234M 43240K 13047K 21 18675 12420K 315K 1 11 0 44 11 0 340K 465 0 118K 465 242K 6958K 31689 69121 1294K 24227 242K 0 1062K 135K 0 455K 6 199K 4 6 35 4

Unconditional Loop Non-loop Inner Outer Inner Outer 138K 4822 2687 263 398K 10376 26629 2725 172K 19578 4465 129K 301K 56317 21 176K 49590 93008 1 463K 306K 871K 1 361K 54997 41728 0 111 233M 1 233M 6311 4037 1 42 0 0 2 0 43 0 8831 0 5539 4792 856 0 9483 51243 105K 10190 42230 60470 54295 4392 18143 236K 0 0 0 1879K 0 0 0 11245 26489 2 4927 13137 7 2 29

StrLn 121K 81179 1614K 1625K 22 22 2480K 2480K 0 0 0 0 4814 5122 108K 399K 0 0

Jsr/Rts

Ind

Total

549K 550K 6873K 6886K 250 250 15005K 15433K 31 31 308 308 15600 16708 614K 614K 29 29

0 0 66 66 0 0 0 0 0 0 1 1 678 3881 0 0 0 0

7275K 6853K 296M 193M 12343K 12852K 1491M 1375M 530K 420K 672K 453K 14107K 8648K 7777K 5559K 523K 223K

Table 7: Dynamic branch breakdown for all benchmarks. Type Class Location 022.li Base Ctr Pro Btc Pred Ctr Pro Btc 023.eqntott Base Ctr Pro Btc Pred Ctr Pro Btc 026.compress Base Ctr Pro Btc Pred Ctr Pro Btc 056.ear Base Ctr Pro Btc Pred Ctr Pro Btc cmp Base Ctr Pro Btc Pred Ctr Pro Btc grep Base Ctr Pro Btc Pred Ctr Pro Btc lex Base Ctr Pro Btc Pred Ctr Pro Btc qsort Base Ctr Pro Btc Pred Ctr Pro Btc wc Base Ctr Pro Btc Pred Ctr Pro Btc

Conditional Loop Non-loop Inner Outer Inner Outer 197K 852 29688 11775 409K 1568 27645 15357 326K 965 45494 16607 216K 37652 32243 8955 187K 33123 26367 15556 349K 65170 34196 13297 1119K 2133K 182K 24080K 3708K 2819K 277K 37272K 2944K 6263K 361K 30110K 2290K 106K 4334 215K 2860K 124K 6406 242K 4712K 116K 6112 400K 274K 49181 41 953K 863K 164K 13 1512K 411K 116K 13 1167K 404K 82058 41 385K 624K 341K 13 740K 544K 142K 13 463K 23091K 15885K 6592 19769K 21644K 14567K 6392 23206K 32762K 19815K 6392 19900K 9334K 6994 0 998 9311K 7894 0 798 22953K 7094 0 2398 1 2 4374 1 1 2 4064 1 397 3 7797 1 19 2 0 5 15 12 0 5 28 3 0 4 452 4329 0 5121 559 4053 0 5322 606 8463 0 9154 4366 4666 0 772 4049 4582 0 745 7590 12720 0 1349 39381 6429 7019 160K 92560 25859 9857 269K 71315 30734 7057 231K 78864 13942 600 35689 82835 16872 23082 51618 142K 55869 1737 84510 740K 0 106K 0 697K 0 101K 0 775K 0 106K 0 133K 0 41875 0 104K 0 34671 0 171K 0 52996 0 13704 35 6 19492 13313 28 3 19674 16019 29 4 24993 20 4 5 10 16 8 3 10 29 5 3 9

StrLn 218K 368K 402K 50515 68877 77203 120K 227K 197K 24064 24459 26728 104 104 104 143 143 143 3077 2277 2777 3077 3377 2377 10 10 10 10 10 10 6 6 6 6 6 6 4169 4510 4827 2649 3290 4454 274K 272K 265K 137K 103K 74833 2 2 2 2 2 2

Unconditional Loop Non-loop Inner Outer Inner Outer 118 153 12 36 76 80 12 33 47614 1103 802 149 55 12 25 128 55 12 25 125 88911 6132 13878 891 8571 1078 490 2450 7979 1078 490 2450 36177 48995 885 54506 3151 1960 294 3245 882 1960 294 3443 113K 60461 294 16269 104 1974 13 8541 104 1974 13 416 7761 63251 13 215K 26 10835 13 286 26 11783 13 286 77686 78255 13 104K 396 693 0 198 396 693 0 198 58031 39550 0 198 99 0 0 297 99 0 0 297 99 0 0 6894 10 0 8 0 10 0 8 0 19 0 42 0 0 1 0 6 0 1 0 6 0 1 0 32 0 5 0 52 0 5 0 38 0 425 0 973 1 5 0 65 1 5 0 39 235 420 0 1583 590 2735 958 724 469 2403 710 608 5306 12798 3070 12641 190 245 55 100 185 245 55 100 32441 37525 274 1020 42 0 0 0 42 0 0 0 106K 0 0 0 131K 0 0 0 104K 0 0 0 190K 0 0 0 2 5 2 7 2 5 2 7 3734 10887 2 678 1 3 2 5 1 3 2 5 1 7 2 5

StrLn 4021 3474 83303 1500 772 48920 4118 3724 735K 9459 9459 1115K 26 26 26 52 52 52 693 693 7086 693 693 74452 0 0 0 0 0 0 0 0 0 0 0 0 856 733 4618 874 1569 3865 42 42 29399 51325 46621 124K 0 0 0 0 0 0

Table 8: Branch misprediction breakdown for all benchmarks.

Jsr/Rts

Ind

Total

MPR

261K 256K 445K 218K 213K 405K 3356K 3355K 6614K 2884K 2872K 4765K 78 78 78 104 104 104 6805K 6175K 9959K 5534K 5534K 8369K 1 1 1 1 1 1 96 7 185 7 7 96 5181 5119 11773 5962 5932 13479 129K 129K 129K 133K 133K 208K 9 9 9 9 9 9

0 0 0 0 0 0 98 98 3154 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 771 771 874 819 819 830 0 0 0 0 0 0 0 0 0 0 0 0

724K 1082K 1370K 565K 545K 1104K 31010K 47676K 47370K 5543K 6147K 11334K 1287K 2543K 1981K 883K 1719K 1411K 65564K 65606K 82552K 14880K 14859K 31416K 4407 4097 8270 44 50 79 10062 9991 19813 9889 9435 24000 228K 412K 396K 139K 186K 378K 1250K 1200K 1413K 627K 526K 822K 33264 33045 56357 61 59 72

0.10 0.15 0.19 0.08 0.08 0.16 0.10 0.16 0.16 0.03 0.03 0.06 0.10 0.21 0.16 0.07 0.13 0.11 0.04 0.04 0.06 0.01 0.01 0.02 0.01 0.01 0.02 0.00 0.00 0.00 0.01 0.01 0.03 0.02 0.02 0.05 0.02 0.03 0.03 0.02 0.02 0.04 0.16 0.15 0.18 0.11 0.09 0.15 0.06 0.06 0.11 0.00 0.00 0.00

022.li

Base Pred 023.eqntott Base Pred 026.compress Base Pred 056.ear Base Pred cmp Base Pred grep Base Pred lex Base Pred qsort Base Pred wc Base Pred

0 4202K 4336K 125M 158M 3980K 5973K 1175M 1216M 409K 407K 445K 223K 10848K 823K 1483K 2313K 331K 210K

1-10 382K 114K 40752K 2124K 2651K 2638K 86393K 37355K 105K 14 195K 197K 1850K 583K 2539K 1083K 22270 14

11-20 452K 126K 34506K 13088K 661K 2262K 38305K 37200K 0 0 0 160 234K 12479 565K 558K 70264 5

21-30 182K 116K 4328K 293K 1011K 4 14876K 70 0 0 0 0 125K 240K 12718 78642 34951 0

31-40 219K 146K 18539K 163K 803K 0 19817K 382 0 0 0 250 88545 6015 114K 0 16 16

41-50 158K 50258 8855K 102K 720K 261K 43835K 19 0 0 1011 0 44539 71519 1503K 204K 15584 8

51-60 93281 104K 9926K 14406 207K 225K 18465K 57 0 0 0 0 38135 54663 295K 0 0 0

61-70 166K 622K 8043K 56995 196K 228K 7005K 0 0 0 0 723 91013 35157 0 0 1306 3

71-80 130K 19165 419K 130K 383K 145K 4011K 0 0 0 0 0 33594 47455 0 0 8 8

81-90 84425 28929 6897K 9391K 311K 36 14 14 0 0 6 0 30043 116K 0 0 4033 0

91-100 370K 122K 29826K 934K 804K 273K 65451K 65258K 11168 13175 14128 14952 490K 6496K 300K 198K 3 3

Table 9: Taken frequency distribution for all benchmarks. 022.li

Base Pred 023.eqntott Base Pred 026.compress Base Pred 056.ear Base Pred cmp Base Pred grep Base Pred lex Base Pred qsort Base Pred wc Base Pred

10 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.02 0.02 0.00 0.00 0.00 0.00 0.02 0.03 0.02 0.03

20 0.00 0.00 0.00 0.00 0.01 0.01 0.00 0.00 0.03 0.03 0.00 0.01 0.00 0.00 0.02 0.03 0.02 0.03

30 0.01 0.00 0.00 0.00 0.01 0.02 0.01 0.01 0.05 0.05 0.01 0.02 0.00 0.00 0.02 0.04 0.02 0.04

40 0.01 0.01 0.00 0.00 0.01 0.02 0.01 0.01 0.07 0.06 0.01 0.02 0.00 0.01 0.03 0.04 0.03 0.04

50 0.01 0.01 0.00 0.00 0.02 0.02 0.02 0.02 0.09 0.08 0.01 0.03 0.00 0.01 0.04 0.05 0.05 0.05

60 0.01 0.01 0.00 0.00 0.03 0.03 0.02 0.02 0.11 0.09 0.02 0.04 0.01 0.01 0.06 0.08 0.05 0.11

70 0.02 0.01 0.01 0.00 0.03 0.03 0.02 0.03 0.14 0.12 0.02 0.04 0.01 0.01 0.07 0.09 0.06 0.17

80 0.02 0.02 0.01 0.01 0.04 0.04 0.03 0.04 0.16 0.14 0.03 0.05 0.03 0.01 0.09 0.10 0.06 0.24

90 0.02 0.03 0.01 0.01 0.07 0.06 0.04 0.04 0.17 0.17 0.03 0.07 0.06 0.02 0.13 0.15 0.09 0.32

100 0.15 0.14 0.15 0.12 0.31 0.27 0.18 0.09 0.33 0.21 0.10 0.18 0.29 0.33 0.45 0.50 0.50 0.38

Table 10: Misprediction coverage for all benchmarks using a BTB with 2-bit counter. 022.li Base Pred 023.eqntott Base Pred 026.compress Base Pred 056.ear Base Pred cmp Base Pred grep Base Pred lex Base Pred qsort Base Pred wc Base Pred

Br MP br Br MP br Br MP br Br MP br Br MP br Br MP br Br MP br Br MP br Br MP br Br MP br Br MP br Br MP br Br MP br Br MP br Br MP br Br MP br Br MP br Br MP br

0 2675K 46872 827K 8699 119M 1110K 12308K 28313 2671K 16343 3094K 14549 66504K 1106K 394M 1699 199K 5 46 5 471K 5 270K 5 6802K 3980 559K 3185 1212K 13 359K 7 348K 66 92054 6

1 942K 6843 2307K 77905 42025K 2745K 5236K 5820 2073K 11972 2151K 125 110M 715K 27198K 1205K 115K 7 52614 4 131K 34 98732 78 3911K 9389 446K 14677 456K 4170 216K 0 48759 11 23 5

2 834K 9795 910K 17157 10517K 923K 110M 9076 1346K 11683 508K 153 500M 209K 307M 617K 11539 3 131K 3 320 8 17668 24 1165K 3382 589K 3421 2393 13 14801 14 1341 39 4 3

3-4 792K 37061 1354K 35384 74928K 2510K 29587K 23577 1587K 22967 1583K 1617 440M 7329K 336M 1898 117K 2 144K 1 33591 32 31131 40 600K 5372 849K 2054 1840K 15653 1623K 4809 26531 592 34 10

5-8 1187K 42965 539K 48790 42437K 1528K 6171K 2101K 1384K 70153 2307K 70444 63641K 3980K 119M 1235K 27589 8 78920 9 22061 45 21933 67 970K 10799 5739K 39177 1557K 225K 459K 5174 50051 207 78902 3

9-16 369K 105K 1723K 94316 4393K 6969K 10489K 37682 1163K 118K 952K 32889 180M 4435K 60810K 4792 59111 14 13165 2 8615 330 8820 102 320K 18677 362K 27609 1373K 15801 538K 24833 31671 6120 26316 3

17-32 414K 182K 338K 170K 564K 6538K 13706K 183K 1108K 231K 1108K 30654 54812K 21368K 59685K 2481K 0 23 0 4 3948 1581 4154 1807 180K 38512 59628 35996 562K 319K 1519K 69725 16145 8089 13151 14

33-64 0 168K 0 396K 5323 5229K 44191 194K 867K 228K 945K 158K 36680K 12436K 30493K 7687 0 766 0 0 118 1879 0 2134 7265 60385 118 39095 421K 442K 692K 253K 0 10476 13151 1

65-128 0 83099 0 136K 0 2411K 60473 1133K 0 321K 39 165K 29738K 4204K 27265K 6384 0 375 0 0 0 2503 0 2844 1350 39225 231 23024 6 191K 14 211K 0 6565 0 1

129-256 0 38330 0 49752 0 1034K 0 83633 6899 216K 9289 319K 7429K 1706K 9909K 19961 0 587 0 3 0 2453 0 2194 1144 16704 0 11130 0 31413 0 42847 0 1064 0 2

257+ 0 2303 0 4915 0 7587 0 1741K 0 37067 0 88904 99 8069K 0 9297K 0 2617 0 13 0 1192 0 594 0 22785 0 17856 0 4478 0 15459 0 35 0 13

Table 11: Distance between branches and mispredicted branches for all benchmarks using a BTB with 2-bit counter.

Loading...

Characterizing the Impact of Predicated Execution on Branch Prediction

Characterizing the Impact of Predicated Execution on Branch Prediction Scott A. Mahlke Richard E. Hank Roger A. Bringmann John C. Gyllenhaal David M...

233KB Sizes 1 Downloads 6 Views

Recommend Documents

Corporate Governance and the Prediction of the Impact of AIFRS
Nov 5, 2011 - Corporate Governance and the Prediction of the Impact of AIFRS. Adoption. JOHN GOODWIN ([email protected]

ture on strategy execution
executing strategies. Global organizations are better than multinational or national organizations at executing strategi

Constitutional Prohibition on the Execution of the Mentally Retarded
States Supreme Court declared on June 20, 2002, in Atkins v. Virginia,2 that the execution of mentally retarded3 crimina

Risk Analysis of Tender Documents on the Execution of Private
Jul 30, 2017 - Abstract. Documents received during the tender, are in the form of drawings, specifications, bill of quan

The Impact of Globalization on the Consumer
impact of such trade on consumers specifically, and residents of different countries in general. It also creates .... of

Evaluations Of High-Impact Weather Prediction Tressa L. Fowler
Evaluations Of High-Impact Weather Prediction. Tressa L. Fowler, Edward Tollerud, Tara Jensen, Eric Gilleland, Barbara B

Execution - University of Florida
Execution. The 4 Disciplines of Execution. “Leadership without execution is incomplete and ineffective. Without the ab

The Trader Interaction Effect on the Impact of Overconfidence on
Dec 5, 2007 - This article extends previous research on how overconfidence affects trading performance in two ways. Firs

Tutorial on the Impact of the Synchronous Generator Model on
model of a synchronous generator represents the machine as a constant ..... generator represented as a constant voltage

The Impact of Socioeconomic Conditions on
mitorios), ausencia de instalaciones sanitarias, años de educación, ocupación/desocupación y cobertura social .....