METHOD FOR ORDERING OPERATIONS FOR SCHEDULING BY A
MODULO SCHEDULER FOR PROCESSORS WITH A LARGE
NUMBER OF FUNCTION UNITS AND RECONFIGURABLE DATA
PATHS
BACKGROUND OF THE INVENTION [01] The present invention generally relates to computer processing and more specifically to a system and method for ordering operations to be scheduled by clustering related operations in an ordering list. [02] Instruction scheduling involves assigning operations from an original sequence of operations to specific functional units at specific times in a way to make efficient use of hardware resources. The scheduled operations produce the same result as executing the operations sequentially in an original order but the operations may not be scheduled in that original order. The goal is to efficiently use hardware resources and retain the original result that would be obtained by executing the operations sequentially.
[03] Instruction scheduling operates by scheduling an instruction that is executed for each clock cycle of a processor. Each instruction includes a slot for each functional unit of the processor where an operation may be scheduled. The instruction scheduler then schedules operations for a functional unit during a clock cycle. Typically, instruction schedulers attempt to schedule operations where a minimum number of instructions are used and operations are scheduled for as many functional units as possible for each instruction used. [04] The process of instruction scheduling orders operations in a scheduling order list, which is typically a list of operations in the order the operations should be executed if they were executed sequentially. Typically, a data dependence graph (DDG) is used to order operations to be scheduled. The DDG is arranged based on the dependencies among a group of operations for a program code. The dependencies of the DDG are represented by edges, which represent delays, i.e., the time delay required between the start of a predecessor operation and the start of a successor operation connected by the edge. Operations in the DDG are assigned heights to establish a priority value for the operation. A height indicates an overall dependency value based on the values of all the edges dependent upon a specific operation. The operation with the greatest height in the DDG becomes the highest priority operation for scheduling. Typically, the operations are ordered starting from operations with
the greatest height to operations with the lowest height. Operations are then scheduled sequentially from the first ordered operation to the last ordered operation. [05] The above approach may work when scheduling a small amount of functional units for each clock cycle. However, when scheduling a large amount of functional units for each clock cycle, problems result when operations are ordered sequentially from the highest priority to lowest priority. Thus, the operations that have the most operations dependent on them are scheduled first. If a DDG is depicted as having branches of related operations, the operations of greatest height are typically ordered first. Then, operations of the next greatest height are ordered next, and so on. This method of ordering operations typically orders operations from different branches in a DDG together because operations of a highest priority are usually located at the top of different branches. When operations ordered in this way are scheduled, related operations in branches are scheduled in functional units in a way that inefficiently uses computing resources. For example, a resulting schedule results in fragmentation, increased costs from moving data from functional unit to functional unit, higher resource use cost, and increased communication resource use.
[06] In one example, a resulting schedule using the above method results in a large amount of data movement because operations that use the same variables may not be grouped together. This results in a schedule that requires a large number of data movement resources. Thus, the processor must include a large number of data movement resources, or a schedule is produced that is inefficient in its use of time because data movement resources are exhausted and the schedule was to be extended in time to compensate.
BRIEF SUMMARY OF THE INVENTION [07] In one embodiment, a method for ordering a plurality of operations that are dependent upon one another in an ordered list to be used for scheduling is provided. The method comprises identifying a current operation in the plurality of operations that is not in the ordered list. Also, it is determined if the current operation has any predecessor operations that are not in the ordered list. If the current operation has predecessor operations, predecessor operations are added to the ordered list. The current operation is then added to the ordered list and a successor operation to the current operation is identified. The successor operation is now considered the current operation and the process reiterates to determine if the current operation has any predecessor operations and continues as above. The process continues until a current operation does not have any successor operations.
[08] In one embodiment, a method for ordering a plurality of operations that are dependent upon one another in an ordered list to be used for scheduling is provided. The method comprises: (a) identifying a current operation in the plurality of operations that is not in the ordered list; (b) determining if the current operation has any predecessor operations that are not in the ordered list; (c) if the current operation has predecessor operations, adding the predecessor operations to the ordered list; (d) adding the operation to the ordered list; (e) identifying a successor operation to the current operation, wherein the successor operation is considered the current operation; and (f) performing steps (b)-(e) until a successor operation is not identified in step (e). [09] In another embodiment, computer program products stored on tangible media that direct a processor to order operations as described below are provided.
[10] A further understanding of the nature and advantages of the invention herein may be realized by reference of the remaining portions in the specifications and the attached drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
[11] Fig. 1 discloses a system for ordering operations according to one embodiment;
[12] Fig. 2A illustrates an example of a DDG according to one embodiment;
[13] Fig. 2B illustrates the DDG for the operations of Fig. 2A with the height shown inside each node;
[14] Fig. 2C illustrates the DDG of Figs. 1 A and IB with a maximum predecessor height shown inside each node;
[15] Fig. 3 illustrates a method for computing a scheduling priority according to one embodiment; [16] Fig. 4A illustrates the method for computing scheduling priority of Fig. 3 in more detail according to one embodiment;
[17] Fig. 4B illustrates the descend method according to one embodiment;
[18] Fig. 4C illustrates the climb method according to one embodiment; and
[19] Fig. 5 illustrates an output of a resultant instruction schedule for the ordered operations and program code.
DETAILED DESCRIPTION OF THE INVENTION [20] Fig. 1 discloses a system 100 for ordering operations according to one embodiment. System 100 is a computing device that outputs an ordered list of operations that may be used
to schedule operations in executable instructions. Examples of computing devices include personal computers, work stations, servers, personal digital assistants (PDAs), pocket PCs, and the like. Once the operations are scheduled, the scheduled operations may be executed in a computer processor, such as a RaPiD processor developed by the University of Washington or an adaptable execution unit developed by Quicksilver Technology, Inc. The processor may be included in a cellular phone, personal digital assistant (PDA), global positioning system (GPS) receiver, etc.
[21] In one embodiment, a computer program product including software code stored on a computer readable medium that directs system 100 as described is provided. Examples of computer readable media include RAM, disk drives, floppy disks, CD-ROMs, flash memory, read only memories (ROMs), and the like.
[22] System 100 receives operations that are to be scheduled for a program code. In one embodiment, system 100 organizes the operations into relationships that may be represented by a data dependence graph (DDG). Each node of the representation on the data dependence graph represents an operation and edges in the DDG represent dependencies between connected nodes.
[23] Fig. 2 A illustrates an example of a DDG according to one embodiment. As shown, each node represents an operation and the edges represent dependencies between the operations. The numbers in the nodes represent an operation number for identification purposes.
[24] Fig. 2B illustrates the DDG for the operations of Fig. 2 A with the height shown inside each node. For purposes of this example, all edges have a weight of one, but it will be understood that different edges may have different weights. [25] Fig. 2C illustrates the DDG of Figs. 1 A and IB with a maximum predecessor height shown inside each node. The maximum predecessor height for a node is the maximum height of any predecessor nodes for the node. A predecessor node is any node the current operation depends on. Predecessor nodes may also be immediate predecessor nodes, which are nodes directly dependent on the current node (connected by an edge). For example, the maximum height of any predecessor nodes of node 1 is five (from nodes 13 and 14) and that MPH is assigned for node 1. Also, the MPH of predecessor nodes of node 4 is four (from nodes 11 and 12) and that MPH is assigned to node 4.
[26] In one embodiment, the values computed for the DDG are used by system 100 to compute the order of operations in the ordered list. Specifically, a descend module 102 and a climb module 104 use the values in ordering the operations.
[27] Descend module 102 implements a descend method, described below. Descend module 102 finds successor operations for a current operation according to one embodiment. For example, a successor operation is an operation that is dependent upon the execution of a current operation. In one embodiment, descend module 102 may find all successor operations that are dependent upon a current operation. Additionally, descend module 102 may find immediate successor operations, which are successor operations that are connected by edges in the DDG to the current operation.
[28] Climb module 104 implements a climb method, described below. Climb module 104 finds predecessor operations for a current operation. Predecessor operations are operations that the current operation depends upon. Climb module 104 may find all predecessor operations that are dependent upon a current operation. Additionally, climb module 104 may find immediate predecessor operations, which are predecessor operations that are connected by an edge in the DDG to the current operation. [29] Using descend module 102 and climb module 104, system 100 is able to order operations that are dependent upon one another. The ordering of operations effectively keeps branches of the DDG together and roughly orders the operations by decreasing height. [30] Fig. 3 illustrates a method for computing a scheduling priority according to one embodiment. In one embodiment, a computer implemented process orders operations in an ordered list. In step S300, a list of operations for a program code is received. In step S302, the dependencies among the operations are determined. For example, the dependencies may be represented by the DDG graphs of Figs. 2A, 2B, and 2C. In step S304, priority values for the operations are determined. For example, the height and MPH of each operation is determined. [31] In step S306, one or more operations dependent upon one another are determined from the list of operations. In one embodiment, the one or more operations include an operation that has a highest priority value assigned to it. Also, the one or more operations may be operations with the highest MPH not already in the ordered list. [32] In step S308, an operation that is not in the ordered list is identified from the one or more operations. In one embodiment, the operation is an operation with the highest priority value that is not in the ordered list.
[33] In step S310, if the identified operation has predecessor operations that are not in the ordered list, the predecessor operations are added to the list. In one embodiment, the predecessor operations include all predecessor operations the operation is dependent on. Also, the predecessor operations may be ordered from a greatest to lowest height.
[34] In step S312, once the predecessor operations are added to the list, the identified operation is added to the list. The process then reiterates to step S308, where the process is repeated for another operation in one or more operations not already in the ordered list. In one embodiment, the operation is a successor operation to the already identified operation. [35] Figs. 4A, 4B and 4C illustrate one embodiment of a method for computing a scheduling priority. In step S400, an instance of a DDG is constructed from operations to be ordered. In step S402, latencies for edges in the instance of the DDG are determined. Additionally, heights for each operation are compiled from the latencies (step S404). In step S406, MPHs for each operation are determined. [36] In step S408, the process determines ifthere are any operations to order. Ifthere are no operations to order, the process ends at step S410.
[37] Ifthere are operations to order, an operation N corresponding to a node in the DDG with the greatest height that is not in the ordered list is identified, (step S412). It will be understood that any operation may be identified that is not yet in the ordered list and determining an operation with the greatest height is not required. After determining the operation N with the greatest height, the process performs a descend method with operation N in step S414.
[38] Fig. 4B illustrates a flow chart of a process for the descend method according to one embodiment. Descend module 102 performs the descend method in one embodiment. In step S416, the descend method performs a climb method with operation N as the current operation.
[39] Fig. 4C illustrates a flow chart of a process for the climb method according to one embodiment. Climb module 102 performs the climb method in one embodiment. In step S430, the process determines if the current operation is in the ordered list. When the climb method is first called by the descend method, the current operation is the operation N from the descend method. However, the current operation may be determined from the climb method in step S438, described below. If current operation is in the ordered list, the method proceeds to step S431, where the process determines if the current operation was determined from the descend method or the climb method. In the recursive nature of the method, the current operation may be the operation determined in step S416 from the descend method or a predecessor operation of the operation from step S438 of the climb method. [40] If the current operation was determined in the descend method, the method returns to step S416 of the descend method in Fig. 4B. In this case, the climb method has been performed for step S416 of the descend method and the method proceeds to step S418.
[41] If the current operation was determined in the climb method, the method returns to step S438 of the climb method. In this case, performing the climb method with the predecessor operation has been performed and the method reiterates to step S434. [42] In step S432, any immediate predecessors of the current operation are sorted by decreasing MPH. In step S434, the process determines if the sorted list of predecessors is empty. The list may be empty ifthere are no predecessors, or if the list has been emptied by the (possibly repeated) application of step S436.
[43] If the list is empty, in step S440, the current operation is appended to the ordered list. The process then proceeds to step S431, described above. If the sorted list of predecessors is not empty in step S434, the first predecessor P in the sorted list is removed in step S436. In step S438, the climb method is recursively invoked with the removed predecessor P as the current operation. This recursive process continues until all predecessor operations of the operation N from the descend method and the operation N from the descend method are added to the ordered list. [44] After the operation N from the descend method is added to ordered list, the climb method has been performed for the current operation in step S416 and returns to the descend method. Referring back to Fig. 4B, the process determines if the current operation has any immediate successor operations. Ifthere are no immediate successor operations, the process returns to step S414 of Fig. 4 A. In this case, the descend method has been performed in step S414 and the process reiterates to step S408. In one example, the process returns to step S414 when a node of the lowest priority on the DDG is reached. [45] In step S420, the process selects an immediate successor operation of the current operation. The immediate successor operation will now be the current operation. In one embodiment, the immediate successor operation of a greatest height is selected. [46] In step S422, the process recursively invokes the descend method with the immediate successor operation as the current operation. The process then continues as described above. [47] An example of the above methods will now be described with reference to the DDGs in Figs. 2A-2C. The operation corresponding to the node with the greatest height that is not in the scheduling order list is determined first. In this case, the operations corresponding the nodes 13 and 14 both have the greatest height of five. In one embodiment, heuristics may be used to determine which node is chosen first.
[48] For example, the node with a greatest number of outputs is selected first. A reason for this is scheduling nodes with more outputs earlier generally allows more nodes to be scheduled earlier. Also, if one of the nodes in the tie is terminal in the life of a variable, that
node is selected. This shortens the lifetime of the variable, possibly reducing the number of registers required. Further, the node with the lowest number of valid locations for scheduling may be selected. Nodes that have fewer valid locations are more difficult to schedule because fewer functional units 106 exist to execute the operations. For the purposes of this example, node 13 is chosen first. The descend method is then called with the operation corresponding to node 13 as the current operation. It will be understood that the above three techniques may be easily combined or used separately.
[49] The descend method first calls the climb method for the current operation corresponding to node 13. Node 13 is not yet in the scheduling order list and the climb method determines ifthere are any immediate predecessors to node 13. In this case, there are no immediate predecessors and the climb method adds node 13 to the end of the scheduling order list. The climb method then returns to the descend method. The scheduling order list now includes node 13. [50] Next, the descend method determines the successor of node 13. In one embodiment, the successor with the greatest height is determined. In this case, node 10 is the only successor node for node 13. The descend method is then called for node 10. [51] The descend method calls the climb method for node 10. The climb method determines that node 10 is not in the scheduling order list and determines if node 10 has any immediate predecessor operations. Nodes 13 and 14 are immediate predecessor operations to node 10 and they are sorted by decreasing MPH. Nodes 13 and 14 both have the same MPH and a heuristic may be used to determine which node is chosen first. For purposes of this example, node 13 is chosen and the method determines if node 13 has any immediate predecessors. Node 13 does not have any immediate predecessors and is already in the ordered list; thus, the method proceeds to the next sorted immediate predecessor. [52] The climb method is then called for the next immediate predecessor, node 14. Node 14 does not have any immediate predecessors and is not in the ordered list. Node 14 is then added to the scheduling order list. The method determines that all the nodes in the sorted list have been processed and the current node 10 is added to scheduling order list. The process returns to the descend method. Thus far, the scheduling order list now contains the nodes 13, 14, and 10.
[53] The successor node of node 10 with the greatest height is now determined. In this case, node 7 is the only successor and is chosen. The descend method is now performed for node 7 as the current operation. The process then calls the climb method for node 7. Node 7 is not in the scheduling order list and has immediate predecessors, which are ordered by
decreasing MPH. An immediate predecessor of a greatest MPH is identified. In this case, the only immediate predecessor to node 7 is node 10. Node 10 is already in the scheduling order list, and the method determines ifthere are no more immediate predecessor nodes in the sorted list. There are no more immediate predecessor nodes are in the sorted list; thus, node 7 is added to the scheduling order list. The process is then returns to the descend method where successors ofnode 7 are determined. The scheduling order list now includes nodes 13, 14, 10, and 7.
[54] The descend method identifies a successor node for node 7 with a greatest height. Node 7 has one successor node, node 3, and the descend method is performed for node 3. [55] The climb method is then called for node 3. Node 3 is not in the scheduling order list and the process determines if node 3 has any immediate predecessor operations. Nodes 6 and 7 are immediate predecessors and the method orders the immediate predecessors by decreasing MPH. In this case, the order is node 7 (MPH of five) followed by node 6 (MPH of 3). Node 7 is already on the scheduling list and the method proceeds to node 6. Node 6 is not in the scheduling order list and has no immediate predecessor nodes. Thus, node 6 is added to the scheduling order list. The method then adds node 3 to the scheduling order list because there are no more sorted immediate predecessor nodes in the sorted list. The scheduling order list now includes nodes 13, 14, 10, 7, 6, and 3. [56] The climb method then returns to the descend method where the successor nodes of node 3 are determined. Node 2 is the only successor node and thus the node with the greatest height.
[57] The descend method is performed for node 2. The climb method for node 2 is then called. The method determines that node 2 is not in the scheduling order list and determines any immediate predecessor nodes to mode 2. Nodes 3 and 4 are immediate predecessor nodes for node 2. The process then orders the predecessors ofnode 2 by decreasing MPH. In this case, the order is node 3 (MPH of 5) followed by node 4 (MPH of 4). Node 3 is already in the ordered list and the process proceeds with node 4. Node 4 is not in the scheduling order list and any immediate predecessor nodes ofnode 4 are determined. [58] The following steps effectively add all the predecessor nodes ofnode 4 that are not in the scheduling order list to the scheduling order list.
[59] First, the immediate predecessors ofnode 4 are determined and sorted by decreasing MPH. Nodes 8 and 9 have the same MPH. For the purposes of this example, the order used is node 9 followed by node 8. Node 9 is not on the scheduling order list and the immediate predecessors ofnode 9 are determined and sorted by decreasing MPH. Node 11 and node 12
are immediate predecessors ofnode 9 and have the same MPH of 4. For purposes of this example, the order used is node 11 followed by node 12.
[60] It is determined that node 11 has no immediate predecessors and node 11 is added to the scheduling order list. Node 12 is next in the sorted list and has no immediate predecessors. Node 12 is not on the scheduling order list and is added. The predecessors for node 9 have now been processed and node 9 is added to the scheduling order list. (At this point, the scheduling order list includes nodes 13, 14, 10, 7, 6, 3, 11, 12, and 9). [61] Node 4 is next in the sorted immediate predecessor list. The method determines the immediate predecessors ofnode 4 that have not been processed. Node 8 has not been processed and is not in the scheduling order list. Also, node 8 does not have any immediate predecessors and the method adds node 8 to the scheduling order list. All sorted immediate predecessors have now been processed for node 4 and node 4 is now added to the end of the scheduling order list. The scheduling order list now includes nodes 13, 14, 10, 7, 6, 3, 11, 12, 9, 8, and 4. [62] All the predecessors of node 2 have now been added to the scheduling order list and node 2 is added to the scheduling order list.
[63] The process then returns to the descend method and successors ofnode 2 are determined. In this case, there is only one successor, node 1. The descend method for node 1 is performed. Subsequently, the descend methods performs the climb method for node 1. The process determines that node 2 is the only immediate predecessor node to node 1. Node 2 is in the scheduling order list and node 1 has no other immediate predecessor operations. Thus, node 1 is added to the scheduling order list because there are no more predecessors operations. The scheduling order list now includes the nodes 15, 14, 10, 7, 6, 3, 11, 12, 9, 8, 4, 2, and 1. The descend method has now reached the bottom of the DDG and the process returns to determine ifthere are any nodes to order.
[64] Node 5 has not been included in the scheduling order list. In one embodiment, node 5 may have been included after node 9. However, for this example, the descend method is called for node 5 because node 5 is not yet in the scheduling order list. The climb method is called by the descend method for node 5. Node 9 is determined to be an immediate predecessor ofnode 5 and is in the scheduling order list. There are no other immediate predecessors and thus node 5 is then added to the scheduling order list. The climb method returns to the descend method and no successors to node 5 are determined. Thus, the method returns and determines that there are no more nodes to order. The final scheduling order list includes the nodes 13, 14, 10, 7, 6, 3, 11, 12, 9, 8, 4, 2, 1, and 5.
[65] In one embodiment, the program code may include cycles. In one embodiment, the cycles may be broken using methods known in the art. For example, techniques based on treating cycles (also known as strongly connected components) as super-vertices may be used. See, for example, pages 37 and 38 of HP labs tech report HPL-94-115 Iterative Modulo Scheduling - Rau, B. Ramakrishna and section 2.1 of Vicki H. Allan, Reese B. Jones, Randall M. Lee, and Stephen J. Allan. Software Pipelining, in ACM Computing Surveys, 27(3):367-432, September 1995. The super-vertices and the cycle-free graph containing the super-vertices may then be processed as described above. [66] Embodiments of the present invention generally keep branches of a DDG together in a scheduling list. Thus, an instruction scheduler will tend to place operations that are near each other in the scheduling order list near each other in the resulting schedule. This method of scheduling reduces a distance that data travels in a lifetime of variables for the operations. Also, the goal of keeping these schedules short is also met and the most critical branches are scheduled first because the scheduling order list is generally ordered by graph height. Also, the amount of data motion is reduced and the need for data movement resources is lessened. Thus, power is saved by reducing data movement and may result in shorter schedules by avoiding exhaustion of data movement resources. Shorter schedules execute in less time and use less power. [67] An example of a program code that may be inputted into system 100 is provided below. The code describes an implementation ofa finite impulse response (FIR) filter. Once receiving the code, system 100 orders the operations generally keeping branches of a representation of a DDG for the code together. Fig. 5 illustrates an output of a resultant instruction schedule for the ordered operations and program code. In the table, the first column represents a phase of the program code that is being executed. The first row illustrates a type of functional unit that each column represents. Other rows represent an instruction that is executed each clock cycle.
[68] The following is an example of the program code:
Start
// Set the addresses to point to the start of the data (use immediate constant 0 for now).
Move outOrigin, outAddr Move inOrigin, inBase Move inBase, inAddr // Prime the coef history by reading a short array of zeros. Read zerolndex, coef
Read zerolndex, coef
Read zerolndex, coef // Set the coefAddr to point to start of coef array.
Move coefOrigin, coefAddr // start the main loop with a zero overhead loop. Run mainLpCnt iterations. startLoop mainLpCnt, mainLoop mainLoop
// Clear out all four accumulators CLRACC sumO
CLRACC suml
CLRACC sum2
CLRACC sum3 // start the mac loop. startLoop macLpCnt, macLoop macLoop
Read coefAddr, coef Inc coefAddr Read inputAddr, input
Inc inputAddr
// use the history property of coef to allow four multiplies for each read MAC input, coef+0, sumO MAC input, coef+ 1 , sum 1 MAC input, coef+2, sum2
MAC input, coef+3, sum3 loopNext macLoop, macLoopEnd
// now some straightline code to implement the diagonal finish of the mac loop macLoopEnd
RSS sumO, output // RSS -> round, saturate and shift. Details TBD
Put outAddr, output
Inc outAddr Read zerolndex, coef // read from 0 to kick the history along and fill it with 0 for next pass
Read inputAddr, input
Inc inputAddr
MAC input, coef+ 1 , sum 1 MAC input, coef+2, sum2
MAC input, coef+3, sum3 //
RSS suml, output
Put outAddr, output Inc outAddr
Read zerolndex, coef // read from 0 to kick the history along and fill it with 0 for next pass
Read inputAddr, input
Inc inputAddr MAC input, coef+2, sum2
MAC input, coef+3, sum3 //
RSS sum2, output
Put outAddr, output Inc outAddr
Read zerolndex, coef // read from 0 to kick the history along and fill it with 0 for next pass
Read inputAddr, input
MAC input, coef+3, sum3 //
RSS sum3, output
Put outAddr, output
Inc outAddr // end diagonal finish of MAC loop
// rewind coef and input to processes another four elements in the input
Move coefOrigin, coefAddr
Add inBase, 4, inBase
Move inBase, inAddr loopNext mainLoop, mainLoopEnd mainLoopEnd halt [69] The above description is illustrative but not restrictive. Many variations of the invention will become apparent to those skilled in the art upon review of the disclosure. The scope of the invention should, therefore, be determined not with reference to the above description, but instead should be determined with reference to the pending claims along with their full scope or equivalents.