Sim-outorder.c
Main函数
Fetch ——> despetch——> issue——> writeback ——>commit
Code text——>fetch queue ——> RUU/LSQ(—>readyqueue)——>event queue——>删除envent——>RUU/LSQ
功能模拟快速跳过的指令,之后在for循环中一个cycle一个cycle的模拟,每个cycle逆序执行:
/* commitentries from RUU/LSQ to architected register file */
ruu_commit();
/* service function unit release events*/
ruu_release_fu();
/* ==> may have ready queue entriescarried over from previous cycles */
/* service result completions, alsoreadies dependent operations */
/* ==> inserts operations into readyqueue --> register deps resolved */
ruu_writeback();
/* try to locate memory operations that are ready to execute */
/* ==> inserts operations into ready queue --> mem deps resolved*/
lsq_refresh();
/* issue operations ready to execute from a previous cycle */
/* <== drains ready queue <-- ready operations commence execution*/
ruu_issue();
/* decode and dispatch new operations */
/* ==> insert ops w/ no deps or allregs ready --> reg deps resolved */
ruu_dispatch();
/* call instruction fetch unit if it isnot blocked */
if (!ruu_fetch_issue_delay)
ruu_fetch();
else
ruu_fetch_issue_delay--;
逆序执行是因为要用顺序执行的代码,模拟并发执行的硬件。如果顺序执行的话,在同一个cycle内,取指阶段取到的指令可能立马就得到了dispatch,即前一阶段的执行结果会修改相关内容,导致下一阶段本来要使用的数据被修改,因此采用逆序执行。
关键数据结构:
/* areservation station link: this structure links elements of a RUU
reservation station list; used for readyinstruction queue, event queue, and
output dependency lists; each RS_LINKnode contains a pointer to the RUU
entry it references along with an instancetag, the RS_LINK is only valid if
the instruction instance tag matches theinstruction RUU entry instance tag;
this strategy allows entries in the RUU canbe squashed and reused without
updating the lists that point to it, whichsignificantly improves the
performance of (all to frequent) squashevents */
struct RS_link {
struct RS_link *next; /* next entry in list */
struct RUU_station *rs; /* referenced RUU resv station */
INST_TAG_TYPE tag; /* inst instance sequence number */
union {
tick_t when; /* time stamp of entry (for eventq) */
INST_SEQ_TYPE seq; /* inst sequence */
int opnum; /*input/output operand number */
} x;
};
用在3个地方,就绪队列,ready_queue,event_queue和保留站中每条指令的输出依赖(即输入依赖于该条指令输出的所有其他保留站)。
/* a register update unit (RUU) station, thisrecord is contained in the
processors RUU, which serves as a collection of ordered reservations
stations. The reservationstations capture register results and await
thetime when all operands are ready, at which time the instruction is
issued to the functional units; the RUU is an order circular queue, inwhich
instructions are inserted in fetch (program) order, results are storedin
theRUU buffers, and later when an RUU entry is the oldest entry in the
machines, it and its instruction‘s value is retired to the architectural
register file in program order, NOTE: the RUU and LSQ share the same
structure, this is useful because loads and stores are split into two
operations: an effective address add and a load/store, the add isinserted
intothe RUU and the load/store inserted into the LSQ, allowing the add
towake up the load/store when effective address computation has finished */
structRUU_station {
/*inst info */
md_inst_t IR; /*instruction bits */
enummd_opcode op; /*decoded instruction opcode */
md_addr_t PC, next_PC, pred_PC; /*inst PC, next PC, predicted PC */
intin_LSQ; /*non-zero if op is in LSQ */
intea_comp; /*non-zero if op is an addr comp */
intrecover_inst; /* start ofmis-speculation? */
intstack_recover_idx; /*non-speculative TOS for RSB pred */
struct bpred_update_t dir_update; /*bpred direction update info */
intspec_mode; /* non-zeroif issued in spec_mode */
md_addr_t addr; /*effective address for ld/st‘s */
INST_TAG_TYPE tag; /*RUU slot tag, increment to
squash operation */
INST_SEQ_TYPE seq; /*instruction sequence, used to
sort the ready list and tag inst */
unsigned int ptrace_seq; /*pipetrace sequence number */
intslip;
/*instruction status */
intqueued; /*operands ready and queued */
intissued; /*operation is/was executing */
intcompleted; /*operation has completed execution */
/*output operand dependency list, these lists are used to
limit the number of associative searches into the RUU when
instructions complete and need to wake up dependent insts */
intonames[MAX_ODEPS]; /* outputlogical names (NA=unused) */
struct RS_link *odep_list[MAX_ODEPS]; /*chains to consuming operations */
/*input dependent links, the output chains rooted above use these
fields to mark input operands as ready, when all these fields have
been set non-zero, the RUU operation has all of its register
operands, it may commence execution as soon as all of its memory
operands are known to be read (see lsq_refresh() for details on
enforcing memory dependencies) */
int idep_ready[MAX_IDEPS]; /* input operand ready? */
};
/*
* thecreate vector maps a logical register to a creator in the RUU (and
*specific output operand) or the architected register file (if RS_link
* isNULL)
*/
/* an entry in the create vector */
structCV_link {
struct RUU_station *rs; /* creator‘s reservation station */
intodep_num; /*specific output operand */
};
每个寄存器都对应0个或者一个CV_link结构,CV即create vector,即最新的产生该寄存器值的保留站。
寄存器有regs和 spec_regs_R(F、C),前者为真实逻辑寄存器,后者为推断执行时的寄存器。
struct regs_t {
md_gpr_t regs_R; /*(signed) integer register file */
md_fpr_t regs_F; /*floating point register file */
md_ctrl_t regs_C; /*control register file */
md_addr_t regs_PC; /*program counter */
md_addr_t regs_NPC; /*next-cycle program counter */
};
在非推断执行模式,在dispatch的译码阶段真正执行指令,包括读写寄存器,此时读写的是regs,在推断执行模式,即spec_mode=true时,如下:
#define GPR(N) (BITMAP_SET_P(use_spec_R,R_BMAP_SZ, (N))\
? spec_regs_R[N] \
: regs.regs_R[N])
#define SET_GPR(N,EXPR) (spec_mode \
? ((spec_regs_R[N] = (EXPR)), \
BITMAP_SET(use_spec_R, R_BMAP_SZ, (N)),\
spec_regs_R[N]) \
: (regs.regs_R[N] = (EXPR)))
即,写寄存器时,如果写spec_regs_R寄存器,并将对应的bitmask表中寄存器对应的位置位;读寄存器时,如果bitmask表中寄存器对应的位置位了,则读spec_regs_R,否则读普通寄存器regs。
各阶段的数据结构:
Ruu_fetch:code_textà fetch_data,即从代码段将指令读进取指队列。
Ruu_dispatch:fetch_dataàRUU(àready_queue,状态为queued),即从取指队列将指令读进保留站,对于普通算术指令,如果操作数准备好了直接发射,对于store指令,操作数准备好了直接发射(发射即进入ready_queue)。对于load、store指令和long latency的指令排在就绪队列的前边,其他的按指令序列插入就绪队列。
Ruu_issue:ready_queueàevent_queue(RSlink,指向保留站),状态为issued,对ready_queue中的就绪指令进行执行,store指令立刻完成(在commit阶段真正访存),load指令检查前面的store指令地址,匹配延迟为1,否则访问cache,记录延迟,设置事件发生在延迟之后;普通算术指令延迟为执行时间,并设置事件发生时间为延迟之后。
lsq_refresh:LSQàready_queue,更新LSQ,处理WAW和WAR。后面的对同一地址的store会覆盖前面的,如果load指令之前的所有store指令都没有地址相同的,则发射load指令。
ruu_writeback:event_queueà删除该事件,并置rs状态为completed,如果当前指令为转移指令且猜测错误,则将该指令之后进入保留站RUU的指令撤销,并且消除推断执行过程中对内存和寄存器的写,并且将返回地址栈顶设置为该转移指令正确的转移地址,设置分支延迟;如果为writeback阶段更新分支预测表则更新;将该指令的输出依赖表置空,并将依赖该指令输出的操作数值为ready,并在全部ready的情况下让指令进入ready_queue。
ruu_release_fu:将资源池中的所有busy的资源的busy值减1。
ruu_commit:删除completed的RUU和LSQ中的指令。处理所有completed的指令,对store指令真正访存写入cache,但是不产生事件(所有使用同一地址的数据的指令已经都得到了数据)。如果分支预测器更新在commit阶段则更新。(寄存器的写回在dispatch阶段就已经完成了)。
转移猜测更新可以在dispatch阶段、writeback阶段或者commit阶段。
ruu_fetch();
while (ifq不满的情况下,最多取ruu_decode_width * fetch_speed条指令)
/*fetch an instruction at the next predicted fetch address */
fetch_regs_PC =fetch_pred_PC;
If(PC合法)
到memory中取指令赋值给inst(cache只模拟访问过程,无数据)
If 存在cache和tlb
则模拟访问cache和tlb,得到取指令的延迟lat
If lat != cache_il1_lat
则阻塞取指令ruu_fetch_issue_delay += lat - 1;
Else
指令为空指令
If 存在分支预测器pred
取操作码op
Ifop为control指令
fetch_pred_PC=预测器预测的指令(同时得到stack_recover_idx)
else
fetch_pred_PC为当前指令的下条指令
else
fetch_pred_PC为当前指令的下条指令
当前指令进入指令队列:
fetch_data[fetch_tail].IR = inst;
fetch_data[fetch_tail].regs_PC =fetch_regs_PC;
fetch_data[fetch_tail].pred_PC =fetch_pred_PC;
fetch_data[fetch_tail].stack_recover_idx= stack_recover_idx;
fetch_data[fetch_tail].ptrace_seq =ptrace_seq++;
ruu_dispatch();
while 取指队列不空,保留站和LSQ不满,没达到每轮取指最大值
如果在“顺序”模式,且最后一条指令操作数没准备好,则退出
//取指令队列中头结点的数据,如下:
inst = fetch_data[fetch_head].IR;
regs.regs_PC = fetch_data[fetch_head].regs_PC;
pred_PC = fetch_data[fetch_head].pred_PC;
dir_update_ptr = &(fetch_data[fetch_head].dir_update);
stack_recover_idx = fetch_data[fetch_head].stack_recover_idx;
pseq = fetch_data[fetch_head].ptrace_seq;
regs.regs_NPC= regs.regs_PC + sizeof(md_inst_t);
//译码and真正执行指令
switch(op)
//next PC 不等于 PC+4,即发生了跳转
br_taken = (regs.regs_NPC != (regs.regs_PC + sizeof(md_inst_t)));
//predicted PC 不等于 PC+4,即预测结果是发生了跳转
br_pred_taken = (pred_PC != (regs.regs_PC + sizeof(md_inst_t)));
if 完美预测下预测错误 或者直接跳转预测跳转但目标错误(显然的错误)
| 修正next PC、指令对列,并设置取指延迟为分支延迟
| fetch_redirected = TRUE;//告知已经取指重定向
if 操作码非空
| 设置保留站rs的相应值
| If 操作为访存操作
| rs->op = MD_AGEN_OP;//add指令计算地址
| rs->ea_comp = TRUE;
| 设置lsq的相应值
| 设置rs和lsq的in/out依赖
| ruu_link_idep(rs, 0, NA);//rs依赖于哪一个寄存器对应的CVlink
ruu_link_idep(rs, 1, in2);
ruu_link_idep(rs, 2, in3);
/* install output after inputs to prevent self reference */
ruu_install_odep(rs, 0, DTMP);//创建寄存器的CVlink
If rs的操作数准备好了
| readyq_enqueue(rs);
if lsq的操作数准备好了(只有store指令!!!)
| readyq_enqueue(lsq);
else
设置rs的in/out依赖
If rs的操作数准备好了
| | readyq_enqueue(rs);
Else 空指令 rs=NULL
If当前不是推断模式
If 当前指令是分支指令且设置的为dispatch阶段更新转移预测表,则
更新分支预测的信息表
if (pred_PC != regs.regs_NPC && !fetch_redirected)
//如果预测的结果不对,且没有修正!!
spec_mode = TRUE;//开始推断执行
rs->recover_inst = TRUE;
recover_PC = regs.regs_NPC;
end while
ruu_issue:
node = ready_queue;//将ready_queue赋给node
ready_queue = NULL;//将ready_queue赋为空
for 不到发射宽度(default 4)
| ifnode合法
| | 从node取rs,并rs->queued = FALSE;
| | if 是store指令
| | | 直接完成,设置rs状态为completed
| | else //不是store指令
| | | if 该指令需要功能部件fu
| | | | if 拿到了相应的fu
| | | | | rs->issued = TRUE;并且设置fu的busy值
| | | | | if 指令为load指令
| | | | | | 查询是否有store在load之前且地址相同,是则延迟为1
| | | | | | 如果没有则到TLB和cache中找,并按结果写入事件队列
| | | | | end if
| | | | else 将rs重新放回就绪队列
| | | else 不需要功能部件,直接将结果写入事件队列,延迟为1
| | end if
| end node合法
| RSLINK_FREE(node);//释放处理过的
End for
For node不为空
| 将node链表中的就绪指令重新放回ready_queue
End for
lsq_refresh:
std_unknowns[MAX_STD_UNKNOWNS];一个地址数组,该地址的值不知道(有未完成的store)
for 遍历LSQ中的每个操作
| ifstore指令
| | if 地址没准备好
| | | 结束,一个不知道地址的store指令可以阻塞之后所有的load、store指令
| | elseif 操作数没准备好
| | | 将该store指令的地址写入std_unknowns中
| | else 操作数和地址都准备好,则将std_unknowns中地址相同的地址清除
| end if
| if 是queue=false,没发射,没完成且操作数准备好的load指令
| | 看std_unknowns中是否有地址与load地址相同的,没有则该load进入就绪队列
ruu_writeback:
while (rs = eventq_next_event())//当前sim_cycle有事件
| rs状态置位completed
| if(rs->recover_inst)即该指令是分支且预测错误
| | ruu_recover//清除RUU和LSQ中,该指令之后的所有指令
| | tracer_recover//清空推断执行寄存器的值,清空对memory的写、取指队列,取消推断执行,并将fetch_pred_PC = fetch_regs_PC = recover_PC;
| | bpred_recover//pred->retstack.tos= stack_recover_idx;
| | 设置分支预测错误延迟3
| end if
| if设置的为该WB阶段更新分支预测器,则更新
| for
| | if当前对应该指令输出的寄存器的creator依赖于该指令
则将依赖于该指令的输出的creator vector清空
| | for将依赖于该指令结果的指令的操作数设置为ready,
| | if 依赖指令得到结果后,操作数都ready了且(不是访存指令或者是store指令)
| | | 将指令加入就绪队列
ruu_release_fu:资源池中的每个资源如果busy不为0,则减1
ruu_commit:
while RUU不为空且不超过提交宽度
| 获取rs
| if指令为地址比较(LSQ中有对应的load store指令)
| | if LSQ中的load、store指令为完成,则break
| | if 指令为store指令
| | | 取store port即fu
| | | if 取到了fu
| | | | 设置fu的busy值为发射延迟为issuelat
| | | | 访问TLB和cache写回数据
| | | else 即没有store port,则break
| | end if
| | 将LSQ的第一个元素删除(load指令已经完成)
| if设置的为该CT阶段更新分支预测器,则更新
| 删除RUU的第一个元素
1、所有跳转指令都需要运算部件,但是直接跳转如果预测跳转且预测跳转的目标不等于指令中的地址,则明显预测错误,可以在译码阶段就清空取指队列并设置分支延迟。
2、译码阶段的宏展开:
switch (op)
{
1、#defineDEFINST(OP,MSK,NAME,OPFORM,RES,CLASS,O1,O2,I1,I2,I3)
\
caseOP: \
/* compute output/input dependencies toout1-2 and in1-3 */ \
out1 = O1; out2 = O2; \
in1 = I1; in2 = I2; in3 = I3; \
/* execute the instruction */ \
SYMCAT(OP,_IMPL); \
break;
省略·············································
2、#include "machine.def"
Default:
}
在machine.def中有
4、#define LDA_IMPL \
{ \
SET_GPR(RA, GPR(RB) + SEXT(OFS)); \
}
3、DEFINST(LDA, 0x08,
"lda", "a,o(b)",
IntALU, F_ICOMP,
DGPR(RA), DNA, DNA,DGPR(RB), DNA)
其中1是宏定义相当于空格,2处引入了文件machine.def。相当于将machine.def的内容放入switch语句中,当然宏4仍然当做空格,3处的代码,正好对应了1处的宏定义,进行展开,变成了case OP语句,而1中的SYMCAT(OP,_IMPL),又对应了4处的宏,于是展开为4中的代码,于是switch语句就变成了
Switch(op)
Case LDA:
out1 = O1; out2 = O2; \
in1 = I1; in2 = I2; in3 = I3; \
SET_GPR(RA, GPR(RB) + SEXT(OFS)); \
break;
simlescalar CPU模拟器源码分析,布布扣,bubuko.com