Models

Utility blocks

Utility blocks are C++ utility classes designed to be shared among C++ models, allowing the reuse of functional and timing behavior.

These blocks cannot be instantiated directly from Python generators; instead, they must be instantiated within C++ model classes.

UART adapters

UART adapters simplify the implementation of UART-based peripherals by abstracting the details of the UART protocol.

A variety of adapters are available, offering different levels of control. It is recommended to begin with a high-level adapter, such as UartAdapterBuffered, and transition to a lower-level adapter if more precise control over the UART protocol is required.

UART adapter

class UartAdapter : public vp::Block

UART adapter.

This adapter can be used to interact with a UART interface.

It takes care of the low-level UART protocol bit interactions and provides a high-level interface for sending and receiving bytes.

The high-level interactions with the upper model are done at byte level using callbacks encapsulated into timed events.

Public Functions

UartAdapter(vp::Component *top, vp::Block *parent, std::string name, std::string itf_name)

Construct a new UART adapter.

Parameters:

top – Top component where the UART interface must be created.
parent – Specify the parent block containing this block. This can be the top component if it is its direct parent.
name – Specify the name of the uart adapter within the parent hierarchy.
itf_name – Specify the name of the UART interface. An interface with this name will be created in the top component. This interface must then be connected in the python generator with the same name to another UART interface.

void baudrate_set(int baudrate)

Set the UART baudrate.

This gives the number of bits per second sent or received. Baudrate is set to 115200 by default.

Parameters:: baudrate – Baudrate in number of bits per second.

void data_bits_set(int data_bits)

Set the number of data bits.

This sets the number of data bits exchanged on the interface for each byte exchanged with the upper-level model. It is set to 8 by default and can only be equal or inferior to 8. If it is inferior to 8, the received data bits are shifted right and filled with zero on the left.

Parameters:: data_bits – Number of data bits.

void stop_bits_set(int stop_bits)

Set the number of stop bits.

This sets the number of stop bits exchanged on the interface before switching to the next byte. It is set to 1 by default and can take any value equal or superior to zero.

Parameters:: stop_bits – Number of stop bits.

void parity_set(bool parity)

Set parity enable.

This enables or disables the parity bit. When enabled, an additional parity bit is transmitted or received right after the data bits to check the data integrity. It is set to false by default. The parity can be retrieved along with the received byte by calling rx_get with the parity argument.

Parameters:: parity – true if parity is enabled.

void ctrl_flow_set(bool ctrl_flow)

Set control flow enable.

This enables or disables the control flow. The control flow is disabled by default. When disabled, the cts line, is set to 0 to allow receiving bytes. Once control flow is enabled, the line is set to 1 to forbid receiving bytes. The upper model can then call rx_flow_enable to tell if reception is allowed or forbidden.

Parameters:: ctrl_flow – true if control flow is enabled.

void loopback_set(bool enabled)

Set loopback enable.

This enables or disables the loopback. When enabled, received bits are handled as usual but are also immeditely mirrored on the transmit line to model a loopback from RX to TX.

Parameters:: enabled – true if loopback is enabled.

void rx_ready_event_set(vp::TimeEvent *event)

Set user event for receiving bytes.

This sets the timed event which should be called everytime a byte has been received and is ready to be retrieved. The event is enqueued to the time scheduler with 0 delay, which means it behaves as a function call. The byte must be retrieved immediately. If not, it may be overwritten by the next byte. The timed event is enqueued as soon as the last stop bit is received.

Parameters:: event – The user event to be enqueued everytime a byte is received.

uint8_t rx_get()

Get the last received byte.

This returns the last byte received from the UART interface. This must usually be called from the RX ready event callback to get the byte and let the adapter capture the next one.

Returns:: The last received byte.

void rx_flow_enable(bool enabled)

Set the flow control signal.

This can be used to directly control the RX control flow signal on the UART interface (CTS). The flow control can be enable to allow the remote part to send bytes or can be disabled to forbit it. The remote part may not immediately take it into account and may be still send a few bytes after it has been disabled. For example, FTDI chips can usually send up to 4 bytes after the flow control has been disabled.

Parameters:: enabled – True if the RX flow control must be enabled, i.e. bytes can be received.

bool tx_ready()

Tell if the adapter is ready to send a byte.

This basically tells if the adapter is in idle, i.e. nothing is being sent. This method can be called to check if a new TX byte can be given to the adapter. If a new byte is given while the adapter is not ready, some data of the previous byte may be overwritten. When a new TX byte is pushed, the adapter becomes ready again once the last bit of the byte is sent to the interface. Note that this method will only check if the adapter is ready to send. It may return true even though the TX flow control (the flow control signal returned by the remote part) forbids sending bytes. This can be used to model behaviors where few bytes can be sent after flow control forbids sending bytes.

Returns:: true if a new TX byte can be given to the adapter.

bool tx_flow_ctrl_ready()

Tell if the TX flow control allows sending bytes.

This tells if the remote part allows the adapter to send a byte through the flow control signal. This always returns true when the flow control is disabled. Note that the adapter does not check TX flow control signal from the remote part. As soon as a byte is pushed by the upper model, the byte is sent on the UART interface. It is up to the upper model to check the flow control signal with this method before pushing bytes.

Returns:: true if the remote part allows sending bytes.

void tx_send_byte(uint8_t byte)

Send a byte.

This is pushing a byte to be sent by the adapter. Only one byte can be pushed at a time. If a byte is pushed while the adapter is not ready, some data will be corrupted. The next byte can be pushed once the last bit of the current byte is sent on the interface. The TX ready event is enqueued when this happens, so that the upper model can push the next byte.

Parameters:: byte – The byte to be sent

void tx_ready_event_set(vp::TimeEvent *event)

Set user event for sending bytes.

This sets the timed event which should be called everytime the adapter becomes ready to send a byte. This happens when the last bit of the current byte has been sent on the interface. The event is enqueued to the time scheduler with 0 delay, which means it behaves as a function call.

Parameters:: event – The user event to be enqueued everytime the adapter becomes ready to send a byte.

void tx_flow_ctrl_event_set(vp::TimeEvent *event)

Set user event for TX control flow.

This sets the timed event which should be called everytime the TX flow control signal is is changing. This is mostly useful to continue sending after it has been forbidden by remote part. tx_flow_ctrl_ready can be called to know if TX is allowed or not. The event is enqueued to the time scheduler with 0 delay, which means it behaves as a function call.

Parameters:: event – The user event to be enqueued everytime the TX flow control is changing.

UART buffered adapter

class UartAdapterBuffered : public vp::Block

UART buffered adapter.

This adapter can be used to interact with a UART interface at a higher-level than UartAdapter. This adapter is using the UartAdapter

for the low-level UART interaction and adds on top of it a queue of bytes for both TX and RX, so that the upper model does not need to take care about control flow.

The upper model can for example push as many bytes as needed, and this adapter will take care of sending them when it is possible.

This model is using a FIFO to store received bytes, which then do not need to be popped immediately, as the FIFO is by default infinite.

A size can be given to the FIFO to drop any received byte when it is full and easily model a peripheral FIFO.

A bandwith limit can also be given to let this adapter drives the CTS line to respect the specified number of bytes per second.

The flow control signal driven by the remote side (RTS) is by default immediately taken into account to stop sending bytes. It is then possible to specify a number of bytes which are still sent after the sending is forbidden to reproduce some behaviors like for FTDI chips.

Public Functions

UartAdapterBuffered(vp::Component *top, vp::Block *parent, std::string name, std::string itf_name)

Construct a new UART buffered adapter.

Parameters:

top – Top component where the UART interface must be created.
parent – Specify the parent block containing this block. This can be the top component if it is its direct parent.
name – Specify the name of the uart adapter within the parent hierarchy.
itf_name – Specify the name of the UART interface. An interface with this name will be created in the top component. This interface must then be connected in the python generator with the same name to another UART interface.

void baudrate_set(int baudrate)

Set the UART baudrate.

This gives the number of bits per second sent or received. Baudrate is set to 115200 by default.

Parameters:: baudrate – Baudrate in number of bits per second.

void data_bits_set(int data_bits)

Set the number of data bits.

This sets the number of data bits exchanged on the interface for each byte exchanged with the upper-level model. It is set to 8 by default and can only be equal or inferior to 8. If it is inferior to 8, the received data bits are shifted right and filled with zero on the left.

Parameters:: data_bits – Number of data bits.

void stop_bits_set(int stop_bits)

Set the number of stop bits.

This sets the number of stop bits exchanged on the interface before switching to the next byte. It is set to 1 by default and can take any value equal or superior to zero.

Parameters:: stop_bits – Number of stop bits.

void parity_set(bool parity)

Set parity enable.

This enables or disables the parity bit. When enabled, an additional parity bit is transmitted or received right after the data bits to check the data integrity. It is set to false by default. The parity can be retrieved along with the received byte by calling rx_get with the parity argument.

Parameters:: parity – true if parity is enabled.

void ctrl_flow_set(bool ctrl_flow)

Set control flow enable.

This enables or disables the control flow. The control flow is disabled by default. When disabled, the cts line, is set to 0 to allow receiving bytes. Once control flow is enabled, the line is set again to 0 to wllow receiving bytes, since this adapter has by default an infinite FIFO to store incoming bytes. rx_fifo_size_set or rx_flow_limiter_set can then be called to either limit the FIFO size of the bandwidth and let this adapter drives the cts according to these properties.

Parameters:: ctrl_flow – true if control flow is enabled.

void loopback_set(bool enabled)

Set loopback enable.

This enables or disables the loopback. When enabled, received bits are handled as usual but are also immeditely mirrored on the transmit line to model a loopback from RX to TX.

Parameters:: enabled – true if loopback is enabled.

void rx_ready_event_set(vp::TimeEvent *event)

Set user event for receiving bytes.

This sets the timed event which should be called everytime a byte has been received and is ready to be retrieved. The event is enqueued to the time scheduler with 0 delay, which means it behaves as a function call. The byte can be retrieved immediately or can be left in the adapter fifo and retrieved later. The timed event is enqueued as soon as the last stop bit is received.

Parameters:: event – The user event to be enqueued everytime a byte is received.

bool rx_ready()

Tell if RX bytes are ready to be popped.

Any received byte is pushed into a FIFO. This function can be called to know if the FIFO contains at least 1 byte which can be popped.

Returns:: true if there is at least one byte to pop.

uint8_t rx_pop()

Poppped a byte from the RX fifo.

This should be called only when rx_ready returns true, which means there is at least one byte in the FIFO. The byte is removed from the FIFO and returned.

Returns:: The byte which is popped from the FIFO.

void rx_fifo_size_set(int size)

Set the RX FIFO size.

This can be used to drive the CTS line according to the state of FIFO. By default the FIFO has an infinite size and the CTS line thus always 0. Once a size is set, the CTS line will be 0 only when the FIFO is not full so that the sender stops sending bytes when the FIFO becomes full

@parameter size FIFO size in number of bytes.

void rx_flow_limiter_set(double bandwidth)

Set the allowed bandwidth for RX.

This can be used to drive the CTS line according to a bandwidth. The model will compute the instant bandwidth and will set CTS line to 0 to allow receiving bytes only if the bandwidth is under what has been specified. In practice, the model will keep switching CTS between 0 and 1 to respect the bandwidth in average, which an easy way to model a UART receiver timing behavior at coarse-grain. This is should not be used at the same time as rx_fifo_size_set as they will both try to driver the CTS.

@parameter bandwidth Bandwidth in number of bytes per second.

bool tx_empty()

Tell if there is no more TX byte on-going.

This tells both if the TX FIFO of this adapter is empty and also if the lower-level adapter is not currently sending. This can be used to flush TX.

Returns:: true if no TX is on-going.

void tx_empty_event_set(vp::TimeEvent *event)

Set user event for TX empty.

This sets the timed event which should be called everytime the adapter has no more TX byte to send. This also means the lower-level adapter has no TX on-going. This happend when the last bit of the current byte has been sent on the interface. The event is enqueued to the time scheduler with 0 delay, which means it behaves as a function call.

Parameters:: event – The user event to be enqueued everytime the TX is empty.

void tx_send_byte(uint8_t byte)

Send a byte.

This is pushing a byte to be sent by the adapter. AS many bytes as needed can be pushed since the adapter is having an infinte TX FIFO.

Parameters:: byte – The byte to be sent

void tx_flow_ctrl_threshold_set(uint8_t byte)

Set TX flow control threshold.

This sets the number of bytes which can still be sent after the RTS line goes to 1 to disable sending. This can be used to model a behavior where the peripheral does not react immediately to the signal and continues sending for a while. This is the case for FTDI chips where they can still send 4 bytes.

Parameters:: byte – The number of bytes which can still be sent after RTS goes to 1.

ISS

Modules

The ISS is made of the following modules interacting together:

prefetch: it contains a local memory, typically 16 bytes, containing instructions prefetched from memory, so that the core cam get them locally. The whole memory is refilled at once when the next instruction is missing.
decode: the first time an instruction is executed, it gets decode by this module in order to produce a structure containing all the information neded to model the instruction.
insn_cache: it provides a cache of decoded instruction to the decode module so that decoded instructions can be found from their PC.
exec: it is controlling the execution of instructions.
lsu: it is in charge of generating memory requests. It handles misaligned accesses and asynchronous replies.
irq: checking and handling of interrupts.
exception: handling of exceptions,
core:: manages reset, mstatus, and return from interrupt.
csr:: manages CSR registers.
gdbserver: manages core commands from GDB like breakpoints and step mode.
mmu: MMU model.
pmp: PMP model.
regfile: register file.
resource: implement a model of shared resource for timing purpose, base on bandwidth and latency, used for shared FPU.
syscalls: OpenOCD semi-hosting plus additional semihosting for traces and VCD traces configuration.
timing: set of functions for managing time modeling, like accounting stalls.
trace: ISS instruction traces.

Core implementation

There is one top class per core implementation which is charge of gathering all the modules needed for this implementation.

This is an example for the ri5cy core.

class Iss
{
public:
    Iss(vp::Component &top);

    Exec exec;
    Timing timing;
    Core core;
    Regfile regfile;
    Prefetcher prefetcher;
    Decode decode;
    Irq irq;
    Gdbserver gdbserver;
    Lsu lsu;
    DbgUnit dbgunit;
    Syscalls syscalls;
    Trace trace;
    Csr csr;
    Mmu mmu;
    Pmp pmp;
    Exception exception;

    iss_pulp_nn_t pulp_nn;
    iss_rnnext_t rnnext;

    vp::Component &top;
};

There are currently classes for ri5cy, snitch, spatz and generic riscv cores.

ISS execution model

The ISS model execution is driven by a single clock event, whose main goal is to execute one instruction per cycle. It is using a clock regular event, which is enabled when instructions should be executed, and disabled oterwise. In order to easily model other activity in the ISS, especially for the handling of complex instructions, the event is actually reused for various events, depending on what the core has to do in the next cycle.

As seen as the following diagram, this event can be used for executing instructions, as well for the page-table walk, LSU complex operations or prefetching.

Chosing the right handler for the next cycle is done by changing the callback of the event.

../../../../_images/iss_event_model1.png

The two main callbacks used for executing instructions are the following:

The default slow one, which is checking everything.
A fast one, which is activated as soon as the exec module detects that nothing special has to be checked. It switches back to slow callback as soon as something special happens, like an interrupt arrived, to check if it should be taken.

The fast callback is doing the following steps:

Get the current PC (program counter). This is an integer so that it is compatible with mmu.
Trigger a fetch to this pc. This is just for timing purpose, since instructions are already decoded.
Get the instruction from the pc, using the instruction cache. This returns a pointer to the decoded instruction.
Execute the instruction handler. This is just a call to a function pointer.
Update the PC with the one returned by the instruction callback.

On top of that, the slow handler is adding these steps:

If an exception is flagged, the current instruction is changed to exception one.
Check if we can switch to fast handler (no gdb, no cycle counter, etc).
Check if there any pending interrupt and handle it if so.
Take care of gdbserver commands.

Instruction decoding

The encoding of the instructions in the ISS must be described in a Python script so that the framework can automatically generate the code for decoding instructions and dumping them in the instructions traces.

The encoding is described in this file: core/models/cpu/iss/isa_gen/isa_riscv_gen.py.

There is a first section in it which describes all the possible formats for the instructions.

Each format specifies how registers and immediates must be extracted from the opcode.

Here is a first example for instructions having 2 input registers and one output register:

if format == 'R':
    self.args = [   OutReg(0, Range(7,  5)),
                    InReg (0, Range(15, 5)),
                    InReg (1, Range(20, 5)),
                    ]

A different class must be used for input and output registers. The first argument gives the register instruction index and the second gives the range in the opcode where the register regfile index must be extracted.

The range is in this example a contiguous single range but it can also be made of several ranges put together. In this example, the first argument is the first bit of the range and the second argument is the width of the range.

Here is another example where an immediate is also extracted:

if format == 'I':
    self.args = [   OutReg(0, Range(7,  5)),
                    InReg (0, Range(15, 5)),
                    SignedImm(0, Range(20, 12)),
                ]

The encoding of the instruction must give the instruction label, the instruction format, a string describing the opcode, and additional options like tags.

The opcode can contain a dash for bits which are not relevant for the decoding because they are used to encode registers or immediates.

Here are examples of encoded instructions. The tag can be used later on to track all instructions of a certain kind, like here the ones doing a load, so that we can assign something like latency.

R5('lw',    'L',    '------- ----- ----- 010 ----- 0000011', tags=["load"]),
R5('addi',  'I',    '------- ----- ----- 000 ----- 0010011'),

Instruction decoding cache

To increase the simulation speed of the ISS, the instructions are decoded only once using a cache of decoded instructions.

Decoding an instruction consists in unpacking the opcode to extract all the register and immediate information and find out which handler should take care of emulating the instruction.

All this information takes time to extract, thus they are done only once and then stored into an unpacked instruction structure.

The role of the cache is to maintain a database of decoded instruction and allow the core to quickly get them from the PC.

For that, as shown on the figure below, the exec module query the instruction cache to get the decoded instruction for a certain PC.

To reduce the time neeed to access instructions, the cache is grouping them into pages of instructions.

Since instructions are more likely to execute in sequence, the cache maintains a pointer to the current instruction page, so that it first check if the required instruction is within the current page. If so, this is the fastest look-up we can get and the instruction is returned. If not, the cache looks for the page using a hash table. if this is a hit, the page becomes the current one and the instruction is returned. If this is a miss, a new page is allocated, inserted into the hash and the instruction is returned.

Since it takes quite some time to decode instructions, and we will most likely not use all the instructions in a page, instructions are actually not decoded immediately. Just a pointer to the instruction structure is returned.

Then to avoid adding checks on the critical path, the exec module will execute the model of the instruction by calling the handler associated to the unpacked instruction structure. This handler is initialized at first to be a decoding function, so that the first time the instruction is executed, it is decoded. Then the next executions will directly jump to the right handler.

ISS customization

The ISS is having some kind of flexibility to allow building on top of it several variants of cores.

The following is an example of how the ISS gets customized for the snitch core:

class Snitch(RiscvCommon):

    def __init__(self,
            parent,
            name,
            isa: str='rv32imafdc',
            misa: int=0,
            binaries: list=[],
            fetch_enable: bool=False,
            boot_addr: int=0):


        isa_instance = cpu.iss.isa_gen.isa_riscv_gen.RiscvIsa("snitch_" + isa, isa)

        super().__init__(parent, name, isa=isa_instance, misa=misa, core="snitch", scoreboard=True)

        self.add_c_flags([
            "-DPIPELINE_STAGES=1",
            "-DCONFIG_ISS_CORE=snitch",
        ])

        self.add_sources([
            "cpu/iss/src/snitch/snitch.cpp",
            "cpu/iss/src/spatz.cpp",
        ])

As we can see, snitch core is described with a specific class inheriting from a generic ISS. It overloads a few parameters, like the core type, and the scoreboard which is enabled. Extra source code is given to include some code specific to this core.

Here is another kind of customization which can be done in the ISS generator:

extensions = [ PulpV2(), Xf16(), Xf16alt(), Xf8(), Xfvec(), Xfaux() ]

isa = cpu.iss.isa_gen.isa_riscv_gen.RiscvIsa(name, 'rv32imfc', extensions=extensions)

The ISA can be generated differently first by giving a different isa string, for what concerns generic isa subsets. Then custom isa can be added by adding them to the extensions parameter.

Hereafter is the generic ISS class with all the possible parameters which can be customized to get ISS variants:

def __init__(self,
        parent,
        name,
        isa,
        misa: int=0,
        first_external_pcer: int=0,
        riscv_dbg_unit: bool=False,
        debug_binaries: list=[],
        binaries: list=[],
        debug_handler: int=0,
        power_models: dict={},
        power_models_file: str=None,
        cluster_id: int=0,
        core_id: int=0,
        fetch_enable: bool=False,
        boot_addr: int=0,
        mmu: bool=False,
        pmp: bool=False,
        riscv_exceptions: bool=False,
        core='riscv',
        supervisor=False,
        user=False,
        internal_atomics=False,
        timed=True,
        scoreboard=False,
        cflags=None,
        prefetcher_size=None,
        wrapper="pulp/cpu/iss/default_iss_wrapper.cpp"):

    if supervisor:
        self.add_c_flags(['-DCONFIG_GVSOC_ISS_SUPERVISOR_MODE=1'])

ISS timing customization

To customize the timing of the ISS, there is a first possibility to do it in the decoding tree.

This has the advantage of not requiring any modification to the C++ code, which can be easier to provide several implementations of the same core.

Here is an example where we use the tags on the instructions to identify a certain kind of instructions and attach some latency to it.

def __build_isa(name):
    extensions = [ PulpV2(), Xf16(), Xf16alt(), Xf8(), Xfvec(), Xfaux() ]

    isa = cpu.iss.isa_gen.isa_riscv_gen.RiscvIsa(name, 'rv32imfc', extensions=extensions)


    for insn in isa.get_insns():

        if "load" in insn.tags:
            insn.get_out_reg(0).set_latency(2)
        elif "mul" in insn.tags:
            insn.get_out_reg(0).set_latency(2)
        elif "mulh" in insn.tags:
            insn.set_latency(5)

    return isa

There is another big advantage of timing an instruction here. The decoder will detect it and will decode the instruction in a specific way which will allow taking into account the latency without requiring any additional check at runtime.

In the example below, this is used to model the data dependency which can happen between a load and the next instruction. Checking that at runtime requires monitoring every register access to see if one instruction tries to read a register the cycle after a load read data from memory to this register. This means adding a check at every cycle.

To avoid that, the decoder detects the dependency while decoding the instruction and only if it is found, replace the normal instruction callback by a slower one adding the latency. The good thing is that most of the time, the compiler will take care of avoiding the data dependency, so we avoid doing one check per cycle for a rare case.

Here is another example where we modify the isa to add a shared resource which is modeling the fact that floating-point instructions are offloading to a shared FPU, which can create some contentions.

def __build_cluster_isa():

    def __attach_resource(insn, resource, latency, bandwidth, tags=[]):
        if len(tags) == 0:
            insn.attach_resource(resource, latency, bandwidth)
        else:
            for tag in tags:
                if tag in insn.tags:
                    insn.attach_resource(resource, latency, bandwidth)

    isa = __build_isa('pulp_cluster')

    # Declare the 3 kind of shared resources with appropriate latency and bandwidth
    isa.add_resource('fpu_base', instances=4)
    isa.add_resource('fpu_sqrt', instances=1)

    # And attach resources to instructions
    for insn in isa.get_tree('f').get_insns() + isa.get_tree('sfloat').get_insns():

        # All float operations are handled by the same unit
        __attach_resource(insn, 'fpu_base', latency=1, bandwidth=1, tags=[
            'fmadd', 'fadd', 'fmul', 'fconv', 'fother',
            'sfmadd', 'sfadd', 'sfmul', 'sfconv', 'sfother',
        ])

        # Except div, rem and sqrt which goes to the sqrt unit
        __attach_resource(insn, 'fpu_sqrt', latency=14, bandwidth=14, tags=[
            'fdiv'
        ])

        # Except div, rem and sqrt which goes to the sqrt unit
        __attach_resource(insn, 'fpu_sqrt', latency=10, bandwidth=10, tags=[
            'sfdiv'
        ])


    return isa

The instruction is functionnally executed in the ISS the standard way, and then the resource model is applied to compute the contentions.

For that, each resource is assign a bandwidth and a latency. The latency tells after how many cycles the instruction result is available, which can stall the core if its pipeline is not deep enough. Then the bandwith indicates how many cycles it needs to take the next instruction.

The resource model is keeping track of the cycle where the next instruction can be accepted, based on the bandwidth and will apply stalls if another instruction is coming before that.

For lots of cases, this way of customizing the timing on the isa is not suitable because for example the number of cycles of the instruction depends on something we only know during execution like the value of the registers.

This is the case for example for the iming of the div instruction on ri5cy core, as it depends on the number of zero. The instruction can still be correctly timed by calling some functions in the timing module:

int cycles;

if (divider == 0)
{
    cycles = 1;
}
else if (divider > 0)
{
    cycles = __builtin_clz(divider) + 3;
}
else
{
    cycles = __builtin_clz((~divider) + 1) + 2;
}

iss->timing.stall_insn_dependency_account(cycles);

In this case, the number of cycles is computed depending on the register value, and the core is stalled for the estimated amount of cycles.

The main drawback of this approach is that it makes it more difficult to maintain several implementions of the core as it is increasing the number of ifdef in the code.

Another example is the timing of the compare and branch instruction, where the stalls must be applied only if the branch is taken:

inline void Timing::stall_taken_branch_account()
{
    this->stall_cycles_account(2);
    this->event_branch_account(1);
    this->event_taken_branch_account(1);
}

L1 interconnect

vp::IoReqStatus interleaver::req(vp::Block *__this, vp::IoReq *req)
{
    interleaver *_this = (interleaver *)__this;
    uint64_t offset = req->get_addr();
    bool is_write = req->get_is_write();
    uint64_t size = req->get_size();
    uint8_t *data = req->get_data();

    _this->trace.msg("Received IO req (offset: 0x%llx, size: 0x%llx, is_write: %d)\n", offset, size, is_write);

    int bank_id = (offset >> 2) & _this->bank_mask;
    uint64_t bank_offset = ((offset >> (_this->stage_bits + 2)) << 2) + (offset & 0x3);

    req->set_addr(bank_offset);
    return _this->out[bank_id]->req_forward(req);
}

Converter

void converter::event_handler(vp::Block *__this, vp::ClockEvent *event)
{
    converter *_this = (converter *)__this;
    vp::IoReq *req = _this->pending_req;
    _this->pending_req = req->get_next();

    _this->trace.msg("Sending partial packet (req: %p, offset: 0x%llx, size: 0x%llx, is_write: %d)\n",
        req, req->get_addr(), req->get_size(), req->get_is_write());

    vp::IoReqStatus err = _this->out.req(req);
    if (err == vp::IO_REQ_OK)
    {
        _this->ready_cycle = _this->clock.get_cycles() + req->get_latency() + 1;
        _this->ongoing_size -= req->get_size();
        if (_this->ongoing_size == 0)
        {
        vp::IoReq *req = _this->ongoing_req;
        _this->trace.msg("Finished handling request (req: %p)\n", req);
        _this->ongoing_req = NULL;
        req->set_latency(req->get_latency() + 1);
        req->get_resp_port()->resp(req);

        if (_this->stalled_req)
        {
            req = _this->stalled_req;
            _this->trace.msg("Unstalling request (req: %p)\n", req);
            _this->stalled_req = req->get_next();
            req->get_resp_port()->grant(req);

            _this->process_pending_req(req);
        }
        }
    }
    else
    {
        _this->ready_cycle = INT32_MAX;
    }

    _this->check_state();
    }

    void converter::check_state()
    {
    if (pending_req)
    {
        int64_t cycle = clock.get_cycles();
        int64_t latency = 1;
        if (ready_cycle > cycle) latency = ready_cycle - cycle;
        if (!event->is_enqueued()) event_enqueue(event, latency);
    }
}

vp::IoReqStatus converter::process_pending_req(vp::IoReq *req)
{
    uint64_t offset = req->get_addr();
    uint64_t size = req->get_size();
    uint8_t *data = req->get_data();
    bool is_write = req->get_is_write();

    int mask = output_align - 1;

    ongoing_req = req;
    ongoing_size = size;

    while (size)
    {
        int iter_size = output_width;
        if (offset & mask) iter_size -= offset & mask;
        if (iter_size > size) iter_size = size;

        vp::IoReq *req = out.req_new(offset, data, iter_size, is_write);
        req->set_next(pending_req);
        pending_req = req;


        size -= iter_size;
        offset += iter_size;
        data += iter_size;
    }

    return vp::IO_REQ_PENDING;
}

vp::IoReqStatus converter::process_req(vp::IoReq *req)
{
    uint64_t offset = req->get_addr();
    uint64_t size = req->get_size();
    uint8_t *data = req->get_data();
    bool is_write = req->get_is_write();

    int mask = output_align - 1;

    // Simple case where the request fit, just forward it
    if ((offset & ~mask) == ((offset + size - 1) & ~mask))
    {
        trace.msg("No conversion applied, forwarding request (req: %p)\n", req);
        return out.req_forward(req);
    }

    return this->process_pending_req(req);
}

vp::IoReqStatus converter::req(vp::Block *__this, vp::IoReq *req)
{
    converter *_this = (converter *)__this;
    uint64_t offset = req->get_addr();
    bool is_write = req->get_is_write();
    uint64_t size = req->get_size();
    uint8_t *data = req->get_data();

    _this->trace.msg("Received IO req (req: %p, offset: 0x%llx, size: 0x%llx, is_write: %d)\n", req, offset, size, is_write);

    if (_this->ongoing_req)
    {
        _this->trace.msg("Stalling request (req: %p)\n", req);

        if (_this->stalled_req)
        _this->last_stalled_req->set_next(req);
        else
        _this->stalled_req = req;
        req->set_next(NULL);
        _this->last_stalled_req = req;

        return vp::IO_REQ_DENIED;
    }

    if (_this->process_req(req) == vp::IO_REQ_OK)
        return vp::IO_REQ_OK;

    _this->check_state();

    return vp::IO_REQ_PENDING;
}