Changeset 3126


Ignore:
Timestamp:
Apr 12, 2013, 1:00:03 AM (4 years ago)
Author:
sacerdot
Message:

Splitted into empty report + "stand alone" paper.
The paper needs to be rephrased a bit to look like one.

Location:
Deliverables/D6.3
Files:
1 added
1 edited

Legend:

Unmodified
Added
Removed
  • Deliverables/D6.3/report.tex

    r3125 r3126  
    22
    33\usepackage{../style/cerco}
     4\usepackage{pdfpages}
    45
    56\usepackage{amsfonts}
     
    114115\end{quotation}
    115116
    116 \section{Middle and Long Term Improvements}
    117 In order to identify the middle and long term improvements, we briefly review
    118 here the premises and goals of the CerCo approach to resource analysis.
    119 \begin{itemize}
    120  \item There is a lot of recent and renewed activity in the formal method
    121    community to accomodate resource analysis using techniques derived from
    122    functional analysis (type systems, logics, abstract interpretation,
    123    amortized analysis applied to data structures, etc.)
    124  \item Most of this work, which currently remains at theoretical level,
    125    is focused on high level languages and it assumes the existence of correct
    126    and compositional resource cost models.
    127  \item High level languages are compiled to object code by compilers that
    128    should respect the functional properties of the program. However, because
    129    of optimizations and the inherently non compositional nature of compilation,
    130    compilers do not respect compositional cost models that are imposed a priori
    131    on the source language. By controlling the compiler and coupling it with a
    132    WCET analyser, it is actually possible to
    133    choose the cost model in such a way that the cost bounds are high enough
    134    to bound the cost of every produced code. This was attempted for instance
    135    in the EmBounded project with good success. However, we believe that bounds
    136    obtained in this way have few possibilities of being tight.
    137  \item Therefore our approach consists in having the compiler generate the
    138    cost model for the user by combining tracking of basic blocks during code
    139    transformations with a static resource analysis on the object code for basic
    140    blocks. We formally prove the compiler to respect the cost model that is
    141    induced on the source level based on a very few assumptions: basically the
    142    cost of a sequence of instructions should be associative and commutative
    143    and it should not depend on the machine status, except its program counter.
    144    Commutativity can be relaxed at the price of introducing more cost updates
    145    in the instrumented source code.
    146  \item The cost model for basic blocks induced on the source language must then
    147    be exploited to derive cost invariants and to prove them automatically.
    148    In CerCo we have shown how even simple invariant generations techniques
    149    are sufficient to enable the fully automatic proving of parametric WCET
    150    bounds for simple C programs and for Lustre programs of arbitrary complexity.
    151 \end{itemize}
     117\newpage
    152118
    153 Compared to traditional WCET techniques, our approach currently has many
    154 similarities, some advantages and some limitations. Both techniques need to
    155 perform data flow analysis on the control flow graph of the program and both
    156 techniques need to estimate the cost of control blocks of instructions.
    157 
    158 \subsection{Control flow analysis}
    159 
    160 The first main difference is in the control flow analysis. Traditional WCET
    161 starts from object code and reconstructs the control flow graph from it.
    162 Moreover, abstract interpretation is heavily employed to bound the number of
    163 executions of cycles. In order to improve the accuracy of estimation, control
    164 flow constraints are provided by the user, usually as systems of (linear)
    165 inequalities. In order to do this, the user, helped by the system, needs to
    166 relate the object code control flow graph with the source one, because it is
    167 on the latter that the bounds can be figured out and be understood. This
    168 operations is untrusted and potentially error prone for complex optimizations
    169 (like aggressive loop optimizations). Efficient tools from linear algebra are
    170 then used to solve the systems of inequations obtained by the abstract
    171 interpreter and from the user constraints.
    172 
    173 In CerCo, instead, we assume full control on the compiler that
    174 is able to relate, even in non trivial ways, the object code control flow graph
    175 onto the source code control flow graph. A clear disadvantage is the
    176 impossibility of applying the tool on the object code produced by third party
    177 compilers. On the other hand, we get rid of the possibility of errors
    178 in the reconstruction of the control flow graph and in the translation of
    179 high level constraints into low level constraints. The second potentially
    180 important advantage is that, once we are dealing with the source language,
    181 we can augment the precision of our dataflow analysis by combining together
    182 functional and non functional invariants. This is what we attempted with
    183 the CerCo Cost Annotating Frama-C Plug-In. The Frama-C architecture allows
    184 several plug-ins to perform all kind of static analisys on the source code,
    185 reusing results from other plug-ins and augmenting the source code with their
    186 results. The techniques are absolutely not limited to linear algebra and
    187 abstract interpretation, and the most important plug-ins call domain specific
    188 and general purpose automated theorem provers to close proof obligations of
    189 arbitrary shape and complexity.
    190 
    191 In principle, the extended flexibility of the analysis should allow for a major
    192 advantage of our technique in terms of precision, also
    193 considering that all analysis used in traditional WCET can still be implemented
    194 as plug-ins. In particular, the target we have in mind are systems that are
    195 both (hard) real time and safety critical. Being safety critical, we can already
    196 expect them to be fully or partially specified at the functional level.
    197 Therefore we expect that the additional functional invariants should allow to
    198 augment the precision of the cost bounds, up to the point where the parametric
    199 cost bound is fully precise.
    200 In practice, we have not had the time to perform extensive
    201 comparisons on the kind of code used by industry in production systems.
    202 The first middle term improvement of CerCo would then consist in this kind
    203 of analysis, to support or disprove our expectations. It seems that the
    204 newborn TACLe Cost Action (Timing Analysis on Code Level) would be the
    205 best framework to achieve this improvement.
    206 In the case where our
    207 technique remains promising, the next long term improvement would consist in
    208 integrating in the Frama-C plug-in ad-hoc analysis and combinations of analysis
    209 that would augment the coverage of the efficiency of the cost estimation
    210 techniques.
    211 
    212 \subsection{Static analysis of costs of basic blocks}
    213 At the beginning of the project we have deliberately decided to focus our
    214 attention on the control flow preservation, the cost model propagation and
    215 the exploitation of the cost model induced on the high level code. For this
    216 reason we have devoted almost no attention to the static analysis of basic
    217 blocks. This was achieved by picking a very simple hardware architecture
    218 (the 8051 microprocessor family) whose cost model is fully predictable and
    219 compositional: the cost of every instruction --- except those that deal with
    220 I/O --- is constant, i.e. it does not depend on the machine status.
    221 We do not regret this choice because, with the limited amount of man power
    222 available in the project, it would have been difficult to also consider this
    223 aspect. However, without showing if the approach can scale to most complex
    224 architectures, our methodology remains of limited interest for the industry.
    225 Therefore, the next important middle term improvement will be the extension
    226 of our methodology to cover pipelines and simple caches. We will now present our
    227 ideas on how we intend to address the problem. The obvious long term
    228 improvement would be to take in consideration multicores system and complex
    229 memory architectures like the ones currently in use in networks on chips.
    230 The problem of execution time analysis for these systems is currently
    231 considered extremely hard or even unfeasible and at the moments it seems
    232 unlikely that our methodology could contribute to the solution of the problem.
    233 
    234 \subsubsection{Static analysis of costs of basic blocks revisited}
    235 We will now describe what currently seems to be the most interesting technique
    236 for the static analysis of the cost of basic blocks in presence of complex
    237 hardware architectures that do not have non compositional cost models.
    238 
    239 We start presenting an idealized model of the execution of a generic
    240 microprocessor (with caches) that has all interrupts disabled and no I/O
    241 instructions. We then classify the models according to some properties of
    242 their cost model. Then we show how to extend the labelling approach of CerCo
    243 to cover models that are classified in a certain way.
    244 
    245 \paragraph{The microprocessor model}
    246 
    247 Let $\sigma,\sigma_1,\ldots$ range over $\Sigma$, the set of the fragments of
    248 the microprocessor states that hold the program counter, the program status word and
    249 all the data manipulated by the object code program, i.e. registers and
    250 memory cells. We call these fragments the \emph{data states}.
    251 
    252 Let $\delta,\delta_1,\ldots$ range over $\Delta$, the set of the
    253 fragments of the microprocessor state that holds the \emph{internal state} of the
    254 microprocessor (e.g. the content of the pipeline and caches, the status of
    255 the branch prediction unit, etc.).
    256 The internal state of the microprocessor influences the execution cost of the
    257 next instruction, but it has no effect on the functional behaviour of the
    258 processor. The whole state of the processor is represented by a pair
    259 $(\sigma,\delta)$.
    260 
    261 Let $I,I_1,\ldots$ range over $\mathcal{I}$, the
    262 the set of \emph{instructions} of the processor and let
    263 $\gamma,\gamma_1,\ldots$ range over $\Gamma$, the set of \emph{operands}
    264 of instructions after the fetching and decoding passes.
    265 Thus a pair $(I,\gamma)$ represents a \emph{decoded instruction} and already contains
    266 the data required for execution. Execution needs to access the data state only
    267 to write the result.
    268 
    269 Let $fetch: \Sigma \to \mathcal{I} \times \Gamma$ be the function that
    270 performs the fetching and execution phases, returning the decoded instruction
    271 ready for execution. This is not meant to be the real fetch-decode function,
    272 that exploits the internal state too to speed up execution (e.g. by retrieving
    273 the instruction arguments from caches) and that, in case of pipelines, works
    274 in several stages. However, such a function exists and
    275 it is observationally equivalent to the real fetch-decode.
    276 
    277 We capture the semantics of the microprocessor with the following set of
    278 functions:
    279 \begin{itemize}
    280  \item The \emph{functional transition} function $\longrightarrow : \Sigma \to \Sigma$
    281    over data states. This is the only part of the semantics that is relevant
    282    to functional analysis.
    283  \item The \emph{internal state transition} function
    284    $\Longrightarrow : \Sigma \times \Delta \to \Delta$ that updates the internal
    285    state.
    286  \item The \emph{cost function} $K: \mathcal{I} \times \Gamma \times \Delta \to \mathbb{N}$
    287    that assigns a cost to transitions. Since decoded instructions hold the data
    288    they act on, the cost of an instruction may depend both on the data being
    289    manipulated and on the internal state.
    290 \end{itemize}
    291 
    292 Given a processor state $(\sigma,\delta)$, the processor evolves in the
    293 new state $(\sigma',\delta')$ in $n$ cost units if
    294 $\sigma \longrightarrow \sigma'$ and $(\sigma,\delta) \Longrightarrow \delta'$
    295 and $fetch(\sigma) = (I,\gamma)$ and $K(I,\gamma,\delta) = n$.
    296 
    297 An \emph{execution history} is a stream of states and transitions
    298 $\sigma_0 \longrightarrow \sigma_1 \longrightarrow \sigma_2 \ldots$
    299 that can be either finite or infinite. Given an execution history, the
    300 corresponding \emph{execution path} is the stream of program counters
    301 obtained from the execution history by forgetting all the remaining information.
    302 The execution path of the history $\sigma_0 \longrightarrow \sigma_1 \longrightarrow \sigma_2 \ldots$ is $pc_0,pc_1,\ldots$ where each $pc_i$ is the program
    303 counter of $\sigma_i$. We denote the set of finite execution paths with $EP$.
    304 
    305 We claim this simple model to be generic enough to cover real world
    306 architectures.
    307 
    308 \paragraph{Classification of cost models}
    309 A cost function is \emph{exact} if it assigns to transitions the real cost
    310 incurred. It is \emph{approximated} if it returns an upper bound of the real
    311 cost.
    312 
    313 A cost function is \emph{operand insensitive} if it does not depend on the
    314 operands of the instructions to be executed. Formally, $K$ is operand insensitive
    315 if there exists a $K': \mathcal{I} \times \Delta \to \mathbb{N}$ such that
    316 $K(I,\gamma,\delta) = K'(I,\delta)$. In this case, with an abuse of terminology,
    317 we will identify $K$ with $K'$.
    318 
    319 The cost functions of simple hardware architectures, in particular RISC ones,
    320 are naturally operand insensitive. In the other cases an exact operand sensitive
    321 cost function can always be turned into an approximated operand insensitive one
    322 by taking $K'(I,\delta) = \max\{K(I,\gamma,\delta)~|~\gamma \in \Gamma\}$.
    323 The question when one performs these approximation is how severe the
    324 approsimation is. A measure is given by the \emph{jitter}, which is defined
    325 as the difference between the best and worst cases. In our case, the jitter
    326 of the approximation $K'$ would be $\max\{K(I,\gamma,\delta)~|~\gamma \in \Gamma\} - \min\{K(I,\gamma,\delta)~|~\gamma \in \Gamma\}$. According to experts
    327 of WCET analysis, the jitters relative to operand sensitivity in modern
    328 architectures are small enough to make WCET estimations still useful.
    329 Therefore, in the sequel we will focus on operand insensitive cost models only.
    330 
    331 Note that cost model that are operand insensitive may still have significant
    332 dependencies over the data manipulated by the instructions, because of the
    333 dependency over internal states. For example, an instruction that reads data
    334 from the memory may change the state of the cache and thus greatly affect the
    335 execution time of successive instructions.
    336 
    337 Nevertheless, operand insensitivity is an important property for the labelling
    338 approach. In~\cite{tranquilli} we introduced \emph{dependent labels} and
    339 \emph{dependent costs}, which are the possibility of assigning costs to basic
    340 blocks of instructions which are also dependent on the state of the high level
    341 program at the beginning of the block. The idea we will now try to pursue is
    342 to exploit dependent costs to capture cost models that are sensitive to
    343 the internal states of the microprocessor. Operand sensitivity, however, is
    344 a major issue in presence of dependent labels: to lift a data sensitive cost
    345 model from the object code to the source language, we need a function that
    346 maps high level states to the operands of the instructions to be executed,
    347 and we need these functions to be simple enough to allow reasoning over them.
    348 Complex optimizations performed by the compiler, however, make the mappings
    349 extremely cumbersomes and history dependent. Moreover, keeping
    350 track of status transformations during compilation would be a significant
    351 departure from classical compilation models which we are not willing to
    352 undertake. By assuming or removing operand sensitivity, we get rid of part
    353 of the problem: we only need to make our costs dependent on the internal
    354 state of the microprocessor. The latter, however, is not at all visible
    355 in the high level code. Our next step is to make it visible.
    356 
    357 In general, the value of the internal state at a certain point in the program
    358 history is affected by all the preceding history. For instance, the
    359 pipeline stages either hold operands of instructions in execution or bubbles
    360 \footnote{A bubble is formed in the pipeline stage $n$ when an instruction is stucked in the pipeline stage $n-1$, waiting for some data which is not available yet.}. The execution history contains data states that in turn contain the
    361 object code data which we do not know how to relate simply to the source code
    362 data. We therefore introduce a new classification.
    363 
    364 A \emph{view} over internal states is a pair $(\mathcal{V},|.|)$ where
    365 $\mathcal{V}$ is a finite non empty set and $|.|:\Delta \to \mathcal{V}$ is
    366 a forgetful function over internal states.
    367 
    368 The operand insensitive cost function $K$ is \emph{dependent on the view
    369 $\mathcal{V}$} if there exists a $K': \mathcal{I} \times \mathcal{V} \to \mathbb{N}$ such that $K(I,\delta) = K'(I,|\delta|)$. In this case, with an abuse of terminology, we will identify $K$ with $K'$.
    370 
    371 Among the possible views, the ones that we will easily be able to work
    372 with in the labelling approach are the \emph{execution history dependent views}.
    373 A view $(\mathcal{V},|.|)$ is execution history dependent with a lookahead
    374 of length $n$ when there exists
    375 a transition function $\hookrightarrow : PC^n \times \mathcal{V} \to \mathcal{V}$
    376 such that for all $(\sigma,\delta)$ and $pc_1,\ldots,pc_n$ such that every
    377 $pc_i$ is the program counter of $\sigma_i$ defined by $\sigma \longrightarrow^i \sigma_i$, we have $(\sigma,\delta) \Longrightarrow \delta'$ iff
    378 $((pc_1,\ldots,pc_n),|\delta|) \hookrightarrow |\delta'|$.
    379 
    380 Less formally, a view is
    381 dependent on the execution history if the evolution of the views is fully
    382 determined by the evolution of the program counters. To better understand
    383 the definition, consider the case where the next instruction to be executed
    384 is a conditional jump. Without knowing the values of the registers, it is
    385 impossible to determine if the true or false branches will be taken. Therefore
    386 it is likely to be impossible to determine the value of the view the follows
    387 the current one. On the other hand, knowing the program counter that will be
    388 reached executing the conditional branch, we also know which
    389 branch will be taken and this may be sufficient to compute the new view.
    390 Lookaheads longer then 1 will be used in case of pipelines: when executing
    391 one instruction in a system with a pipeline of length $n$, the internal state
    392 of the pipeline already holds information on the next $n$ instructions to
    393 be executed.
    394 
    395 The reference to execution historys in the names is due to the following
    396 fact: every execution history dependent transition function $\hookrightarrow$
    397 can be lifted to the type $EP \times \mathcal{V} \to \mathcal{V}$ by folding
    398 the definition over the path trace:
    399 $((pc_0,\ldots,pc_m),v_0) \hookrightarrow v_n$ iff for all $i \leq m - n$,
    400 $((pc_i,\ldots,pc_{i+n}),v_i) \hookrightarrow v_{i+1}$.
    401 Moreover, the folding is clearly associative:
    402 $(\tau_1 @ \tau_2, v) \hookrightarrow v''$ iff
    403 $(\tau_1,v) \hookrightarrow v'$ and $(\tau_2,v') \hookrightarrow v''$.
    404 
    405 As a final definition, we say that a cost function $K$ is
    406 \emph{data independent} if it is dependent on a view that is execution path
    407 dependent. In two paragraphs we will show how we can extend the
    408 labelling approach to deal with data independent cost models.
    409 
    410 Before that, we show that the class of data independent cost functions
    411 is not too restricted to be interesting. In particular, at least simple
    412 pipeline models admit data independent cost functions.
    413 
    414 \paragraph{A data independent cost function for simple pipelines}
    415 We consider here a simple model for a pipeline with $n$ stages without branch
    416 prediction and hazards. We also assume that the actual value of the operands
    417 of the instruction that is being read have no influence on stalls (i.e. the
    418 creation of bubbles) nor on the execution cost. The type of operands, however,
    419 can. For example, reading the value 4 from a register may stall a pipeline
    420 if the register has not been written yet, while reading 4 from a different
    421 register may not stall the pipeline.
    422 
    423 The internal states $\Delta$ of the pipeline are $n$-tuples of decoded
    424 instructions or bubbles:
    425 $\Delta = (\mathcal{I} \times \Gamma \cup \mathbbm{1})^n$.
    426 This representation is not meant to describe the real data structures used
    427 in the pipeline: in the implementation the operands are not present in every
    428 stage of the pipeline, but are progressively fetched. A state
    429 $(x_1,x_2,\ldots,(I,\gamma))$ represents the state of the pipeline just before
    430 the completion of instruction $(I,\gamma)$. The first $n-1$ instructions that
    431 follow $I$ may already be stored in the pipeline, unless bubbles have delayed
    432 one or more of them.
    433 
    434 We introduce the following view over internal states:
    435 $(\{0,1\}^n,|.|)$ where $\mathbb{N}_n = {0,\ldots,2^n-1}$ and
    436 $|(x_1,\ldots,x_n)| = (y_1,\ldots,y_n)$ where $y_i$ is 1 iff $x_i$ is a bubble.
    437 Thus the view only remembers which stages of the pipelines are stuck.
    438 The view holds enough information to reconstruct the internal state given
    439 the current data state: from the data state we can fetch the program counter
    440 of the current and the next $n-1$ instructions and, by simulating at most
    441 $n$ execution steps and by knowing where the bubbles were, we can fill up
    442 the internal state of the pipeline.
    443 
    444 The assumptions on the lack of influence of operands values on stalls and
    445 execution times ensures the existence of the data independent cost function
    446 $K: PC \times \{0,1\}^n \to \mathbb{N}$. The transition function for a
    447 pipeline with $n$ stages may require $n$ lookaheads:
    448 $\hookrightarrow : PC^n \times \{0,1\}^n \to \{0,1\}^n$.
    449 
    450 While the model is a bit simplicistic, it can nevertheless be used to describe
    451 existing pipelines. It is also simple to be convinced that the same model
    452 also captures static branch prediction: speculative execution of conditional
    453 jumps is performed by always taking the same branch which does not depend on
    454 the execution history. In order to take in account jumpt predictions based on
    455 the execution history, we just need to incorporate in the status and the view
    456 statistical informations on the last executions of the branch.
    457 
    458 \paragraph{The labelling approach for data independent cost models}
    459 We now describe how the labelling approach can be slightly modified to deal
    460 with a data independent cost model $(\hookrightarrow,K)$ built over
    461 $(\mathcal{V},|.|)$.
    462 
    463 In the labelling approach, every basic block in the object code is identified
    464 with a unique label $L$ which is also associated to the corresponding
    465 basic block in the source level. Let us assume that labels are also inserted
    466 after every function call and that this property is preserved during
    467 compilation. Adding labels after calls makes the instrumented code heavier
    468 to read and it generates more proof obligations on the instrumented code, but
    469 it does not create any additional problems. The preservation during compilation
    470 creates some significant technical complications in the proof of correctness
    471 of the compiler, but those can be solved.
    472 
    473 The static analysis performed in the last step of the basic labelling approach
    474 analyses the object code in order to assign a cost to every label or,
    475 equivalently, to every basic block. The cost is simply the sum of the cost
    476 of very instruction in the basic block.
    477 
    478 In our scenario, instructions no longer have a cost, because the cost function
    479 $K$ takes in input a program counter but also a view $v$. Therefore we
    480 replace the static analysis with the computation, for every basic block and
    481 every $v \in \mathcal{V}$, of the sum of the costs of the instructions in the
    482 block, starting in the initial view $v$. Formally, let the sequence of the
    483 program counters of the basic block form the execution path $pc_0,\ldots,pc_n$.
    484 The cost $K(v_0,L)$ associated to the block labelled with
    485 $L$ and the initial view $v_0$ is
    486 $K(pc_0,v_0) + K(pc_1,v_1) + \ldots + K(pc_n,v_n)$ where for every $i < n$,
    487 $((pc_i,\ldots,pc_{i+l}),v_i) \hookrightarrow v_{k+1}$ where $l$ is the
    488 lookahead required. When the lookahead requires program counters outside the
    489 block under analysis, we are free to use dummy ones because the parts of the
    490 view that deal with future behaviour have no impact on the cost of the
    491 previous operations by assumption.
    492 
    493 The static analysis can be performed
    494 in linear time in the size of the program because the cardinality of the sets
    495 of labels (i.e. the number of basic blocks) is bounded by the size of the
    496 program and because the set $V$ is finite. In the case of the pipelines of
    497 the previous paragraph, the static analysis is $2^n$ times more expensive
    498 than the one of the basic labelling approach, where $n$ is the number of
    499 pipeline stages.
    500 
    501 The first imporant theorem in the labelling approach is the correctness of the
    502 static analysis: if the (dependent) cost associated to a label $L$ is $k$,
    503 then executing a program from the beginning of the basic block to the end
    504 of the basic block should take exactly $k$ cost units. The proof only relies
    505 on associativity and commutativity of the composition of costs. Commutativity
    506 is only required if basic blocks can be nested, i.e. if a basic block does not
    507 terminate when it reaches a call, but it continues after the called function
    508 returns. By assuming to have a cost label after each block, we do not need
    509 commutativity any longer, which does not hold for the definition of $K$ we
    510 just gave. The reason is that, if $pc_i$ is a function call executed in
    511 the view (state) $v_i$, it is not true that, after return, the state
    512 will be $v_i+1$ defined by $(pc_i,pc_{i+1},v_i) \hookrightarrow v_{i+1}$
    513 (assuming a lookahead of 1, which is already problematic).
    514 Indeed $pc_{i+1}$ is the program counter of the instruction that follows
    515 the call, whereas the next program counter to be reached is the one of the
    516 body of the call. Moreover, even if the computation would make sense,
    517 $v_{i+1}$ would be the state at the beginning of the execution of the body of
    518 the call, while we should know the state after the function returns.
    519 The latter cannot be
    520 statically predicted. That's why we had to impose labels after calls.
    521 Associativity, on the other hand, trivially holds. Therefore the proof
    522 of correctness of the static analysis can be reused without any change.
    523 
    524 So far, we have computed the dependent cost $K: \mathcal{V} \times \mathcal{L} \to \mathbb{N}$ that associates a cost to basic blocks and views.
    525 The second step consists in statically computing the transition function
    526 $\hookrightarrow : \mathcal{L} \times \mathcal{L} \times \mathcal{V} \to \mathcal{V}$ that
    527 associates to each pair of consecutively executed basic blocks and input view
    528 the view obtained at the end of the execution of the first block.
    529 
    530 The definition is the following:
    531 $(L,L',v) \hookrightarrow v'$ iff $((pc_0,\ldots,pc_n,pc'_0,\ldots,pc'_m),v) \hookrightarrow v'$ where $(pc_0,\ldots,pc_n)$ are the program counters of the block labelled by $L$ and $(pc'_0,\ldots,pc'_m)$ are those of the block labelled
    532 with $L'$. We assume here that $m$ is always longer or equal to the lookahead
    533 required by the transition function $\hookrightarrow$ over execution paths.
    534 When this is not the case we could make the new transition function take in
    535 input a longer lookahead of labels. Or we may assume to introduce enough
    536 \texttt{NOP}s at the beginning of the block $L'$ to enforce the property.
    537 In the rest of the paragraph we assume to have followed the second approach
    538 to simplify the presentation.
    539 
    540 The extended transition function over labels is not present in the basic labelling approach. Actually, the basic labelling
    541 approach can be understood as the generalized approach where the view $\mathcal{V} = \mathbbm{1}$.
    542 The computation of the extended $\hookrightarrow$ transition function
    543 is again linear in the size of the program.
    544 
    545 Both the versions of $K$ and $\hookrightarrow$ defined over labels can be
    546 lifted to work over traces by folding them over the list of labels in the
    547 trace: for $K$ we have $K((L_1,\ldots,L_n),v) = K(L_1,v) + K((L_2,\ldots,L_n),v')$ where
    548 $(L_1,L_2,v) \hookrightarrow v'$;  for $\hookrightarrow$ we have
    549 $((L_1,\ldots,L_n),v) \hookrightarrow v''$ iff $(L_1,L_2,v) \hookrightarrow v'$ and $((L_2,\ldots,L_n),v') \hookrightarrow v''$. The two definitions are also
    550 clearly associative.
    551 
    552 The second main theorem of the labelling approach is trace preservation:
    553 the trace produced by the object code is the same as the trace produced by
    554 the source code. Without any need to change the proof, we immediately obtain
    555 as a corollary that for every view $v$, the cost $K(\tau,v)$ computed from
    556 the source code trace $\tau$ is the same than the cost $K(\tau,v)$ computed
    557 on the object code trace, which is again $\tau$.
    558 
    559 The final step of the labelling approach is source code instrumentation.
    560 In the basic labelling approach it consists in adding a global variable
    561 \texttt{\_\_cost}, initialized with 0, which is incremented at the beginning
    562 of every basic block with the cost of the label of the basic block.
    563 Here we just need a more complex instrumentation that keeps track of the
    564 values of the views during execution:
    565 \begin{itemize}
    566  \item We define three global variables \texttt{\_\_cost}, initialized at 0,
    567    \texttt{\_\_label}, initialized with \texttt{NULL}, and
    568    \texttt{\_\_view}, uninitialized.
    569  \item At the beginning of every basic block labelled by $L$ we add the
    570   following code fragment:
    571 
    572   \begin{tabular}{l}
    573   \texttt{\_\_view = \_\_next(\_\_label,}$L$\texttt{,\_\_view);}\\
    574   \texttt{\_\_cost += }$K(\mbox{\texttt{\_\_view}},L)$\texttt{;}\\
    575   \texttt{\_\_label = }$L$\texttt{;}
    576   \end{tabular}
    577 
    578   where \texttt{\_\_next(}$L_1,L_2,v$\texttt{)} = $v'$ iff
    579   $(L_1,L_2,v) \hookrightarrow v'$ unless $L_1$ is \texttt{NULL}.
    580   In that case \texttt{(\_\_next(NULL,}$L$\texttt{)} = $v_0$ where
    581   $v_0 = |\delta_0|$ and $\delta_0$ is the initial value of the internal stat
    582   at the beginning of program execution.
    583  
    584   The first line of the code fragment computes the view at the beginning of
    585   the execution of the block from the view at the end of the previous block.
    586   Then we update the cost function with the cost of the block. Finally we
    587   remember the current block to use it for the computation of the next view
    588   at the beginning of the next block.
    589 \end{itemize}
    590 
    591 \paragraph{An example of instrumentation in presence of a pipeline}
    592 In Figure~\ref{factprogi} we show how the instrumentationof a program
    593 that computes the factorial of 10 would look like in presence of a pipeline.
    594 The instrumentation has been manually produced. The \texttt{\_\_next}
    595 function says that the body of the internal loop of the \texttt{fact}
    596 function can be executed in two different internal states, summarized by the
    597 views 2 and 3. The view 2 holds at the beginning of even iterations,
    598 while the view 3 holds at the beginning of odd ones. Therefore even and odd
    599 iterations are assigned a different cost. Also the code after the loop can
    600 be executed in two different states, depending on the oddness of the last
    601 loop iteration.
    602 
    603 \begin{figure}[t]
    604 \begin{tiny}
    605 \begin{verbatim}
    606 int fact (int n) {
    607   int i, res = 1;
    608   for (i = 1 ; i <= n ; i++) res *= i;
    609   return res;
    610 }
    611 
    612 int main () {
    613   return (fact(10));
    614 }
    615 \end{verbatim}
    616 \end{tiny}
    617 \caption{A simple program that computes the factorial of 10.\label{factprog}}
    618 \end{figure}
    619 
    620 The definitions of \texttt{\_\_next} and \texttt{\_\_K} are just examples.
    621 For instance, it is possible as well that each one of the 10 iterations
    622 is executed in a different internal state.
    623 
    624 \begin{figure}[t]
    625 \begin{tiny}
    626 \begin{verbatim}
    627 int __cost = 8;
    628 int __label = 0;
    629 int __view;
    630 
    631 void __cost_incr(int incr) {
    632   __cost = __cost + incr;
    633 }
    634 
    635 int __next(int label1, int label2, int view) {
    636        if (label1 == 0)                             return 0;
    637   else if (label1 == 0 && label2 == 1)              return 1;
    638   else if (label1 == 1 && label2 == 2)              return 2;
    639   else if (label1 == 2 && label2 == 2 && view == 2) return 3;
    640   else if (label1 == 2 && label2 == 2 && view == 3) return 2;
    641   else if (label1 == 2 && label2 == 3 && view == 2) return 1;
    642   else if (label1 == 2 && label2 == 3 && view == 3) return 0;
    643   else if (label1 == 3 && label2 == 4 && view == 0) return 0;
    644   else if (label1 == 3 && label2 == 4 && view == 1) return 0;
    645 }
    646 
    647 int __K(int view, int label) {
    648        if (view == 0 && label == 0) return 3;
    649   else if (view == 1 && label == 1) return 14;
    650   else if (view == 2 && label == 2) return 35;
    651   else if (view == 3 && label == 2) return 26;
    652   else if (view == 0 && label == 3) return 6;
    653   else if (view == 1 && label == 3) return 8;
    654   else if (view == 0 && label == 4) return 6;
    655 }
    656 
    657 int fact(int n)
    658 {
    659   int i;
    660   int res;
    661   __view = __next(__label,1,__view); __cost_incr(_K(__view,1)); __label = 1;
    662   res = 1;
    663   for (i = 1; i <= n; i = i + 1) {
    664      __view = __next(__label,2,__view); __cost_incr(__K(__view,2)); __label = 2;
    665     res = res * i;
    666   }
    667   __view = __next(__label,3,__view); __cost_incr(K(__view,3)); __label = 3;
    668   return res;
    669 }
    670 
    671 int main(void)
    672 {
    673   int t;
    674   __view = __next(__label,0,__view); __cost_incr(__K(__view,0)); __label = 0;
    675   t = fact(10);
    676   __view = __next(__label,4,__view); __cost_incr(__K(__view,4)); __label = 4;
    677   return t;
    678 }
    679 \end{verbatim}
    680 \end{tiny}
    681 \caption{The instrumented version of the program in Figure~\ref{factprog}.\label{factprogi}}
    682 \end{figure}
    683 
    684 \paragraph{Considerations on the instrumentation}
    685 The example of instrumentation in the previous paragraph shows that the
    686 approach we are proposing exposes at the source level a certain amount
    687 of information about the machine behavior. Syntactically, the additional
    688 details, are almost entirely confined into the \texttt{\_\_next} and
    689 \texttt{\_\_K} functions and they do not affect at all the functional behaviour
    690 of the program. In particular, all invariants, proof obligations and proofs
    691 that deal with the functional behavior only are preserved.
    692 
    693 The interesting question, then, is what is the impact of the additional
    694 details on non functional (intensional) invariants and proof obligations.
    695 At the moment, without a working implementation to perform some large scale
    696 tests, it is difficult to understand the level of automation that can be
    697 achieved and the techniques that are likely to work better without introducing
    698 major approximations. In any case, the preliminary considerations of the
    699 project remain valid:
    700 \begin{itemize}
    701  \item The task of computing and proving invariants can be simplified,
    702    even automatically, trading correctness with precision. For example, the
    703    most aggressive approximation simply replaces the cost function
    704    \texttt{\_\_K} with the function that ignores the view and returns for each
    705    label the maximum cost over all possible views. Correspondingly, the
    706    function \texttt{\_\_next} can be dropped since it returns views that
    707    are later ignored.
    708 
    709    A more refined possibility consists in approximating the output only on
    710    those labels whose jitter is small or for those that mark basic blocks
    711    that are executed only a small number of times. By simplyfing the
    712    \texttt{\_\_next} function accordingly, it is possible to considerably
    713    reduce the search space for automated provers.
    714  \item The situation is not worse than what happens with time analysis on
    715    the object code (the current state of the art). There it is common practice
    716    to analyse larger chunks of execution to minimize the effect of later
    717    approximations. For example, if it is known that a loop can be executed at
    718    most 10 times, computing the cost of 10 iterations yields a
    719    better bound than multiplying by 10 the worst case of a single interation.
    720 
    721    We clearly can do the same on the source level.
    722    More generally, every algorithm that works in standard WCET tools on the
    723    object code is likely to have a counterpart on the source code.
    724    We also expect to be able to do better working on the source code.
    725    The reason is that we assume to know more functional properties
    726    of the program and more high level invariants, and to have more techniques
    727    and tools at our disposal. Even if at the moment we have no evidence to
    728    support our claims, we think that this approach is worth pursuing in the
    729    long term.
    730 \end{itemize}
    731 
    732 \paragraph{The problem with caches}
    733 Cost models for pipelines --- at least simple ones --- are data independent,
    734 i.e. they are dependent on a view that is execution path dependent. In other
    735 words, the knowledge about the sequence of executed instructions is sufficient
    736 to predict the cost of future instructions.
    737  
    738 The same property does not hold for caches. The cost of accessing a memory
    739 cell strongly depends on the addresses of the memory cells that have been read
    740 in the past. In turn, the accessed addresses are a function of the low level
    741 data state, that cannot be correlated to the source program state.
    742 
    743 The strong correlation between the internal state of caches and the data
    744 accessed in the past is one of the two main responsibles for the lack of
    745 precision of static analysis in modern uni-core architectures. The other one
    746 is the lack of precise knowledge on the real behaviour of modern hardware
    747 systems. In order to overcome both problems, that Cazorla\&alt.~\cite{proartis}
    748 call the ``\emph{Timing Analysis Walls}'', the PROARTIS European Project has
    749 proposed to design ``\emph{a hardware/software architecture whose execution
    750 timing behaviour eradicates dependence on execution history}'' (\cite{proartis}, Section 1.2). The statement is obviously too strong. What is concretely
    751 proposed by PROARTIS is the design of a hardware/software architecture whose
    752 execution timing is \emph{execution path dependent} (our terminology).
    753 
    754 We have already seen that we are able to accomodate in the labelling approach
    755 cost functions that are dependent on views that are execution path dependent.
    756 Before fully embracing the PROARTIS vision,
    757 we need to check if there are other aspects of the PROARTIS proposal
    758 that can be problematic for CerCo.
    759 
    760 \paragraph{Static Probabilistic Time Analysis}
    761 The approach of PROARTIS to achieve execution path dependent cost models
    762 consists in turning the hard-to-analyze deterministic hardware components
    763 (e.g. the cache) into probabilistic hardware components. Intuitively,
    764 algorithms that took decision based on the program history now throw a dice.
    765 The typical example which has been thoroughly studied in
    766 PROARTIS~\cite{proartis2} is that of caches. There the proposal is to replace
    767 the commonly used deterministic placement and replacement algorithms (e.g. LRU)
    768 with fully probabilistic choices: when the cache needs to evict a page, the
    769 page to be evicted is randomly selected according to the uniform distribution.
    770 
    771 The expectation is that probabilistic hardware will have worse performances
    772 in the average case, but it will exhibit the worst case performance only with
    773 negligible probability. Therefore, it becomes no longer interesting to estimate
    774 the actual worst case bound. What becomes interesting is to plot the probability
    775 that the execution time will exceed a certain treshold. For all practical
    776 purposes, a program that misses its deadline with a negligible probability
    777 (e.g. $10^-9$ per hour of operation) will be perfectly acceptable when deployed
    778 on an hardware system (e.g. a car or an airplane) that is already specified in
    779 such a way.
    780 
    781 In order to plot the probability distribution of execution times, PROARTIS
    782 proposes two methodologies: Static Probabilistic Time Analysis (SPTA) and
    783 Measurement Based Probabilistic Time Analysis (MBPTA). The first one is
    784 similar to traditional static analysis, but it operates on probabilistic
    785 hardware. It is the one that we would like to embrace. The second one is
    786 based on measurements and it is justified by the following assumption: if
    787 the probabilities associated to every hardware operation are all independent
    788 and identically distributed, then measuring the fime spent on full runs of
    789 sub-systems components yields a probabilistic estimate that remains valid
    790 when the sub-system is plugged in a larger one. Moreover, the probabilistic
    791 distribution of past runs must be equal to the one of future runs.
    792 
    793 We understand that MBPTA is useful to analyze closed (sub)-systems whose
    794 functional behavior is deterministic. It does not seem to have immediate
    795 applications to parametric time analysis, which we are interested in.
    796 Therefore we focus on SPTA.
    797 
    798 According to~\cite{proartis},
    799 ``\emph{in SPTA, execution time probability distributions for individual operations \ldots are determined statically
    800 from a model of the processor. The design principles of PROARTIS
    801 will ensure \ldots that the probabilities for the execution time of each
    802 instruction are independent. \ldots SPTA is performed by calculating the
    803 convolution ($\oplus$) of the discrete probability distributions which describe
    804 the execution time for each instruction on a CPU; this provides a probability
    805 distribution \ldots representing the timing behaviour of the entire sequence of
    806 instructions.}''
    807 
    808 We will now analyze to what extend we can embrace SPTA in CerCo.
    809 
    810 \paragraph{The labelling approach for Static Probabilistic Time Analysis}
    811 
    812 To summarize, the main practical differences between standard static time
    813 analysis and SPTA are:
    814 \begin{itemize}
    815  \item The cost functions for single instructions or sequences of instructions
    816    no longer return a natural numbers (number of cost units) but integral
    817    random variables.
    818  \item Cost functions are extended from single instructions to sequences of
    819    instructions by using the associative convolution operator $\oplus$
    820 \end{itemize}
    821 
    822 By reviewing the papers that described the labelling approach, it is easy to
    823 get convinced that the codomain of the cost analysis can be lifted from
    824 that of natural numbers to any group. Moreover, by imposing labels after
    825 every function call, commutativity can be dropped and the approach works on
    826 every monoid (usually called \emph{cost monoids} in the litterature).
    827 Because random variables and convolutions form a monoid, we immediately have
    828 that the labelling approach extends itself to SPTA. The instrumented code
    829 produced by an SPTA-CerCo compiler will then have random variables (on a finite
    830 domain) as costs and convolutions in place of the \texttt\{\_\_cost\_incr\}
    831 function.
    832 
    833 What is left to be understood is the way to state and compute the
    834 probabilistic invariants to do \emph{parametric SPTA}. Indeed, it seems that
    835 PROARTIS only got interested into non parametric PTA. For example, it is
    836 well known that actually computing the convolutions results in an exponential
    837 growth of the memory required to represent the result of the convolutions.
    838 Therefore, the analysis should work symbolically until the moment where
    839 we are interested into extracting information from the convolution.
    840 
    841 Moreover, assuming that the problem of computing invariants is solved,
    842 the actual behavior of automated theorem
    843 provers on probabilistic invariants is to be understood. It is likely that
    844 a good amount of domain specific knowledge about probability theory must be
    845 exploited and incorporated into automatic provers to achieve concrete results.
    846 
    847 Parametric SPTA using the methodology developed in CerCo is a future research
    848 direction that we believe to be worth exploring in the middle and long term.
    849 
    850 \paragraph{Static Probabilistic Time Analysis for Caches in CerCo}
    851 
    852 As a final remark, we note that the analysis in CerCo of systems that implement
    853 probabilistic caches requires a combination of SPTA and data independent cost
    854 models. The need for a probabilistic analysis is obvious but, as we saw in
    855 the previous paragraph, it requires no modification of the labelling approach.
    856 
    857 In order to understand the need for dependent labelling (to work on data
    858 independent cost functions), we need to review the behaviour of probabilistic
    859 caches as proposed by PROARTIS. The interested reader can
    860 consult~\cite{proartis2-articolo30} for further informations.
    861 
    862 In a randomized cache, the probability of evicting a given line on every access
    863 is $1/N$ where $N$ stands for the number of cache entries. Therefore the
    864 hit probability of a specific access to such a cache is
    865 $P(hit) = (\frac{N-1}{N})^K$ where $K$ is the number of cache misses between
    866 two consecutive accesses to the same cache entry. For the purposes of our
    867 analysis, we must assume that every cache access could cause an eviction.
    868 Therefore, we define $K$ (the \emph{reuse distance}) to be the number of
    869 memory accesses between two consecutive accesses to the same cache entry,
    870 including the access for which we are computing $K$. In order to compute
    871 $K$ for every code memory address, we need to know the execution path (in
    872 our terminology). In other words, we need a view that records for each
    873 cache entry the number of memory accesses that has occurred since the last
    874 access.
    875 
    876 For pipelines with $n$ stages, the number of possible views is limited to
    877 $2^n$: a view can usually just be represented by a word. This is not the
    878 case for the views on caches, which are in principle very large. Therefore,
    879 the dependent labelling approach for data indepedent cost functions that we
    880 have presented here could still be unpractical for caches. If that turns out
    881 to be the case, a possible strategy is the use of abstract interpretations
    882 techniques on the object code to reduce the size of views exposed at the
    883 source level, at the price of an early loss of precision in the analysis.
    884 
    885 More research work must be performed at the current stage to understand if
    886 caches can be analyzed, even probabilistically, using the CerCo technology.
    887 This is left for future work and it will be postponed after the work on
    888 pipelines.
    889 
    890 \paragraph{Conclusions}
    891 At the current state of the art functional properties of programs are better
    892 proved high level languages, but the non functional ones are proved on the
    893 corresponding object code. The non functional analysis, however, depends on
    894 functional invariants, e.g. to bound or correlate the number of executions of
    895 cycles.
    896 
    897 The aim of the CerCo project is to reconcile the two analysis by performing
    898 non functional analysis on the source code. This requires computing a cost
    899 model on the object code and reflecting the cost model on the source code.
    900 We achieve this in CerCo by designing a certified Cost Annotating Compiler that
    901 keeps tracks of transformations of basic blocks, in order to create a
    902 correspondence (not necessaritly bijection) between the basic blocks of the
    903 source and target language. We then prove that the sequence of basic blocks
    904 that are met in the source and target executions is correlated. Then,
    905 we perform a static analysis of the cost of basic blocks on the target language
    906 and we use it to compute the cost model for the source language basic blocks.
    907 Finally, we compute cost invariants on the source code from the inferred cost
    908 model and from the functional program invariants; we generate proof obligations
    909 for the invariants; we use automatic provers to try to close the proof
    910 obligations.
    911 
    912 The cost of single instructions on modern architectures depend on the internal
    913 state of various hardware components (pipelines, caches, branch predicting
    914 units, etc.). The internal states are determined by the previous execution
    915 history. Therefore the cost of basic blocks is parametric on the execution
    916 history, which means both the instructions executed and the data manipulated
    917 by the instructions. The CerCo approach is able to correlate the sequence
    918 of blocks of source instructions with the sequence of blocks of target
    919 instructions. It does not correlate the high level and the low level data.
    920 Therefore we are not able in the general case to lift a cost model parametric
    921 on the execution history on the source code.
    922 
    923 To overcome the problem, we have identified a particular class of cost models
    924 that are not dependent on the data manipulated. We argue that the CerCo
    925 approach can cover this scenario by exposing in the source program a finite
    926 data type of views over internal machine states. The costs of basic blocks
    927 is parametric on these views, and the current view is updated at basic block
    928 entry according to some abstraction of the machine hardware that does not
    929 need to be understood. Further studies are needed to understand how invariant
    930 generators and automatic provers can cope with these updates and parametric
    931 costs.
    932 
    933 We have argued how pipelines, at least simple ones, are captured by the
    934 previous scenario and can in principle be manipulated using CerCo tools.
    935 The same is not true for caches, whose behaviour deeply depends on the
    936 data manipulated. By embracing the PROARTIS proposal of turning caches into
    937 probabilistic components, we can break the data dependency. Nevertheless,
    938 cache analysis remains more problematic because of the size of the views.
    939 Further studies need to be focused on caches to understand if the problem of
    940 size of the views can be tamed in practice without ruining the whole approach.
     119\includepdf[pages={-}]{pipelines.pdf}
    941120
    942121\end{document}
Note: See TracChangeset for help on using the changeset viewer.