source: Deliverables/D6.3/pipelines.tex @ 3152

Last change on this file since 3152 was 3126, checked in by sacerdot, 7 years ago

Splitted into empty report + "stand alone" paper.
The paper needs to be rephrased a bit to look like one.

File size: 47.3 KB
Line 
1\documentclass[11pt, epsf, a4wide]{article}
2
3\usepackage{../style/cerco}
4
5\usepackage{amsfonts}
6\usepackage{amsmath}
7\usepackage{amssymb} 
8\usepackage[english]{babel}
9\usepackage{graphicx}
10\usepackage[utf8x]{inputenc}
11\usepackage{listings}
12\usepackage{stmaryrd}
13\usepackage{url}
14\usepackage{bbm}
15
16\begin{document}
17
18\title{??????}
19\author{??????}
20\date{dd/mm/yy}
21
22\maketitle
23
24\section{Middle and Long Term Improvements}
25In order to identify the middle and long term improvements, we briefly review
26here the premises and goals of the CerCo approach to resource analysis.
27\begin{itemize}
28 \item There is a lot of recent and renewed activity in the formal method
29   community to accomodate resource analysis using techniques derived from
30   functional analysis (type systems, logics, abstract interpretation,
31   amortized analysis applied to data structures, etc.)
32 \item Most of this work, which currently remains at theoretical level,
33   is focused on high level languages and it assumes the existence of correct
34   and compositional resource cost models.
35 \item High level languages are compiled to object code by compilers that
36   should respect the functional properties of the program. However, because
37   of optimizations and the inherently non compositional nature of compilation,
38   compilers do not respect compositional cost models that are imposed a priori
39   on the source language. By controlling the compiler and coupling it with a
40   WCET analyser, it is actually possible to
41   choose the cost model in such a way that the cost bounds are high enough
42   to bound the cost of every produced code. This was attempted for instance
43   in the EmBounded project with good success. However, we believe that bounds
44   obtained in this way have few possibilities of being tight.
45 \item Therefore our approach consists in having the compiler generate the
46   cost model for the user by combining tracking of basic blocks during code
47   transformations with a static resource analysis on the object code for basic
48   blocks. We formally prove the compiler to respect the cost model that is
49   induced on the source level based on a very few assumptions: basically the
50   cost of a sequence of instructions should be associative and commutative
51   and it should not depend on the machine status, except its program counter.
52   Commutativity can be relaxed at the price of introducing more cost updates
53   in the instrumented source code.
54 \item The cost model for basic blocks induced on the source language must then
55   be exploited to derive cost invariants and to prove them automatically.
56   In CerCo we have shown how even simple invariant generations techniques
57   are sufficient to enable the fully automatic proving of parametric WCET
58   bounds for simple C programs and for Lustre programs of arbitrary complexity.
59\end{itemize}
60
61Compared to traditional WCET techniques, our approach currently has many
62similarities, some advantages and some limitations. Both techniques need to
63perform data flow analysis on the control flow graph of the program and both
64techniques need to estimate the cost of control blocks of instructions.
65
66\subsection{Control flow analysis}
67
68The first main difference is in the control flow analysis. Traditional WCET
69starts from object code and reconstructs the control flow graph from it.
70Moreover, abstract interpretation is heavily employed to bound the number of
71executions of cycles. In order to improve the accuracy of estimation, control
72flow constraints are provided by the user, usually as systems of (linear)
73inequalities. In order to do this, the user, helped by the system, needs to
74relate the object code control flow graph with the source one, because it is
75on the latter that the bounds can be figured out and be understood. This
76operations is untrusted and potentially error prone for complex optimizations
77(like aggressive loop optimizations). Efficient tools from linear algebra are
78then used to solve the systems of inequations obtained by the abstract
79interpreter and from the user constraints.
80
81In CerCo, instead, we assume full control on the compiler that
82is able to relate, even in non trivial ways, the object code control flow graph
83onto the source code control flow graph. A clear disadvantage is the
84impossibility of applying the tool on the object code produced by third party
85compilers. On the other hand, we get rid of the possibility of errors
86in the reconstruction of the control flow graph and in the translation of
87high level constraints into low level constraints. The second potentially
88important advantage is that, once we are dealing with the source language,
89we can augment the precision of our dataflow analysis by combining together
90functional and non functional invariants. This is what we attempted with
91the CerCo Cost Annotating Frama-C Plug-In. The Frama-C architecture allows
92several plug-ins to perform all kind of static analisys on the source code,
93reusing results from other plug-ins and augmenting the source code with their
94results. The techniques are absolutely not limited to linear algebra and
95abstract interpretation, and the most important plug-ins call domain specific
96and general purpose automated theorem provers to close proof obligations of
97arbitrary shape and complexity.
98
99In principle, the extended flexibility of the analysis should allow for a major
100advantage of our technique in terms of precision, also
101considering that all analysis used in traditional WCET can still be implemented
102as plug-ins. In particular, the target we have in mind are systems that are
103both (hard) real time and safety critical. Being safety critical, we can already
104expect them to be fully or partially specified at the functional level.
105Therefore we expect that the additional functional invariants should allow to
106augment the precision of the cost bounds, up to the point where the parametric
107cost bound is fully precise.
108In practice, we have not had the time to perform extensive
109comparisons on the kind of code used by industry in production systems.
110The first middle term improvement of CerCo would then consist in this kind
111of analysis, to support or disprove our expectations. It seems that the
112newborn TACLe Cost Action (Timing Analysis on Code Level) would be the
113best framework to achieve this improvement.
114In the case where our
115technique remains promising, the next long term improvement would consist in
116integrating in the Frama-C plug-in ad-hoc analysis and combinations of analysis
117that would augment the coverage of the efficiency of the cost estimation
118techniques.
119
120\subsection{Static analysis of costs of basic blocks}
121At the beginning of the project we have deliberately decided to focus our
122attention on the control flow preservation, the cost model propagation and
123the exploitation of the cost model induced on the high level code. For this
124reason we have devoted almost no attention to the static analysis of basic
125blocks. This was achieved by picking a very simple hardware architecture
126(the 8051 microprocessor family) whose cost model is fully predictable and
127compositional: the cost of every instruction --- except those that deal with
128I/O --- is constant, i.e. it does not depend on the machine status.
129We do not regret this choice because, with the limited amount of man power
130available in the project, it would have been difficult to also consider this
131aspect. However, without showing if the approach can scale to most complex
132architectures, our methodology remains of limited interest for the industry.
133Therefore, the next important middle term improvement will be the extension
134of our methodology to cover pipelines and simple caches. We will now present our
135ideas on how we intend to address the problem. The obvious long term
136improvement would be to take in consideration multicores system and complex
137memory architectures like the ones currently in use in networks on chips.
138The problem of execution time analysis for these systems is currently
139considered extremely hard or even unfeasible and at the moments it seems
140unlikely that our methodology could contribute to the solution of the problem.
141
142\subsubsection{Static analysis of costs of basic blocks revisited}
143We will now describe what currently seems to be the most interesting technique
144for the static analysis of the cost of basic blocks in presence of complex
145hardware architectures that do not have non compositional cost models.
146
147We start presenting an idealized model of the execution of a generic
148microprocessor (with caches) that has all interrupts disabled and no I/O
149instructions. We then classify the models according to some properties of
150their cost model. Then we show how to extend the labelling approach of CerCo
151to cover models that are classified in a certain way.
152
153\paragraph{The microprocessor model}
154
155Let $\sigma,\sigma_1,\ldots$ range over $\Sigma$, the set of the fragments of
156the microprocessor states that hold the program counter, the program status word and
157all the data manipulated by the object code program, i.e. registers and
158memory cells. We call these fragments the \emph{data states}.
159
160Let $\delta,\delta_1,\ldots$ range over $\Delta$, the set of the
161fragments of the microprocessor state that holds the \emph{internal state} of the
162microprocessor (e.g. the content of the pipeline and caches, the status of
163the branch prediction unit, etc.).
164The internal state of the microprocessor influences the execution cost of the
165next instruction, but it has no effect on the functional behaviour of the
166processor. The whole state of the processor is represented by a pair
167$(\sigma,\delta)$.
168
169Let $I,I_1,\ldots$ range over $\mathcal{I}$, the
170the set of \emph{instructions} of the processor and let
171$\gamma,\gamma_1,\ldots$ range over $\Gamma$, the set of \emph{operands}
172of instructions after the fetching and decoding passes.
173Thus a pair $(I,\gamma)$ represents a \emph{decoded instruction} and already contains
174the data required for execution. Execution needs to access the data state only
175to write the result.
176
177Let $fetch: \Sigma \to \mathcal{I} \times \Gamma$ be the function that
178performs the fetching and execution phases, returning the decoded instruction
179ready for execution. This is not meant to be the real fetch-decode function,
180that exploits the internal state too to speed up execution (e.g. by retrieving
181the instruction arguments from caches) and that, in case of pipelines, works
182in several stages. However, such a function exists and
183it is observationally equivalent to the real fetch-decode.
184
185We capture the semantics of the microprocessor with the following set of
186functions:
187\begin{itemize}
188 \item The \emph{functional transition} function $\longrightarrow : \Sigma \to \Sigma$
189   over data states. This is the only part of the semantics that is relevant
190   to functional analysis.
191 \item The \emph{internal state transition} function
192   $\Longrightarrow : \Sigma \times \Delta \to \Delta$ that updates the internal
193   state.
194 \item The \emph{cost function} $K: \mathcal{I} \times \Gamma \times \Delta \to \mathbb{N}$
195   that assigns a cost to transitions. Since decoded instructions hold the data
196   they act on, the cost of an instruction may depend both on the data being
197   manipulated and on the internal state.
198\end{itemize}
199
200Given a processor state $(\sigma,\delta)$, the processor evolves in the
201new state $(\sigma',\delta')$ in $n$ cost units if
202$\sigma \longrightarrow \sigma'$ and $(\sigma,\delta) \Longrightarrow \delta'$
203and $fetch(\sigma) = (I,\gamma)$ and $K(I,\gamma,\delta) = n$.
204
205An \emph{execution history} is a stream of states and transitions
206$\sigma_0 \longrightarrow \sigma_1 \longrightarrow \sigma_2 \ldots$
207that can be either finite or infinite. Given an execution history, the
208corresponding \emph{execution path} is the stream of program counters
209obtained from the execution history by forgetting all the remaining information.
210The execution path of the history $\sigma_0 \longrightarrow \sigma_1 \longrightarrow \sigma_2 \ldots$ is $pc_0,pc_1,\ldots$ where each $pc_i$ is the program
211counter of $\sigma_i$. We denote the set of finite execution paths with $EP$.
212
213We claim this simple model to be generic enough to cover real world
214architectures.
215
216\paragraph{Classification of cost models}
217A cost function is \emph{exact} if it assigns to transitions the real cost
218incurred. It is \emph{approximated} if it returns an upper bound of the real
219cost.
220
221A cost function is \emph{operand insensitive} if it does not depend on the
222operands of the instructions to be executed. Formally, $K$ is operand insensitive
223if there exists a $K': \mathcal{I} \times \Delta \to \mathbb{N}$ such that
224$K(I,\gamma,\delta) = K'(I,\delta)$. In this case, with an abuse of terminology,
225we will identify $K$ with $K'$.
226
227The cost functions of simple hardware architectures, in particular RISC ones,
228are naturally operand insensitive. In the other cases an exact operand sensitive
229cost function can always be turned into an approximated operand insensitive one
230by taking $K'(I,\delta) = \max\{K(I,\gamma,\delta)~|~\gamma \in \Gamma\}$.
231The question when one performs these approximation is how severe the
232approsimation is. A measure is given by the \emph{jitter}, which is defined
233as the difference between the best and worst cases. In our case, the jitter
234of the approximation $K'$ would be $\max\{K(I,\gamma,\delta)~|~\gamma \in \Gamma\} - \min\{K(I,\gamma,\delta)~|~\gamma \in \Gamma\}$. According to experts
235of WCET analysis, the jitters relative to operand sensitivity in modern
236architectures are small enough to make WCET estimations still useful.
237Therefore, in the sequel we will focus on operand insensitive cost models only.
238
239Note that cost model that are operand insensitive may still have significant
240dependencies over the data manipulated by the instructions, because of the
241dependency over internal states. For example, an instruction that reads data
242from the memory may change the state of the cache and thus greatly affect the
243execution time of successive instructions.
244
245Nevertheless, operand insensitivity is an important property for the labelling
246approach. In~\cite{tranquilli} we introduced \emph{dependent labels} and
247\emph{dependent costs}, which are the possibility of assigning costs to basic
248blocks of instructions which are also dependent on the state of the high level
249program at the beginning of the block. The idea we will now try to pursue is
250to exploit dependent costs to capture cost models that are sensitive to
251the internal states of the microprocessor. Operand sensitivity, however, is
252a major issue in presence of dependent labels: to lift a data sensitive cost
253model from the object code to the source language, we need a function that
254maps high level states to the operands of the instructions to be executed,
255and we need these functions to be simple enough to allow reasoning over them.
256Complex optimizations performed by the compiler, however, make the mappings
257extremely cumbersomes and history dependent. Moreover, keeping
258track of status transformations during compilation would be a significant
259departure from classical compilation models which we are not willing to
260undertake. By assuming or removing operand sensitivity, we get rid of part
261of the problem: we only need to make our costs dependent on the internal
262state of the microprocessor. The latter, however, is not at all visible
263in the high level code. Our next step is to make it visible.
264
265In general, the value of the internal state at a certain point in the program
266history is affected by all the preceding history. For instance, the
267pipeline stages either hold operands of instructions in execution or bubbles
268\footnote{A bubble is formed in the pipeline stage $n$ when an instruction is stucked in the pipeline stage $n-1$, waiting for some data which is not available yet.}. The execution history contains data states that in turn contain the
269object code data which we do not know how to relate simply to the source code
270data. We therefore introduce a new classification.
271
272A \emph{view} over internal states is a pair $(\mathcal{V},|.|)$ where
273$\mathcal{V}$ is a finite non empty set and $|.|:\Delta \to \mathcal{V}$ is
274a forgetful function over internal states.
275
276The operand insensitive cost function $K$ is \emph{dependent on the view
277$\mathcal{V}$} if there exists a $K': \mathcal{I} \times \mathcal{V} \to \mathbb{N}$ such that $K(I,\delta) = K'(I,|\delta|)$. In this case, with an abuse of terminology, we will identify $K$ with $K'$.
278
279Among the possible views, the ones that we will easily be able to work
280with in the labelling approach are the \emph{execution history dependent views}.
281A view $(\mathcal{V},|.|)$ is execution history dependent with a lookahead
282of length $n$ when there exists
283a transition function $\hookrightarrow : PC^n \times \mathcal{V} \to \mathcal{V}$
284such that for all $(\sigma,\delta)$ and $pc_1,\ldots,pc_n$ such that every
285$pc_i$ is the program counter of $\sigma_i$ defined by $\sigma \longrightarrow^i \sigma_i$, we have $(\sigma,\delta) \Longrightarrow \delta'$ iff
286$((pc_1,\ldots,pc_n),|\delta|) \hookrightarrow |\delta'|$.
287
288Less formally, a view is
289dependent on the execution history if the evolution of the views is fully
290determined by the evolution of the program counters. To better understand
291the definition, consider the case where the next instruction to be executed
292is a conditional jump. Without knowing the values of the registers, it is
293impossible to determine if the true or false branches will be taken. Therefore
294it is likely to be impossible to determine the value of the view the follows
295the current one. On the other hand, knowing the program counter that will be
296reached executing the conditional branch, we also know which
297branch will be taken and this may be sufficient to compute the new view.
298Lookaheads longer then 1 will be used in case of pipelines: when executing
299one instruction in a system with a pipeline of length $n$, the internal state
300of the pipeline already holds information on the next $n$ instructions to
301be executed.
302
303The reference to execution historys in the names is due to the following
304fact: every execution history dependent transition function $\hookrightarrow$
305can be lifted to the type $EP \times \mathcal{V} \to \mathcal{V}$ by folding
306the definition over the path trace:
307$((pc_0,\ldots,pc_m),v_0) \hookrightarrow v_n$ iff for all $i \leq m - n$,
308$((pc_i,\ldots,pc_{i+n}),v_i) \hookrightarrow v_{i+1}$.
309Moreover, the folding is clearly associative:
310$(\tau_1 @ \tau_2, v) \hookrightarrow v''$ iff
311$(\tau_1,v) \hookrightarrow v'$ and $(\tau_2,v') \hookrightarrow v''$.
312
313As a final definition, we say that a cost function $K$ is
314\emph{data independent} if it is dependent on a view that is execution path
315dependent. In two paragraphs we will show how we can extend the
316labelling approach to deal with data independent cost models.
317
318Before that, we show that the class of data independent cost functions
319is not too restricted to be interesting. In particular, at least simple
320pipeline models admit data independent cost functions.
321
322\paragraph{A data independent cost function for simple pipelines}
323We consider here a simple model for a pipeline with $n$ stages without branch
324prediction and hazards. We also assume that the actual value of the operands
325of the instruction that is being read have no influence on stalls (i.e. the
326creation of bubbles) nor on the execution cost. The type of operands, however,
327can. For example, reading the value 4 from a register may stall a pipeline
328if the register has not been written yet, while reading 4 from a different
329register may not stall the pipeline.
330
331The internal states $\Delta$ of the pipeline are $n$-tuples of decoded
332instructions or bubbles:
333$\Delta = (\mathcal{I} \times \Gamma \cup \mathbbm{1})^n$.
334This representation is not meant to describe the real data structures used
335in the pipeline: in the implementation the operands are not present in every
336stage of the pipeline, but are progressively fetched. A state
337$(x_1,x_2,\ldots,(I,\gamma))$ represents the state of the pipeline just before
338the completion of instruction $(I,\gamma)$. The first $n-1$ instructions that
339follow $I$ may already be stored in the pipeline, unless bubbles have delayed
340one or more of them.
341
342We introduce the following view over internal states:
343$(\{0,1\}^n,|.|)$ where $\mathbb{N}_n = {0,\ldots,2^n-1}$ and
344$|(x_1,\ldots,x_n)| = (y_1,\ldots,y_n)$ where $y_i$ is 1 iff $x_i$ is a bubble.
345Thus the view only remembers which stages of the pipelines are stuck.
346The view holds enough information to reconstruct the internal state given
347the current data state: from the data state we can fetch the program counter
348of the current and the next $n-1$ instructions and, by simulating at most
349$n$ execution steps and by knowing where the bubbles were, we can fill up
350the internal state of the pipeline.
351
352The assumptions on the lack of influence of operands values on stalls and
353execution times ensures the existence of the data independent cost function
354$K: PC \times \{0,1\}^n \to \mathbb{N}$. The transition function for a
355pipeline with $n$ stages may require $n$ lookaheads:
356$\hookrightarrow : PC^n \times \{0,1\}^n \to \{0,1\}^n$.
357
358While the model is a bit simplicistic, it can nevertheless be used to describe
359existing pipelines. It is also simple to be convinced that the same model
360also captures static branch prediction: speculative execution of conditional
361jumps is performed by always taking the same branch which does not depend on
362the execution history. In order to take in account jumpt predictions based on
363the execution history, we just need to incorporate in the status and the view
364statistical informations on the last executions of the branch.
365
366\paragraph{The labelling approach for data independent cost models}
367We now describe how the labelling approach can be slightly modified to deal
368with a data independent cost model $(\hookrightarrow,K)$ built over
369$(\mathcal{V},|.|)$.
370
371In the labelling approach, every basic block in the object code is identified
372with a unique label $L$ which is also associated to the corresponding
373basic block in the source level. Let us assume that labels are also inserted
374after every function call and that this property is preserved during
375compilation. Adding labels after calls makes the instrumented code heavier
376to read and it generates more proof obligations on the instrumented code, but
377it does not create any additional problems. The preservation during compilation
378creates some significant technical complications in the proof of correctness
379of the compiler, but those can be solved.
380
381The static analysis performed in the last step of the basic labelling approach
382analyses the object code in order to assign a cost to every label or,
383equivalently, to every basic block. The cost is simply the sum of the cost
384of very instruction in the basic block.
385
386In our scenario, instructions no longer have a cost, because the cost function
387$K$ takes in input a program counter but also a view $v$. Therefore we
388replace the static analysis with the computation, for every basic block and
389every $v \in \mathcal{V}$, of the sum of the costs of the instructions in the
390block, starting in the initial view $v$. Formally, let the sequence of the
391program counters of the basic block form the execution path $pc_0,\ldots,pc_n$.
392The cost $K(v_0,L)$ associated to the block labelled with
393$L$ and the initial view $v_0$ is
394$K(pc_0,v_0) + K(pc_1,v_1) + \ldots + K(pc_n,v_n)$ where for every $i < n$,
395$((pc_i,\ldots,pc_{i+l}),v_i) \hookrightarrow v_{k+1}$ where $l$ is the
396lookahead required. When the lookahead requires program counters outside the
397block under analysis, we are free to use dummy ones because the parts of the
398view that deal with future behaviour have no impact on the cost of the
399previous operations by assumption.
400
401The static analysis can be performed
402in linear time in the size of the program because the cardinality of the sets
403of labels (i.e. the number of basic blocks) is bounded by the size of the
404program and because the set $V$ is finite. In the case of the pipelines of
405the previous paragraph, the static analysis is $2^n$ times more expensive
406than the one of the basic labelling approach, where $n$ is the number of
407pipeline stages.
408
409The first imporant theorem in the labelling approach is the correctness of the
410static analysis: if the (dependent) cost associated to a label $L$ is $k$,
411then executing a program from the beginning of the basic block to the end
412of the basic block should take exactly $k$ cost units. The proof only relies
413on associativity and commutativity of the composition of costs. Commutativity
414is only required if basic blocks can be nested, i.e. if a basic block does not
415terminate when it reaches a call, but it continues after the called function
416returns. By assuming to have a cost label after each block, we do not need
417commutativity any longer, which does not hold for the definition of $K$ we
418just gave. The reason is that, if $pc_i$ is a function call executed in
419the view (state) $v_i$, it is not true that, after return, the state
420will be $v_i+1$ defined by $(pc_i,pc_{i+1},v_i) \hookrightarrow v_{i+1}$
421(assuming a lookahead of 1, which is already problematic).
422Indeed $pc_{i+1}$ is the program counter of the instruction that follows
423the call, whereas the next program counter to be reached is the one of the
424body of the call. Moreover, even if the computation would make sense,
425$v_{i+1}$ would be the state at the beginning of the execution of the body of
426the call, while we should know the state after the function returns.
427The latter cannot be
428statically predicted. That's why we had to impose labels after calls.
429Associativity, on the other hand, trivially holds. Therefore the proof
430of correctness of the static analysis can be reused without any change.
431
432So far, we have computed the dependent cost $K: \mathcal{V} \times \mathcal{L} \to \mathbb{N}$ that associates a cost to basic blocks and views.
433The second step consists in statically computing the transition function
434$\hookrightarrow : \mathcal{L} \times \mathcal{L} \times \mathcal{V} \to \mathcal{V}$ that
435associates to each pair of consecutively executed basic blocks and input view
436the view obtained at the end of the execution of the first block.
437
438The definition is the following:
439$(L,L',v) \hookrightarrow v'$ iff $((pc_0,\ldots,pc_n,pc'_0,\ldots,pc'_m),v) \hookrightarrow v'$ where $(pc_0,\ldots,pc_n)$ are the program counters of the block labelled by $L$ and $(pc'_0,\ldots,pc'_m)$ are those of the block labelled
440with $L'$. We assume here that $m$ is always longer or equal to the lookahead
441required by the transition function $\hookrightarrow$ over execution paths.
442When this is not the case we could make the new transition function take in
443input a longer lookahead of labels. Or we may assume to introduce enough
444\texttt{NOP}s at the beginning of the block $L'$ to enforce the property.
445In the rest of the paragraph we assume to have followed the second approach
446to simplify the presentation.
447
448The extended transition function over labels is not present in the basic labelling approach. Actually, the basic labelling
449approach can be understood as the generalized approach where the view $\mathcal{V} = \mathbbm{1}$.
450The computation of the extended $\hookrightarrow$ transition function
451is again linear in the size of the program.
452
453Both the versions of $K$ and $\hookrightarrow$ defined over labels can be
454lifted to work over traces by folding them over the list of labels in the
455trace: for $K$ we have $K((L_1,\ldots,L_n),v) = K(L_1,v) + K((L_2,\ldots,L_n),v')$ where
456$(L_1,L_2,v) \hookrightarrow v'$;  for $\hookrightarrow$ we have
457$((L_1,\ldots,L_n),v) \hookrightarrow v''$ iff $(L_1,L_2,v) \hookrightarrow v'$ and $((L_2,\ldots,L_n),v') \hookrightarrow v''$. The two definitions are also
458clearly associative.
459
460The second main theorem of the labelling approach is trace preservation:
461the trace produced by the object code is the same as the trace produced by
462the source code. Without any need to change the proof, we immediately obtain
463as a corollary that for every view $v$, the cost $K(\tau,v)$ computed from
464the source code trace $\tau$ is the same than the cost $K(\tau,v)$ computed
465on the object code trace, which is again $\tau$.
466
467The final step of the labelling approach is source code instrumentation.
468In the basic labelling approach it consists in adding a global variable
469\texttt{\_\_cost}, initialized with 0, which is incremented at the beginning
470of every basic block with the cost of the label of the basic block.
471Here we just need a more complex instrumentation that keeps track of the
472values of the views during execution:
473\begin{itemize}
474 \item We define three global variables \texttt{\_\_cost}, initialized at 0,
475   \texttt{\_\_label}, initialized with \texttt{NULL}, and
476   \texttt{\_\_view}, uninitialized.
477 \item At the beginning of every basic block labelled by $L$ we add the
478  following code fragment:
479
480  \begin{tabular}{l}
481  \texttt{\_\_view = \_\_next(\_\_label,}$L$\texttt{,\_\_view);}\\
482  \texttt{\_\_cost += }$K(\mbox{\texttt{\_\_view}},L)$\texttt{;}\\
483  \texttt{\_\_label = }$L$\texttt{;}
484  \end{tabular}
485
486  where \texttt{\_\_next(}$L_1,L_2,v$\texttt{)} = $v'$ iff
487  $(L_1,L_2,v) \hookrightarrow v'$ unless $L_1$ is \texttt{NULL}.
488  In that case \texttt{(\_\_next(NULL,}$L$\texttt{)} = $v_0$ where
489  $v_0 = |\delta_0|$ and $\delta_0$ is the initial value of the internal stat
490  at the beginning of program execution.
491 
492  The first line of the code fragment computes the view at the beginning of
493  the execution of the block from the view at the end of the previous block.
494  Then we update the cost function with the cost of the block. Finally we
495  remember the current block to use it for the computation of the next view
496  at the beginning of the next block.
497\end{itemize}
498
499\paragraph{An example of instrumentation in presence of a pipeline}
500In Figure~\ref{factprogi} we show how the instrumentationof a program
501that computes the factorial of 10 would look like in presence of a pipeline.
502The instrumentation has been manually produced. The \texttt{\_\_next}
503function says that the body of the internal loop of the \texttt{fact}
504function can be executed in two different internal states, summarized by the
505views 2 and 3. The view 2 holds at the beginning of even iterations,
506while the view 3 holds at the beginning of odd ones. Therefore even and odd
507iterations are assigned a different cost. Also the code after the loop can
508be executed in two different states, depending on the oddness of the last
509loop iteration.
510
511\begin{figure}[t]
512\begin{tiny}
513\begin{verbatim}
514int fact (int n) {
515  int i, res = 1;
516  for (i = 1 ; i <= n ; i++) res *= i;
517  return res;
518}
519
520int main () {
521  return (fact(10));
522}
523\end{verbatim}
524\end{tiny}
525\caption{A simple program that computes the factorial of 10.\label{factprog}}
526\end{figure}
527
528The definitions of \texttt{\_\_next} and \texttt{\_\_K} are just examples.
529For instance, it is possible as well that each one of the 10 iterations
530is executed in a different internal state.
531
532\begin{figure}[t]
533\begin{tiny}
534\begin{verbatim}
535int __cost = 8;
536int __label = 0;
537int __view;
538
539void __cost_incr(int incr) {
540  __cost = __cost + incr;
541}
542
543int __next(int label1, int label2, int view) {
544       if (label1 == 0)                             return 0;
545  else if (label1 == 0 && label2 == 1)              return 1;
546  else if (label1 == 1 && label2 == 2)              return 2;
547  else if (label1 == 2 && label2 == 2 && view == 2) return 3;
548  else if (label1 == 2 && label2 == 2 && view == 3) return 2;
549  else if (label1 == 2 && label2 == 3 && view == 2) return 1;
550  else if (label1 == 2 && label2 == 3 && view == 3) return 0;
551  else if (label1 == 3 && label2 == 4 && view == 0) return 0;
552  else if (label1 == 3 && label2 == 4 && view == 1) return 0;
553}
554
555int __K(int view, int label) {
556       if (view == 0 && label == 0) return 3;
557  else if (view == 1 && label == 1) return 14;
558  else if (view == 2 && label == 2) return 35;
559  else if (view == 3 && label == 2) return 26;
560  else if (view == 0 && label == 3) return 6;
561  else if (view == 1 && label == 3) return 8;
562  else if (view == 0 && label == 4) return 6;
563}
564
565int fact(int n)
566{
567  int i;
568  int res;
569  __view = __next(__label,1,__view); __cost_incr(_K(__view,1)); __label = 1;
570  res = 1;
571  for (i = 1; i <= n; i = i + 1) {
572     __view = __next(__label,2,__view); __cost_incr(__K(__view,2)); __label = 2;
573    res = res * i;
574  }
575  __view = __next(__label,3,__view); __cost_incr(K(__view,3)); __label = 3;
576  return res;
577}
578
579int main(void)
580{
581  int t;
582  __view = __next(__label,0,__view); __cost_incr(__K(__view,0)); __label = 0;
583  t = fact(10);
584  __view = __next(__label,4,__view); __cost_incr(__K(__view,4)); __label = 4;
585  return t;
586}
587\end{verbatim}
588\end{tiny}
589\caption{The instrumented version of the program in Figure~\ref{factprog}.\label{factprogi}}
590\end{figure}
591
592\paragraph{Considerations on the instrumentation}
593The example of instrumentation in the previous paragraph shows that the
594approach we are proposing exposes at the source level a certain amount
595of information about the machine behavior. Syntactically, the additional
596details, are almost entirely confined into the \texttt{\_\_next} and
597\texttt{\_\_K} functions and they do not affect at all the functional behaviour
598of the program. In particular, all invariants, proof obligations and proofs
599that deal with the functional behavior only are preserved.
600
601The interesting question, then, is what is the impact of the additional
602details on non functional (intensional) invariants and proof obligations.
603At the moment, without a working implementation to perform some large scale
604tests, it is difficult to understand the level of automation that can be
605achieved and the techniques that are likely to work better without introducing
606major approximations. In any case, the preliminary considerations of the
607project remain valid:
608\begin{itemize}
609 \item The task of computing and proving invariants can be simplified,
610   even automatically, trading correctness with precision. For example, the
611   most aggressive approximation simply replaces the cost function
612   \texttt{\_\_K} with the function that ignores the view and returns for each
613   label the maximum cost over all possible views. Correspondingly, the
614   function \texttt{\_\_next} can be dropped since it returns views that
615   are later ignored.
616
617   A more refined possibility consists in approximating the output only on
618   those labels whose jitter is small or for those that mark basic blocks
619   that are executed only a small number of times. By simplyfing the
620   \texttt{\_\_next} function accordingly, it is possible to considerably
621   reduce the search space for automated provers.
622 \item The situation is not worse than what happens with time analysis on
623   the object code (the current state of the art). There it is common practice
624   to analyse larger chunks of execution to minimize the effect of later
625   approximations. For example, if it is known that a loop can be executed at
626   most 10 times, computing the cost of 10 iterations yields a
627   better bound than multiplying by 10 the worst case of a single interation.
628
629   We clearly can do the same on the source level.
630   More generally, every algorithm that works in standard WCET tools on the
631   object code is likely to have a counterpart on the source code.
632   We also expect to be able to do better working on the source code.
633   The reason is that we assume to know more functional properties
634   of the program and more high level invariants, and to have more techniques
635   and tools at our disposal. Even if at the moment we have no evidence to
636   support our claims, we think that this approach is worth pursuing in the
637   long term.
638\end{itemize}
639
640\paragraph{The problem with caches}
641Cost models for pipelines --- at least simple ones --- are data independent,
642i.e. they are dependent on a view that is execution path dependent. In other
643words, the knowledge about the sequence of executed instructions is sufficient
644to predict the cost of future instructions.
645 
646The same property does not hold for caches. The cost of accessing a memory
647cell strongly depends on the addresses of the memory cells that have been read
648in the past. In turn, the accessed addresses are a function of the low level
649data state, that cannot be correlated to the source program state.
650
651The strong correlation between the internal state of caches and the data
652accessed in the past is one of the two main responsibles for the lack of
653precision of static analysis in modern uni-core architectures. The other one
654is the lack of precise knowledge on the real behaviour of modern hardware
655systems. In order to overcome both problems, that Cazorla\&alt.~\cite{proartis}
656call the ``\emph{Timing Analysis Walls}'', the PROARTIS European Project has
657proposed to design ``\emph{a hardware/software architecture whose execution
658timing behaviour eradicates dependence on execution history}'' (\cite{proartis}, Section 1.2). The statement is obviously too strong. What is concretely
659proposed by PROARTIS is the design of a hardware/software architecture whose
660execution timing is \emph{execution path dependent} (our terminology).
661
662We have already seen that we are able to accomodate in the labelling approach
663cost functions that are dependent on views that are execution path dependent.
664Before fully embracing the PROARTIS vision,
665we need to check if there are other aspects of the PROARTIS proposal
666that can be problematic for CerCo.
667
668\paragraph{Static Probabilistic Time Analysis}
669The approach of PROARTIS to achieve execution path dependent cost models
670consists in turning the hard-to-analyze deterministic hardware components
671(e.g. the cache) into probabilistic hardware components. Intuitively,
672algorithms that took decision based on the program history now throw a dice.
673The typical example which has been thoroughly studied in
674PROARTIS~\cite{proartis2} is that of caches. There the proposal is to replace
675the commonly used deterministic placement and replacement algorithms (e.g. LRU)
676with fully probabilistic choices: when the cache needs to evict a page, the
677page to be evicted is randomly selected according to the uniform distribution.
678
679The expectation is that probabilistic hardware will have worse performances
680in the average case, but it will exhibit the worst case performance only with
681negligible probability. Therefore, it becomes no longer interesting to estimate
682the actual worst case bound. What becomes interesting is to plot the probability
683that the execution time will exceed a certain treshold. For all practical
684purposes, a program that misses its deadline with a negligible probability
685(e.g. $10^-9$ per hour of operation) will be perfectly acceptable when deployed
686on an hardware system (e.g. a car or an airplane) that is already specified in
687such a way.
688
689In order to plot the probability distribution of execution times, PROARTIS
690proposes two methodologies: Static Probabilistic Time Analysis (SPTA) and
691Measurement Based Probabilistic Time Analysis (MBPTA). The first one is
692similar to traditional static analysis, but it operates on probabilistic
693hardware. It is the one that we would like to embrace. The second one is
694based on measurements and it is justified by the following assumption: if
695the probabilities associated to every hardware operation are all independent
696and identically distributed, then measuring the fime spent on full runs of
697sub-systems components yields a probabilistic estimate that remains valid
698when the sub-system is plugged in a larger one. Moreover, the probabilistic
699distribution of past runs must be equal to the one of future runs.
700
701We understand that MBPTA is useful to analyze closed (sub)-systems whose
702functional behavior is deterministic. It does not seem to have immediate
703applications to parametric time analysis, which we are interested in.
704Therefore we focus on SPTA.
705
706According to~\cite{proartis},
707``\emph{in SPTA, execution time probability distributions for individual operations \ldots are determined statically
708from a model of the processor. The design principles of PROARTIS
709will ensure \ldots that the probabilities for the execution time of each
710instruction are independent. \ldots SPTA is performed by calculating the
711convolution ($\oplus$) of the discrete probability distributions which describe
712the execution time for each instruction on a CPU; this provides a probability
713distribution \ldots representing the timing behaviour of the entire sequence of
714instructions.}''
715
716We will now analyze to what extend we can embrace SPTA in CerCo.
717
718\paragraph{The labelling approach for Static Probabilistic Time Analysis}
719
720To summarize, the main practical differences between standard static time
721analysis and SPTA are:
722\begin{itemize}
723 \item The cost functions for single instructions or sequences of instructions
724   no longer return a natural numbers (number of cost units) but integral
725   random variables.
726 \item Cost functions are extended from single instructions to sequences of
727   instructions by using the associative convolution operator $\oplus$
728\end{itemize}
729
730By reviewing the papers that described the labelling approach, it is easy to
731get convinced that the codomain of the cost analysis can be lifted from
732that of natural numbers to any group. Moreover, by imposing labels after
733every function call, commutativity can be dropped and the approach works on
734every monoid (usually called \emph{cost monoids} in the litterature).
735Because random variables and convolutions form a monoid, we immediately have
736that the labelling approach extends itself to SPTA. The instrumented code
737produced by an SPTA-CerCo compiler will then have random variables (on a finite
738domain) as costs and convolutions in place of the \texttt\{\_\_cost\_incr\}
739function.
740
741What is left to be understood is the way to state and compute the
742probabilistic invariants to do \emph{parametric SPTA}. Indeed, it seems that
743PROARTIS only got interested into non parametric PTA. For example, it is
744well known that actually computing the convolutions results in an exponential
745growth of the memory required to represent the result of the convolutions.
746Therefore, the analysis should work symbolically until the moment where
747we are interested into extracting information from the convolution.
748
749Moreover, assuming that the problem of computing invariants is solved,
750the actual behavior of automated theorem
751provers on probabilistic invariants is to be understood. It is likely that
752a good amount of domain specific knowledge about probability theory must be
753exploited and incorporated into automatic provers to achieve concrete results.
754
755Parametric SPTA using the methodology developed in CerCo is a future research
756direction that we believe to be worth exploring in the middle and long term.
757
758\paragraph{Static Probabilistic Time Analysis for Caches in CerCo}
759
760As a final remark, we note that the analysis in CerCo of systems that implement
761probabilistic caches requires a combination of SPTA and data independent cost
762models. The need for a probabilistic analysis is obvious but, as we saw in
763the previous paragraph, it requires no modification of the labelling approach.
764
765In order to understand the need for dependent labelling (to work on data
766independent cost functions), we need to review the behaviour of probabilistic
767caches as proposed by PROARTIS. The interested reader can
768consult~\cite{proartis2-articolo30} for further informations.
769
770In a randomized cache, the probability of evicting a given line on every access
771is $1/N$ where $N$ stands for the number of cache entries. Therefore the
772hit probability of a specific access to such a cache is
773$P(hit) = (\frac{N-1}{N})^K$ where $K$ is the number of cache misses between
774two consecutive accesses to the same cache entry. For the purposes of our
775analysis, we must assume that every cache access could cause an eviction.
776Therefore, we define $K$ (the \emph{reuse distance}) to be the number of
777memory accesses between two consecutive accesses to the same cache entry,
778including the access for which we are computing $K$. In order to compute
779$K$ for every code memory address, we need to know the execution path (in
780our terminology). In other words, we need a view that records for each
781cache entry the number of memory accesses that has occurred since the last
782access.
783
784For pipelines with $n$ stages, the number of possible views is limited to
785$2^n$: a view can usually just be represented by a word. This is not the
786case for the views on caches, which are in principle very large. Therefore,
787the dependent labelling approach for data indepedent cost functions that we
788have presented here could still be unpractical for caches. If that turns out
789to be the case, a possible strategy is the use of abstract interpretations
790techniques on the object code to reduce the size of views exposed at the
791source level, at the price of an early loss of precision in the analysis.
792
793More research work must be performed at the current stage to understand if
794caches can be analyzed, even probabilistically, using the CerCo technology.
795This is left for future work and it will be postponed after the work on
796pipelines.
797
798\paragraph{Conclusions}
799At the current state of the art functional properties of programs are better
800proved high level languages, but the non functional ones are proved on the
801corresponding object code. The non functional analysis, however, depends on
802functional invariants, e.g. to bound or correlate the number of executions of
803cycles.
804
805The aim of the CerCo project is to reconcile the two analysis by performing
806non functional analysis on the source code. This requires computing a cost
807model on the object code and reflecting the cost model on the source code.
808We achieve this in CerCo by designing a certified Cost Annotating Compiler that
809keeps tracks of transformations of basic blocks, in order to create a
810correspondence (not necessaritly bijection) between the basic blocks of the
811source and target language. We then prove that the sequence of basic blocks
812that are met in the source and target executions is correlated. Then,
813we perform a static analysis of the cost of basic blocks on the target language
814and we use it to compute the cost model for the source language basic blocks.
815Finally, we compute cost invariants on the source code from the inferred cost
816model and from the functional program invariants; we generate proof obligations
817for the invariants; we use automatic provers to try to close the proof
818obligations.
819
820The cost of single instructions on modern architectures depend on the internal
821state of various hardware components (pipelines, caches, branch predicting
822units, etc.). The internal states are determined by the previous execution
823history. Therefore the cost of basic blocks is parametric on the execution
824history, which means both the instructions executed and the data manipulated
825by the instructions. The CerCo approach is able to correlate the sequence
826of blocks of source instructions with the sequence of blocks of target
827instructions. It does not correlate the high level and the low level data.
828Therefore we are not able in the general case to lift a cost model parametric
829on the execution history on the source code.
830
831To overcome the problem, we have identified a particular class of cost models
832that are not dependent on the data manipulated. We argue that the CerCo
833approach can cover this scenario by exposing in the source program a finite
834data type of views over internal machine states. The costs of basic blocks
835is parametric on these views, and the current view is updated at basic block
836entry according to some abstraction of the machine hardware that does not
837need to be understood. Further studies are needed to understand how invariant
838generators and automatic provers can cope with these updates and parametric
839costs.
840
841We have argued how pipelines, at least simple ones, are captured by the
842previous scenario and can in principle be manipulated using CerCo tools.
843The same is not true for caches, whose behaviour deeply depends on the
844data manipulated. By embracing the PROARTIS proposal of turning caches into
845probabilistic components, we can break the data dependency. Nevertheless,
846cache analysis remains more problematic because of the size of the views.
847Further studies need to be focused on caches to understand if the problem of
848size of the views can be tamed in practice without ruining the whole approach.
849
850\end{document}
Note: See TracBrowser for help on using the repository browser.