source: Deliverables/D6.3/pipelines.tex @ 3208

Last change on this file since 3208 was 3208, checked in by sacerdot, 6 years ago

Final version.

File size: 53.0 KB
Line 
1\documentclass[11pt, epsf, a4wide]{article}
2
3\usepackage{../style/cerco}
4
5\usepackage{amsfonts}
6\usepackage{amsmath}
7\usepackage{amssymb} 
8\usepackage[english]{babel}
9\usepackage{graphicx}
10\usepackage[utf8x]{inputenc}
11\usepackage{listings}
12\usepackage{stmaryrd}
13\usepackage{url}
14\usepackage{bbm}
15
16\begin{document}
17
18\title{Dependent labelling applied to stateful hardware components}
19\author{G. Pulcini$^*$, C. Sacerdoti Coen\footnote{The project CerCo acknowledges the financial support of the Future and Emerging Technologies (FET) programme within the Seventh Framework Programme for Research of the European Commission, under FET-Open grant number: 243881}\\~\\
20\texttt{\{pulcini,sacerdot\}@cs.unibo.it}\\~
21\\Dipartimento di Informatica - Scienza ed Ingegneria\\
22Universit\`a di Bologna}
23\date{03/04/2013}
24
25\maketitle
26
27\abstract{The Certified Complexity (CerCo) EU Project~\cite{cerco} aims at integrating and unifying the functional and non functional analyses of safety and time critical software by performing both of them together on the source code. It is based on the labelling approach~\cite{labelling} that allows to write compilers that induce on the source code a sound and precise cost model for basic blocks. The cost model is computed on object code blocks and then transferred to the source code by reversing the optimizations to the control flow~\cite{paolo}.
28
29In this technical report we address the important issue of stateful hardware whose instructions cost is a function of the internal state. Typical examples are pipelines and caches. In order to avoid loss of precision, the cost model must be parametric on the state, which has no correspondent one at the source level. We show how to enrich the source code with the minimal amount of information that allows to compute costs precisely. We also briefly argue in favour of probabilistic static time analysis and we examine how it is already supported by the labelling approach.
30}
31
32\section{Introduction}
33The labelling approach~\cite{labelling,paolo} is a new technique to implement
34compilers that induce on the source code a sound and precise cost model for
35basic blocks. The cost model is computed on object code blocks and then
36transferred to the source code by reversing the optimizations to the control
37flow~\cite{paolo}. In the rest of this technical report we assume reader's
38knowledge on the labelling approach.
39
40At the end of the CerCo project, we are convinced that the labelling approach
41can support most optimizations currently supported by real world compilers.
42Moreover, we have formally certified the code of a compiler that transfers
43cost models according to the labelling approach, greatly reducing the trusted
44code base of time analysis. The main issue that is left is to identify the
45classes of cost models that can be expressed at the source level and that we
46can automatically reason on. In particular, we know that cost models for
47modern hardware are complex objects that assign to each instruction a cost which
48also depends on the internal state of stateful hardware components that are
49designed to optimize the average case. Pipelines, caches, branch predictors are
50typical examples of components that have no impact on the functional behavior
51of the code, but that greatly augment the code performance.
52
53Ignorance on the internal state of these components during a WCET analysis
54forces to assume the worst case, leading to useless time bounds. Therefore,
55standard WCET tools tend to statically execute code fragments as much as
56possible, to minimize the ignorance on the state. For loops, for example, are
57totally unrolled. In CerCo we could do the same, departing from the labelling
58approach. However, what is more natural is to stay closer to the labelling
59approach and avoid loosing precision by computing parametric exact costs.
60This is the topic of this research report.
61
62In Section~\ref{premises} we briefly review the premises and goals of the
63CerCo approach in order to compare the CerCo approach to the traditional
64WCET one, which is able to deal with modern hardware.
65In Section~\ref{controlflow} we compare the control flow analyses performed
66in CerCo and in traditional WCET, arguing that our approach is winning.
67In Section~\ref{static} we compare the static analyses performed in CerCo
68and in traditional WCET. Only the latter can deal with modern hardware.
69Section~\ref{static2} contains the new material. We revise our static analysis
70and the labelling approach in general to accommodate modern hardware.
71Temporary conclusions are found in Section~\ref{conclusions}.
72
73\section{\label{premises}CerCo: premises and goals}
74
75We briefly review
76here the premises and goals of the CerCo approach to resource analysis.
77\begin{itemize}
78 \item There is a lot of recent and renewed activity in the formal method
79   community to accommodate resource analysis using techniques derived from
80   functional analysis (type systems, logics, abstract interpretation,
81   amortized analysis applied to data structures, etc.)
82 \item Most of this work, which currently remains at theoretical level,
83   is focused on high level languages and it assumes the existence of correct
84   and compositional resource cost models.
85 \item High level languages are compiled to object code by compilers that
86   should respect the functional properties of the program. However, because
87   of optimizations and the inherently non compositional nature of compilation,
88   compilers do not respect compositional cost models that are imposed a priori
89   on the source language. By controlling the compiler and coupling it with a
90   WCET analyser, it is actually possible to
91   choose the cost model in such a way that the cost bounds are high enough
92   to bound the cost of every produced code. This was attempted for instance
93   in the EMBounded project~\cite{embounded} with good success. However, we believe that bounds
94   obtained in this way have few possibilities of being tight.
95 \item Therefore our approach consists in having the compiler generate the
96   cost model for the user by combining tracking of basic blocks during code
97   transformations with a static resource analysis on the object code for basic
98   blocks. We formally prove the compiler to respect the cost model that is
99   induced on the source level based on a very few assumptions: basically the
100   cost of a sequence of instructions should be associative and commutative
101   and it should not depend on the machine status, except its program counter.
102   Commutativity can be relaxed at the price of introducing more cost updates
103   in the instrumented source code.
104 \item The cost model for basic blocks induced on the source language must then
105   be exploited to derive cost invariants and to prove them automatically.
106   In CerCo we have shown how even simple invariant generations techniques
107   are sufficient to enable the fully automatic proving of parametric WCET
108   bounds for simple C programs and for Lustre programs of arbitrary complexity.
109\end{itemize}
110
111Compared to traditional WCET techniques, our approach currently has many
112similarities, some advantages and some limitations. Both techniques need to
113perform data flow analysis on the control flow graph of the program and both
114techniques need to estimate the cost of control blocks of instructions.
115
116\section{Control flow analysis\label{controlflow}}
117
118The first main difference is in the control flow analysis. Traditional WCET
119starts from object code and reconstructs the control flow graph from it.
120Moreover, abstract interpretation is heavily employed to bound the number of
121executions of cycles. In order to improve the accuracy of estimation, control
122flow constraints are provided by the user, usually as systems of (linear)
123inequalities. In order to do this, the user, helped by the system, needs to
124relate the object code control flow graph with the source one, because it is
125on the latter that the bounds can be figured out and be understood. This
126operations is untrusted and potentially error prone for complex optimizations
127(like aggressive loop optimizations). Efficient tools from linear algebra are
128then used to solve the systems of inequations obtained by the abstract
129interpreter and from the user constraints.
130
131In CerCo, instead, we assume full control on the compiler that
132is able to relate, even in non trivial ways, the object code control flow graph
133onto the source code control flow graph. A clear disadvantage is the
134impossibility of applying the tool on the object code produced by third party
135compilers. On the other hand, we get rid of the possibility of errors
136in the reconstruction of the control flow graph and in the translation of
137high level constraints into low level constraints. The second potentially
138important advantage is that, once we are dealing with the source language,
139we can augment the precision of our dataflow analysis by combining together
140functional and non functional invariants. This is what we attempted with
141the CerCo Cost Annotating Frama-C Plug-In. The Frama-C architecture allows
142several plug-ins to perform all kind of static analysis on the source code,
143reusing results from other plug-ins and augmenting the source code with their
144results. The techniques are absolutely not limited to linear algebra and
145abstract interpretation, and the most important plug-ins call domain specific
146and general purpose automated theorem provers to close proof obligations of
147arbitrary shape and complexity.
148
149In principle, the extended flexibility of the analysis should allow for a major
150advantage of our technique in terms of precision, also
151considering that all analysis used in traditional WCET can still be implemented
152as plug-ins. In particular, the target we have in mind are systems that are
153both (hard) real time and safety critical. Being safety critical, we can already
154expect them to be fully or partially specified at the functional level.
155Therefore we expect that the additional functional invariants should allow to
156augment the precision of the cost bounds, up to the point where the parametric
157cost bound is fully precise.
158In practice, we have not had the time to perform extensive
159comparisons on the kind of code used by industry in production systems.
160The first middle term improvement of CerCo would then consist in this kind
161of analysis, to support or disprove our expectations. It seems that the
162newborn TACLe Cost Action (Timing Analysis on Code Level) would be the
163best framework to achieve this improvement.
164In the case where our
165technique remains promising, the next long term improvement would consist in
166integrating in the Frama-C plug-in ad-hoc analysis and combinations of analysis
167that would augment the coverage of the efficiency of the cost estimation
168techniques.
169
170\section{Static analysis of costs of basic blocks\label{static}}
171At the beginning of the project we have deliberately decided to focus our
172attention on the control flow preservation, the cost model propagation and
173the exploitation of the cost model induced on the high level code. For this
174reason we have devoted almost no attention to the static analysis of basic
175blocks. This was achieved by picking a very simple hardware architecture
176(the 8051 microprocessor family) whose cost model is fully predictable and
177compositional: the cost of every instruction --- except those that deal with
178I/O --- is constant, i.e. it does not depend on the machine status.
179We do not regret this choice because, with the limited amount of man power
180available in the project, it would have been difficult to also consider this
181aspect. However, without showing if the approach can scale to most complex
182architectures, our methodology remains of limited interest for the industry.
183Therefore, the next important middle term improvement will be the extension
184of our methodology to cover pipelines and simple caches. We will now present our
185ideas on how we intend to address the problem. The obvious long term
186improvement would be to take in consideration multicores system and complex
187memory architectures like the ones currently in use in networks on chips.
188The problem of execution time analysis for these systems is currently
189considered extremely hard or even unfeasible and at the moments it seems
190unlikely that our methodology could contribute to the solution of the problem.
191
192\section{Static analysis of costs of basic blocks revisited\label{static2}}
193We will now describe what currently seems to be the most interesting technique
194for the static analysis of the cost of basic blocks in presence of complex
195hardware architectures that do not have non compositional cost models.
196
197We start presenting an idealized model of the execution of a generic
198microprocessor (with caches) that has all interrupts disabled and no I/O
199instructions. We then classify the models according to some properties of
200their cost model. Then we show how to extend the labelling approach of CerCo
201to cover models that are classified in a certain way.
202
203\paragraph{The microprocessor model}
204
205Let $\sigma,\sigma_1,\ldots$ range over $\Sigma$, the set of the fragments of
206the microprocessor states that hold the program counter, the program status word and
207all the data manipulated by the object code program, i.e. registers and
208memory cells. We call these fragments the \emph{data states}.
209
210Let $\delta,\delta_1,\ldots$ range over $\Delta$, the set of the
211fragments of the microprocessor state that holds the \emph{internal state} of the
212microprocessor (e.g. the content of the pipeline and caches, the status of
213the branch prediction unit, etc.).
214The internal state of the microprocessor influences the execution cost of the
215next instruction, but it has no effect on the functional behaviour of the
216processor. The whole state of the processor is represented by a pair
217$(\sigma,\delta)$.
218
219Let $I,I_1,\ldots$ range over $\mathcal{I}$, the
220the set of \emph{instructions} of the processor and let
221$\gamma,\gamma_1,\ldots$ range over $\Gamma$, the set of \emph{operands}
222of instructions after the fetching and decoding passes.
223Thus a pair $(I,\gamma)$ represents a \emph{decoded instruction} and already contains
224the data required for execution. Execution needs to access the data state only
225to write the result.
226
227Let $fetch: \Sigma \to \mathcal{I} \times \Gamma$ be the function that
228performs the fetching and execution phases, returning the decoded instruction
229ready for execution. This is not meant to be the real fetch-decode function,
230that exploits the internal state too to speed up execution (e.g. by retrieving
231the instruction arguments from caches) and that, in case of pipelines, works
232in several stages. However, such a function exists and
233it is observationally equivalent to the real fetch-decode.
234
235We capture the semantics of the microprocessor with the following set of
236functions:
237\begin{itemize}
238 \item The \emph{functional transition} function $\longrightarrow : \Sigma \to \Sigma$
239   over data states. This is the only part of the semantics that is relevant
240   to functional analysis.
241 \item The \emph{internal state transition} function
242   $\Longrightarrow : \Sigma \times \Delta \to \Delta$ that updates the internal
243   state.
244 \item The \emph{cost function} $K: \mathcal{I} \times \Gamma \times \Delta \to \mathbb{N}$
245   that assigns a cost to transitions. Since decoded instructions hold the data
246   they act on, the cost of an instruction may depend both on the data being
247   manipulated and on the internal state.
248\end{itemize}
249
250Given a processor state $(\sigma,\delta)$, the processor evolves in the
251new state $(\sigma',\delta')$ in $n$ cost units if
252$\sigma \longrightarrow \sigma'$ and $(\sigma,\delta) \Longrightarrow \delta'$
253and $fetch(\sigma) = (I,\gamma)$ and $K(I,\gamma,\delta) = n$.
254
255An \emph{execution history} is a stream of states and transitions
256$\sigma_0 \longrightarrow \sigma_1 \longrightarrow \sigma_2 \ldots$
257that can be either finite or infinite. Given an execution history, the
258corresponding \emph{execution path} is the stream of program counters
259obtained from the execution history by forgetting all the remaining information.
260The execution path of the history $\sigma_0 \longrightarrow \sigma_1 \longrightarrow \sigma_2 \ldots$ is $pc_0,pc_1,\ldots$ where each $pc_i$ is the program
261counter of $\sigma_i$. We denote the set of finite execution paths with $EP$.
262
263We claim this simple model to be generic enough to cover real world
264architectures.
265
266\paragraph{Classification of cost models}
267A cost function is \emph{exact} if it assigns to transitions the real cost
268incurred. It is \emph{approximated} if it returns an upper bound of the real
269cost.
270
271A cost function is \emph{operand insensitive} if it does not depend on the
272operands of the instructions to be executed. Formally, $K$ is operand insensitive
273if there exists a $K': \mathcal{I} \times \Delta \to \mathbb{N}$ such that
274$K(I,\gamma,\delta) = K'(I,\delta)$. In this case, with an abuse of terminology,
275we will identify $K$ with $K'$.
276
277The cost functions of simple hardware architectures, in particular RISC ones,
278are naturally operand insensitive. In the other cases an exact operand sensitive
279cost function can always be turned into an approximated operand insensitive one
280by taking $K'(I,\delta) = \max\{K(I,\gamma,\delta)~|~\gamma \in \Gamma\}$.
281The question when one performs these approximation is how severe the
282approximation is. A measure is given by the \emph{jitter}, which is defined
283as the difference between the best and worst cases. In our case, the jitter
284of the approximation $K'$ would be $\max\{K(I,\gamma,\delta)~|~\gamma \in \Gamma\} - \min\{K(I,\gamma,\delta)~|~\gamma \in \Gamma\}$. According to experts
285of WCET analysis, the jitters relative to operand sensitivity in modern
286architectures are small enough to make WCET estimations still useful.
287Therefore, in the sequel we will focus on operand insensitive cost models only.
288
289Note that cost model that are operand insensitive may still have significant
290dependencies over the data manipulated by the instructions, because of the
291dependency over internal states. For example, an instruction that reads data
292from the memory may change the state of the cache and thus greatly affect the
293execution time of successive instructions.
294
295Nevertheless, operand insensitivity is an important property for the labelling
296approach. In~\cite{paolo} we introduced \emph{dependent labels} and
297\emph{dependent costs}, which are the possibility of assigning costs to basic
298blocks of instructions which are also dependent on the state of the high level
299program at the beginning of the block. The idea we will now try to pursue is
300to exploit dependent costs to capture cost models that are sensitive to
301the internal states of the microprocessor. Operand sensitivity, however, is
302a major issue in presence of dependent labels: to lift a data sensitive cost
303model from the object code to the source language, we need a function that
304maps high level states to the operands of the instructions to be executed,
305and we need these functions to be simple enough to allow reasoning over them.
306Complex optimizations performed by the compiler, however, make the mappings
307extremely cumbersome and history dependent. Moreover, keeping
308track of status transformations during compilation would be a significant
309departure from classical compilation models which we are not willing to
310undertake. By assuming or removing operand sensitivity, we get rid of part
311of the problem: we only need to make our costs dependent on the internal
312state of the microprocessor. The latter, however, is not at all visible
313in the high level code. Our next step is to make it visible.
314
315In general, the value of the internal state at a certain point in the program
316history is affected by all the preceding history. For instance, the
317pipeline stages either hold operands of instructions in execution or bubbles
318\footnote{A bubble is formed in the pipeline stage $n$ when an instruction is stuck in the pipeline stage $n-1$, waiting for some data which is not available yet.}. The execution history contains data states that in turn contain the
319object code data which we do not know how to relate simply to the source code
320data. We therefore introduce a new classification.
321
322A \emph{view} over internal states is a pair $(\mathcal{V},|.|)$ where
323$\mathcal{V}$ is a finite non empty set and $|.|:\Delta \to \mathcal{V}$ is
324a forgetful function over internal states.
325
326The operand insensitive cost function $K$ is \emph{dependent on the view
327$\mathcal{V}$} if there exists a $K': \mathcal{I} \times \mathcal{V} \to \mathbb{N}$ such that $K(I,\delta) = K'(I,|\delta|)$. In this case, with an abuse of terminology, we will identify $K$ with $K'$.
328
329Among the possible views, the ones that we will easily be able to work
330with in the labelling approach are the \emph{execution history dependent views}.
331A view $(\mathcal{V},|.|)$ is execution history dependent with a lookahead
332of length $n$ when there exists
333a transition function $\hookrightarrow : PC^n \times \mathcal{V} \to \mathcal{V}$
334such that for all $(\sigma,\delta)$ and $pc_1,\ldots,pc_n$ such that every
335$pc_i$ is the program counter of $\sigma_i$ defined by $\sigma \longrightarrow^i \sigma_i$, we have $(\sigma,\delta) \Longrightarrow \delta'$ iff
336$((pc_1,\ldots,pc_n),|\delta|) \hookrightarrow |\delta'|$.
337
338Less formally, a view is
339dependent on the execution history if the evolution of the views is fully
340determined by the evolution of the program counters. To better understand
341the definition, consider the case where the next instruction to be executed
342is a conditional jump. Without knowing the values of the registers, it is
343impossible to determine if the true or false branches will be taken. Therefore
344it is likely to be impossible to determine the value of the view the follows
345the current one. On the other hand, knowing the program counter that will be
346reached executing the conditional branch, we also know which
347branch will be taken and this may be sufficient to compute the new view.
348Lookaheads longer then 1 will be used in case of pipelines: when executing
349one instruction in a system with a pipeline of length $n$, the internal state
350of the pipeline already holds information on the next $n$ instructions to
351be executed.
352
353The reference to execution histories in the names is due to the following
354fact: every execution history dependent transition function $\hookrightarrow$
355can be lifted to the type $EP \times \mathcal{V} \to \mathcal{V}$ by folding
356the definition over the path trace:
357$((pc_0,\ldots,pc_m),v_0) \hookrightarrow v_n$ iff for all $i \leq m - n$,
358$((pc_i,\ldots,pc_{i+n}),v_i) \hookrightarrow v_{i+1}$.
359Moreover, the folding is clearly associative:
360$(\tau_1 @ \tau_2, v) \hookrightarrow v''$ iff
361$(\tau_1,v) \hookrightarrow v'$ and $(\tau_2,v') \hookrightarrow v''$.
362
363As a final definition, we say that a cost function $K$ is
364\emph{data independent} if it is dependent on a view that is execution path
365dependent. In two paragraphs we will show how we can extend the
366labelling approach to deal with data independent cost models.
367
368Before that, we show that the class of data independent cost functions
369is not too restricted to be interesting. In particular, at least simple
370pipeline models admit data independent cost functions.
371
372\paragraph{A data independent cost function for simple pipelines}
373We consider here a simple model for a pipeline with $n$ stages without branch
374prediction and hazards. We also assume that the actual value of the operands
375of the instruction that is being read have no influence on stalls (i.e. the
376creation of bubbles) nor on the execution cost. The type of operands, however,
377can. For example, reading the value 4 from a register may stall a pipeline
378if the register has not been written yet, while reading 4 from a different
379register may not stall the pipeline.
380
381The internal states $\Delta$ of the pipeline are $n$-tuples of decoded
382instructions or bubbles:
383$\Delta = (\mathcal{I} \times \Gamma \cup \mathbbm{1})^n$.
384This representation is not meant to describe the real data structures used
385in the pipeline: in the implementation the operands are not present in every
386stage of the pipeline, but are progressively fetched. A state
387$(x_1,x_2,\ldots,(I,\gamma))$ represents the state of the pipeline just before
388the completion of instruction $(I,\gamma)$. The first $n-1$ instructions that
389follow $I$ may already be stored in the pipeline, unless bubbles have delayed
390one or more of them.
391
392We introduce the following view over internal states:
393$(\{0,1\}^n,|.|)$ where $\mathbb{N}_n = {0,\ldots,2^n-1}$ and
394$|(x_1,\ldots,x_n)| = (y_1,\ldots,y_n)$ where $y_i$ is 1 iff $x_i$ is a bubble.
395Thus the view only remembers which stages of the pipelines are stuck.
396The view holds enough information to reconstruct the internal state given
397the current data state: from the data state we can fetch the program counter
398of the current and the next $n-1$ instructions and, by simulating at most
399$n$ execution steps and by knowing where the bubbles were, we can fill up
400the internal state of the pipeline.
401
402The assumptions on the lack of influence of operands values on stalls and
403execution times ensures the existence of the data independent cost function
404$K: PC \times \{0,1\}^n \to \mathbb{N}$. The transition function for a
405pipeline with $n$ stages may require $n$ lookaheads:
406$\hookrightarrow : PC^n \times \{0,1\}^n \to \{0,1\}^n$.
407
408While the model is a bit simplicist, it can nevertheless be used to describe
409existing pipelines. It is also simple to be convinced that the same model
410also captures static branch prediction: speculative execution of conditional
411jumps is performed by always taking the same branch which does not depend on
412the execution history. In order to take in account jump predictions based on
413the execution history, we just need to incorporate in the status and the view
414statistical informations on the last executions of the branch.
415
416\paragraph{The labelling approach for data independent cost models}
417We now describe how the labelling approach can be slightly modified to deal
418with a data independent cost model $(\hookrightarrow,K)$ built over
419$(\mathcal{V},|.|)$.
420
421In the labelling approach, every basic block in the object code is identified
422with a unique label $L$ which is also associated to the corresponding
423basic block in the source level. Let us assume that labels are also inserted
424after every function call and that this property is preserved during
425compilation. Adding labels after calls makes the instrumented code heavier
426to read and it generates more proof obligations on the instrumented code, but
427it does not create any additional problems. The preservation during compilation
428creates some significant technical complications in the proof of correctness
429of the compiler, but those can be solved.
430
431The static analysis performed in the last step of the basic labelling approach
432analyses the object code in order to assign a cost to every label or,
433equivalently, to every basic block. The cost is simply the sum of the cost
434of very instruction in the basic block.
435
436In our scenario, instructions no longer have a cost, because the cost function
437$K$ takes in input a program counter but also a view $v$. Therefore we
438replace the static analysis with the computation, for every basic block and
439every $v \in \mathcal{V}$, of the sum of the costs of the instructions in the
440block, starting in the initial view $v$. Formally, let the sequence of the
441program counters of the basic block form the execution path $pc_0,\ldots,pc_n$.
442The cost $K(v_0,L)$ associated to the block labelled with
443$L$ and the initial view $v_0$ is
444$K(pc_0,v_0) + K(pc_1,v_1) + \ldots + K(pc_n,v_n)$ where for every $i < n$,
445$((pc_i,\ldots,pc_{i+l}),v_i) \hookrightarrow v_{k+1}$ where $l$ is the
446lookahead required. When the lookahead requires program counters outside the
447block under analysis, we are free to use dummy ones because the parts of the
448view that deal with future behaviour have no impact on the cost of the
449previous operations by assumption.
450
451The static analysis can be performed
452in linear time in the size of the program because the cardinality of the sets
453of labels (i.e. the number of basic blocks) is bounded by the size of the
454program and because the set $V$ is finite. In the case of the pipelines of
455the previous paragraph, the static analysis is $2^n$ times more expensive
456than the one of the basic labelling approach, where $n$ is the number of
457pipeline stages.
458
459The first important theorem in the labelling approach is the correctness of the
460static analysis: if the (dependent) cost associated to a label $L$ is $k$,
461then executing a program from the beginning of the basic block to the end
462of the basic block should take exactly $k$ cost units. The proof only relies
463on associativity and commutativity of the composition of costs. Commutativity
464is only required if basic blocks can be nested, i.e. if a basic block does not
465terminate when it reaches a call, but it continues after the called function
466returns. By assuming to have a cost label after each block, we do not need
467commutativity any longer, which does not hold for the definition of $K$ we
468just gave. The reason is that, if $pc_i$ is a function call executed in
469the view (state) $v_i$, it is not true that, after return, the state
470will be $v_i+1$ defined by $(pc_i,pc_{i+1},v_i) \hookrightarrow v_{i+1}$
471(assuming a lookahead of 1, which is already problematic).
472Indeed $pc_{i+1}$ is the program counter of the instruction that follows
473the call, whereas the next program counter to be reached is the one of the
474body of the call. Moreover, even if the computation would make sense,
475$v_{i+1}$ would be the state at the beginning of the execution of the body of
476the call, while we should know the state after the function returns.
477The latter cannot be
478statically predicted. That's why we had to impose labels after calls.
479Associativity, on the other hand, trivially holds. Therefore the proof
480of correctness of the static analysis can be reused without any change.
481
482So far, we have computed the dependent cost $K: \mathcal{V} \times \mathcal{L} \to \mathbb{N}$ that associates a cost to basic blocks and views.
483The second step consists in statically computing the transition function
484$\hookrightarrow : \mathcal{L} \times \mathcal{L} \times \mathcal{V} \to \mathcal{V}$ that
485associates to each pair of consecutively executed basic blocks and input view
486the view obtained at the end of the execution of the first block.
487
488The definition is the following:
489$(L,L',v) \hookrightarrow v'$ iff $((pc_0,\ldots,pc_n,pc'_0,\ldots,pc'_m),v) \hookrightarrow v'$ where $(pc_0,\ldots,pc_n)$ are the program counters of the block labelled by $L$ and $(pc'_0,\ldots,pc'_m)$ are those of the block labelled
490with $L'$. We assume here that $m$ is always longer or equal to the lookahead
491required by the transition function $\hookrightarrow$ over execution paths.
492When this is not the case we could make the new transition function take in
493input a longer lookahead of labels. Or we may assume to introduce enough
494\texttt{NOP}s at the beginning of the block $L'$ to enforce the property.
495In the rest of the paragraph we assume to have followed the second approach
496to simplify the presentation.
497
498The extended transition function over labels is not present in the basic labelling approach. Actually, the basic labelling
499approach can be understood as the generalized approach where the view $\mathcal{V} = \mathbbm{1}$.
500The computation of the extended $\hookrightarrow$ transition function
501is again linear in the size of the program.
502
503Both the versions of $K$ and $\hookrightarrow$ defined over labels can be
504lifted to work over traces by folding them over the list of labels in the
505trace: for $K$ we have $K((L_1,\ldots,L_n),v) = K(L_1,v) + K((L_2,\ldots,L_n),v')$ where
506$(L_1,L_2,v) \hookrightarrow v'$;  for $\hookrightarrow$ we have
507$((L_1,\ldots,L_n),v) \hookrightarrow v''$ iff $(L_1,L_2,v) \hookrightarrow v'$ and $((L_2,\ldots,L_n),v') \hookrightarrow v''$. The two definitions are also
508clearly associative.
509
510The second main theorem of the labelling approach is trace preservation:
511the trace produced by the object code is the same as the trace produced by
512the source code. Without any need to change the proof, we immediately obtain
513as a corollary that for every view $v$, the cost $K(\tau,v)$ computed from
514the source code trace $\tau$ is the same than the cost $K(\tau,v)$ computed
515on the object code trace, which is again $\tau$.
516
517The final step of the labelling approach is source code instrumentation.
518In the basic labelling approach it consists in adding a global variable
519\texttt{\_\_cost}, initialized with 0, which is incremented at the beginning
520of every basic block with the cost of the label of the basic block.
521Here we just need a more complex instrumentation that keeps track of the
522values of the views during execution:
523\begin{itemize}
524 \item We define three global variables \texttt{\_\_cost}, initialized at 0,
525   \texttt{\_\_label}, initialized with \texttt{NULL}, and
526   \texttt{\_\_view}, uninitialized.
527 \item At the beginning of every basic block labelled by $L$ we add the
528  following code fragment:
529
530  \begin{tabular}{l}
531  \texttt{\_\_view = \_\_next(\_\_label,}$L$\texttt{,\_\_view);}\\
532  \texttt{\_\_cost += }$K(\mbox{\texttt{\_\_view}},L)$\texttt{;}\\
533  \texttt{\_\_label = }$L$\texttt{;}
534  \end{tabular}
535
536  where \texttt{\_\_next(}$L_1,L_2,v$\texttt{)} = $v'$ iff
537  $(L_1,L_2,v) \hookrightarrow v'$ unless $L_1$ is \texttt{NULL}.
538  In that case \texttt{(\_\_next(NULL,}$L$\texttt{)} = $v_0$ where
539  $v_0 = |\delta_0|$ and $\delta_0$ is the initial value of the internal state
540  at the beginning of program execution.
541 
542  The first line of the code fragment computes the view at the beginning of
543  the execution of the block from the view at the end of the previous block.
544  Then we update the cost function with the cost of the block. Finally we
545  remember the current block to use it for the computation of the next view
546  at the beginning of the next block.
547\end{itemize}
548
549\paragraph{An example of instrumentation in presence of a pipeline}
550In Figure~\ref{factprogi} we show how the instrumentation of a program
551that computes the factorial of 10 would look like in presence of a pipeline.
552The instrumentation has been manually produced. The \texttt{\_\_next}
553function says that the body of the internal loop of the \texttt{fact}
554function can be executed in two different internal states, summarized by the
555views 2 and 3. The view 2 holds at the beginning of even iterations,
556while the view 3 holds at the beginning of odd ones. Therefore even and odd
557iterations are assigned a different cost. Also the code after the loop can
558be executed in two different states, depending on the oddness of the last
559loop iteration.
560
561\begin{figure}[t]
562\begin{tiny}
563\begin{verbatim}
564int fact (int n) {
565  int i, res = 1;
566  for (i = 1 ; i <= n ; i++) res *= i;
567  return res;
568}
569
570int main () {
571  return (fact(10));
572}
573\end{verbatim}
574\end{tiny}
575\caption{A simple program that computes the factorial of 10.\label{factprog}}
576\end{figure}
577
578The definitions of \texttt{\_\_next} and \texttt{\_\_K} are just examples.
579For instance, it is possible as well that each one of the 10 iterations
580is executed in a different internal state.
581
582\begin{figure}[t]
583\begin{tiny}
584\begin{verbatim}
585int __cost = 8;
586int __label = 0;
587int __view;
588
589void __cost_incr(int incr) {
590  __cost = __cost + incr;
591}
592
593int __next(int label1, int label2, int view) {
594       if (label1 == 0)                             return 0;
595  else if (label1 == 0 && label2 == 1)              return 1;
596  else if (label1 == 1 && label2 == 2)              return 2;
597  else if (label1 == 2 && label2 == 2 && view == 2) return 3;
598  else if (label1 == 2 && label2 == 2 && view == 3) return 2;
599  else if (label1 == 2 && label2 == 3 && view == 2) return 1;
600  else if (label1 == 2 && label2 == 3 && view == 3) return 0;
601  else if (label1 == 3 && label2 == 4 && view == 0) return 0;
602  else if (label1 == 3 && label2 == 4 && view == 1) return 0;
603}
604
605int __K(int view, int label) {
606       if (view == 0 && label == 0) return 3;
607  else if (view == 1 && label == 1) return 14;
608  else if (view == 2 && label == 2) return 35;
609  else if (view == 3 && label == 2) return 26;
610  else if (view == 0 && label == 3) return 6;
611  else if (view == 1 && label == 3) return 8;
612  else if (view == 0 && label == 4) return 6;
613}
614
615int fact(int n)
616{
617  int i;
618  int res;
619  __view = __next(__label,1,__view); __cost_incr(_K(__view,1)); __label = 1;
620  res = 1;
621  for (i = 1; i <= n; i = i + 1) {
622     __view = __next(__label,2,__view); __cost_incr(__K(__view,2)); __label = 2;
623    res = res * i;
624  }
625  __view = __next(__label,3,__view); __cost_incr(K(__view,3)); __label = 3;
626  return res;
627}
628
629int main(void)
630{
631  int t;
632  __view = __next(__label,0,__view); __cost_incr(__K(__view,0)); __label = 0;
633  t = fact(10);
634  __view = __next(__label,4,__view); __cost_incr(__K(__view,4)); __label = 4;
635  return t;
636}
637\end{verbatim}
638\end{tiny}
639\caption{The instrumented version of the program in Figure~\ref{factprog}.\label{factprogi}}
640\end{figure}
641
642\paragraph{Considerations on the instrumentation}
643The example of instrumentation in the previous paragraph shows that the
644approach we are proposing exposes at the source level a certain amount
645of information about the machine behavior. Syntactically, the additional
646details, are almost entirely confined into the \texttt{\_\_next} and
647\texttt{\_\_K} functions and they do not affect at all the functional behaviour
648of the program. In particular, all invariants, proof obligations and proofs
649that deal with the functional behavior only are preserved.
650
651The interesting question, then, is what is the impact of the additional
652details on non functional (intensional) invariants and proof obligations.
653At the moment, without a working implementation to perform some large scale
654tests, it is difficult to understand the level of automation that can be
655achieved and the techniques that are likely to work better without introducing
656major approximations. In any case, the preliminary considerations of the
657project remain valid:
658\begin{itemize}
659 \item The task of computing and proving invariants can be simplified,
660   even automatically, trading correctness with precision. For example, the
661   most aggressive approximation simply replaces the cost function
662   \texttt{\_\_K} with the function that ignores the view and returns for each
663   label the maximum cost over all possible views. Correspondingly, the
664   function \texttt{\_\_next} can be dropped since it returns views that
665   are later ignored.
666
667   A more refined possibility consists in approximating the output only on
668   those labels whose jitter is small or for those that mark basic blocks
669   that are executed only a small number of times. By simplifying the
670   \texttt{\_\_next} function accordingly, it is possible to considerably
671   reduce the search space for automated provers.
672 \item The situation is not worse than what happens with time analysis on
673   the object code (the current state of the art). There it is common practice
674   to analyse larger chunks of execution to minimize the effect of later
675   approximations. For example, if it is known that a loop can be executed at
676   most 10 times, computing the cost of 10 iterations yields a
677   better bound than multiplying by 10 the worst case of a single interaction.
678
679   We clearly can do the same on the source level.
680   More generally, every algorithm that works in standard WCET tools on the
681   object code is likely to have a counterpart on the source code.
682   We also expect to be able to do better working on the source code.
683   The reason is that we assume to know more functional properties
684   of the program and more high level invariants, and to have more techniques
685   and tools at our disposal. Even if at the moment we have no evidence to
686   support our claims, we think that this approach is worth pursuing in the
687   long term.
688\end{itemize}
689
690\paragraph{The problem with caches}
691Cost models for pipelines --- at least simple ones --- are data independent,
692i.e. they are dependent on a view that is execution path dependent. In other
693words, the knowledge about the sequence of executed instructions is sufficient
694to predict the cost of future instructions.
695 
696The same property does not hold for caches. The cost of accessing a memory
697cell strongly depends on the addresses of the memory cells that have been read
698in the past. In turn, the accessed addresses are a function of the low level
699data state, that cannot be correlated to the source program state.
700
701The strong correlation between the internal state of caches and the data
702accessed in the past is one of the two main responsibles for the lack of
703precision of static analysis in modern uni-core architectures. The other one
704is the lack of precise knowledge on the real behaviour of modern hardware
705systems. In order to overcome both problems, that Cazorla\&alt.~\cite{proartis}
706call the ``\emph{Timing Analysis Walls}'', the PROARTIS European Project has
707proposed to design ``\emph{a hardware/software architecture whose execution
708timing behaviour eradicates dependence on execution history}'' (\cite{proartis}, Section 1.2). The statement is obviously too strong. What is concretely
709proposed by PROARTIS is the design of a hardware/software architecture whose
710execution timing is \emph{execution path dependent} (our terminology).
711
712We have already seen that we are able to accommodate in the labelling approach
713cost functions that are dependent on views that are execution path dependent.
714Before fully embracing the PROARTIS vision,
715we need to check if there are other aspects of the PROARTIS proposal
716that can be problematic for CerCo.
717
718\paragraph{Static Probabilistic Time Analysis}
719The approach of PROARTIS to achieve execution path dependent cost models
720consists in turning the hard-to-analyze deterministic hardware components
721(e.g. the cache) into probabilistic hardware components. Intuitively,
722algorithms that took decision based on the program history now throw a dice.
723The typical example which has been thoroughly studied in
724PROARTIS~\cite{proartis2} is that of caches. There the proposal is to replace
725the commonly used deterministic placement and replacement algorithms (e.g. LRU)
726with fully probabilistic choices: when the cache needs to evict a page, the
727page to be evicted is randomly selected according to the uniform distribution.
728
729The expectation is that probabilistic hardware will have worse performances
730in the average case, but it will exhibit the worst case performance only with
731negligible probability. Therefore, it becomes no longer interesting to estimate
732the actual worst case bound. What becomes interesting is to plot the probability
733that the execution time will exceed a certain threshold. For all practical
734purposes, a program that misses its deadline with a negligible probability
735(e.g. $10^-9$ per hour of operation) will be perfectly acceptable when deployed
736on an hardware system (e.g. a car or an airplane) that is already specified in
737such a way.
738
739In order to plot the probability distribution of execution times, PROARTIS
740proposes two methodologies: Static Probabilistic Time Analysis (SPTA) and
741Measurement Based Probabilistic Time Analysis (MBPTA). The first one is
742similar to traditional static analysis, but it operates on probabilistic
743hardware. It is the one that we would like to embrace. The second one is
744based on measurements and it is justified by the following assumption: if
745the probabilities associated to every hardware operation are all independent
746and identically distributed, then measuring the time spent on full runs of
747sub-systems components yields a probabilistic estimate that remains valid
748when the sub-system is plugged in a larger one. Moreover, the probabilistic
749distribution of past runs must be equal to the one of future runs.
750
751We understand that MBPTA is useful to analyze closed (sub)-systems whose
752functional behavior is deterministic. It does not seem to have immediate
753applications to parametric time analysis, which we are interested in.
754Therefore we focus on SPTA.
755
756According to~\cite{proartis},
757``\emph{in SPTA, execution time probability distributions for individual operations \ldots are determined statically
758from a model of the processor. The design principles of PROARTIS
759will ensure \ldots that the probabilities for the execution time of each
760instruction are independent. \ldots SPTA is performed by calculating the
761convolution ($\oplus$) of the discrete probability distributions which describe
762the execution time for each instruction on a CPU; this provides a probability
763distribution \ldots representing the timing behaviour of the entire sequence of
764instructions.}''
765
766We will now analyze to what extend we can embrace SPTA in CerCo.
767
768\paragraph{The labelling approach for Static Probabilistic Time Analysis}
769
770To summarize, the main practical differences between standard static time
771analysis and SPTA are:
772\begin{itemize}
773 \item The cost functions for single instructions or sequences of instructions
774   no longer return a natural numbers (number of cost units) but integral
775   random variables.
776 \item Cost functions are extended from single instructions to sequences of
777   instructions by using the associative convolution operator $\oplus$
778\end{itemize}
779
780By reviewing the papers that described the labelling approach, it is easy to
781get convinced that the codomain of the cost analysis can be lifted from
782that of natural numbers to any group. Moreover, by imposing labels after
783every function call, commutativity can be dropped and the approach works on
784every monoid (usually called \emph{cost monoids} in the literature).
785Because random variables and convolutions form a monoid, we immediately have
786that the labelling approach extends itself to SPTA. The instrumented code
787produced by an SPTA-CerCo compiler will then have random variables (on a finite
788domain) as costs and convolutions in place of the \texttt\{\_\_cost\_incr\}
789function.
790
791What is left to be understood is the way to state and compute the
792probabilistic invariants to do \emph{parametric SPTA}. Indeed, it seems that
793PROARTIS only got interested into non parametric PTA. For example, it is
794well known that actually computing the convolutions results in an exponential
795growth of the memory required to represent the result of the convolutions.
796Therefore, the analysis should work symbolically until the moment where
797we are interested into extracting information from the convolution.
798
799Moreover, assuming that the problem of computing invariants is solved,
800the actual behavior of automated theorem
801provers on probabilistic invariants is to be understood. It is likely that
802a good amount of domain specific knowledge about probability theory must be
803exploited and incorporated into automatic provers to achieve concrete results.
804
805Parametric SPTA using the methodology developed in CerCo is a future research
806direction that we believe to be worth exploring in the middle and long term.
807
808\paragraph{Static Probabilistic Time Analysis for Caches in CerCo}
809
810As a final remark, we note that the analysis in CerCo of systems that implement
811probabilistic caches requires a combination of SPTA and data independent cost
812models. The need for a probabilistic analysis is obvious but, as we saw in
813the previous paragraph, it requires no modification of the labelling approach.
814
815In order to understand the need for dependent labelling (to work on data
816independent cost functions), we need to review the behaviour of probabilistic
817caches as proposed by PROARTIS. The interested reader can
818consult~\cite{proartis2} for further informations.
819
820In a randomized cache, the probability of evicting a given line on every access
821is $1/N$ where $N$ stands for the number of cache entries. Therefore the
822hit probability of a specific access to such a cache is
823$P(hit) = (\frac{N-1}{N})^K$ where $K$ is the number of cache misses between
824two consecutive accesses to the same cache entry. For the purposes of our
825analysis, we must assume that every cache access could cause an eviction.
826Therefore, we define $K$ (the \emph{reuse distance}) to be the number of
827memory accesses between two consecutive accesses to the same cache entry,
828including the access for which we are computing $K$. In order to compute
829$K$ for every code memory address, we need to know the execution path (in
830our terminology). In other words, we need a view that records for each
831cache entry the number of memory accesses that has occurred since the last
832access.
833
834For pipelines with $n$ stages, the number of possible views is limited to
835$2^n$: a view can usually just be represented by a word. This is not the
836case for the views on caches, which are in principle very large. Therefore,
837the dependent labelling approach for data independent cost functions that we
838have presented here could still be unpractical for caches. If that turns out
839to be the case, a possible strategy is the use of abstract interpretations
840techniques on the object code to reduce the size of views exposed at the
841source level, at the price of an early loss of precision in the analysis.
842
843More research work must be performed at the current stage to understand if
844caches can be analyzed, even probabilistically, using the CerCo technology.
845This is left for future work and it will be postponed after the work on
846pipelines.
847
848\section{Conclusions\label{conclusions}}
849At the current state of the art functional properties of programs are better
850proved high level languages, but the non functional ones are proved on the
851corresponding object code. The non functional analysis, however, depends on
852functional invariants, e.g. to bound or correlate the number of executions of
853cycles.
854
855The aim of the CerCo project is to reconcile the two analysis by performing
856non functional analysis on the source code. This requires computing a cost
857model on the object code and reflecting the cost model on the source code.
858We achieve this in CerCo by designing a certified Cost Annotating Compiler that
859keeps tracks of transformations of basic blocks, in order to create a
860correspondence (not necessarily bijection) between the basic blocks of the
861source and target language. We then prove that the sequence of basic blocks
862that are met in the source and target executions is correlated. Then,
863we perform a static analysis of the cost of basic blocks on the target language
864and we use it to compute the cost model for the source language basic blocks.
865Finally, we compute cost invariants on the source code from the inferred cost
866model and from the functional program invariants; we generate proof obligations
867for the invariants; we use automatic provers to try to close the proof
868obligations.
869
870The cost of single instructions on modern architectures depend on the internal
871state of various hardware components (pipelines, caches, branch predicting
872units, etc.). The internal states are determined by the previous execution
873history. Therefore the cost of basic blocks is parametric on the execution
874history, which means both the instructions executed and the data manipulated
875by the instructions. The CerCo approach is able to correlate the sequence
876of blocks of source instructions with the sequence of blocks of target
877instructions. It does not correlate the high level and the low level data.
878Therefore we are not able in the general case to lift a cost model parametric
879on the execution history on the source code.
880
881To overcome the problem, we have identified a particular class of cost models
882that are not dependent on the data manipulated. We argue that the CerCo
883approach can cover this scenario by exposing in the source program a finite
884data type of views over internal machine states. The costs of basic blocks
885is parametric on these views, and the current view is updated at basic block
886entry according to some abstraction of the machine hardware that does not
887need to be understood. Further studies are needed to understand how invariant
888generators and automatic provers can cope with these updates and parametric
889costs.
890
891We have argued how pipelines, at least simple ones, are captured by the
892previous scenario and can in principle be manipulated using CerCo tools.
893The same is not true for caches, whose behaviour deeply depends on the
894data manipulated. By embracing the PROARTIS proposal of turning caches into
895probabilistic components, we can break the data dependency. Nevertheless,
896cache analysis remains more problematic because of the size of the views.
897Further studies need to be focused on caches to understand if the problem of
898size of the views can be tamed in practice without ruining the whole approach.
899
900\begin{thebibliography}{}
901
902\bibitem{cerco} \textbf{Certified Complexity}, R. Amadio, A. Asperti, N. Ayache, B. Campbell, D. Mulligan, R. Pollack, Y. Regis-Gianas, C. Sacerdoti Coen, I. Stark, in Procedia Computer Science, Volume 7, 2011, Proceedings of the 2 nd European Future Technologies Conference and Exhibition 2011 (FET 11), 175-177.
903
904\bibitem{labelling} \textbf{Certifying and Reasoning on Cost Annotations in C Programs}, N.  Ayache, R.M. Amadio, Y.Régis-Gianas, in Proc. FMICS, Springer LNCS
9057437: 32-46, 2012, DOI:10.1007/978-3-642-32469-7\_3.
906
907\bibitem{paolo} \textbf{Indexed Labels for Loop Iteration Dependent Costs}, P. Tranquilli, in Proceedings of the 11th International Workshop on Quantitative Aspects of Programming Languages and Systems (QAPL 2013), Rome, 23rd-24th March 2013, Electronic Proceedings in Theoretical Computer Science, to appear in 2013.
908
909\bibitem{embounded} \textbf{The EmBounded project (project paper)}, K. Hammond, R. Dyckhoff, C. Ferdinand, R. Heckmann, M. Hofmann, H. Loidl, G. Michaelson, J. Serot, A. Wallace, in Trends in Functional Programming, Volume 6, Intellect Press, 2006.
910
911\bibitem{proartis} \textbf{PROARTIS: Probabilistically Analysable Real-Time Systems}, F.J. Cazorla, E. Qui˜nones, T. Vardanega, L. Cucu, B. Triquet, G. Bernat, E. Berger, J. Abella, F. Wartel, M. Houston, et al., in ACM Transactions on Embedded Computing Systems, 2012.
912
913\bibitem{proartis2} \textbf{A Cache Design for Probabilistic Real-Time Systems}, L. Kosmidis, J. Abella, E. Quinones, and F. Cazorla, in Design, Automation, and Test in Europe (DATE), Grenoble, France, 03/2013.
914
915\end{thebibliography}
916
917\end{document}
Note: See TracBrowser for help on using the repository browser.