source: Deliverables/D6.3/report.tex @ 3125

Last change on this file since 3125 was 3125, checked in by sacerdot, 7 years ago

...

File size: 49.6 KB
Line 
1\documentclass[11pt, epsf, a4wide]{article}
2
3\usepackage{../style/cerco}
4
5\usepackage{amsfonts}
6\usepackage{amsmath}
7\usepackage{amssymb} 
8\usepackage[english]{babel}
9\usepackage{graphicx}
10\usepackage[utf8x]{inputenc}
11\usepackage{listings}
12\usepackage{stmaryrd}
13\usepackage{url}
14\usepackage{bbm}
15
16\title{
17INFORMATION AND COMMUNICATION TECHNOLOGIES\\
18(ICT)\\
19PROGRAMME\\
20\vspace*{1cm}Project FP7-ICT-2009-C-243881 \cerco{}}
21
22\lstdefinelanguage{matita-ocaml}
23  {keywords={ndefinition,ncoercion,nlemma,ntheorem,nremark,ninductive,nrecord,nqed,nlet,let,in,rec,match,return,with,Type,try},
24   morekeywords={[2]nwhd,nnormalize,nelim,ncases,ndestruct},
25   morekeywords={[3]type,of},
26   mathescape=true,
27  }
28
29\lstset{language=matita-ocaml,basicstyle=\small\tt,columns=flexible,breaklines=false,
30        keywordstyle=\color{red}\bfseries,
31        keywordstyle=[2]\color{blue},
32        keywordstyle=[3]\color{blue}\bfseries,
33        commentstyle=\color{green},
34        stringstyle=\color{blue},
35        showspaces=false,showstringspaces=false}
36
37\lstset{extendedchars=false}
38\lstset{inputencoding=utf8x}
39\DeclareUnicodeCharacter{8797}{:=}
40\DeclareUnicodeCharacter{10746}{++}
41\DeclareUnicodeCharacter{9001}{\ensuremath{\langle}}
42\DeclareUnicodeCharacter{9002}{\ensuremath{\rangle}}
43
44\date{}
45\author{}
46
47\begin{document}
48
49\thispagestyle{empty}
50
51\vspace*{-1cm}
52\begin{center}
53\includegraphics[width=0.6\textwidth]{../style/cerco_logo.png}
54\end{center}
55
56\begin{minipage}{\textwidth}
57\maketitle
58\end{minipage}
59
60\vspace*{0.5cm}
61\begin{center}
62\begin{LARGE}
63\textbf{
64Report n. D6.3\\
65Final Report on User Validation}
66\end{LARGE} 
67\end{center}
68
69\vspace*{2cm}
70\begin{center}
71\begin{large}
72Version 1.0
73\end{large}
74\end{center}
75
76\vspace*{0.5cm}
77\begin{center}
78\begin{large}
79Main Authors:\\
80XXXX %Dominic P. Mulligan and Claudio Sacerdoti Coen
81\end{large}
82\end{center}
83
84\vspace*{\fill}
85
86\noindent
87Project Acronym: \cerco{}\\
88Project full title: Certified Complexity\\
89Proposal/Contract no.: FP7-ICT-2009-C-243881 \cerco{}\\
90
91\clearpage
92\pagestyle{myheadings}
93\markright{\cerco{}, FP7-ICT-2009-C-243881}
94
95\newpage
96
97\vspace*{7cm}
98\paragraph{Abstract}
99xxx
100
101\newpage
102
103\tableofcontents
104
105\newpage
106
107\section{Task}
108\label{sect.task}
109
110The Grant Agreement describes deliverable D6.3 as follows:
111\begin{quotation}
112\textbf{Final Report on User Validation}: An articulated analysis and critics of the user validation experiences. In particular we will review the effectiveness of the techniques for cost annotation exploitement that have been employed in the project and that have been validated on simple and non trivial examples. We will also identify additional techniques that could be exploited in the middle and long term to bring the CerCo compiler to its full potentialities.
113
114\end{quotation}
115
116\section{Middle and Long Term Improvements}
117In order to identify the middle and long term improvements, we briefly review
118here the premises and goals of the CerCo approach to resource analysis.
119\begin{itemize}
120 \item There is a lot of recent and renewed activity in the formal method
121   community to accomodate resource analysis using techniques derived from
122   functional analysis (type systems, logics, abstract interpretation,
123   amortized analysis applied to data structures, etc.)
124 \item Most of this work, which currently remains at theoretical level,
125   is focused on high level languages and it assumes the existence of correct
126   and compositional resource cost models.
127 \item High level languages are compiled to object code by compilers that
128   should respect the functional properties of the program. However, because
129   of optimizations and the inherently non compositional nature of compilation,
130   compilers do not respect compositional cost models that are imposed a priori
131   on the source language. By controlling the compiler and coupling it with a
132   WCET analyser, it is actually possible to
133   choose the cost model in such a way that the cost bounds are high enough
134   to bound the cost of every produced code. This was attempted for instance
135   in the EmBounded project with good success. However, we believe that bounds
136   obtained in this way have few possibilities of being tight.
137 \item Therefore our approach consists in having the compiler generate the
138   cost model for the user by combining tracking of basic blocks during code
139   transformations with a static resource analysis on the object code for basic
140   blocks. We formally prove the compiler to respect the cost model that is
141   induced on the source level based on a very few assumptions: basically the
142   cost of a sequence of instructions should be associative and commutative
143   and it should not depend on the machine status, except its program counter.
144   Commutativity can be relaxed at the price of introducing more cost updates
145   in the instrumented source code.
146 \item The cost model for basic blocks induced on the source language must then
147   be exploited to derive cost invariants and to prove them automatically.
148   In CerCo we have shown how even simple invariant generations techniques
149   are sufficient to enable the fully automatic proving of parametric WCET
150   bounds for simple C programs and for Lustre programs of arbitrary complexity.
151\end{itemize}
152
153Compared to traditional WCET techniques, our approach currently has many
154similarities, some advantages and some limitations. Both techniques need to
155perform data flow analysis on the control flow graph of the program and both
156techniques need to estimate the cost of control blocks of instructions.
157
158\subsection{Control flow analysis}
159
160The first main difference is in the control flow analysis. Traditional WCET
161starts from object code and reconstructs the control flow graph from it.
162Moreover, abstract interpretation is heavily employed to bound the number of
163executions of cycles. In order to improve the accuracy of estimation, control
164flow constraints are provided by the user, usually as systems of (linear)
165inequalities. In order to do this, the user, helped by the system, needs to
166relate the object code control flow graph with the source one, because it is
167on the latter that the bounds can be figured out and be understood. This
168operations is untrusted and potentially error prone for complex optimizations
169(like aggressive loop optimizations). Efficient tools from linear algebra are
170then used to solve the systems of inequations obtained by the abstract
171interpreter and from the user constraints.
172
173In CerCo, instead, we assume full control on the compiler that
174is able to relate, even in non trivial ways, the object code control flow graph
175onto the source code control flow graph. A clear disadvantage is the
176impossibility of applying the tool on the object code produced by third party
177compilers. On the other hand, we get rid of the possibility of errors
178in the reconstruction of the control flow graph and in the translation of
179high level constraints into low level constraints. The second potentially
180important advantage is that, once we are dealing with the source language,
181we can augment the precision of our dataflow analysis by combining together
182functional and non functional invariants. This is what we attempted with
183the CerCo Cost Annotating Frama-C Plug-In. The Frama-C architecture allows
184several plug-ins to perform all kind of static analisys on the source code,
185reusing results from other plug-ins and augmenting the source code with their
186results. The techniques are absolutely not limited to linear algebra and
187abstract interpretation, and the most important plug-ins call domain specific
188and general purpose automated theorem provers to close proof obligations of
189arbitrary shape and complexity.
190
191In principle, the extended flexibility of the analysis should allow for a major
192advantage of our technique in terms of precision, also
193considering that all analysis used in traditional WCET can still be implemented
194as plug-ins. In particular, the target we have in mind are systems that are
195both (hard) real time and safety critical. Being safety critical, we can already
196expect them to be fully or partially specified at the functional level.
197Therefore we expect that the additional functional invariants should allow to
198augment the precision of the cost bounds, up to the point where the parametric
199cost bound is fully precise.
200In practice, we have not had the time to perform extensive
201comparisons on the kind of code used by industry in production systems.
202The first middle term improvement of CerCo would then consist in this kind
203of analysis, to support or disprove our expectations. It seems that the
204newborn TACLe Cost Action (Timing Analysis on Code Level) would be the
205best framework to achieve this improvement.
206In the case where our
207technique remains promising, the next long term improvement would consist in
208integrating in the Frama-C plug-in ad-hoc analysis and combinations of analysis
209that would augment the coverage of the efficiency of the cost estimation
210techniques.
211
212\subsection{Static analysis of costs of basic blocks}
213At the beginning of the project we have deliberately decided to focus our
214attention on the control flow preservation, the cost model propagation and
215the exploitation of the cost model induced on the high level code. For this
216reason we have devoted almost no attention to the static analysis of basic
217blocks. This was achieved by picking a very simple hardware architecture
218(the 8051 microprocessor family) whose cost model is fully predictable and
219compositional: the cost of every instruction --- except those that deal with
220I/O --- is constant, i.e. it does not depend on the machine status.
221We do not regret this choice because, with the limited amount of man power
222available in the project, it would have been difficult to also consider this
223aspect. However, without showing if the approach can scale to most complex
224architectures, our methodology remains of limited interest for the industry.
225Therefore, the next important middle term improvement will be the extension
226of our methodology to cover pipelines and simple caches. We will now present our
227ideas on how we intend to address the problem. The obvious long term
228improvement would be to take in consideration multicores system and complex
229memory architectures like the ones currently in use in networks on chips.
230The problem of execution time analysis for these systems is currently
231considered extremely hard or even unfeasible and at the moments it seems
232unlikely that our methodology could contribute to the solution of the problem.
233
234\subsubsection{Static analysis of costs of basic blocks revisited}
235We will now describe what currently seems to be the most interesting technique
236for the static analysis of the cost of basic blocks in presence of complex
237hardware architectures that do not have non compositional cost models.
238
239We start presenting an idealized model of the execution of a generic
240microprocessor (with caches) that has all interrupts disabled and no I/O
241instructions. We then classify the models according to some properties of
242their cost model. Then we show how to extend the labelling approach of CerCo
243to cover models that are classified in a certain way.
244
245\paragraph{The microprocessor model}
246
247Let $\sigma,\sigma_1,\ldots$ range over $\Sigma$, the set of the fragments of
248the microprocessor states that hold the program counter, the program status word and
249all the data manipulated by the object code program, i.e. registers and
250memory cells. We call these fragments the \emph{data states}.
251
252Let $\delta,\delta_1,\ldots$ range over $\Delta$, the set of the
253fragments of the microprocessor state that holds the \emph{internal state} of the
254microprocessor (e.g. the content of the pipeline and caches, the status of
255the branch prediction unit, etc.).
256The internal state of the microprocessor influences the execution cost of the
257next instruction, but it has no effect on the functional behaviour of the
258processor. The whole state of the processor is represented by a pair
259$(\sigma,\delta)$.
260
261Let $I,I_1,\ldots$ range over $\mathcal{I}$, the
262the set of \emph{instructions} of the processor and let
263$\gamma,\gamma_1,\ldots$ range over $\Gamma$, the set of \emph{operands}
264of instructions after the fetching and decoding passes.
265Thus a pair $(I,\gamma)$ represents a \emph{decoded instruction} and already contains
266the data required for execution. Execution needs to access the data state only
267to write the result.
268
269Let $fetch: \Sigma \to \mathcal{I} \times \Gamma$ be the function that
270performs the fetching and execution phases, returning the decoded instruction
271ready for execution. This is not meant to be the real fetch-decode function,
272that exploits the internal state too to speed up execution (e.g. by retrieving
273the instruction arguments from caches) and that, in case of pipelines, works
274in several stages. However, such a function exists and
275it is observationally equivalent to the real fetch-decode.
276
277We capture the semantics of the microprocessor with the following set of
278functions:
279\begin{itemize}
280 \item The \emph{functional transition} function $\longrightarrow : \Sigma \to \Sigma$
281   over data states. This is the only part of the semantics that is relevant
282   to functional analysis.
283 \item The \emph{internal state transition} function
284   $\Longrightarrow : \Sigma \times \Delta \to \Delta$ that updates the internal
285   state.
286 \item The \emph{cost function} $K: \mathcal{I} \times \Gamma \times \Delta \to \mathbb{N}$
287   that assigns a cost to transitions. Since decoded instructions hold the data
288   they act on, the cost of an instruction may depend both on the data being
289   manipulated and on the internal state.
290\end{itemize}
291
292Given a processor state $(\sigma,\delta)$, the processor evolves in the
293new state $(\sigma',\delta')$ in $n$ cost units if
294$\sigma \longrightarrow \sigma'$ and $(\sigma,\delta) \Longrightarrow \delta'$
295and $fetch(\sigma) = (I,\gamma)$ and $K(I,\gamma,\delta) = n$.
296
297An \emph{execution history} is a stream of states and transitions
298$\sigma_0 \longrightarrow \sigma_1 \longrightarrow \sigma_2 \ldots$
299that can be either finite or infinite. Given an execution history, the
300corresponding \emph{execution path} is the stream of program counters
301obtained from the execution history by forgetting all the remaining information.
302The execution path of the history $\sigma_0 \longrightarrow \sigma_1 \longrightarrow \sigma_2 \ldots$ is $pc_0,pc_1,\ldots$ where each $pc_i$ is the program
303counter of $\sigma_i$. We denote the set of finite execution paths with $EP$.
304
305We claim this simple model to be generic enough to cover real world
306architectures.
307
308\paragraph{Classification of cost models}
309A cost function is \emph{exact} if it assigns to transitions the real cost
310incurred. It is \emph{approximated} if it returns an upper bound of the real
311cost.
312
313A cost function is \emph{operand insensitive} if it does not depend on the
314operands of the instructions to be executed. Formally, $K$ is operand insensitive
315if there exists a $K': \mathcal{I} \times \Delta \to \mathbb{N}$ such that
316$K(I,\gamma,\delta) = K'(I,\delta)$. In this case, with an abuse of terminology,
317we will identify $K$ with $K'$.
318
319The cost functions of simple hardware architectures, in particular RISC ones,
320are naturally operand insensitive. In the other cases an exact operand sensitive
321cost function can always be turned into an approximated operand insensitive one
322by taking $K'(I,\delta) = \max\{K(I,\gamma,\delta)~|~\gamma \in \Gamma\}$.
323The question when one performs these approximation is how severe the
324approsimation is. A measure is given by the \emph{jitter}, which is defined
325as the difference between the best and worst cases. In our case, the jitter
326of the approximation $K'$ would be $\max\{K(I,\gamma,\delta)~|~\gamma \in \Gamma\} - \min\{K(I,\gamma,\delta)~|~\gamma \in \Gamma\}$. According to experts
327of WCET analysis, the jitters relative to operand sensitivity in modern
328architectures are small enough to make WCET estimations still useful.
329Therefore, in the sequel we will focus on operand insensitive cost models only.
330
331Note that cost model that are operand insensitive may still have significant
332dependencies over the data manipulated by the instructions, because of the
333dependency over internal states. For example, an instruction that reads data
334from the memory may change the state of the cache and thus greatly affect the
335execution time of successive instructions.
336
337Nevertheless, operand insensitivity is an important property for the labelling
338approach. In~\cite{tranquilli} we introduced \emph{dependent labels} and
339\emph{dependent costs}, which are the possibility of assigning costs to basic
340blocks of instructions which are also dependent on the state of the high level
341program at the beginning of the block. The idea we will now try to pursue is
342to exploit dependent costs to capture cost models that are sensitive to
343the internal states of the microprocessor. Operand sensitivity, however, is
344a major issue in presence of dependent labels: to lift a data sensitive cost
345model from the object code to the source language, we need a function that
346maps high level states to the operands of the instructions to be executed,
347and we need these functions to be simple enough to allow reasoning over them.
348Complex optimizations performed by the compiler, however, make the mappings
349extremely cumbersomes and history dependent. Moreover, keeping
350track of status transformations during compilation would be a significant
351departure from classical compilation models which we are not willing to
352undertake. By assuming or removing operand sensitivity, we get rid of part
353of the problem: we only need to make our costs dependent on the internal
354state of the microprocessor. The latter, however, is not at all visible
355in the high level code. Our next step is to make it visible.
356
357In general, the value of the internal state at a certain point in the program
358history is affected by all the preceding history. For instance, the
359pipeline stages either hold operands of instructions in execution or bubbles
360\footnote{A bubble is formed in the pipeline stage $n$ when an instruction is stucked in the pipeline stage $n-1$, waiting for some data which is not available yet.}. The execution history contains data states that in turn contain the
361object code data which we do not know how to relate simply to the source code
362data. We therefore introduce a new classification.
363
364A \emph{view} over internal states is a pair $(\mathcal{V},|.|)$ where
365$\mathcal{V}$ is a finite non empty set and $|.|:\Delta \to \mathcal{V}$ is
366a forgetful function over internal states.
367
368The operand insensitive cost function $K$ is \emph{dependent on the view
369$\mathcal{V}$} if there exists a $K': \mathcal{I} \times \mathcal{V} \to \mathbb{N}$ such that $K(I,\delta) = K'(I,|\delta|)$. In this case, with an abuse of terminology, we will identify $K$ with $K'$.
370
371Among the possible views, the ones that we will easily be able to work
372with in the labelling approach are the \emph{execution history dependent views}.
373A view $(\mathcal{V},|.|)$ is execution history dependent with a lookahead
374of length $n$ when there exists
375a transition function $\hookrightarrow : PC^n \times \mathcal{V} \to \mathcal{V}$
376such that for all $(\sigma,\delta)$ and $pc_1,\ldots,pc_n$ such that every
377$pc_i$ is the program counter of $\sigma_i$ defined by $\sigma \longrightarrow^i \sigma_i$, we have $(\sigma,\delta) \Longrightarrow \delta'$ iff
378$((pc_1,\ldots,pc_n),|\delta|) \hookrightarrow |\delta'|$.
379
380Less formally, a view is
381dependent on the execution history if the evolution of the views is fully
382determined by the evolution of the program counters. To better understand
383the definition, consider the case where the next instruction to be executed
384is a conditional jump. Without knowing the values of the registers, it is
385impossible to determine if the true or false branches will be taken. Therefore
386it is likely to be impossible to determine the value of the view the follows
387the current one. On the other hand, knowing the program counter that will be
388reached executing the conditional branch, we also know which
389branch will be taken and this may be sufficient to compute the new view.
390Lookaheads longer then 1 will be used in case of pipelines: when executing
391one instruction in a system with a pipeline of length $n$, the internal state
392of the pipeline already holds information on the next $n$ instructions to
393be executed.
394
395The reference to execution historys in the names is due to the following
396fact: every execution history dependent transition function $\hookrightarrow$
397can be lifted to the type $EP \times \mathcal{V} \to \mathcal{V}$ by folding
398the definition over the path trace:
399$((pc_0,\ldots,pc_m),v_0) \hookrightarrow v_n$ iff for all $i \leq m - n$,
400$((pc_i,\ldots,pc_{i+n}),v_i) \hookrightarrow v_{i+1}$.
401Moreover, the folding is clearly associative:
402$(\tau_1 @ \tau_2, v) \hookrightarrow v''$ iff
403$(\tau_1,v) \hookrightarrow v'$ and $(\tau_2,v') \hookrightarrow v''$.
404
405As a final definition, we say that a cost function $K$ is
406\emph{data independent} if it is dependent on a view that is execution path
407dependent. In two paragraphs we will show how we can extend the
408labelling approach to deal with data independent cost models.
409
410Before that, we show that the class of data independent cost functions
411is not too restricted to be interesting. In particular, at least simple
412pipeline models admit data independent cost functions.
413
414\paragraph{A data independent cost function for simple pipelines}
415We consider here a simple model for a pipeline with $n$ stages without branch
416prediction and hazards. We also assume that the actual value of the operands
417of the instruction that is being read have no influence on stalls (i.e. the
418creation of bubbles) nor on the execution cost. The type of operands, however,
419can. For example, reading the value 4 from a register may stall a pipeline
420if the register has not been written yet, while reading 4 from a different
421register may not stall the pipeline.
422
423The internal states $\Delta$ of the pipeline are $n$-tuples of decoded
424instructions or bubbles:
425$\Delta = (\mathcal{I} \times \Gamma \cup \mathbbm{1})^n$.
426This representation is not meant to describe the real data structures used
427in the pipeline: in the implementation the operands are not present in every
428stage of the pipeline, but are progressively fetched. A state
429$(x_1,x_2,\ldots,(I,\gamma))$ represents the state of the pipeline just before
430the completion of instruction $(I,\gamma)$. The first $n-1$ instructions that
431follow $I$ may already be stored in the pipeline, unless bubbles have delayed
432one or more of them.
433
434We introduce the following view over internal states:
435$(\{0,1\}^n,|.|)$ where $\mathbb{N}_n = {0,\ldots,2^n-1}$ and
436$|(x_1,\ldots,x_n)| = (y_1,\ldots,y_n)$ where $y_i$ is 1 iff $x_i$ is a bubble.
437Thus the view only remembers which stages of the pipelines are stuck.
438The view holds enough information to reconstruct the internal state given
439the current data state: from the data state we can fetch the program counter
440of the current and the next $n-1$ instructions and, by simulating at most
441$n$ execution steps and by knowing where the bubbles were, we can fill up
442the internal state of the pipeline.
443
444The assumptions on the lack of influence of operands values on stalls and
445execution times ensures the existence of the data independent cost function
446$K: PC \times \{0,1\}^n \to \mathbb{N}$. The transition function for a
447pipeline with $n$ stages may require $n$ lookaheads:
448$\hookrightarrow : PC^n \times \{0,1\}^n \to \{0,1\}^n$.
449
450While the model is a bit simplicistic, it can nevertheless be used to describe
451existing pipelines. It is also simple to be convinced that the same model
452also captures static branch prediction: speculative execution of conditional
453jumps is performed by always taking the same branch which does not depend on
454the execution history. In order to take in account jumpt predictions based on
455the execution history, we just need to incorporate in the status and the view
456statistical informations on the last executions of the branch.
457
458\paragraph{The labelling approach for data independent cost models}
459We now describe how the labelling approach can be slightly modified to deal
460with a data independent cost model $(\hookrightarrow,K)$ built over
461$(\mathcal{V},|.|)$.
462
463In the labelling approach, every basic block in the object code is identified
464with a unique label $L$ which is also associated to the corresponding
465basic block in the source level. Let us assume that labels are also inserted
466after every function call and that this property is preserved during
467compilation. Adding labels after calls makes the instrumented code heavier
468to read and it generates more proof obligations on the instrumented code, but
469it does not create any additional problems. The preservation during compilation
470creates some significant technical complications in the proof of correctness
471of the compiler, but those can be solved.
472
473The static analysis performed in the last step of the basic labelling approach
474analyses the object code in order to assign a cost to every label or,
475equivalently, to every basic block. The cost is simply the sum of the cost
476of very instruction in the basic block.
477
478In our scenario, instructions no longer have a cost, because the cost function
479$K$ takes in input a program counter but also a view $v$. Therefore we
480replace the static analysis with the computation, for every basic block and
481every $v \in \mathcal{V}$, of the sum of the costs of the instructions in the
482block, starting in the initial view $v$. Formally, let the sequence of the
483program counters of the basic block form the execution path $pc_0,\ldots,pc_n$.
484The cost $K(v_0,L)$ associated to the block labelled with
485$L$ and the initial view $v_0$ is
486$K(pc_0,v_0) + K(pc_1,v_1) + \ldots + K(pc_n,v_n)$ where for every $i < n$,
487$((pc_i,\ldots,pc_{i+l}),v_i) \hookrightarrow v_{k+1}$ where $l$ is the
488lookahead required. When the lookahead requires program counters outside the
489block under analysis, we are free to use dummy ones because the parts of the
490view that deal with future behaviour have no impact on the cost of the
491previous operations by assumption.
492
493The static analysis can be performed
494in linear time in the size of the program because the cardinality of the sets
495of labels (i.e. the number of basic blocks) is bounded by the size of the
496program and because the set $V$ is finite. In the case of the pipelines of
497the previous paragraph, the static analysis is $2^n$ times more expensive
498than the one of the basic labelling approach, where $n$ is the number of
499pipeline stages.
500
501The first imporant theorem in the labelling approach is the correctness of the
502static analysis: if the (dependent) cost associated to a label $L$ is $k$,
503then executing a program from the beginning of the basic block to the end
504of the basic block should take exactly $k$ cost units. The proof only relies
505on associativity and commutativity of the composition of costs. Commutativity
506is only required if basic blocks can be nested, i.e. if a basic block does not
507terminate when it reaches a call, but it continues after the called function
508returns. By assuming to have a cost label after each block, we do not need
509commutativity any longer, which does not hold for the definition of $K$ we
510just gave. The reason is that, if $pc_i$ is a function call executed in
511the view (state) $v_i$, it is not true that, after return, the state
512will be $v_i+1$ defined by $(pc_i,pc_{i+1},v_i) \hookrightarrow v_{i+1}$
513(assuming a lookahead of 1, which is already problematic).
514Indeed $pc_{i+1}$ is the program counter of the instruction that follows
515the call, whereas the next program counter to be reached is the one of the
516body of the call. Moreover, even if the computation would make sense,
517$v_{i+1}$ would be the state at the beginning of the execution of the body of
518the call, while we should know the state after the function returns.
519The latter cannot be
520statically predicted. That's why we had to impose labels after calls.
521Associativity, on the other hand, trivially holds. Therefore the proof
522of correctness of the static analysis can be reused without any change.
523
524So far, we have computed the dependent cost $K: \mathcal{V} \times \mathcal{L} \to \mathbb{N}$ that associates a cost to basic blocks and views.
525The second step consists in statically computing the transition function
526$\hookrightarrow : \mathcal{L} \times \mathcal{L} \times \mathcal{V} \to \mathcal{V}$ that
527associates to each pair of consecutively executed basic blocks and input view
528the view obtained at the end of the execution of the first block.
529
530The definition is the following:
531$(L,L',v) \hookrightarrow v'$ iff $((pc_0,\ldots,pc_n,pc'_0,\ldots,pc'_m),v) \hookrightarrow v'$ where $(pc_0,\ldots,pc_n)$ are the program counters of the block labelled by $L$ and $(pc'_0,\ldots,pc'_m)$ are those of the block labelled
532with $L'$. We assume here that $m$ is always longer or equal to the lookahead
533required by the transition function $\hookrightarrow$ over execution paths.
534When this is not the case we could make the new transition function take in
535input a longer lookahead of labels. Or we may assume to introduce enough
536\texttt{NOP}s at the beginning of the block $L'$ to enforce the property.
537In the rest of the paragraph we assume to have followed the second approach
538to simplify the presentation.
539
540The extended transition function over labels is not present in the basic labelling approach. Actually, the basic labelling
541approach can be understood as the generalized approach where the view $\mathcal{V} = \mathbbm{1}$.
542The computation of the extended $\hookrightarrow$ transition function
543is again linear in the size of the program.
544
545Both the versions of $K$ and $\hookrightarrow$ defined over labels can be
546lifted to work over traces by folding them over the list of labels in the
547trace: for $K$ we have $K((L_1,\ldots,L_n),v) = K(L_1,v) + K((L_2,\ldots,L_n),v')$ where
548$(L_1,L_2,v) \hookrightarrow v'$;  for $\hookrightarrow$ we have
549$((L_1,\ldots,L_n),v) \hookrightarrow v''$ iff $(L_1,L_2,v) \hookrightarrow v'$ and $((L_2,\ldots,L_n),v') \hookrightarrow v''$. The two definitions are also
550clearly associative.
551
552The second main theorem of the labelling approach is trace preservation:
553the trace produced by the object code is the same as the trace produced by
554the source code. Without any need to change the proof, we immediately obtain
555as a corollary that for every view $v$, the cost $K(\tau,v)$ computed from
556the source code trace $\tau$ is the same than the cost $K(\tau,v)$ computed
557on the object code trace, which is again $\tau$.
558
559The final step of the labelling approach is source code instrumentation.
560In the basic labelling approach it consists in adding a global variable
561\texttt{\_\_cost}, initialized with 0, which is incremented at the beginning
562of every basic block with the cost of the label of the basic block.
563Here we just need a more complex instrumentation that keeps track of the
564values of the views during execution:
565\begin{itemize}
566 \item We define three global variables \texttt{\_\_cost}, initialized at 0,
567   \texttt{\_\_label}, initialized with \texttt{NULL}, and
568   \texttt{\_\_view}, uninitialized.
569 \item At the beginning of every basic block labelled by $L$ we add the
570  following code fragment:
571
572  \begin{tabular}{l}
573  \texttt{\_\_view = \_\_next(\_\_label,}$L$\texttt{,\_\_view);}\\
574  \texttt{\_\_cost += }$K(\mbox{\texttt{\_\_view}},L)$\texttt{;}\\
575  \texttt{\_\_label = }$L$\texttt{;}
576  \end{tabular}
577
578  where \texttt{\_\_next(}$L_1,L_2,v$\texttt{)} = $v'$ iff
579  $(L_1,L_2,v) \hookrightarrow v'$ unless $L_1$ is \texttt{NULL}.
580  In that case \texttt{(\_\_next(NULL,}$L$\texttt{)} = $v_0$ where
581  $v_0 = |\delta_0|$ and $\delta_0$ is the initial value of the internal stat
582  at the beginning of program execution.
583 
584  The first line of the code fragment computes the view at the beginning of
585  the execution of the block from the view at the end of the previous block.
586  Then we update the cost function with the cost of the block. Finally we
587  remember the current block to use it for the computation of the next view
588  at the beginning of the next block.
589\end{itemize}
590
591\paragraph{An example of instrumentation in presence of a pipeline}
592In Figure~\ref{factprogi} we show how the instrumentationof a program
593that computes the factorial of 10 would look like in presence of a pipeline.
594The instrumentation has been manually produced. The \texttt{\_\_next}
595function says that the body of the internal loop of the \texttt{fact}
596function can be executed in two different internal states, summarized by the
597views 2 and 3. The view 2 holds at the beginning of even iterations,
598while the view 3 holds at the beginning of odd ones. Therefore even and odd
599iterations are assigned a different cost. Also the code after the loop can
600be executed in two different states, depending on the oddness of the last
601loop iteration.
602
603\begin{figure}[t]
604\begin{tiny}
605\begin{verbatim}
606int fact (int n) {
607  int i, res = 1;
608  for (i = 1 ; i <= n ; i++) res *= i;
609  return res;
610}
611
612int main () {
613  return (fact(10));
614}
615\end{verbatim}
616\end{tiny}
617\caption{A simple program that computes the factorial of 10.\label{factprog}}
618\end{figure}
619
620The definitions of \texttt{\_\_next} and \texttt{\_\_K} are just examples.
621For instance, it is possible as well that each one of the 10 iterations
622is executed in a different internal state.
623
624\begin{figure}[t]
625\begin{tiny}
626\begin{verbatim}
627int __cost = 8;
628int __label = 0;
629int __view;
630
631void __cost_incr(int incr) {
632  __cost = __cost + incr;
633}
634
635int __next(int label1, int label2, int view) {
636       if (label1 == 0)                             return 0;
637  else if (label1 == 0 && label2 == 1)              return 1;
638  else if (label1 == 1 && label2 == 2)              return 2;
639  else if (label1 == 2 && label2 == 2 && view == 2) return 3;
640  else if (label1 == 2 && label2 == 2 && view == 3) return 2;
641  else if (label1 == 2 && label2 == 3 && view == 2) return 1;
642  else if (label1 == 2 && label2 == 3 && view == 3) return 0;
643  else if (label1 == 3 && label2 == 4 && view == 0) return 0;
644  else if (label1 == 3 && label2 == 4 && view == 1) return 0;
645}
646
647int __K(int view, int label) {
648       if (view == 0 && label == 0) return 3;
649  else if (view == 1 && label == 1) return 14;
650  else if (view == 2 && label == 2) return 35;
651  else if (view == 3 && label == 2) return 26;
652  else if (view == 0 && label == 3) return 6;
653  else if (view == 1 && label == 3) return 8;
654  else if (view == 0 && label == 4) return 6;
655}
656
657int fact(int n)
658{
659  int i;
660  int res;
661  __view = __next(__label,1,__view); __cost_incr(_K(__view,1)); __label = 1;
662  res = 1;
663  for (i = 1; i <= n; i = i + 1) {
664     __view = __next(__label,2,__view); __cost_incr(__K(__view,2)); __label = 2;
665    res = res * i;
666  }
667  __view = __next(__label,3,__view); __cost_incr(K(__view,3)); __label = 3;
668  return res;
669}
670
671int main(void)
672{
673  int t;
674  __view = __next(__label,0,__view); __cost_incr(__K(__view,0)); __label = 0;
675  t = fact(10);
676  __view = __next(__label,4,__view); __cost_incr(__K(__view,4)); __label = 4;
677  return t;
678}
679\end{verbatim}
680\end{tiny}
681\caption{The instrumented version of the program in Figure~\ref{factprog}.\label{factprogi}}
682\end{figure}
683
684\paragraph{Considerations on the instrumentation}
685The example of instrumentation in the previous paragraph shows that the
686approach we are proposing exposes at the source level a certain amount
687of information about the machine behavior. Syntactically, the additional
688details, are almost entirely confined into the \texttt{\_\_next} and
689\texttt{\_\_K} functions and they do not affect at all the functional behaviour
690of the program. In particular, all invariants, proof obligations and proofs
691that deal with the functional behavior only are preserved.
692
693The interesting question, then, is what is the impact of the additional
694details on non functional (intensional) invariants and proof obligations.
695At the moment, without a working implementation to perform some large scale
696tests, it is difficult to understand the level of automation that can be
697achieved and the techniques that are likely to work better without introducing
698major approximations. In any case, the preliminary considerations of the
699project remain valid:
700\begin{itemize}
701 \item The task of computing and proving invariants can be simplified,
702   even automatically, trading correctness with precision. For example, the
703   most aggressive approximation simply replaces the cost function
704   \texttt{\_\_K} with the function that ignores the view and returns for each
705   label the maximum cost over all possible views. Correspondingly, the
706   function \texttt{\_\_next} can be dropped since it returns views that
707   are later ignored.
708
709   A more refined possibility consists in approximating the output only on
710   those labels whose jitter is small or for those that mark basic blocks
711   that are executed only a small number of times. By simplyfing the
712   \texttt{\_\_next} function accordingly, it is possible to considerably
713   reduce the search space for automated provers.
714 \item The situation is not worse than what happens with time analysis on
715   the object code (the current state of the art). There it is common practice
716   to analyse larger chunks of execution to minimize the effect of later
717   approximations. For example, if it is known that a loop can be executed at
718   most 10 times, computing the cost of 10 iterations yields a
719   better bound than multiplying by 10 the worst case of a single interation.
720
721   We clearly can do the same on the source level.
722   More generally, every algorithm that works in standard WCET tools on the
723   object code is likely to have a counterpart on the source code.
724   We also expect to be able to do better working on the source code.
725   The reason is that we assume to know more functional properties
726   of the program and more high level invariants, and to have more techniques
727   and tools at our disposal. Even if at the moment we have no evidence to
728   support our claims, we think that this approach is worth pursuing in the
729   long term.
730\end{itemize}
731
732\paragraph{The problem with caches}
733Cost models for pipelines --- at least simple ones --- are data independent,
734i.e. they are dependent on a view that is execution path dependent. In other
735words, the knowledge about the sequence of executed instructions is sufficient
736to predict the cost of future instructions.
737 
738The same property does not hold for caches. The cost of accessing a memory
739cell strongly depends on the addresses of the memory cells that have been read
740in the past. In turn, the accessed addresses are a function of the low level
741data state, that cannot be correlated to the source program state.
742
743The strong correlation between the internal state of caches and the data
744accessed in the past is one of the two main responsibles for the lack of
745precision of static analysis in modern uni-core architectures. The other one
746is the lack of precise knowledge on the real behaviour of modern hardware
747systems. In order to overcome both problems, that Cazorla\&alt.~\cite{proartis}
748call the ``\emph{Timing Analysis Walls}'', the PROARTIS European Project has
749proposed to design ``\emph{a hardware/software architecture whose execution
750timing behaviour eradicates dependence on execution history}'' (\cite{proartis}, Section 1.2). The statement is obviously too strong. What is concretely
751proposed by PROARTIS is the design of a hardware/software architecture whose
752execution timing is \emph{execution path dependent} (our terminology).
753
754We have already seen that we are able to accomodate in the labelling approach
755cost functions that are dependent on views that are execution path dependent.
756Before fully embracing the PROARTIS vision,
757we need to check if there are other aspects of the PROARTIS proposal
758that can be problematic for CerCo.
759
760\paragraph{Static Probabilistic Time Analysis}
761The approach of PROARTIS to achieve execution path dependent cost models
762consists in turning the hard-to-analyze deterministic hardware components
763(e.g. the cache) into probabilistic hardware components. Intuitively,
764algorithms that took decision based on the program history now throw a dice.
765The typical example which has been thoroughly studied in
766PROARTIS~\cite{proartis2} is that of caches. There the proposal is to replace
767the commonly used deterministic placement and replacement algorithms (e.g. LRU)
768with fully probabilistic choices: when the cache needs to evict a page, the
769page to be evicted is randomly selected according to the uniform distribution.
770
771The expectation is that probabilistic hardware will have worse performances
772in the average case, but it will exhibit the worst case performance only with
773negligible probability. Therefore, it becomes no longer interesting to estimate
774the actual worst case bound. What becomes interesting is to plot the probability
775that the execution time will exceed a certain treshold. For all practical
776purposes, a program that misses its deadline with a negligible probability
777(e.g. $10^-9$ per hour of operation) will be perfectly acceptable when deployed
778on an hardware system (e.g. a car or an airplane) that is already specified in
779such a way.
780
781In order to plot the probability distribution of execution times, PROARTIS
782proposes two methodologies: Static Probabilistic Time Analysis (SPTA) and
783Measurement Based Probabilistic Time Analysis (MBPTA). The first one is
784similar to traditional static analysis, but it operates on probabilistic
785hardware. It is the one that we would like to embrace. The second one is
786based on measurements and it is justified by the following assumption: if
787the probabilities associated to every hardware operation are all independent
788and identically distributed, then measuring the fime spent on full runs of
789sub-systems components yields a probabilistic estimate that remains valid
790when the sub-system is plugged in a larger one. Moreover, the probabilistic
791distribution of past runs must be equal to the one of future runs.
792
793We understand that MBPTA is useful to analyze closed (sub)-systems whose
794functional behavior is deterministic. It does not seem to have immediate
795applications to parametric time analysis, which we are interested in.
796Therefore we focus on SPTA.
797
798According to~\cite{proartis},
799``\emph{in SPTA, execution time probability distributions for individual operations \ldots are determined statically
800from a model of the processor. The design principles of PROARTIS
801will ensure \ldots that the probabilities for the execution time of each
802instruction are independent. \ldots SPTA is performed by calculating the
803convolution ($\oplus$) of the discrete probability distributions which describe
804the execution time for each instruction on a CPU; this provides a probability
805distribution \ldots representing the timing behaviour of the entire sequence of
806instructions.}''
807
808We will now analyze to what extend we can embrace SPTA in CerCo.
809
810\paragraph{The labelling approach for Static Probabilistic Time Analysis}
811
812To summarize, the main practical differences between standard static time
813analysis and SPTA are:
814\begin{itemize}
815 \item The cost functions for single instructions or sequences of instructions
816   no longer return a natural numbers (number of cost units) but integral
817   random variables.
818 \item Cost functions are extended from single instructions to sequences of
819   instructions by using the associative convolution operator $\oplus$
820\end{itemize}
821
822By reviewing the papers that described the labelling approach, it is easy to
823get convinced that the codomain of the cost analysis can be lifted from
824that of natural numbers to any group. Moreover, by imposing labels after
825every function call, commutativity can be dropped and the approach works on
826every monoid (usually called \emph{cost monoids} in the litterature).
827Because random variables and convolutions form a monoid, we immediately have
828that the labelling approach extends itself to SPTA. The instrumented code
829produced by an SPTA-CerCo compiler will then have random variables (on a finite
830domain) as costs and convolutions in place of the \texttt\{\_\_cost\_incr\}
831function.
832
833What is left to be understood is the way to state and compute the
834probabilistic invariants to do \emph{parametric SPTA}. Indeed, it seems that
835PROARTIS only got interested into non parametric PTA. For example, it is
836well known that actually computing the convolutions results in an exponential
837growth of the memory required to represent the result of the convolutions.
838Therefore, the analysis should work symbolically until the moment where
839we are interested into extracting information from the convolution.
840
841Moreover, assuming that the problem of computing invariants is solved,
842the actual behavior of automated theorem
843provers on probabilistic invariants is to be understood. It is likely that
844a good amount of domain specific knowledge about probability theory must be
845exploited and incorporated into automatic provers to achieve concrete results.
846
847Parametric SPTA using the methodology developed in CerCo is a future research
848direction that we believe to be worth exploring in the middle and long term.
849
850\paragraph{Static Probabilistic Time Analysis for Caches in CerCo}
851
852As a final remark, we note that the analysis in CerCo of systems that implement
853probabilistic caches requires a combination of SPTA and data independent cost
854models. The need for a probabilistic analysis is obvious but, as we saw in
855the previous paragraph, it requires no modification of the labelling approach.
856
857In order to understand the need for dependent labelling (to work on data
858independent cost functions), we need to review the behaviour of probabilistic
859caches as proposed by PROARTIS. The interested reader can
860consult~\cite{proartis2-articolo30} for further informations.
861
862In a randomized cache, the probability of evicting a given line on every access
863is $1/N$ where $N$ stands for the number of cache entries. Therefore the
864hit probability of a specific access to such a cache is
865$P(hit) = (\frac{N-1}{N})^K$ where $K$ is the number of cache misses between
866two consecutive accesses to the same cache entry. For the purposes of our
867analysis, we must assume that every cache access could cause an eviction.
868Therefore, we define $K$ (the \emph{reuse distance}) to be the number of
869memory accesses between two consecutive accesses to the same cache entry,
870including the access for which we are computing $K$. In order to compute
871$K$ for every code memory address, we need to know the execution path (in
872our terminology). In other words, we need a view that records for each
873cache entry the number of memory accesses that has occurred since the last
874access.
875
876For pipelines with $n$ stages, the number of possible views is limited to
877$2^n$: a view can usually just be represented by a word. This is not the
878case for the views on caches, which are in principle very large. Therefore,
879the dependent labelling approach for data indepedent cost functions that we
880have presented here could still be unpractical for caches. If that turns out
881to be the case, a possible strategy is the use of abstract interpretations
882techniques on the object code to reduce the size of views exposed at the
883source level, at the price of an early loss of precision in the analysis.
884
885More research work must be performed at the current stage to understand if
886caches can be analyzed, even probabilistically, using the CerCo technology.
887This is left for future work and it will be postponed after the work on
888pipelines.
889
890\paragraph{Conclusions}
891At the current state of the art functional properties of programs are better
892proved high level languages, but the non functional ones are proved on the
893corresponding object code. The non functional analysis, however, depends on
894functional invariants, e.g. to bound or correlate the number of executions of
895cycles.
896
897The aim of the CerCo project is to reconcile the two analysis by performing
898non functional analysis on the source code. This requires computing a cost
899model on the object code and reflecting the cost model on the source code.
900We achieve this in CerCo by designing a certified Cost Annotating Compiler that
901keeps tracks of transformations of basic blocks, in order to create a
902correspondence (not necessaritly bijection) between the basic blocks of the
903source and target language. We then prove that the sequence of basic blocks
904that are met in the source and target executions is correlated. Then,
905we perform a static analysis of the cost of basic blocks on the target language
906and we use it to compute the cost model for the source language basic blocks.
907Finally, we compute cost invariants on the source code from the inferred cost
908model and from the functional program invariants; we generate proof obligations
909for the invariants; we use automatic provers to try to close the proof
910obligations.
911
912The cost of single instructions on modern architectures depend on the internal
913state of various hardware components (pipelines, caches, branch predicting
914units, etc.). The internal states are determined by the previous execution
915history. Therefore the cost of basic blocks is parametric on the execution
916history, which means both the instructions executed and the data manipulated
917by the instructions. The CerCo approach is able to correlate the sequence
918of blocks of source instructions with the sequence of blocks of target
919instructions. It does not correlate the high level and the low level data.
920Therefore we are not able in the general case to lift a cost model parametric
921on the execution history on the source code.
922
923To overcome the problem, we have identified a particular class of cost models
924that are not dependent on the data manipulated. We argue that the CerCo
925approach can cover this scenario by exposing in the source program a finite
926data type of views over internal machine states. The costs of basic blocks
927is parametric on these views, and the current view is updated at basic block
928entry according to some abstraction of the machine hardware that does not
929need to be understood. Further studies are needed to understand how invariant
930generators and automatic provers can cope with these updates and parametric
931costs.
932
933We have argued how pipelines, at least simple ones, are captured by the
934previous scenario and can in principle be manipulated using CerCo tools.
935The same is not true for caches, whose behaviour deeply depends on the
936data manipulated. By embracing the PROARTIS proposal of turning caches into
937probabilistic components, we can break the data dependency. Nevertheless,
938cache analysis remains more problematic because of the size of the views.
939Further studies need to be focused on caches to understand if the problem of
940size of the views can be tamed in practice without ruining the whole approach.
941
942\end{document}
Note: See TracBrowser for help on using the repository browser.