# Changeset 3659 for Papers/jar-cerco-2017/proof.tex

Ignore:
Timestamp:
Mar 16, 2017, 12:54:45 PM (4 years ago)
Message:

more cannibalising, adding paolo's report on indexed labelling technique

File:
1 edited

### Legend:

Unmodified
 r3657 %   Technical issues in front end (Brian?) %   Main theorem statement \section{Introduction} In~\cite{D2.1}, Armadio \emph{et al} propose an approach for building a compiler for a large fragment of the \textsc{c} programming language. The novelty of their proposal lies in the fact that their proposed design is capable of lifting execution cost information from the compiled code and presenting it to the user. This idea is foundational for the CerCo project, which strives to produce a mechanically certified version of such a compiler. To summarise, Armadio's proposal consisted of decorations' on the source code, along with the insertion of labels at key points. These labels are preserved as compilation progresses, from one intermediate language to another. Once the final object code is produced, such labels should correspond to the parts of the compiled code that have a constant cost. Two properties must hold of any cost estimate. The first property, paramount to the correctness of the method, is \emph{soundness}, that is, that the actual execution cost is bounded by the estimate. In the labelling approach, this is guaranteed if every loop in the control flow of the compiled code passes through at least one cost label. The second property, optional but desirable, is \emph{preciseness}: the estimate \emph{is} the actual cost. In the labelling approach, this will be true if, for every label, every possible execution of the compiled code starting from such a label yields the same cost before hitting another one. In simple architectures such as the 8051 micro-controller this can be guaranteed by placing labels at the start of any branch in the control flow, and by ensuring that no labels are duplicated. The reader should note that the above mentioned requirements must hold when executing the code obtained at the end of the compilation chain. So even if one is careful about injecting the labels at suitable places in the source code, the requirements might still fail because of two main obstacles: \begin{itemize} \item The compilation process introduces important changes in the control flow, inserting loops or branches. For example, the insertion of functions in the source code replacing instructions that are unavailable in the target architecture. This require loops to be inserted (for example, for multi-word division and generic shift in the 8051 architecture), or effort spent in providing unbranching translations of higher level instructions~\cite{D2.2}. \item Even when the compiled code \emph{does}---as far as the the syntactic control flow graph is concerned---respect the conditions for soundness and preciseness, the cost of blocks of instructions might not be independent of context, so that different passes through a label might have different costs. This becomes a concern if one wishes to apply the approach to more complex architectures, for example one with caching or pipelining. \end{itemize} The first point unveils a weakness of the current labelling approach when it comes to some common code transformations performed along a compilation chain. In particular, most \emph{loop optimisations} are disruptive, in the sense outlined in the first bulletpoint above. An example optimisation of this kind is \emph{loop peeling}. This optimisation is employed by compilers in order to trigger other optimisations, for example, dead code elimination or invariant code motion. Here, a first iteration of the loop is hoisted out of the body of the loop, possibly being assigned a different cost than later iterations. The second bulletpoint above highlights another weakness. Different tools allow to predict up to a certain extent the behaviour of cache. For example, the well known tool \s{aiT}~\cite{absint}---based on abstract interpretation---allows the user to estimate the worst-case execution time (\textsc{wcet}) of a piece of source code, taking into account advanced features of the target architecture. While such a tool is not fit for a compositional approach which is central to CerCo's project\footnote{\s{aiT} assumes the cache is empty at the start of computation, and treats each procedure call separately, unrolling a great part of the control flow.}, \s{aiT}'s ability to produce tight estimates of execution costs would sthill enhance the effectiveness of the CerCo compiler, \eg{} by integrating such techniques in its development. A typical case where cache analysis yields a difference in the execution cost of a block is in loops: the first iteration will usually stumble upon more cache misses than subsequent iterations. If one looks closely, the source of the weakness of the labelling approach as presented in~\cite{D2.1} is common to both points: the inability to state different costs for different occurrences of labels, where the difference might be originated by labels being duplicated along the compilation, or the costs being sensitive to the current state of execution. The preliminary work we present here addresses this weakness by introducing cost labels that are dependent on which iteration of its containing loops it occurs in. This is achieved by means of \emph{indexed labels}; all cost labels are decorated with formal indices coming from the loops containing such labels. These indices allow us to rebuild, even after multiple loop transformations, which iterations of the original loops in the source code a particular label occurrence belongs to. During the annotation stage of the source code, this information is presented to the user by means of \emph{dependent costs}. We concentrate on integrating the labelling approach with two loop transformations. For general information on general compiler optimisations (and loop optimisations in particular) we refer the reader to the vast literature on the subject (\eg\cite{muchnick,morgan}). \paragraph{Loop peeling} As already mentioned, loop peeling consists in preceding the loop with a copy of its body, appropriately guarded. This is used, in general, to trigger further optimisations, such as those that rely on execution information which can be computed at compile time, but which is erased by further iterations of the loop, or those that use the hoisted code to be more effective at eliminating redundant code. Integrating this transformation in to the labelling approach would also allow the integration of the common case of cache analysis explained above; the analysis of cache hits and misses usually benefits from a form of \emph{virtual} loop peeling~\cite{cacheprediction}. \paragraph{Loop unrolling} This optimisation consists of the repetition of several copies of the body of the loop inside the loop itself (inserting appropriate guards, or avoiding them altogether if enough information about the loop's guard is available at compile time). This can limit the number of (conditional or unconditional) jumps executed by the code and trigger further optimisations dealing with pipelining, if appropriate for the architecture. \\\\ Whilst we cover only two loop optimisations in this report, we argue that the work presented herein poses a good foundation for extending the labelling approach, in order to cover more and more common optimisations, as well as gaining insight into how to integrate advanced cost estimation techniques, such as cache analysis, into the CerCo compiler. Moreover loop peeling itself has the fortuitous property of enhancing and enabling other optimisations. Experimentation with CerCo's untrusted prototype compiler, which implements constant propagation and partial redundancy elimination~\cite{PRE,muchnick}, show how loop peeling enhances those other optimisations. \paragraph{Outline} We will present our approach on a minimal toy' imperative language, \imp{} with \s{goto}s, which we present in Section~\ref{sec:defimp} along with formal definitions of the loop transformations. This language already presents most of the difficulties encountered when dealing with \textsc{c}, so we stick to it for the sake of this presentation. In Section~\ref{sec:labelling} we summarize the labelling approach as presented in~\cite{D2.1}. Section~\ref{sec:indexedlabels} presents \emph{indexed labels}, our proposal for dependent labels which are able to describe precise costs even in the presence of the various loop transformations we consider. Finally Section~\ref{sec:conc} goes into more detail regarding the implementation of indexed labels in CerCo's untrusted compiler and speculates on further work on the subject. \section{\imp{} with goto}\label{sec:defimp} We briefly outline the toy language, \imp{} with \s{goto}s. The language was designed in order to pose problems for the existing labelling approach, and as a testing ground for our new notion of indexed labels. The syntax and operational semantics of our toy language are presented in~\ref{fig:minimp}. Note, we may augment the language further, with \s{break} and \s{continue}, at no further expense. \begin{figureplr} $$\begin{gathered} \begin{array}{nlBl>(R<)n} \multicolumn{4}{C}{\bfseries Syntax}\\ \multicolumn{4}{ncn}{ \ell,\ldots \hfill \text{(labels)} \hfill x,y,\ldots \hfill \text{(identifiers)} \hfill e,f,\ldots \hfill \text{(expression)} }\\ P,S,T,\ldots &\gramm& \s{skip} \mid s;t \mid \sop{if}e\sbin{then}s\sbin{else}t \mid \sop{while} e \sbin{do} s \mid x \ass e \\&\mid& \ell : s \mid \sop{goto}\ell& \spanr{-2}{statements}\\ \\ \multicolumn{4}{C}{\bfseries Semantics}\\ K,\ldots &\gramm& \s{halt} \mid S \cdot K & continuations \end{array} \\[15pt] \s{find}(\ell,S,K) \ass \left\{\begin{array}{lL} \bot & if S=\s{skip},\sop{goto} \ell' or x\ass e,\\ (T, K) & if S=\ell:T,\\ \s{find}(\ell,T,K) & otherwise, if S = \ell':T,\\ \s{find}(\ell,T_1,T_2\cdot K) & if defined and S=T_1;T_2,\\ \s{find}(\ell,T_1,K) & if defined and S=\sop{if}b\sbin{then}T_1\sbin{else}T_2,\\ \s{find}(\ell,T_2,K) & otherwise, if S=T_1;T_2 or \sop{if}b\sbin{then}T_1\sbin{else}T_2,\\ \s{find}(\ell,T,S\cdot K) & if S = \sop{while}b\sbin{do}T. \end{array}\right. \\[15pt] \begin{array}{lBl} (x:=e,K,s) &\to_P& (\s{skip},K,s[v/x]) \qquad\mbox{if }(e,s)\eval v \\ \\ (S;T,K,s) &\to_P& (S,T\cdot K,s) \\ \\ (\s{if} \ b \ \s{then} \ S \ \s{else} \ T,K,s) &\to_P&\left\{ \begin{array}{ll} (S,K,s) &\mbox{if }(b,s)\eval v \neq 0 \\ (T,K,s) &\mbox{if }(b,s)\eval 0 \end{array} \right. \\ \\ (\s{while} \ b \ \s{do} \ S ,K,s) &\to_P&\left\{ \begin{array}{ll} (S,\s{while} \ b \ \s{do} \ S \cdot K,s) &\mbox{if }(b,s)\eval v \neq 0 \\ (\s{skip},K,s) &\mbox{if }(b,s)\eval 0 \end{array} \right. \\ \\ (\s{skip},S\cdot K,s) &\to_P&(S,K,s) \\ \\ (\ell : S, K, s) &\to_P& (S,K,s) \\ \\ (\sop{goto}\ell,K,s) &\to_P& (\s{find}(\ell,P,\s{halt}),s) \\ \\ \end{array} \end{gathered}$$ \caption{The syntax and operational semantics of \imp.} \label{fig:minimp} \end{figureplr} The precise grammar for expressions is not particularly relevant so we do not give one in full. For the sake of conciseness we also treat boolean and arithmetic expressions together (with the usual \textsc{c} convention of an expression being true iff non-zero). We may omit the \s{else} clause of a conditional if it leads to a \s{skip} statement. We will presuppose that all programs are \emph{well-labelled}, \ie every label labels at most one occurrence of a statement in a program, and every \s{goto} points to a label actually present in the program. The \s{find} helper function has the task of not only finding the labelled statement in the program, but also building the correct continuation. The continuation built by \s{find} replaces the current continuation in the case of a jump. \paragraph{Further down the compilation chain} We abstract over the rest of the compilation chain. We posit the existence of a suitable notion of `sequential instructions', wherein each instruction has a single natural successor to which we can add our own, for every language $L$ further down the compilation chain. \subsection{Loop transformations} We call a loop $L$ \emph{single-entry} in $P$ if there is no \s{goto} to $P$ outside of $L$ which jumps into $L$.\footnote{This is a reasonable aproximation: it defines a loop as multi-entry if it has an external but unreachable \s{goto} jumping into it.} Many loop optimisations do not preserve the semantics of multi-entry loops in general, or are otherwise rendered ineffective. Usually compilers implement a single-entry loop detection which avoids the multi-entry ones from being targeted by optimisations~\cite{muchnick,morgan}. The loop transformations we present are local, \ie they target a single loop and transform it. Which loops are targeted may be decided by some \emph{ad hoc} heuristic. However, the precise details of which loops are targetted and how is not important here. \paragraph{Loop peeling} $$\sop{while}b\sbin{do}S \mapsto \sop{if}b\sbin{then} S; \sop{while} b \sbin{do} S[\ell'_i/\ell_i]$$ where $\ell'_i$ is a fresh label for any $\ell_i$ labelling a statement in $S$. This relabelling is safe for \s{goto}s occurring outside the loop because of the single-entry condition. Note that for \s{break} and \s{continue} statements, those should be replaced with \s{goto}s in the peeled body $S$. \paragraph{Loop unrolling} $$\sop{while}b\sbin{do}S\mapsto \sop{while} b \sbin{do} (S ; \sop{if} b \sbin{then} (S[\ell^1_i/\ell_i] ; \cdots \sop{if} b \sbin{then} S[\ell^n_i/\ell_i]) \cdots)$$ where $\ell^j_i$ are again fresh labels for any $\ell_i$ labelling a statement in $S$. This is a wilfully na\"{i}ve version of loop unrolling, which usually targets less general loops. The problem this transformation poses to CerCo's labelling approach are independent of the sophistication of the actual transformation. \begin{example} In \autoref{fig:example1} we show a program (a wilfully inefficient computation of of the sum of the first $n$ factorials) and a possible transformation of it, combining loop peeling and loop unrolling. \begin{figureplr} $$\fbox{\begin{array}{l} s\ass 0;\\ i\ass 0;\\ \sop{while}i= 1)?+y\verb+:+z \mapsto \verb+(_i_0 == 0)?+x\verb+:+y,  \item  c\texttt{?}x\verb+:(+d\texttt{?}x\texttt{:}y\verb+)+ \mapsto \texttt{(}c\texttt{ || }d\texttt{)?}x\texttt{:}y,  \item \begin{tabular}[t]{np{\linewidth}n} \verb+(_i_0 == 0)?+x\verb+:(_i_0 % 2 == 0 && _i_0 >= 2)?+y\verb+:+z \mapsto \\\hfill \verb+(_i_0 == 0)?+x\verb+:(_i_0 % 2 == 0)?+y\verb+:+z. \end{tabular} \end{itemize} The second transformation tends to accumulate disjunctions, to the detriment of readability. A further transformation swaps two branches of the ternary expression if the negation of the condition can be expressed with fewer clauses. For example:$$ \verb+(_i_0 % 3 == 0 || _i_0 % 3 == 1)?+x\verb+:+y \mapsto \verb+(_i_0 % 3 == 2)?+y\verb+:+x. $$Picking up again the example depicted in \autoref{ssec:detailedex}, \label{pag:continued} we can see that the cost in \eqref{eq:ex} can be simplified to the following, using some of the transformation described above:$$ \kappa^\iota(\gamma)= \indetern{i_0 = 0} a {\indetern{i_0\bmod 2 = 1} {\indetern{i_1=0} b {\indetern{i_1 = 1} c {\indetern{i_1\bmod 2 = 0} de } } } {\indetern{i_1 \bmod 2 = 0} fg } } $$One should keep in mind that the example was wilfully complicated, in practice the cost expressions produced have rarely more clauses than the number of nested loops containing the annotation. \paragraph{Updates to the frama-C cost plugin} Cerco's frama-C~\cite{framac} cost plugin\todo{is there a reference for this?}{} has been updated to take into account our new notion of dependent costs. The frama-c framework expands ternary expressions to branch statements, introducing temporaries along the way. This makes the task of analyzing ternary cost expressions rather daunting. It was deemed necessary to provide an option in the compiler to use actual branch statements for cost annotations rather than ternary expressions, so that at least frama-C's use of temporaries in cost annotation could be avoided. The cost analysis carried out by the plugin now takes into account such dependent costs. The only limitation (which actually simplifies the code) is that, within a dependent cost, simple conditions with modulus on the same loop index should not be modulo different numbers. This corresponds to a reasonable limitation on the number of times loop unrolling may be applied to the same loop: at most once. \paragraph{Further work} For the time being, indexed labels are only implemented in the untrusted Ocaml compiler, while they are not present yet in the Matita code. Porting them should pose no significant problem. Once ported, the task of proving properties about them in Matita can begin. Because most of the executable operational semantics of the languages across the frontend and the backend are oblivious to cost labels, it should be expected that the bulk of the semantic preservation proofs that still needs to be done will not get any harder because of indexed labels. The only trickier point that we foresee would be in the translation of \s{Clight} to \s{Cminor}, where we pass from structured indexed loops to atomic instructions on loop indices. An invariant which should probably be proved and provably preserved along the compilation chain is the non-overlap of indexings for the same atom. Then, supposing cost correctness for the unindexed approach, the indexed one will just need to amend the proof that$$\forall C\text{ constant indexing}.\forall \alpha\la I\ra\text{ appearing in the compiled code}. \kappa(\alpha)\circ (I\circ C) = \kappa(\alpha\la I \ra). $$Here, C represents a snapshot of loop indices in the compiled code, while I\circ C is the corresponding snapshot in the source code. Semantics preservation will ensure that when, with snapshot C, we emit \alpha\la I\ra (that is, we have \alpha\la I\circ C\ra in the trace), \alpha must also be emitted in the source code with indexing I\circ C, so the cost \kappa(\alpha)\circ (I\circ C) applies. Aside from carrying over the proofs, we would like to extend the approach to more loop transformations. Important examples are loop inversion (where a for loop is reversed, usually to make iterations appear to be truly independent) or loop interchange (where two nested loops are swapped, usually to have more loop invariants or to enhance strength reduction). This introduces interesting changes to the approach, where we would have indexings such as:$$i_0\mapsto n - i_0\quad\text{or}\quad i_0\mapsto i_1, i_1\mapsto i_0. In particular dependency over actual variables of the code would enter the frame, as indexings would depend on the number of iterations of a well-behaving guarded loop (the $n$ in the first example). Finally, as stated in the introduction, the approach should allow some integration of techniques for cache analysis, a possibility that for now has been put aside as the standard 8051 target architecture for the CerCo project lacks a cache. Two possible developments for this line of work present themselves: \begin{enumerate} \item One could extend the development to some 8051 variants, of which some have been produced with a cache. \item One could make the compiler implement its own cache: this cannot apply to \textsc{ram} accesses of the standard 8051 architecture, as the difference in cost of accessing the two types of \textsc{ram} is only one clock cycle, which makes any implementation of cache counterproductive. So for this proposal, we could either artificially change the accessing cost of \textsc{ram} of the model just for the sake of possible future adaptations to other architectures, or otherwise model access to an external memory by means of the serial port. \end{enumerate} \section{Compiler proof}