source: Papers/sttt/main.tex @ 3476

Last change on this file since 3476 was 3476, checked in by mulligan, 5 years ago

beginning reorientation of paper, rewrote abstract

  • Property svn:executable set to *
File size: 46.7 KB
Line 
1\documentclass[twocolumn,draft]{svjour3}
2\usepackage{algpseudocode}
3%\usepackage{algorithmicx}
4\usepackage{alltt}
5\usepackage{amsfonts}
6\usepackage{amsmath}
7\usepackage[british]{babel}
8\usepackage{caption}
9\usepackage{hyperref}
10\usepackage[utf8]{inputenc}
11\usepackage{listings}
12\usepackage{subcaption}
13
14\renewcommand{\verb}{\lstinline}
15\def\lstlanguagefiles{lst-grafite.tex}
16\lstset{language=Grafite}
17
18\begin{document}
19
20\title{On the proof of correctness of a verified optimising assembler
21\thanks{Research supported by the CerCo project, within the Future and Emerging Technologies (FET) programme of the Seventh Framework Programme for Research of the European Commission, under FET-Open grant number 243881}}
22\author{Jaap Boender \and Dominic P. Mulligan \and Claudio Sacerdoti Coen}
23\institute{Department of Computer Science, University of Middlesex \and Computer Laboratory, University of Cambridge \and Dipartimento di Informatica, University of Bologna}
24
25\maketitle
26
27\begin{abstract}
28Optimising assemblers face the `branch displacement' or `jump encoding' problem, i.e. how best to choose between concrete machine code jump instructions --- typically of differing instruction and offset sizes --- when expanding pseudo-instructions.
29Ideally, an optimising assembler would choose the set of jump expansions that minimises the size of the resulting machine code program, a task that is provably \textsc{np}-hard.
30
31As part of CerCo (`Certified Complexity') --- an \textsc{eu}-funded project to develop a verified concrete complexity preserving compiler for a large subset of the C programming language --- we have implemented and proved correct an optimising assembler within the Matita interactive theorem prover.
32Our assembler targets the instruction set of a typical micro-controller, the Intel \textsc{mcs}-51 series.
33As is common in embedded systems development, this micro-controller has a paucity of available code memory and therefore we face an additional pressure in reducing the size of any assembled machine code program.
34Out of necessity, then, our assembler implements an algorithm for solving the branch displacement problem, and we must prove that this algorithm is correct.
35
36However, the efficient expansion of pseudoinstructions, namely jumps, into machine instructions is complex.
37We therefore isolate the decision making over how jumps should be expanded from the expansion process itself as much as possible using `policies', making the proof of correctness for the assembler more straightforward.
38Our proof strategy contains a tracking facility for `good addresses' and only programs that use good addresses have their semantics preserved under assembly, as we observe that it is impossible for an assembler to preserve the semantics of every assembly program.
39Our strategy offers increased flexibility over the traditional approach to proving the correctness of assemblers, wherein addresses in assembly are kept opaque and immutable.
40In particular, we may experiment with allowing the benign manipulation of addresses.
41
42We discuss wider issues associated with a proof of correctness for an assembler, detail our algorithm solving the `branch displacement' problem, and discuss our proof of correctness, employing `policies', for the assembler.
43
44\keywords{Formal verification, interactive theorem proving, assembler, branch displacement optimisation, CerCo (`Certified Complexity'), MCS-51 microcontroller, Matita proof assistant}
45\end{abstract}
46
47\section{Introduction}
48
49The problem of branch displacement optimisation, also known as jump encoding, is
50a well-known problem in assembler design~\cite{Hyde2006}. Its origin lies in the
51fact that in many architecture sets, the encoding (and therefore size) of some
52instructions depends on the distance to their operand (the instruction 'span').
53The branch displacement optimisation problem consists of encoding these
54span-dependent instructions in such a way that the resulting program is as
55small as possible.
56
57This problem is the subject of the present paper. After introducing the problem
58in more detail, we will discuss the solutions used by other compilers, present
59the algorithm we use in the CerCo assembler, and discuss its verification,
60that is the proofs of termination and correctness using the Matita proof
61assistant~\cite{Asperti2007}.
62
63Formulating the final statement of correctness and finding the loop invariants
64have been non-trivial tasks and are, indeed, the main contribution of this
65paper. It has required considerable care and fine-tuning to formulate not
66only the minimal statement required for the ulterior proof of correctness of
67the assembler, but also the minimal set of invariants needed for the proof
68of correctness of the algorithm.
69
70The research presented in this paper has been executed within the CerCo project
71which aims at formally verifying a C compiler with cost annotations. The
72target architecture for this project is the MCS-51, whose instruction set
73contains span-dependent instructions. Furthermore, its maximum addressable
74memory size is very small (64 Kb), which makes it important to generate
75programs that are as small as possible. With this optimisation, however, comes increased complexity and hence increased possibility for error. We must make sure that the branch instructions are encoded correctly, otherwise the assembled program will behave unpredictably.
76
77All Matita files related to this development can be found on the CerCo
78website, \url{http://cerco.cs.unibo.it}. The specific part that contains the
79branch displacement algorithm is in the {\tt ASM} subdirectory, in the files
80{\tt PolicyFront.ma}, {\tt PolicyStep.ma} and {\tt Policy.ma}.
81
82\section{The branch displacement optimisation problem}
83
84In most modern instruction sets that have them, the only span-dependent
85instructions are branch instructions. Taking the ubiquitous x86-64 instruction
86set as an example, we find that it contains eleven different forms of the
87unconditional branch instruction, all with different ranges, instruction sizes
88and semantics (only six are valid in 64-bit mode, for example). Some examples
89are shown in Figure~\ref{f:x86jumps} (see also~\cite{IntelDev}).
90
91\begin{figure}[h]
92\begin{center}
93\begin{tabular}{|l|l|l|}
94\hline
95Instruction & Size (bytes) & Displacement range \\
96\hline
97Short jump & 2 & -128 to 127 bytes \\
98Relative near jump & 5 & $-2^{32}$ to $2^{32}-1$ bytes \\
99Absolute near jump & 6 & one segment (64-bit address) \\
100Far jump & 8 & entire memory (indirect jump) \\
101\hline
102\end{tabular}
103\end{center}
104\caption{List of x86 branch instructions}
105\label{f:x86jumps}
106\end{figure}
107
108The chosen target architecture of the CerCo project is the Intel MCS-51, which
109features three types of branch instructions (or jump instructions; the two terms
110are used interchangeably), as shown in Figure~\ref{f:mcs51jumps}.
111
112\begin{figure}[h]
113\begin{center}
114\begin{tabular}{|l|l|l|l|}
115\hline
116Instruction & Size    & Execution time & Displacement range \\
117            & (bytes) & (cycles) & \\
118\hline
119SJMP (`short jump') & 2 & 2 & -128 to 127 bytes \\
120AJMP (`absolute jump') & 2 & 2 & one segment (11-bit address) \\
121LJMP (`long jump') & 3 & 3 & entire memory \\
122\hline
123\end{tabular}
124\end{center}
125\caption{List of MCS-51 branch instructions}
126\label{f:mcs51jumps}
127\end{figure}
128
129Conditional branch instructions are only available in short form, which
130means that a conditional branch outside the short address range has to be
131encoded using three branch instructions (for instructions whose logical
132negation is available, it can be done with two branch instructions, but for
133some instructions this is not the case). The call instruction is
134only available in absolute and long forms.
135
136Note that even though the MCS-51 architecture is much less advanced and much
137simpler than the x86-64 architecture, the basic types of branch instruction
138remain the same: a short jump with a limited range, an intra-segment jump and a
139jump that can reach the entire available memory.
140 
141Generally, in code fed to the assembler as input, the only
142difference between branch instructions is semantics, not span. This
143means that a distinction is made between an unconditional branch and the
144several kinds of conditional branch, but not between their short, absolute or
145long variants.
146
147The algorithm used by the assembler to encode these branch instructions into
148the different machine instructions is known as the {\em branch displacement
149algorithm}. The optimisation problem consists of finding as small an encoding as
150possible, thus minimising program length and execution time.
151
152Similar problems, e.g. the branch displacement optimisation problem for other
153architectures, are known to be NP-complete~\cite{Robertson1979,Szymanski1978},
154which could make finding an optimal solution very time-consuming.
155
156The canonical solution, as shown by Szymanski~\cite{Szymanski1978} or more
157recently by Dickson~\cite{Dickson2008} for the x86 instruction set, is to use a
158fixed point algorithm that starts with the shortest possible encoding (all
159branch instruction encoded as short jumps, which is likely not a correct
160solution) and then iterates over the source to re-encode those branch
161instructions whose target is outside their range.
162
163\subsection*{Adding absolute jumps}
164
165In both papers mentioned above, the encoding of a jump is only dependent on the
166distance between the jump and its target: below a certain value a short jump
167can be used; above this value the jump must be encoded as a long jump.
168
169Here, termination of the smallest fixed point algorithm is easy to prove. All
170branch instructions start out encoded as short jumps, which means that the
171distance between any branch instruction and its target is as short as possible
172(all the intervening jumps are short).
173If, in this situation, there is a branch instruction $b$ whose span is not
174within the range for a short jump, we can be sure that we can never reach a
175situation where the span of $j$ is so small that it can be encoded as a short
176jump. This argument continues to hold throughout the subsequent iterations of
177the algorithm: short jumps can change into long jumps, but not \emph{vice versa},
178as spans only increase. Hence, the algorithm either terminates early when a fixed
179point is reached or when all short jumps have been changed into long jumps.
180
181Also, we can be certain that we have reached an optimal solution: a short jump
182is only changed into a long jump if it is absolutely necessary.
183
184However, neither of these claims (termination nor optimality) hold when we add
185the absolute jump. With absolute jumps, the encoding of a branch
186instruction no longer depends only on the distance between the branch
187instruction and its target. An absolute jump is possible when instruction and
188target are in the same segment (for the MCS-51, this means that the first 5
189bytes of their addresses have to be equal). It is therefore entirely possible
190for two branch instructions with the same span to be encoded in different ways
191(absolute if the branch instruction and its target are in the same segment,
192long if this is not the case).
193
194\begin{figure}[t]
195\begin{subfigure}[b]{.45\linewidth}
196\small
197\begin{alltt}
198    jmp X
199    \ldots
200L\(\sb{0}\): \ldots
201% Start of new segment if
202% jmp X is encoded as short
203    \ldots
204    jmp L\(\sb{0}\)
205\end{alltt}
206\caption{Example of a program where a long jump becomes absolute}
207\label{f:term_example}
208\end{subfigure}
209\hfill
210\begin{subfigure}[b]{.45\linewidth}
211\small
212\begin{alltt}
213L\(\sb{0}\): jmp X
214X:  \ldots
215    \ldots
216L\(\sb{1}\): \ldots
217% Start of new segment if
218% jmp X is encoded as short
219    \ldots
220    jmp L\(\sb{1}\)
221    \ldots
222    jmp L\(\sb{1}\)
223    \ldots
224    jmp L\(\sb{1}\) 
225    \ldots
226\end{alltt}
227\caption{Example of a program where the fixed-point algorithm is not optimal}
228\label{f:opt_example}
229\end{subfigure}
230\end{figure}
231
232This invalidates our earlier termination argument: a branch instruction, once encoded
233as a long jump, can be re-encoded during a later iteration as an absolute jump.
234Consider the program shown in Figure~\ref{f:term_example}. At the start of the
235first iteration, both the branch to {\tt X} and the branch to $\mathtt{L}_{0}$
236are encoded as small jumps. Let us assume that in this case, the placement of
237$\mathtt{L}_{0}$ and the branch to it are such that $\mathtt{L}_{0}$ is just
238outside the segment that contains this branch. Let us also assume that the
239distance between $\mathtt{L}_{0}$ and the branch to it is too large for the
240branch instruction to be encoded as a short jump.
241
242All this means that in the second iteration, the branch to $\mathtt{L}_{0}$ will
243be encoded as a long jump. If we assume that the branch to {\tt X} is encoded as
244a long jump as well, the size of the branch instruction will increase and
245$\mathtt{L}_{0}$ will be `propelled' into the same segment as its branch
246instruction, because every subsequent instruction will move one byte forward.
247Hence, in the third iteration, the branch to $\mathtt{L}_{0}$ can be encoded as
248an absolute jump. At first glance, there is nothing that prevents us from
249constructing a configuration where two branch instructions interact in such a
250way as to iterate indefinitely between long and absolute encodings.
251
252This situation mirrors the explanation by Szymanski~\cite{Szymanski1978} of why
253the branch displacement optimisation problem is NP-complete. In this explanation,
254a condition for NP-completeness is the fact that programs be allowed to contain
255{\em pathological} jumps. These are branch instructions that can normally not be
256encoded as a short(er) jump, but gain this property when some other branch
257instructions are encoded as a long(er) jump. This is exactly what happens in
258Figure~\ref{f:term_example}. By encoding the first branch instruction as a long
259jump, another branch instruction switches from long to absolute (which is
260shorter).
261
262In addition, our previous optimality argument no longer holds. Consider the
263program shown in Figure~\ref{f:opt_example}. Suppose that the distance between
264$\mathtt{L}_{0}$ and $\mathtt{L}_{1}$ is such that if {\tt jmp X} is encoded
265as a short jump, there is a segment border just after $\mathtt{L}_{1}$. Let
266us also assume that all three branches to $\mathtt{L}_{1}$ are in the same
267segment, but far enough away from $\mathtt{L}_{1}$ that they cannot be encoded
268as short jumps.
269
270Then, if {\tt jmp X} were to be encoded as a short jump, which is clearly
271possible, all of the branches to $\mathtt{L}_{1}$ would have to be encoded as
272long jumps. However, if {\tt jmp X} were to be encoded as a long jump, and
273therefore increase in size, $\mathtt{L}_{1}$ would be `propelled' across the
274segment border, so that the three branches to $\mathtt{L}_{1}$ could be encoded
275as absolute jumps. Depending on the relative sizes of long and absolute jumps,
276this solution might actually be smaller than the one reached by the smallest
277fixed point algorithm.
278
279\section{Our algorithm}
280
281\subsection{Design decisions}
282
283Given the NP-completeness of the problem, finding optimal solutions
284(using, for example, a constraint solver) can potentially be very costly.
285
286The SDCC compiler~\cite{SDCC2011}, which has a backend targeting the MCS-51
287instruction set, simply encodes every branch instruction as a long jump
288without taking the distance into account. While certainly correct (the long
289jump can reach any destination in memory) and a very fast solution to compute,
290it results in a less than optimal solution in terms of output size and
291execution time.
292
293On the other hand, the {\tt gcc} compiler suite, while compiling
294C on the x86 architecture, uses a greatest fix point algorithm. In other words,
295it starts with all branch instructions encoded as the largest jumps
296available, and then tries to reduce the size of branch instructions as much as
297possible.
298
299Such an algorithm has the advantage that any intermediate result it returns
300is correct: the solution where every branch instruction is encoded as a large
301jump is always possible, and the algorithm only reduces those branch
302instructions whose destination address is in range for a shorter jump.
303The algorithm can thus be stopped after a determined number of steps without
304sacrificing correctness.
305
306The result, however, is not necessarily optimal. Even if the algorithm is run
307until it terminates naturally, the fixed point reached is the {\em greatest}
308fixed point, not the least fixed point. Furthermore, {\tt gcc} (at least for
309the x86 architecture) only uses short and long jumps. This makes the algorithm
310more efficient, as shown in the previous section, but also results in a less
311optimal solution.
312
313In the CerCo assembler, we opted at first for a least fixed point algorithm,
314taking absolute jumps into account.
315
316Here, we ran into a problem with proving termination, as explained in the
317previous section: if we only take short and long jumps into account, the jump
318encoding can only switch from short to long, but never in the other direction.
319When we add absolute jumps, however, it is theoretically possible for a branch
320instruction to switch from absolute to long and back, as previously explained.
321Proving termination then becomes difficult, because there is nothing that
322precludes a branch instruction from oscillating back and forth between absolute
323and long jumps indefinitely.
324
325To keep the algorithm in the same complexity class and more easily
326prove termination, we decided to explicitly enforce the `branch instructions
327must always grow longer' requirement: if a branch instruction is encoded as a
328long jump in one iteration, it will also be encoded as a long jump in all the
329following iterations. Therefore the encoding of any branch instruction
330can change at most two times: once from short to absolute (or long), and once
331from absolute to long.
332
333There is one complicating factor. Suppose that a branch instruction is encoded
334in step $n$ as an absolute jump, but in step $n+1$ it is determined that
335(because of changes elsewhere) it can now be encoded as a short jump. Due to
336the requirement that the branch instructions must always grow longer,
337the branch encoding will be encoded as an absolute jump in step
338$n+1$ as well.
339
340This is not necessarily correct. A branch instruction that can be
341encoded as a short jump cannot always also be encoded as an absolute jump, as a
342short jump can bridge segments, whereas an absolute jump cannot. Therefore,
343in this situation we have decided to encode the branch instruction as a long
344jump, which is always correct.
345
346The resulting algorithm, therefore, will not return the least fixed point, as
347it might have too many long jumps. However, it is still better than the
348algorithms from SDCC and {\tt gcc}, since even in the worst case, it will still
349return a smaller or equal solution.
350
351Experimenting with our algorithm on the test suite of C programs included with
352gcc 2.3.3 has shown that on average, about 25 percent of jumps are encoded as short or absolute jumps by the algorithm. As not all instructions are jumps, this does not make for a large reduction in size, but it can make for a reduction in execution time: if jumps
353are executed multiple times, for example in loops, the fact that short jumps take less cycles to execute than long jumps can have great effect.
354
355As for complexity, there are at most $2n$ iterations, where $n$ is the number of
356branch instructions. Practical tests within the CerCo project on small to
357medium pieces of code have shown that in almost all cases, a fixed point is
358reached in 3 passes. Only in one case did the algorithm need 4. This is not surprising: after all, the difference between short/absolute and
359long jumps is only one byte (three for conditional jumps). For a change from
360short/absolute to long to have an effect on other jumps is therefore relatively
361uncommon, which explains why a fixed point is reached so quickly.
362
363\subsection{The algorithm in detail}
364
365The branch displacement algorithm forms part of the translation from
366pseudocode to assembler. More specifically, it is used by the function that
367translates pseudo-addresses (natural numbers indicating the position of the
368instruction in the program) to actual addresses in memory. Note that in pseudocode, all instructions are of size 1.
369
370Our original intention was to have two different functions, one function
371$\mathtt{policy}: \mathbb{N} \rightarrow \{\mathtt{short\_jump},
372\mathtt{absolute\_jump}, \mathtt{long\_jump}\}$ to associate jumps to their
373intended encoding, and a function $\sigma: \mathbb{N} \rightarrow
374\mathtt{Word}$ to associate pseudo-addresses to machine addresses. $\sigma$
375would use $\mathtt{policy}$ to determine the size of jump instructions. This turned out to be suboptimal from the algorithmic point of view and
376impossible to prove correct.
377
378From the algorithmic point of view, in order to create the $\mathtt{policy}$
379function, we must necessarily have a translation from pseudo-addresses
380to machine addresses (i.e. a $\sigma$ function): in order to judge the distance
381between a jump and its destination, we must know their memory locations.
382Conversely, in order to create the $\sigma$ function, we need to have the
383$\mathtt{policy}$ function, otherwise we do not know the sizes of the jump
384instructions in the program.
385
386Much the same problem appears when we try to prove the algorithm correct: the
387correctness of $\mathtt{policy}$ depends on the correctness of $\sigma$, and
388the correctness of $\sigma$ depends on the correctness of $\mathtt{policy}$.
389
390We solved this problem by integrating the $\mathtt{policy}$ and $\sigma$
391algorithms. We now have a function
392$\sigma: \mathbb{N} \rightarrow \mathtt{Word} \times \mathtt{bool}$ which
393associates a pseudo-address to a machine address. The boolean denotes a forced
394long jump; as noted in the previous section, if during the fixed point
395computation an absolute jump changes to be potentially re-encoded as a short
396jump, the result is actually a long jump. It might therefore be the case that
397jumps are encoded as long jumps without this actually being necessary, and this
398information needs to be passed to the code generating function.
399
400The assembler function encodes the jumps by checking the distance between
401source and destination according to $\sigma$, so it could select an absolute
402jump in a situation where there should be a long jump. The boolean is there
403to prevent this from happening by indicating the locations where a long jump
404should be encoded, even if a shorter jump is possible. This has no effect on
405correctness, since a long jump is applicable in any situation.
406
407\begin{figure}[t]
408\small
409\begin{algorithmic}
410\Function{f}{$labels$,$old\_sigma$,$instr$,$ppc$,$acc$}
411  \State $\langle added, pc, sigma \rangle \gets acc$
412  \If {$instr$ is a backward jump to $j$}
413    \State $length \gets \mathrm{jump\_size}(pc,sigma_1(labels(j)))$
414    \Comment compute jump distance
415  \ElsIf {$instr$ is a forward jump to $j$}
416    \State $length \gets \mathrm{jump\_size}(pc,old\_sigma_1(labels(j))+added)$
417  \EndIf
418  \State $old\_length \gets \mathrm{old\_sigma_1}(ppc)$
419  \State $new\_length \gets \mathrm{max}(old\_length, length)$
420  \Comment length must never decrease
421  \State $old\_size \gets \mathrm{old\_sigma_2}(ppc)$
422  \State $new\_size \gets \mathrm{instruction\_size}(instr,new\_length)$
423  \Comment compute size in bytes
424  \State $new\_added \gets added+(new\_size-old\_size)$
425  \Comment keep track of total added bytes
426  \State $new\_sigma \gets old\_sigma$
427  \State $new\_sigma_1(ppc+1) \gets pc+new\_size$
428  \State $new\_sigma_2(ppc) \gets new\_length$
429  \Comment update $\sigma$ \\
430  \Return $\langle new\_added, pc+new\_size, new\_sigma \rangle$
431\EndFunction
432\end{algorithmic}
433\caption{The heart of the algorithm}
434\label{f:jump_expansion_step}
435\end{figure}
436
437The algorithm, shown in Figure~\ref{f:jump_expansion_step}, works by folding the
438function {\sc f} over the entire program, thus gradually constructing $sigma$.
439This constitutes one step in the fixed point calculation; successive steps
440repeat the fold until a fixed point is reached. We have abstracted away the case where an instruction is not a jump, since the size of these instructions is constant.
441
442Parameters of the function {\sc f} are:
443\begin{itemize}
444  \item a function $labels$ that associates a label to its pseudo-address;
445  \item $old\_sigma$, the $\sigma$ function returned by the previous
446    iteration of the fixed point calculation;
447  \item $instr$, the instruction currently under consideration;
448  \item $ppc$, the pseudo-address of $instr$;
449  \item $acc$, the fold accumulator, which contains $added$ (the number of
450  bytes added to the program size with respect to the previous iteration), $pc$
451  (the highest memory address reached so far), and of course $sigma$, the
452    $\sigma$ function under construction.
453\end{itemize}
454The first two are parameters that remain the same through one iteration, the
455final three are standard parameters for a fold function (including $ppc$,
456which is simply the number of instructions of the program already processed).
457
458The $\sigma$ functions used by {\sc f} are not of the same type as the final
459$\sigma$ function: they are of type
460$\sigma: \mathbb{N} \rightarrow \mathbb{N} \times \{\mathtt{short\_jump},
461\mathtt{absolute\_jump},\mathtt{long\_jump}\}$; a function that associates a
462pseudo-address with a memory address and a jump length. We do this to
463ease the comparison of jump lengths between iterations. In the algorithm,
464we use the notation $sigma_1(x)$ to denote the memory address corresponding to
465$x$, and $sigma_2(x)$ for the jump length corresponding to $x$.
466
467Note that the $\sigma$ function used for label lookup varies depending on
468whether the label is behind our current position or ahead of it. For
469backward branches, where the label is behind our current position, we can use
470$sigma$ for lookup, since its memory address is already known. However, for
471forward branches, the memory address of the address of the label is not yet
472known, so we must use $old\_sigma$.
473
474We cannot use $old\_sigma$ without change: it might be the case that we have
475already increased the size of some branch instructions before, making the
476program longer and moving every instruction forward. We must compensate for this
477by adding the size increase of the program to the label's memory address
478according to $old\_sigma$, so that branch instruction spans do not get
479compromised.
480
481%Note also that we add the pc to $sigma$ at location $ppc+1$, whereas we add the
482%jump length at location $ppc$. We do this so that $sigma(ppc)$ will always
483%return a pair with the start address of the instruction at $ppc$ and the
484%length of its branch instruction (if any); the end address of the program can
485%be found at $sigma(n+1)$, where $n$ is the number of instructions in the
486%program.
487
488\section{The proof}
489
490In this section, we present the correctness proof for the algorithm in more
491detail. The main correctness statement is shown, slightly simplified, in~Figure~\ref{statement}.
492%
493\begin{figure}[t]
494\small
495\begin{alignat*}{6}
496\mathtt{sigma}&\omit\rlap{$\mathtt{\_policy\_specification} \equiv
497\lambda program.\lambda sigma.$} \notag\\
498  & \omit\rlap{$sigma\ 0 = 0\ \wedge$} \notag\\
499  & \mathbf{let}\ & & \omit\rlap{$instr\_list \equiv code\ program\ \mathbf{in}$} \notag\\
500  &&& \omit\rlap{$\forall ppc.ppc < |instr\_list| \rightarrow$} \notag\\
501  &&& \mathbf{let}\ && pc \equiv sigma\ ppc\ \mathbf{in} \notag\\
502  &&& \mathbf{let}\ && instruction \equiv \mathtt{fetch\_pseudo\_instruction}\ instr\_list\ ppc\ \mathbf{in} \notag\\
503  &&& \mathbf{let}\ && next\_pc \equiv sigma\ (ppc+1)\ \mathbf{in}\notag\\
504  &&&&& next\_pc = pc + \mathtt{instruction\_size}\ sigma\ ppc\ instruction\ \wedge\notag\\
505  &&&&& (pc + \mathtt{instruction\_size}\ sigma\ ppc\ instruction < 2^{16}\ \vee\notag\\
506  &&&&& (\forall ppc'.ppc' < |instr\_list| \rightarrow ppc < ppc' \rightarrow \notag\\
507  &&&&& \mathbf{let}\ instruction' \equiv \mathtt{fetch\_pseudo\_instruction}\ instr\_list\ ppc'\ \mathbf{in} \notag\\
508  &&&&&\ \mathtt{instruction\_size}\ sigma\ ppc'\ instruction' = 0)\ \wedge \notag\\
509  &&&&& pc + \mathtt{instruction\_size}\ sigma\ ppc\ instruction = 2^{16})
510\end{alignat*}
511\caption{Main correctness statement\label{statement}}
512\label{sigmapolspec}
513\end{figure}
514%
515Informally, this means that when fetching a pseudo-instruction at $ppc$, the
516translation by $\sigma$ of $ppc+1$ is the same as $\sigma(ppc)$ plus the size
517of the instruction at $ppc$.  That is, an instruction is placed consecutively
518after the previous one, and there are no overlaps. The rest of the statement deals with memory size: either the next instruction fits within memory ($next\_pc < 2^{16}$) or it ends exactly at the limit memory,
519in which case it must be the last translated instruction in the program (enforced by specfiying that the size of all subsequent instructions is 0: there may be comments or cost annotations that are not translated).
520
521Finally, we enforce that the program starts at address 0, i.e. $\sigma(0) = 0$. It may seem strange that we do not explicitly include a safety property stating that every jump instruction is of the right type with respect to its target (akin to the lemma from Figure~\ref{sigmasafe}), but this is not necessary. The distance is recalculated according to the instruction addresses from $\sigma$, which implicitly expresses safety.
522
523Since our computation is a least fixed point computation, we must prove
524termination in order to prove correctness: if the algorithm is halted after
525a number of steps without reaching a fixed point, the solution is not
526guaranteed to be correct. More specifically, branch instructions might be
527encoded which do not coincide with the span between their location and their
528destination.
529
530Proof of termination rests on the fact that the encoding of branch
531instructions can only grow larger, which means that we must reach a fixed point
532after at most $2n$ iterations, with $n$ the number of branch instructions in
533the program. This worst case is reached if at every iteration, we change the
534encoding of exactly one branch instruction; since the encoding of any branch
535instruction can change first from short to absolute, and then to long, there
536can be at most $2n$ changes.
537
538%The proof has been carried out using the ``Russell'' style from~\cite{Sozeau2006}.
539%We have proven some invariants of the {\sc f} function from the previous
540%section; these invariants are then used to prove properties that hold for every
541%iteration of the fixed point computation; and finally, we can prove some
542%properties of the fixed point.
543
544\subsection{Fold invariants}
545
546In this section, we present the invariants that hold during the fold of {\sc f}
547over the program. These will be used later on to prove the properties of the
548iteration. During the fixed point computation, the $\sigma$ function is
549implemented as a trie for ease of access; computing $\sigma(x)$ is achieved by
550looking up the value of $x$ in the trie. Actually, during the fold, the value
551we pass along is a pair $\mathbb{N} \times \mathtt{ppc\_pc\_map}$. The first
552component is the number of bytes added to the program so far with respect to
553the previous iteration, and the second component, {\tt ppc\_pc\_map}, is the
554actual $\sigma$ trie (which we'll call $strie$ to avoid confusion).
555%
556{\small
557\begin{alignat*}{2}
558\mathtt{out} & \mathtt{\_of\_program\_none} \equiv \lambda prefix.\lambda strie. \notag\\
559& \forall i.i < 2^{16} \rightarrow (i > |prefix| \leftrightarrow
560 \mathtt{lookup\_opt}\ i\ (\mathtt{snd}\ strie) = \mathtt{None})
561\end{alignat*}}
562%
563The first invariant states that any pseudo-address not yet examined is not
564present in the lookup trie.
565%
566{\small
567\begin{alignat*}{2}
568\mathtt{not} & \mathtt{\_jump\_default} \equiv \lambda prefix.\lambda strie.\forall i.i < |prefix| \rightarrow\notag\\
569& \neg\mathtt{is\_jump}\ (\mathtt{nth}\ i\ prefix) \rightarrow \mathtt{lookup}\ i\ (\mathtt{snd}\ strie) = \mathtt{short\_jump}
570\end{alignat*}}
571%
572This invariant states that when we try to look up the jump length of a
573pseudo-address where there is no branch instruction, we will get the default
574value, a short jump.
575%
576{\small
577\begin{alignat*}{4}
578\mathtt{jump} & \mathtt{\_increase} \equiv \lambda pc.\lambda op.\lambda p.\forall i.i < |prefix| \rightarrow \notag\\ 
579& \mathbf{let}\  oj \equiv \mathtt{lookup}\ i\ (\mathtt{snd}\ op)\ \mathbf{in} \notag\\
580& \mathbf{let}\ j \equiv \mathtt{lookup}\ i\ (\mathtt{snd}\ p)\ \mathbf{in}\ \mathtt{jmpleq}\ oj\ j
581\end{alignat*}}
582%
583This invariant states that between iterations (with $op$ being the previous
584iteration, and $p$ the current one), jump lengths either remain equal or
585increase. It is needed for proving termination.
586%
587\begin{figure}[h]
588\small
589\begin{alignat*}{6}
590\mathtt{sigma} & \omit\rlap{$\mathtt{\_compact\_unsafe} \equiv \lambda prefix.\lambda strie.\forall n.n < |prefix| \rightarrow$}\notag\\
591& \mathbf{match}\ && \omit\rlap{$\mathtt{lookup\_opt}\ n\ (\mathtt{snd}\ strie)\ \mathbf{with}$}\notag\\
592&&& \omit\rlap{$\mathtt{None} \Rightarrow \mathrm{False}$} \notag\\
593&&& \omit\rlap{$\mathtt{Some}\ \langle pc, j \rangle \Rightarrow$} \notag\\
594&&& \mathbf{match}\ && \mathtt{lookup\_opt}\ (n+1)\ (\mathtt{snd}\ strie)\ \mathbf{with}\notag\\
595&&&&& \mathtt{None} \Rightarrow \mathrm{False} \notag\\
596&&&&& \mathtt{Some}\ \langle pc_1, j_1 \rangle \Rightarrow
597    pc_1 = pc + \notag\\
598&&&&& \ \ \mathtt{instruction\_size\_jmplen}\ j\ (\mathtt{nth}\ n\ prefix)
599\end{alignat*}
600\caption{Temporary safety property}
601\label{sigmacompactunsafe}
602\end{figure}
603%
604We now proceed with the safety lemmas. The lemma in
605Figure~\ref{sigmacompactunsafe} is a temporary formulation of the main
606property {\tt sigma\_policy\_specification}. Its main difference from the
607final version is that it uses {\tt instruction\_size\_jmplen} to compute the
608instruction size. This function uses $j$ to compute the span of branch
609instructions  (i.e. it uses the $\sigma$ under construction), instead
610of looking at the distance between source and destination. This is because
611$\sigma$ is still under construction; we will prove below that after the
612final iteration, {\tt sigma\_compact\_unsafe} is equivalent to the main
613property in Figure~\ref{sigmasafe} which holds at the end of the computation.
614%
615\begin{figure}[h]
616\small
617\begin{alignat*}{6}
618\mathtt{sigma} & \omit\rlap{$\mathtt{\_safe} \equiv \lambda prefix.\lambda labels.\lambda old\_strie.\lambda strie.\forall i.i < |prefix| \rightarrow$} \notag\\
619& \omit\rlap{$\forall dest\_label.\mathtt{is\_jump\_to\ (\mathtt{nth}\ i\ prefix})\ dest\_label \rightarrow$} \notag\\
620& \mathbf{let} && \omit\rlap{$\ paddr \equiv \mathtt{lookup}\ labels\ dest\_label\ \mathbf{in}$} \notag\\
621& \mathbf{let} && \omit\rlap{$\ \langle j, src, dest \rangle \equiv \mathbf{if} \ paddr\ \leq\ i\ \mathbf{then}$}\notag\\
622&&&&& \mathbf{let}\ \langle \_, j \rangle \equiv \mathtt{lookup}\ i\ (\mathtt{snd}\ strie)\ \mathbf{in} \notag\\
623&&&&& \mathbf{let}\ \langle pc\_plus\_jl, \_ \rangle \equiv \mathtt{lookup}\ (i+1)\ (\mathtt{snd}\ strie)\ \mathbf{in}\notag\\
624&&&&& \mathbf{let}\ \langle addr, \_ \rangle \equiv \mathtt{lookup}\ paddr\ (\mathtt{snd}\ strie)\ \mathbf{in}\notag\\
625&&&&& \langle j, pc\_plus\_jl, addr \rangle\notag\\
626&&&\mathbf{else} \notag\\
627&&&&&\mathbf{let}\ \langle \_, j \rangle \equiv \mathtt{lookup}\ i\ (\mathtt{snd}\ strie)\ \mathbf{in} \notag\\
628&&&&&\mathbf{let}\ \langle pc\_plus\_jl, \_ \rangle \equiv \mathtt{lookup}\ (i+1)\ (\mathtt{snd}\ old\_strie)\ \mathbf{in}\notag\\
629&&&&&\mathbf{let}\ \langle addr, \_ \rangle \equiv \mathtt{lookup}\ paddr\ (\mathtt{snd}\ old\_strie)\ \mathbf{in}\notag\\
630&&&&&\langle j, pc\_plus\_jl, addr \rangle \mathbf{in}\ \notag\\
631&&&\mathbf{match} && \ j\ \mathbf{with} \notag\\
632&&&&&\mathrm{short\_jump} \Rightarrow \mathtt{short\_jump\_valid}\ src\ dest\notag\\
633&&&&&\mathrm{absolute\_jump} \Rightarrow \mathtt{absolute\_jump\_valid}\ src\ dest\notag\\
634&&&&&\mathrm{long\_jump} \Rightarrow \mathrm{True}
635\end{alignat*}
636\caption{Safety property}
637\label{sigmasafe}
638\end{figure}
639%
640We compute the distance using the memory address of the instruction
641plus its size. This follows the behaviour of the MCS-51 microprocessor, which
642increases the program counter directly after fetching, and only then executes
643the branch instruction (by changing the program counter again).
644
645There are also some simple, properties to make sure that our policy
646remains consistent, and to keep track of whether the fixed point has been
647reached. We do not include them here in detail. Two of these properties give the values of $\sigma$ for the start and end of the program; $\sigma(0) = 0$ and $\sigma(n)$, where $n$ is the number of instructions up until now, is equal to the maximum memory address so far. There are also two properties that deal with what happens when the previous
648iteration does not change with respect to the current one. $added$ is a
649variable that keeps track of the number of bytes we have added to the program
650size by changing the encoding of branch instructions. If $added$ is 0, the program
651has not changed and vice versa.
652
653%{\small
654%\begin{align*}
655%& \mathtt{lookup}\ 0\ (\mathtt{snd}\ strie) = 0 \notag\\
656%& \mathtt{lookup}\ |prefix|\ (\mathtt{snd}\ strie) = \mathtt{fst}\ strie
657%\end{align*}}
658
659
660%{\small
661%\begin{align*}
662%& added = 0\ \rightarrow\ \mathtt{policy\_pc\_equal}\ prefix\ old\_strie\ strie \notag\\
663%& \mathtt{policy\_jump\_equal}\ prefix\ old\_strie\ strie\ \rightarrow\ added = 0
664%\end{align*}}
665
666We need to use two different formulations, because the fact that $added$ is 0
667does not guarantee that no branch instructions have changed.  For instance,
668it is possible that we have replaced a short jump with an absolute jump, which
669does not change the size of the branch instruction. Therefore {\tt policy\_pc\_equal} states that $old\_sigma_1(x) = sigma_1(x)$, whereas {\tt policy\_jump\_equal} states that $old\_sigma_2(x) = sigma_2(x)$. This formulation is sufficient to prove termination and compactness. 
670
671Proving these invariants is simple, usually by induction on the prefix length.
672
673\subsection{Iteration invariants}
674
675These are invariants that hold after the completion of an iteration. The main
676difference between these invariants and the fold invariants is that after the
677completion of the fold, we check whether the program size does not supersede
67864 Kb, the maximum memory size the MCS-51 can address. The type of an iteration therefore becomes an option type: {\tt None} in case
679the program becomes larger than 64 Kb, or $\mathtt{Some}\ \sigma$
680otherwise. We also no longer pass along the number of bytes added to the
681program size, but a boolean that indicates whether we have changed something
682during the iteration or not.
683
684If the iteration returns {\tt None}, which means that it has become too large for memory, there is an invariant that states that the previous iteration cannot
685have every branch instruction encoded as a long jump. This is needed later in the proof of termination. If the iteration returns $\mathtt{Some}\ \sigma$, the fold invariants are retained without change.
686
687Instead of using {\tt sigma\_compact\_unsafe}, we can now use the proper
688invariant:
689%
690{\small
691\begin{alignat*}{6}
692\mathtt{sigma} & \omit\rlap{$\mathtt{\_compact} \equiv \lambda program.\lambda sigma.$} \notag\\
693& \omit\rlap{$\forall n.n < |program|\ \rightarrow$} \notag\\
694& \mathbf{match}\ && \omit\rlap{$\mathtt{lookup\_opt}\ n\ (\mathtt{snd}\ sigma)\ \mathbf{with}$}\notag\\
695&&& \omit\rlap{$\mathrm{None}\ \Rightarrow\ \mathrm{False}$}\notag\\
696&&& \omit\rlap{$\mathrm{Some}\ \langle pc, j \rangle \Rightarrow$}\notag\\
697&&& \mathbf{match}\ && \mathtt{lookup\_opt}\ (n+1)\ (\mathtt{snd}\ sigma)\ \mathbf{with}\notag\\
698&&&&& \mathrm{None}\ \Rightarrow\ \mathrm{False}\notag\\
699&&&&& \mathrm{Some} \langle pc1, j1 \rangle \Rightarrow\notag\\
700&&&&& \ \ pc1 = pc + \mathtt{instruction\_size}\ n\ (\mathtt{nth}\ n\ program)
701\end{alignat*}}
702%
703This is almost the same invariant as ${\tt sigma\_compact\_unsafe}$, but differs in that it
704computes the sizes of branch instructions by looking at the distance between
705position and destination using $\sigma$. In actual use, the invariant is qualified: $\sigma$ is compact if there have
706been no changes (i.e. the boolean passed along is {\tt true}). This is to
707reflect the fact that we are doing a least fixed point computation: the result
708is only correct when we have reached the fixed point.
709
710There is another, trivial, invariant in case the iteration returns
711$\mathtt{Some}\ \sigma$: it must hold that $\mathtt{fst}\ sigma < 2^{16}$.
712We need this invariant to make sure that addresses do not overflow.
713
714The proof of {\tt nec\_plus\_ultra} goes as follows: if we return {\tt None},
715then the program size must be greater than 64 Kb. However, since the
716previous iteration did not return {\tt None} (because otherwise we would
717terminate immediately), the program size in the previous iteration must have
718been smaller than 64 Kb.
719
720Suppose that all the branch instructions in the previous iteration are
721encoded as long jumps. This means that all branch instructions in this
722iteration are long jumps as well, and therefore that both iterations are equal
723in the encoding of their branch instructions. Per the invariant, this means that
724$added = 0$, and therefore that all addresses in both iterations are equal.
725But if all addresses are equal, the program sizes must be equal too, which
726means that the program size in the current iteration must be smaller than
72764 Kb. This contradicts the earlier hypothesis, hence not all branch
728instructions in the previous iteration are encoded as long jumps.
729
730The proof of {\tt sigma\_compact} follows from {\tt sigma\_compact\_unsafe} and
731the fact that we have reached a fixed point, i.e. the previous iteration and
732the current iteration are the same. This means that the results of
733{\tt instruction\_size\_jmplen} and {\tt instruction\_size} are the same.
734
735\subsection{Final properties}
736
737These are the invariants that hold after $2n$ iterations, where $n$ is the
738program size (we use the program size for convenience; we could also use the
739number of branch instructions, but this is more complex). Here, we only
740need {\tt out\_of\_program\_none}, {\tt sigma\_compact} and the fact that
741$\sigma(0) = 0$.
742
743Termination can now be proved using the fact that there is a $k \leq 2n$, with
744$n$ the length of the program, such that iteration $k$ is equal to iteration
745$k+1$. There are two possibilities: either there is a $k < 2n$ such that this
746property holds, or every iteration up to $2n$ is different. In the latter case,
747since the only changes between the iterations can be from shorter jumps to
748longer jumps, in iteration $2n$ every branch instruction must be encoded as
749a long jump. In this case, iteration $2n$ is equal to iteration $2n+1$ and the
750fixed point is reached.
751
752\section{Conclusion}
753
754In the previous sections we have discussed the branch displacement optimisation
755problem, presented an optimised solution, and discussed the proof of
756termination and correctness for this algorithm, as formalised in Matita.
757
758The algorithm we have presented is fast and correct, but not optimal; a true
759optimal solution would need techniques like constraint solvers. While outside
760the scope of the present research, it would be interesting to see if enough
761heuristics could be found to make such a solution practical for implementing
762in an existing compiler; this would be especially useful for embedded systems,
763where it is important to have as small a solution as possible.
764
765In itself the algorithm is already useful, as it results in a smaller solution
766than the simple `every branch instruction is long' used up until now---and with
767only 64 Kb of memory, every byte counts. It also results in a smaller solution
768than the greatest fixed point algorithm that {\tt gcc} uses. It does this
769without sacrificing speed or correctness.
770
771The certification of an assembler that relies on the branch displacement
772algorithm described in this paper was presented in~\cite{lastyear}.
773The assembler computes the $\sigma$ map as described in this paper and
774then works in two passes. In the first pass it builds a map
775from instruction labels to addresses in the assembly code. In the
776second pass it iterates over the code, translating every pseudo jump
777at address $src$ to a label l associated to the assembly instruction at
778address $dst$ to a jump of the size dictated by $(\sigma\ src)$ to
779$(\sigma\ dst)$. In case of conditional jumps, the translated jump may be
780implemented with a series of instructions.
781
782The proof of correctness abstracts over the algorithm used and only relies on
783{\tt sigma\_policy\_specification} (page~\ref{sigmapolspec}). It is a variation
784of a standard 1-to-many forward simulation proof~\cite{Leroy2009}. The
785relation R between states just maps every code address $ppc$ stored in
786registers or memory to $(\sigma\ ppc)$. To identify the code addresses,
787an additional data structure is always kept together with the source
788state and is updated by the semantics. The semantics is preserved
789only for those programs whose source code operations
790$(f\ ppc_1\ \ldots\ ppc_n)$ applied to code addresses $ppc_1 \ldots ppc_n$ are
791such that $(f\ (\sigma\ ppc_1)\ ...\ (\sigma\ ppc_n) = f\ ppc_1\ ppc_n))$. For
792example, an injective $\sigma$ preserves a binary equality test f for code
793addresses, but not pointer subtraction.
794
795The main lemma (fetching simulation), which relies on\\
796{\tt sigma\_policy\_specification} and is established by structural induction
797over the source code, says that fetching an assembly instruction at
798position ppc is equal to fetching the translation of the instruction at
799position $(\sigma\ ppc)$, and that the new incremented program counter is at
800the beginning of the next instruction (compactness). The only exception is
801when the instruction fetched is placed at the end of code memory and is
802followed only by dead code. Execution simulation is trivial because of the
803restriction over well behaved programs w.r.t. sigma. The condition
804$\sigma\ 0 = 0$ is necessary because the hardware model prescribes that the
805first instruction to be executed will be at address 0. For the details
806see~\cite{lastyear}.
807
808Instead of verifying the algorithm directly, another solution to the problem
809would be to run an optimisation algorithm, and then verify the safety of the
810result using a verified validator. Such a validator would be easier to verify
811than the algorithm itself and it would also be efficient, requiring only a
812linear pass over the source code to test the specification. However, it is
813surely also interesting to formally prove that the assembler never rejects
814programs that should be accepted, i.e. that the algorithm itself is correct.
815This is the topic of the current paper.
816
817\subsection{Related work}
818
819As far as we are aware, this is the first formal discussion of the branch
820displacement optimisation algorithm.
821
822The CompCert project is another verified compiler project.
823Their backend~\cite{Leroy2009} generates assembly code for (amongst others) subsets of the
824PowerPC and x86 (32-bit) architectures. At the assembly code stage, there is
825no distinction between the span-dependent jump instructions, so a branch
826displacement optimisation algorithm is not needed.
827
828%An offshoot of the CompCert project is the CompCertTSO project, which adds
829%thread concurrency and synchronisation to the CompCert compiler~\cite{TSO2011}.
830%This compiler also generates assembly code and therefore does not include a
831%branch displacement algorithm.
832 
833%Finally, there is also the Piton stack~\cite{Moore1996}, which not only includes the
834%formal verification of a compiler, but also of the machine architecture
835%targeted by that compiler, a microprocessor called the FM9001.
836%However, this architecture does not have different
837%jump sizes (branching is simulated by assigning values to the program counter),
838%so the branch displacement problem is irrelevant.
839 
840
841
842\bibliography{biblio}
843\bibliographystyle{spbasic}
844
845\end{document}
Note: See TracBrowser for help on using the repository browser.