source: Papers/sttt/main.tex @ 3477

Last change on this file since 3477 was 3477, checked in by mulligan, 5 years ago

changes to intro, in progress...

  • Property svn:executable set to *
File size: 51.3 KB
22\title{On the proof of correctness of a verified optimising assembler
23\thanks{Research supported by the CerCo project, within the Future and Emerging Technologies (FET) programme of the Seventh Framework Programme for Research of the European Commission, under FET-Open grant number 243881}}
24\author{Jaap Boender \and Dominic P. Mulligan \and Claudio Sacerdoti Coen}
25%\institute{Department of Computer Science, University of Middlesex \and Computer Laboratory, University of Cambridge \and Dipartimento di Informatica, University of Bologna}
30Optimising assemblers face the `branch displacement' or `jump encoding' problem, that is: how best to choose between concrete machine code jump instructions---typically of differing instruction and offset sizes---when expanding pseudo-instructions.
31Ideally, on space constrained hardware, an optimising assembler would choose the set of pseudo-instruction expansions that minimises the size of the resulting machine code program, a task that is provably \textsc{np}-hard.
33As part of CerCo (`Certified Complexity')---an \textsc{eu}-funded project to develop a verified concrete complexity preserving compiler for a large subset of the C programming language---we have implemented and proved correct an optimising assembler within the Matita interactive theorem prover.
34Our assembler targets the instruction set of a typical micro-controller, the Intel \textsc{mcs-51} series.
35As is common with embedded systems development this micro-controller has a paucity of available code memory and therefore we face an additional pressure in reducing the size of any assembled machine code program.
36Out of necessity, then, our assembler implements an algorithm for solving the branch displacement problem, and we have proved that this algorithm is correct.
38However, this efficient expansion of jump pseudoinstructions into machine code equivalents is complex, and therefore could unneccessarily complicate the proof of correctness for the rest of the assembler.
39We therefore isolate the decision making over how jumps should be expanded from the expansion process itself as much as possible using `policies', making the rest of the proof of correctness more straightforward.
40Our proof strategy contains a tracking facility for `good addresses' and only programs that use good addresses have their semantics preserved under assembly, as we observe that it is impossible for an assembler to preserve the semantics of every assembly program.
41Our strategy offers increased flexibility over the traditional approach to proving the correctness of assemblers, wherein addresses in assembly are kept opaque and immutable.
42In particular, our approach permits experimentation with the benign manipulation of addresses.
44We discuss wider issues associated with a proof of correctness for an assembler, detail our algorithm solving the `branch displacement' problem, and discuss our proof of correctness, employing `policies', for the assembler.
46%\keywords{Formal verification, interactive theorem proving, assembler, branch displacement optimisation, CerCo (`Certified Complexity'), MCS-51 microcontroller, Matita proof assistant}
51We consider the formalisation of an assembler for the Intel MCS-51 8-bit microprocessor in the Matita proof assistant~\cite{asperti:user:2007}.
52This formalisation forms a major component of the EU-funded CerCo (`Certified Complexity') project~\cite{cerco:2011}, concerning the construction and formalisation of a concrete complexity preserving compiler for a large subset of the C programming language.
54The \textsc{mcs-51}, commonly called the 8051, is an 8-bit Harvard architecture \textsc{cisc} instruction set micro-controller dating from the early 1980s, and was originally manufactured by Intel.
55An extended variant, the \textsc{mcs-52} or 8052, was subsequently released adding extra \textsc{ram} and \textsc{rom} above and beyond that offered by the original \textsc{mcs-51}, and an extra timer.
56Derivatives are still widely manufactured by a number of semiconductor foundries, with the processor being used especially in embedded systems.
58The MCS-51 has a relative paucity of features compared to its more modern brethren, with a lack of any caching or pipelining features meaning that timing of execution is predictable, making the MCS-51 very attractive for CerCo's ends.
59However, the MCS-51's paucity of features---though an advantage in many respects---also quickly becomes a hindrance, as the MCS-51 features a relatively minuscule series of memory spaces by modern standards.
60As a result our C compiler, to be able to successfully compile realistic programs for embedded devices, ought to produce `tight' machine code.
62To do this, we must solve the `branch displacement' problem---deciding how best to expand pseudojumps to labels in assembly language to machine code jumps.
63This problem, also known as `jump encoding', is
64a well-known problem amongst assembler implementors~\cite{Hyde2006} and arises when pseudojumps can be expanded
65in different ways to real machine instructions, but the different expansions
66are not equivalent (e.g. jumps that differ in the distance that they `span', or instructions that differ in the number of clock cycles needed to completely execute them) and are not always
67correct (e.g. correctness is only up to global constraints over the compiled
68code). For instance, some jump instructions (short jumps) are very small
69and can be executed quickly, but they can only reach destinations within a
70certain distance from the current instruction. When the destinations are
71too far away, larger and slower long jumps must be used. The use of a long jump may
72augment the distance between another pseudojump and its target, forcing
73another long jump use, in a cascade. The job of the optimising
74compiler (assembler) is to individually expand every pseudo-instruction in such a way
75that all global constraints are satisfied and that the compiled program
76is minimal in size and faster in concrete time complexity.
77This problem is known to be computationally hard for most CISC architectures (\textsc{np}-hard, see~\cite{hyde:branch:2006}).
79To simplify the CerCo C compiler we have chosen to implement an optimising assembler whose input language the compiler will target.
80Labels, conditional jumps to labels, a program preamble containing global data and a \texttt{MOV} instruction for moving this global data into the MCS-51's one 16-bit register all feature in our assembly language.
81We further simplify by ignoring linking, assuming that all our assembly programs are pre-linked.
83The requirements of the CerCo project add yet more complications to our proof of correctness, as we must also address a cost model.
84CerCo imposes a cost model on C programs or, more specifically, on simple blocks of instructions.
85This cost model is induced by the compilation process itself, and its non-compositional nature allows us to assign different costs to identical C statements depending on how they are compiled.
86In short, we aim to obtain a very precise costing for a program by embracing the compilation process, not ignoring it.
87At the assembler level, this is reflected by our need to induce a cost
88model on the assembly code as a function of the assembly program and the
89strategy used to solve the branch displacement problem. In particular, our
90optimising assembler should also return a map that assigns a cost (in clock
91cycles) to every instruction in the source program. We expect the induced cost
92to be preserved by the assembler: we will prove that the compiled code
93tightly simulates the source code by taking exactly the predicted amount of
95Note that the temporal `tightness' of the simulation is a fundamental prerequisite
96of the correctness of the simulation, as some functions of the MCS-51---timers and \textsc{i/o}---depend on the microprocessor's clock.
97If the pseudo- and concrete clock differ the result of an \textsc{i/o} operation may not be preserved.
119% This problem is the subject of the present paper. After introducing the problem
120% in more detail, we will discuss the solutions used by other compilers, present
121% the algorithm we use in the CerCo assembler, and discuss its verification,
122% that is the proofs of termination and correctness using the Matita proof
123% assistant~\cite{Asperti2007}.
125% Formulating the final statement of correctness and finding the loop invariants
126% have been non-trivial tasks and are, indeed, the main contribution of this
127% paper. It has required considerable care and fine-tuning to formulate not
128% only the minimal statement required for the ulterior proof of correctness of
129% the assembler, but also the minimal set of invariants needed for the proof
130% of correctness of the algorithm.
132% The research presented in this paper has been executed within the CerCo project
133% which aims at formally verifying a C compiler with cost annotations. The
134% target architecture for this project is the MCS-51, whose instruction set
135% contains span-dependent instructions. Furthermore, its maximum addressable
136% memory size is very small (64 Kb), which makes it important to generate
137% programs that are as small as possible. With this optimisation, however, comes increased complexity and hence increased possibility for error. We must make sure that the branch instructions are encoded correctly, otherwise the assembled program will behave unpredictably.
139% All Matita files related to this development can be found on the CerCo
140% website, \url{}. The specific part that contains the
141% branch displacement algorithm is in the {\tt ASM} subdirectory, in the files
142% {\tt}, {\tt} and {\tt}.
144\section{The branch displacement optimisation problem}
146In most modern instruction sets that have them, the only span-dependent
147instructions are branch instructions. Taking the ubiquitous x86-64 instruction
148set as an example, we find that it contains eleven different forms of the
149unconditional branch instruction, all with different ranges, instruction sizes
150and semantics (only six are valid in 64-bit mode, for example). Some examples
151are shown in Figure~\ref{f:x86jumps} (see also~\cite{IntelDev}).
157Instruction & Size (bytes) & Displacement range \\
159Short jump & 2 & -128 to 127 bytes \\
160Relative near jump & 5 & $-2^{32}$ to $2^{32}-1$ bytes \\
161Absolute near jump & 6 & one segment (64-bit address) \\
162Far jump & 8 & entire memory (indirect jump) \\
166\caption{List of x86 branch instructions}
170The chosen target architecture of the CerCo project is the Intel MCS-51, which
171features three types of branch instructions (or jump instructions; the two terms
172are used interchangeably), as shown in Figure~\ref{f:mcs51jumps}.
178Instruction & Size    & Execution time & Displacement range \\
179            & (bytes) & (cycles) & \\
181SJMP (`short jump') & 2 & 2 & -128 to 127 bytes \\
182AJMP (`absolute jump') & 2 & 2 & one segment (11-bit address) \\
183LJMP (`long jump') & 3 & 3 & entire memory \\
187\caption{List of MCS-51 branch instructions}
191Conditional branch instructions are only available in short form, which
192means that a conditional branch outside the short address range has to be
193encoded using three branch instructions (for instructions whose logical
194negation is available, it can be done with two branch instructions, but for
195some instructions this is not the case). The call instruction is
196only available in absolute and long forms.
198Note that even though the MCS-51 architecture is much less advanced and much
199simpler than the x86-64 architecture, the basic types of branch instruction
200remain the same: a short jump with a limited range, an intra-segment jump and a
201jump that can reach the entire available memory.
203Generally, in code fed to the assembler as input, the only
204difference between branch instructions is semantics, not span. This
205means that a distinction is made between an unconditional branch and the
206several kinds of conditional branch, but not between their short, absolute or
207long variants.
209The algorithm used by the assembler to encode these branch instructions into
210the different machine instructions is known as the {\em branch displacement
211algorithm}. The optimisation problem consists of finding as small an encoding as
212possible, thus minimising program length and execution time.
214Similar problems, e.g. the branch displacement optimisation problem for other
215architectures, are known to be NP-complete~\cite{Robertson1979,Szymanski1978},
216which could make finding an optimal solution very time-consuming.
218The canonical solution, as shown by Szymanski~\cite{Szymanski1978} or more
219recently by Dickson~\cite{Dickson2008} for the x86 instruction set, is to use a
220fixed point algorithm that starts with the shortest possible encoding (all
221branch instruction encoded as short jumps, which is likely not a correct
222solution) and then iterates over the source to re-encode those branch
223instructions whose target is outside their range.
225\subsection*{Adding absolute jumps}
227In both papers mentioned above, the encoding of a jump is only dependent on the
228distance between the jump and its target: below a certain value a short jump
229can be used; above this value the jump must be encoded as a long jump.
231Here, termination of the smallest fixed point algorithm is easy to prove. All
232branch instructions start out encoded as short jumps, which means that the
233distance between any branch instruction and its target is as short as possible
234(all the intervening jumps are short).
235If, in this situation, there is a branch instruction $b$ whose span is not
236within the range for a short jump, we can be sure that we can never reach a
237situation where the span of $j$ is so small that it can be encoded as a short
238jump. This argument continues to hold throughout the subsequent iterations of
239the algorithm: short jumps can change into long jumps, but not \emph{vice versa},
240as spans only increase. Hence, the algorithm either terminates early when a fixed
241point is reached or when all short jumps have been changed into long jumps.
243Also, we can be certain that we have reached an optimal solution: a short jump
244is only changed into a long jump if it is absolutely necessary.
246However, neither of these claims (termination nor optimality) hold when we add
247the absolute jump. With absolute jumps, the encoding of a branch
248instruction no longer depends only on the distance between the branch
249instruction and its target. An absolute jump is possible when instruction and
250target are in the same segment (for the MCS-51, this means that the first 5
251bytes of their addresses have to be equal). It is therefore entirely possible
252for two branch instructions with the same span to be encoded in different ways
253(absolute if the branch instruction and its target are in the same segment,
254long if this is not the case).
260    jmp X
261    \ldots
262L\(\sb{0}\): \ldots
263% Start of new segment if
264% jmp X is encoded as short
265    \ldots
266    jmp L\(\sb{0}\)
268\caption{Example of a program where a long jump becomes absolute}
275L\(\sb{0}\): jmp X
276X:  \ldots
277    \ldots
278L\(\sb{1}\): \ldots
279% Start of new segment if
280% jmp X is encoded as short
281    \ldots
282    jmp L\(\sb{1}\)
283    \ldots
284    jmp L\(\sb{1}\)
285    \ldots
286    jmp L\(\sb{1}\) 
287    \ldots
289\caption{Example of a program where the fixed-point algorithm is not optimal}
294This invalidates our earlier termination argument: a branch instruction, once encoded
295as a long jump, can be re-encoded during a later iteration as an absolute jump.
296Consider the program shown in Figure~\ref{f:term_example}. At the start of the
297first iteration, both the branch to {\tt X} and the branch to $\mathtt{L}_{0}$
298are encoded as small jumps. Let us assume that in this case, the placement of
299$\mathtt{L}_{0}$ and the branch to it are such that $\mathtt{L}_{0}$ is just
300outside the segment that contains this branch. Let us also assume that the
301distance between $\mathtt{L}_{0}$ and the branch to it is too large for the
302branch instruction to be encoded as a short jump.
304All this means that in the second iteration, the branch to $\mathtt{L}_{0}$ will
305be encoded as a long jump. If we assume that the branch to {\tt X} is encoded as
306a long jump as well, the size of the branch instruction will increase and
307$\mathtt{L}_{0}$ will be `propelled' into the same segment as its branch
308instruction, because every subsequent instruction will move one byte forward.
309Hence, in the third iteration, the branch to $\mathtt{L}_{0}$ can be encoded as
310an absolute jump. At first glance, there is nothing that prevents us from
311constructing a configuration where two branch instructions interact in such a
312way as to iterate indefinitely between long and absolute encodings.
314This situation mirrors the explanation by Szymanski~\cite{Szymanski1978} of why
315the branch displacement optimisation problem is NP-complete. In this explanation,
316a condition for NP-completeness is the fact that programs be allowed to contain
317{\em pathological} jumps. These are branch instructions that can normally not be
318encoded as a short(er) jump, but gain this property when some other branch
319instructions are encoded as a long(er) jump. This is exactly what happens in
320Figure~\ref{f:term_example}. By encoding the first branch instruction as a long
321jump, another branch instruction switches from long to absolute (which is
324In addition, our previous optimality argument no longer holds. Consider the
325program shown in Figure~\ref{f:opt_example}. Suppose that the distance between
326$\mathtt{L}_{0}$ and $\mathtt{L}_{1}$ is such that if {\tt jmp X} is encoded
327as a short jump, there is a segment border just after $\mathtt{L}_{1}$. Let
328us also assume that all three branches to $\mathtt{L}_{1}$ are in the same
329segment, but far enough away from $\mathtt{L}_{1}$ that they cannot be encoded
330as short jumps.
332Then, if {\tt jmp X} were to be encoded as a short jump, which is clearly
333possible, all of the branches to $\mathtt{L}_{1}$ would have to be encoded as
334long jumps. However, if {\tt jmp X} were to be encoded as a long jump, and
335therefore increase in size, $\mathtt{L}_{1}$ would be `propelled' across the
336segment border, so that the three branches to $\mathtt{L}_{1}$ could be encoded
337as absolute jumps. Depending on the relative sizes of long and absolute jumps,
338this solution might actually be smaller than the one reached by the smallest
339fixed point algorithm.
341\section{Our algorithm}
343\subsection{Design decisions}
345Given the NP-completeness of the problem, finding optimal solutions
346(using, for example, a constraint solver) can potentially be very costly.
348The SDCC compiler~\cite{SDCC2011}, which has a backend targeting the MCS-51
349instruction set, simply encodes every branch instruction as a long jump
350without taking the distance into account. While certainly correct (the long
351jump can reach any destination in memory) and a very fast solution to compute,
352it results in a less than optimal solution in terms of output size and
353execution time.
355On the other hand, the {\tt gcc} compiler suite, while compiling
356C on the x86 architecture, uses a greatest fix point algorithm. In other words,
357it starts with all branch instructions encoded as the largest jumps
358available, and then tries to reduce the size of branch instructions as much as
361Such an algorithm has the advantage that any intermediate result it returns
362is correct: the solution where every branch instruction is encoded as a large
363jump is always possible, and the algorithm only reduces those branch
364instructions whose destination address is in range for a shorter jump.
365The algorithm can thus be stopped after a determined number of steps without
366sacrificing correctness.
368The result, however, is not necessarily optimal. Even if the algorithm is run
369until it terminates naturally, the fixed point reached is the {\em greatest}
370fixed point, not the least fixed point. Furthermore, {\tt gcc} (at least for
371the x86 architecture) only uses short and long jumps. This makes the algorithm
372more efficient, as shown in the previous section, but also results in a less
373optimal solution.
375In the CerCo assembler, we opted at first for a least fixed point algorithm,
376taking absolute jumps into account.
378Here, we ran into a problem with proving termination, as explained in the
379previous section: if we only take short and long jumps into account, the jump
380encoding can only switch from short to long, but never in the other direction.
381When we add absolute jumps, however, it is theoretically possible for a branch
382instruction to switch from absolute to long and back, as previously explained.
383Proving termination then becomes difficult, because there is nothing that
384precludes a branch instruction from oscillating back and forth between absolute
385and long jumps indefinitely.
387To keep the algorithm in the same complexity class and more easily
388prove termination, we decided to explicitly enforce the `branch instructions
389must always grow longer' requirement: if a branch instruction is encoded as a
390long jump in one iteration, it will also be encoded as a long jump in all the
391following iterations. Therefore the encoding of any branch instruction
392can change at most two times: once from short to absolute (or long), and once
393from absolute to long.
395There is one complicating factor. Suppose that a branch instruction is encoded
396in step $n$ as an absolute jump, but in step $n+1$ it is determined that
397(because of changes elsewhere) it can now be encoded as a short jump. Due to
398the requirement that the branch instructions must always grow longer,
399the branch encoding will be encoded as an absolute jump in step
400$n+1$ as well.
402This is not necessarily correct. A branch instruction that can be
403encoded as a short jump cannot always also be encoded as an absolute jump, as a
404short jump can bridge segments, whereas an absolute jump cannot. Therefore,
405in this situation we have decided to encode the branch instruction as a long
406jump, which is always correct.
408The resulting algorithm, therefore, will not return the least fixed point, as
409it might have too many long jumps. However, it is still better than the
410algorithms from SDCC and {\tt gcc}, since even in the worst case, it will still
411return a smaller or equal solution.
413Experimenting with our algorithm on the test suite of C programs included with
414gcc 2.3.3 has shown that on average, about 25 percent of jumps are encoded as short or absolute jumps by the algorithm. As not all instructions are jumps, this does not make for a large reduction in size, but it can make for a reduction in execution time: if jumps
415are executed multiple times, for example in loops, the fact that short jumps take less cycles to execute than long jumps can have great effect.
417As for complexity, there are at most $2n$ iterations, where $n$ is the number of
418branch instructions. Practical tests within the CerCo project on small to
419medium pieces of code have shown that in almost all cases, a fixed point is
420reached in 3 passes. Only in one case did the algorithm need 4. This is not surprising: after all, the difference between short/absolute and
421long jumps is only one byte (three for conditional jumps). For a change from
422short/absolute to long to have an effect on other jumps is therefore relatively
423uncommon, which explains why a fixed point is reached so quickly.
425\subsection{The algorithm in detail}
427The branch displacement algorithm forms part of the translation from
428pseudocode to assembler. More specifically, it is used by the function that
429translates pseudo-addresses (natural numbers indicating the position of the
430instruction in the program) to actual addresses in memory. Note that in pseudocode, all instructions are of size 1.
432Our original intention was to have two different functions, one function
433$\mathtt{policy}: \mathbb{N} \rightarrow \{\mathtt{short\_jump},
434\mathtt{absolute\_jump}, \mathtt{long\_jump}\}$ to associate jumps to their
435intended encoding, and a function $\sigma: \mathbb{N} \rightarrow
436\mathtt{Word}$ to associate pseudo-addresses to machine addresses. $\sigma$
437would use $\mathtt{policy}$ to determine the size of jump instructions. This turned out to be suboptimal from the algorithmic point of view and
438impossible to prove correct.
440From the algorithmic point of view, in order to create the $\mathtt{policy}$
441function, we must necessarily have a translation from pseudo-addresses
442to machine addresses (i.e. a $\sigma$ function): in order to judge the distance
443between a jump and its destination, we must know their memory locations.
444Conversely, in order to create the $\sigma$ function, we need to have the
445$\mathtt{policy}$ function, otherwise we do not know the sizes of the jump
446instructions in the program.
448Much the same problem appears when we try to prove the algorithm correct: the
449correctness of $\mathtt{policy}$ depends on the correctness of $\sigma$, and
450the correctness of $\sigma$ depends on the correctness of $\mathtt{policy}$.
452We solved this problem by integrating the $\mathtt{policy}$ and $\sigma$
453algorithms. We now have a function
454$\sigma: \mathbb{N} \rightarrow \mathtt{Word} \times \mathtt{bool}$ which
455associates a pseudo-address to a machine address. The boolean denotes a forced
456long jump; as noted in the previous section, if during the fixed point
457computation an absolute jump changes to be potentially re-encoded as a short
458jump, the result is actually a long jump. It might therefore be the case that
459jumps are encoded as long jumps without this actually being necessary, and this
460information needs to be passed to the code generating function.
462The assembler function encodes the jumps by checking the distance between
463source and destination according to $\sigma$, so it could select an absolute
464jump in a situation where there should be a long jump. The boolean is there
465to prevent this from happening by indicating the locations where a long jump
466should be encoded, even if a shorter jump is possible. This has no effect on
467correctness, since a long jump is applicable in any situation.
473  \State $\langle added, pc, sigma \rangle \gets acc$
474  \If {$instr$ is a backward jump to $j$}
475    \State $length \gets \mathrm{jump\_size}(pc,sigma_1(labels(j)))$
476    \Comment compute jump distance
477  \ElsIf {$instr$ is a forward jump to $j$}
478    \State $length \gets \mathrm{jump\_size}(pc,old\_sigma_1(labels(j))+added)$
479  \EndIf
480  \State $old\_length \gets \mathrm{old\_sigma_1}(ppc)$
481  \State $new\_length \gets \mathrm{max}(old\_length, length)$
482  \Comment length must never decrease
483  \State $old\_size \gets \mathrm{old\_sigma_2}(ppc)$
484  \State $new\_size \gets \mathrm{instruction\_size}(instr,new\_length)$
485  \Comment compute size in bytes
486  \State $new\_added \gets added+(new\_size-old\_size)$
487  \Comment keep track of total added bytes
488  \State $new\_sigma \gets old\_sigma$
489  \State $new\_sigma_1(ppc+1) \gets pc+new\_size$
490  \State $new\_sigma_2(ppc) \gets new\_length$
491  \Comment update $\sigma$ \\
492  \Return $\langle new\_added, pc+new\_size, new\_sigma \rangle$
495\caption{The heart of the algorithm}
499The algorithm, shown in Figure~\ref{f:jump_expansion_step}, works by folding the
500function {\sc f} over the entire program, thus gradually constructing $sigma$.
501This constitutes one step in the fixed point calculation; successive steps
502repeat the fold until a fixed point is reached. We have abstracted away the case where an instruction is not a jump, since the size of these instructions is constant.
504Parameters of the function {\sc f} are:
506  \item a function $labels$ that associates a label to its pseudo-address;
507  \item $old\_sigma$, the $\sigma$ function returned by the previous
508    iteration of the fixed point calculation;
509  \item $instr$, the instruction currently under consideration;
510  \item $ppc$, the pseudo-address of $instr$;
511  \item $acc$, the fold accumulator, which contains $added$ (the number of
512  bytes added to the program size with respect to the previous iteration), $pc$
513  (the highest memory address reached so far), and of course $sigma$, the
514    $\sigma$ function under construction.
516The first two are parameters that remain the same through one iteration, the
517final three are standard parameters for a fold function (including $ppc$,
518which is simply the number of instructions of the program already processed).
520The $\sigma$ functions used by {\sc f} are not of the same type as the final
521$\sigma$ function: they are of type
522$\sigma: \mathbb{N} \rightarrow \mathbb{N} \times \{\mathtt{short\_jump},
523\mathtt{absolute\_jump},\mathtt{long\_jump}\}$; a function that associates a
524pseudo-address with a memory address and a jump length. We do this to
525ease the comparison of jump lengths between iterations. In the algorithm,
526we use the notation $sigma_1(x)$ to denote the memory address corresponding to
527$x$, and $sigma_2(x)$ for the jump length corresponding to $x$.
529Note that the $\sigma$ function used for label lookup varies depending on
530whether the label is behind our current position or ahead of it. For
531backward branches, where the label is behind our current position, we can use
532$sigma$ for lookup, since its memory address is already known. However, for
533forward branches, the memory address of the address of the label is not yet
534known, so we must use $old\_sigma$.
536We cannot use $old\_sigma$ without change: it might be the case that we have
537already increased the size of some branch instructions before, making the
538program longer and moving every instruction forward. We must compensate for this
539by adding the size increase of the program to the label's memory address
540according to $old\_sigma$, so that branch instruction spans do not get
543%Note also that we add the pc to $sigma$ at location $ppc+1$, whereas we add the
544%jump length at location $ppc$. We do this so that $sigma(ppc)$ will always
545%return a pair with the start address of the instruction at $ppc$ and the
546%length of its branch instruction (if any); the end address of the program can
547%be found at $sigma(n+1)$, where $n$ is the number of instructions in the
550\section{The proof}
552In this section, we present the correctness proof for the algorithm in more
553detail. The main correctness statement is shown, slightly simplified, in~Figure~\ref{statement}.
558\mathtt{sigma}&\omit\rlap{$\mathtt{\_policy\_specification} \equiv
559\lambda program.\lambda sigma.$} \notag\\
560  & \omit\rlap{$sigma\ 0 = 0\ \wedge$} \notag\\
561  & \mathbf{let}\ & & \omit\rlap{$instr\_list \equiv code\ program\ \mathbf{in}$} \notag\\
562  &&& \omit\rlap{$\forall ppc.ppc < |instr\_list| \rightarrow$} \notag\\
563  &&& \mathbf{let}\ && pc \equiv sigma\ ppc\ \mathbf{in} \notag\\
564  &&& \mathbf{let}\ && instruction \equiv \mathtt{fetch\_pseudo\_instruction}\ instr\_list\ ppc\ \mathbf{in} \notag\\
565  &&& \mathbf{let}\ && next\_pc \equiv sigma\ (ppc+1)\ \mathbf{in}\notag\\
566  &&&&& next\_pc = pc + \mathtt{instruction\_size}\ sigma\ ppc\ instruction\ \wedge\notag\\
567  &&&&& (pc + \mathtt{instruction\_size}\ sigma\ ppc\ instruction < 2^{16}\ \vee\notag\\
568  &&&&& (\forall ppc'.ppc' < |instr\_list| \rightarrow ppc < ppc' \rightarrow \notag\\
569  &&&&& \mathbf{let}\ instruction' \equiv \mathtt{fetch\_pseudo\_instruction}\ instr\_list\ ppc'\ \mathbf{in} \notag\\
570  &&&&&\ \mathtt{instruction\_size}\ sigma\ ppc'\ instruction' = 0)\ \wedge \notag\\
571  &&&&& pc + \mathtt{instruction\_size}\ sigma\ ppc\ instruction = 2^{16})
573\caption{Main correctness statement\label{statement}}
577Informally, this means that when fetching a pseudo-instruction at $ppc$, the
578translation by $\sigma$ of $ppc+1$ is the same as $\sigma(ppc)$ plus the size
579of the instruction at $ppc$.  That is, an instruction is placed consecutively
580after the previous one, and there are no overlaps. The rest of the statement deals with memory size: either the next instruction fits within memory ($next\_pc < 2^{16}$) or it ends exactly at the limit memory,
581in which case it must be the last translated instruction in the program (enforced by specfiying that the size of all subsequent instructions is 0: there may be comments or cost annotations that are not translated).
583Finally, we enforce that the program starts at address 0, i.e. $\sigma(0) = 0$. It may seem strange that we do not explicitly include a safety property stating that every jump instruction is of the right type with respect to its target (akin to the lemma from Figure~\ref{sigmasafe}), but this is not necessary. The distance is recalculated according to the instruction addresses from $\sigma$, which implicitly expresses safety.
585Since our computation is a least fixed point computation, we must prove
586termination in order to prove correctness: if the algorithm is halted after
587a number of steps without reaching a fixed point, the solution is not
588guaranteed to be correct. More specifically, branch instructions might be
589encoded which do not coincide with the span between their location and their
592Proof of termination rests on the fact that the encoding of branch
593instructions can only grow larger, which means that we must reach a fixed point
594after at most $2n$ iterations, with $n$ the number of branch instructions in
595the program. This worst case is reached if at every iteration, we change the
596encoding of exactly one branch instruction; since the encoding of any branch
597instruction can change first from short to absolute, and then to long, there
598can be at most $2n$ changes.
600%The proof has been carried out using the ``Russell'' style from~\cite{Sozeau2006}.
601%We have proven some invariants of the {\sc f} function from the previous
602%section; these invariants are then used to prove properties that hold for every
603%iteration of the fixed point computation; and finally, we can prove some
604%properties of the fixed point.
606\subsection{Fold invariants}
608In this section, we present the invariants that hold during the fold of {\sc f}
609over the program. These will be used later on to prove the properties of the
610iteration. During the fixed point computation, the $\sigma$ function is
611implemented as a trie for ease of access; computing $\sigma(x)$ is achieved by
612looking up the value of $x$ in the trie. Actually, during the fold, the value
613we pass along is a pair $\mathbb{N} \times \mathtt{ppc\_pc\_map}$. The first
614component is the number of bytes added to the program so far with respect to
615the previous iteration, and the second component, {\tt ppc\_pc\_map}, is the
616actual $\sigma$ trie (which we'll call $strie$ to avoid confusion).
620\mathtt{out} & \mathtt{\_of\_program\_none} \equiv \lambda prefix.\lambda strie. \notag\\
621& \forall i.i < 2^{16} \rightarrow (i > |prefix| \leftrightarrow
622 \mathtt{lookup\_opt}\ i\ (\mathtt{snd}\ strie) = \mathtt{None})
625The first invariant states that any pseudo-address not yet examined is not
626present in the lookup trie.
630\mathtt{not} & \mathtt{\_jump\_default} \equiv \lambda prefix.\lambda strie.\forall i.i < |prefix| \rightarrow\notag\\
631& \neg\mathtt{is\_jump}\ (\mathtt{nth}\ i\ prefix) \rightarrow \mathtt{lookup}\ i\ (\mathtt{snd}\ strie) = \mathtt{short\_jump}
634This invariant states that when we try to look up the jump length of a
635pseudo-address where there is no branch instruction, we will get the default
636value, a short jump.
640\mathtt{jump} & \mathtt{\_increase} \equiv \lambda pc.\lambda op.\lambda p.\forall i.i < |prefix| \rightarrow \notag\\ 
641& \mathbf{let}\  oj \equiv \mathtt{lookup}\ i\ (\mathtt{snd}\ op)\ \mathbf{in} \notag\\
642& \mathbf{let}\ j \equiv \mathtt{lookup}\ i\ (\mathtt{snd}\ p)\ \mathbf{in}\ \mathtt{jmpleq}\ oj\ j
645This invariant states that between iterations (with $op$ being the previous
646iteration, and $p$ the current one), jump lengths either remain equal or
647increase. It is needed for proving termination.
652\mathtt{sigma} & \omit\rlap{$\mathtt{\_compact\_unsafe} \equiv \lambda prefix.\lambda strie.\forall n.n < |prefix| \rightarrow$}\notag\\
653& \mathbf{match}\ && \omit\rlap{$\mathtt{lookup\_opt}\ n\ (\mathtt{snd}\ strie)\ \mathbf{with}$}\notag\\
654&&& \omit\rlap{$\mathtt{None} \Rightarrow \mathrm{False}$} \notag\\
655&&& \omit\rlap{$\mathtt{Some}\ \langle pc, j \rangle \Rightarrow$} \notag\\
656&&& \mathbf{match}\ && \mathtt{lookup\_opt}\ (n+1)\ (\mathtt{snd}\ strie)\ \mathbf{with}\notag\\
657&&&&& \mathtt{None} \Rightarrow \mathrm{False} \notag\\
658&&&&& \mathtt{Some}\ \langle pc_1, j_1 \rangle \Rightarrow
659    pc_1 = pc + \notag\\
660&&&&& \ \ \mathtt{instruction\_size\_jmplen}\ j\ (\mathtt{nth}\ n\ prefix)
662\caption{Temporary safety property}
666We now proceed with the safety lemmas. The lemma in
667Figure~\ref{sigmacompactunsafe} is a temporary formulation of the main
668property {\tt sigma\_policy\_specification}. Its main difference from the
669final version is that it uses {\tt instruction\_size\_jmplen} to compute the
670instruction size. This function uses $j$ to compute the span of branch
671instructions  (i.e. it uses the $\sigma$ under construction), instead
672of looking at the distance between source and destination. This is because
673$\sigma$ is still under construction; we will prove below that after the
674final iteration, {\tt sigma\_compact\_unsafe} is equivalent to the main
675property in Figure~\ref{sigmasafe} which holds at the end of the computation.
680\mathtt{sigma} & \omit\rlap{$\mathtt{\_safe} \equiv \lambda prefix.\lambda labels.\lambda old\_strie.\lambda strie.\forall i.i < |prefix| \rightarrow$} \notag\\
681& \omit\rlap{$\forall dest\_label.\mathtt{is\_jump\_to\ (\mathtt{nth}\ i\ prefix})\ dest\_label \rightarrow$} \notag\\
682& \mathbf{let} && \omit\rlap{$\ paddr \equiv \mathtt{lookup}\ labels\ dest\_label\ \mathbf{in}$} \notag\\
683& \mathbf{let} && \omit\rlap{$\ \langle j, src, dest \rangle \equiv \mathbf{if} \ paddr\ \leq\ i\ \mathbf{then}$}\notag\\
684&&&&& \mathbf{let}\ \langle \_, j \rangle \equiv \mathtt{lookup}\ i\ (\mathtt{snd}\ strie)\ \mathbf{in} \notag\\
685&&&&& \mathbf{let}\ \langle pc\_plus\_jl, \_ \rangle \equiv \mathtt{lookup}\ (i+1)\ (\mathtt{snd}\ strie)\ \mathbf{in}\notag\\
686&&&&& \mathbf{let}\ \langle addr, \_ \rangle \equiv \mathtt{lookup}\ paddr\ (\mathtt{snd}\ strie)\ \mathbf{in}\notag\\
687&&&&& \langle j, pc\_plus\_jl, addr \rangle\notag\\
688&&&\mathbf{else} \notag\\
689&&&&&\mathbf{let}\ \langle \_, j \rangle \equiv \mathtt{lookup}\ i\ (\mathtt{snd}\ strie)\ \mathbf{in} \notag\\
690&&&&&\mathbf{let}\ \langle pc\_plus\_jl, \_ \rangle \equiv \mathtt{lookup}\ (i+1)\ (\mathtt{snd}\ old\_strie)\ \mathbf{in}\notag\\
691&&&&&\mathbf{let}\ \langle addr, \_ \rangle \equiv \mathtt{lookup}\ paddr\ (\mathtt{snd}\ old\_strie)\ \mathbf{in}\notag\\
692&&&&&\langle j, pc\_plus\_jl, addr \rangle \mathbf{in}\ \notag\\
693&&&\mathbf{match} && \ j\ \mathbf{with} \notag\\
694&&&&&\mathrm{short\_jump} \Rightarrow \mathtt{short\_jump\_valid}\ src\ dest\notag\\
695&&&&&\mathrm{absolute\_jump} \Rightarrow \mathtt{absolute\_jump\_valid}\ src\ dest\notag\\
696&&&&&\mathrm{long\_jump} \Rightarrow \mathrm{True}
698\caption{Safety property}
702We compute the distance using the memory address of the instruction
703plus its size. This follows the behaviour of the MCS-51 microprocessor, which
704increases the program counter directly after fetching, and only then executes
705the branch instruction (by changing the program counter again).
707There are also some simple, properties to make sure that our policy
708remains consistent, and to keep track of whether the fixed point has been
709reached. We do not include them here in detail. Two of these properties give the values of $\sigma$ for the start and end of the program; $\sigma(0) = 0$ and $\sigma(n)$, where $n$ is the number of instructions up until now, is equal to the maximum memory address so far. There are also two properties that deal with what happens when the previous
710iteration does not change with respect to the current one. $added$ is a
711variable that keeps track of the number of bytes we have added to the program
712size by changing the encoding of branch instructions. If $added$ is 0, the program
713has not changed and vice versa.
717%& \mathtt{lookup}\ 0\ (\mathtt{snd}\ strie) = 0 \notag\\
718%& \mathtt{lookup}\ |prefix|\ (\mathtt{snd}\ strie) = \mathtt{fst}\ strie
724%& added = 0\ \rightarrow\ \mathtt{policy\_pc\_equal}\ prefix\ old\_strie\ strie \notag\\
725%& \mathtt{policy\_jump\_equal}\ prefix\ old\_strie\ strie\ \rightarrow\ added = 0
728We need to use two different formulations, because the fact that $added$ is 0
729does not guarantee that no branch instructions have changed.  For instance,
730it is possible that we have replaced a short jump with an absolute jump, which
731does not change the size of the branch instruction. Therefore {\tt policy\_pc\_equal} states that $old\_sigma_1(x) = sigma_1(x)$, whereas {\tt policy\_jump\_equal} states that $old\_sigma_2(x) = sigma_2(x)$. This formulation is sufficient to prove termination and compactness. 
733Proving these invariants is simple, usually by induction on the prefix length.
735\subsection{Iteration invariants}
737These are invariants that hold after the completion of an iteration. The main
738difference between these invariants and the fold invariants is that after the
739completion of the fold, we check whether the program size does not supersede
74064 Kb, the maximum memory size the MCS-51 can address. The type of an iteration therefore becomes an option type: {\tt None} in case
741the program becomes larger than 64 Kb, or $\mathtt{Some}\ \sigma$
742otherwise. We also no longer pass along the number of bytes added to the
743program size, but a boolean that indicates whether we have changed something
744during the iteration or not.
746If the iteration returns {\tt None}, which means that it has become too large for memory, there is an invariant that states that the previous iteration cannot
747have every branch instruction encoded as a long jump. This is needed later in the proof of termination. If the iteration returns $\mathtt{Some}\ \sigma$, the fold invariants are retained without change.
749Instead of using {\tt sigma\_compact\_unsafe}, we can now use the proper
754\mathtt{sigma} & \omit\rlap{$\mathtt{\_compact} \equiv \lambda program.\lambda sigma.$} \notag\\
755& \omit\rlap{$\forall n.n < |program|\ \rightarrow$} \notag\\
756& \mathbf{match}\ && \omit\rlap{$\mathtt{lookup\_opt}\ n\ (\mathtt{snd}\ sigma)\ \mathbf{with}$}\notag\\
757&&& \omit\rlap{$\mathrm{None}\ \Rightarrow\ \mathrm{False}$}\notag\\
758&&& \omit\rlap{$\mathrm{Some}\ \langle pc, j \rangle \Rightarrow$}\notag\\
759&&& \mathbf{match}\ && \mathtt{lookup\_opt}\ (n+1)\ (\mathtt{snd}\ sigma)\ \mathbf{with}\notag\\
760&&&&& \mathrm{None}\ \Rightarrow\ \mathrm{False}\notag\\
761&&&&& \mathrm{Some} \langle pc1, j1 \rangle \Rightarrow\notag\\
762&&&&& \ \ pc1 = pc + \mathtt{instruction\_size}\ n\ (\mathtt{nth}\ n\ program)
765This is almost the same invariant as ${\tt sigma\_compact\_unsafe}$, but differs in that it
766computes the sizes of branch instructions by looking at the distance between
767position and destination using $\sigma$. In actual use, the invariant is qualified: $\sigma$ is compact if there have
768been no changes (i.e. the boolean passed along is {\tt true}). This is to
769reflect the fact that we are doing a least fixed point computation: the result
770is only correct when we have reached the fixed point.
772There is another, trivial, invariant in case the iteration returns
773$\mathtt{Some}\ \sigma$: it must hold that $\mathtt{fst}\ sigma < 2^{16}$.
774We need this invariant to make sure that addresses do not overflow.
776The proof of {\tt nec\_plus\_ultra} goes as follows: if we return {\tt None},
777then the program size must be greater than 64 Kb. However, since the
778previous iteration did not return {\tt None} (because otherwise we would
779terminate immediately), the program size in the previous iteration must have
780been smaller than 64 Kb.
782Suppose that all the branch instructions in the previous iteration are
783encoded as long jumps. This means that all branch instructions in this
784iteration are long jumps as well, and therefore that both iterations are equal
785in the encoding of their branch instructions. Per the invariant, this means that
786$added = 0$, and therefore that all addresses in both iterations are equal.
787But if all addresses are equal, the program sizes must be equal too, which
788means that the program size in the current iteration must be smaller than
78964 Kb. This contradicts the earlier hypothesis, hence not all branch
790instructions in the previous iteration are encoded as long jumps.
792The proof of {\tt sigma\_compact} follows from {\tt sigma\_compact\_unsafe} and
793the fact that we have reached a fixed point, i.e. the previous iteration and
794the current iteration are the same. This means that the results of
795{\tt instruction\_size\_jmplen} and {\tt instruction\_size} are the same.
797\subsection{Final properties}
799These are the invariants that hold after $2n$ iterations, where $n$ is the
800program size (we use the program size for convenience; we could also use the
801number of branch instructions, but this is more complex). Here, we only
802need {\tt out\_of\_program\_none}, {\tt sigma\_compact} and the fact that
803$\sigma(0) = 0$.
805Termination can now be proved using the fact that there is a $k \leq 2n$, with
806$n$ the length of the program, such that iteration $k$ is equal to iteration
807$k+1$. There are two possibilities: either there is a $k < 2n$ such that this
808property holds, or every iteration up to $2n$ is different. In the latter case,
809since the only changes between the iterations can be from shorter jumps to
810longer jumps, in iteration $2n$ every branch instruction must be encoded as
811a long jump. In this case, iteration $2n$ is equal to iteration $2n+1$ and the
812fixed point is reached.
816In the previous sections we have discussed the branch displacement optimisation
817problem, presented an optimised solution, and discussed the proof of
818termination and correctness for this algorithm, as formalised in Matita.
820The algorithm we have presented is fast and correct, but not optimal; a true
821optimal solution would need techniques like constraint solvers. While outside
822the scope of the present research, it would be interesting to see if enough
823heuristics could be found to make such a solution practical for implementing
824in an existing compiler; this would be especially useful for embedded systems,
825where it is important to have as small a solution as possible.
827In itself the algorithm is already useful, as it results in a smaller solution
828than the simple `every branch instruction is long' used up until now---and with
829only 64 Kb of memory, every byte counts. It also results in a smaller solution
830than the greatest fixed point algorithm that {\tt gcc} uses. It does this
831without sacrificing speed or correctness.
833The certification of an assembler that relies on the branch displacement
834algorithm described in this paper was presented in~\cite{lastyear}.
835The assembler computes the $\sigma$ map as described in this paper and
836then works in two passes. In the first pass it builds a map
837from instruction labels to addresses in the assembly code. In the
838second pass it iterates over the code, translating every pseudo jump
839at address $src$ to a label l associated to the assembly instruction at
840address $dst$ to a jump of the size dictated by $(\sigma\ src)$ to
841$(\sigma\ dst)$. In case of conditional jumps, the translated jump may be
842implemented with a series of instructions.
844The proof of correctness abstracts over the algorithm used and only relies on
845{\tt sigma\_policy\_specification} (page~\ref{sigmapolspec}). It is a variation
846of a standard 1-to-many forward simulation proof~\cite{Leroy2009}. The
847relation R between states just maps every code address $ppc$ stored in
848registers or memory to $(\sigma\ ppc)$. To identify the code addresses,
849an additional data structure is always kept together with the source
850state and is updated by the semantics. The semantics is preserved
851only for those programs whose source code operations
852$(f\ ppc_1\ \ldots\ ppc_n)$ applied to code addresses $ppc_1 \ldots ppc_n$ are
853such that $(f\ (\sigma\ ppc_1)\ ...\ (\sigma\ ppc_n) = f\ ppc_1\ ppc_n))$. For
854example, an injective $\sigma$ preserves a binary equality test f for code
855addresses, but not pointer subtraction.
857The main lemma (fetching simulation), which relies on\\
858{\tt sigma\_policy\_specification} and is established by structural induction
859over the source code, says that fetching an assembly instruction at
860position ppc is equal to fetching the translation of the instruction at
861position $(\sigma\ ppc)$, and that the new incremented program counter is at
862the beginning of the next instruction (compactness). The only exception is
863when the instruction fetched is placed at the end of code memory and is
864followed only by dead code. Execution simulation is trivial because of the
865restriction over well behaved programs w.r.t. sigma. The condition
866$\sigma\ 0 = 0$ is necessary because the hardware model prescribes that the
867first instruction to be executed will be at address 0. For the details
870Instead of verifying the algorithm directly, another solution to the problem
871would be to run an optimisation algorithm, and then verify the safety of the
872result using a verified validator. Such a validator would be easier to verify
873than the algorithm itself and it would also be efficient, requiring only a
874linear pass over the source code to test the specification. However, it is
875surely also interesting to formally prove that the assembler never rejects
876programs that should be accepted, i.e. that the algorithm itself is correct.
877This is the topic of the current paper.
879\subsection{Related work}
881As far as we are aware, this is the first formal discussion of the branch
882displacement optimisation algorithm.
884The CompCert project is another verified compiler project.
885Their backend~\cite{Leroy2009} generates assembly code for (amongst others) subsets of the
886PowerPC and x86 (32-bit) architectures. At the assembly code stage, there is
887no distinction between the span-dependent jump instructions, so a branch
888displacement optimisation algorithm is not needed.
890%An offshoot of the CompCert project is the CompCertTSO project, which adds
891%thread concurrency and synchronisation to the CompCert compiler~\cite{TSO2011}.
892%This compiler also generates assembly code and therefore does not include a
893%branch displacement algorithm.
895%Finally, there is also the Piton stack~\cite{Moore1996}, which not only includes the
896%formal verification of a compiler, but also of the machine architecture
897%targeted by that compiler, a microprocessor called the FM9001.
898%However, this architecture does not have different
899%jump sizes (branching is simulated by assigning values to the program counter),
900%so the branch displacement problem is irrelevant.
Note: See TracBrowser for help on using the repository browser.