Index: Papers/jarassembler2015/jarassembler2014.tex
===================================================================
 Papers/jarassembler2015/jarassembler2014.tex (revision 3555)
+++ (revision )
@@ 1,923 +1,0 @@
%\documentclass[twocolumn,draft]{svjour3}
\documentclass[a4paper]{article}
\usepackage{algpseudocode}
%\usepackage{algorithmicx}
\usepackage{alltt}
\usepackage{amsfonts}
\usepackage{amsmath}
\usepackage[british]{babel}
\usepackage{caption}
\usepackage[colorlinks]{hyperref}
\usepackage[utf8]{inputenc}
\usepackage{listings}
\usepackage{microtype}
\usepackage{subcaption}

\usepackage{kpfonts}
\usepackage[T1]{fontenc}

\renewcommand{\verb}{\lstinline}
\def\lstlanguagefiles{lstgrafite.tex}
\lstset{language=Grafite}

\begin{document}

\title{On the proof of correctness of a verified optimising assembler
\thanks{Research supported by the CerCo project, within the Future and Emerging Technologies (FET) programme of the Seventh Framework Programme for Research of the European Commission, under FETOpen grant number 243881}}
\author{Jaap Boender \and Dominic P. Mulligan \and Claudio Sacerdoti Coen}
%\institute{Department of Computer Science, University of Middlesex \and Computer Laboratory, University of Cambridge \and Dipartimento di Informatica, University of Bologna}

\maketitle

\begin{abstract}
Optimising assemblers face the `branch displacement' or `jump encoding' problem, that is: how best to choose between concrete machine code jump instructionstypically of differing instruction and offset sizeswhen expanding pseudoinstructions.
Ideally, on space constrained hardware, an optimising assembler would choose the set of pseudoinstruction expansions that minimises the size of the resulting machine code program, a task that is provably \textsc{np}hard.

As part of CerCo (`Certified Complexity')an \textsc{eu}funded project to develop a verified concrete complexity preserving compiler for a large subset of the C programming languagewe have implemented and proved correct an optimising assembler within the Matita interactive theorem prover.
Our assembler targets the instruction set of a typical microcontroller, the Intel \textsc{mcs51} series.
As is common with embedded systems development this microcontroller has a paucity of available code memory and therefore we face an additional pressure in reducing the size of any assembled machine code program.
Out of necessity, then, our assembler implements an algorithm for solving the branch displacement problem, and we have proved that this algorithm is correct.

However, this efficient expansion of jump pseudoinstructions into machine code equivalents is complex, and therefore could unneccessarily complicate the proof of correctness for the rest of the assembler.
We therefore isolate the decision making over how jumps should be expanded from the expansion process itself as much as possible using `policies', making the rest of the proof of correctness more straightforward.
Our proof strategy contains a tracking facility for `good addresses' and only programs that use good addresses have their semantics preserved under assembly, as we observe that it is impossible for an assembler to preserve the semantics of every assembly program.
Our strategy offers increased flexibility over the traditional approach to proving the correctness of assemblers, wherein addresses in assembly are kept opaque and immutable.
In particular, our approach permits experimentation with the benign manipulation of addresses.

We discuss wider issues associated with a proof of correctness for an assembler, detail our algorithm solving the `branch displacement' problem, and discuss our proof of correctness, employing `policies', for the assembler.

%\keywords{Formal verification, interactive theorem proving, assembler, branch displacement optimisation, CerCo (`Certified Complexity'), MCS51 microcontroller, Matita proof assistant}
\end{abstract}

\section{Introduction}

We consider the formalisation of an assembler for the Intel MCS51 8bit microprocessor in the Matita proof assistant~\cite{asperti:user:2007}.
This formalisation forms a major component of the EUfunded CerCo (`Certified Complexity') project~\cite{cerco:2011}, concerning the construction and formalisation of a concrete complexity preserving compiler for a large subset of the C programming language.
Though this document describes the formalisation of an assembler for a specific microprocessor, we believe that the techniques described are general, and should be of interest to anybody providing a proof of correctness for an optimising assembler.

The \textsc{mcs51}, commonly called the 8051, is an 8bit Harvard architecture \textsc{cisc} instruction set microprocessor dating from the early 1980s.
An extended variant, the \textsc{mcs52}, or 8052, was subsequently released adding extra \textsc{ram} and \textsc{rom} above and beyond that offered by the original \textsc{mcs51}, along with an additional timer.
Originally manufactured and marketed by Intel, the microprocessor quickly gained and maintained its popularity, especially in embedded systems development, with dozens of semiconductor foundries subsequently releasing derivative copies many of which are still manufactured today.

Admittedly, a typical \textsc{mcs51} derivative lacks many features compared to its more modern brethren.
In particular, most derivatives lack a cache, any pipelining features, branch speculation, outoforder execution, or other advanced microarchitectural features we now take for granted on stateoftheart microprocessors.
As a direct result program execution is much more predictable: as part of the processor's specification provided by the manufacturer we can be sure that a \texttt{NOP} instruction will always take exactly one processor cycle to execute, whilst an \texttt{ADD} instruction always takes two processor cycles, and so on and so forth for all instructions in the processor's instruction set.

Certainty surrounding execution times makes the \textsc{mcs51} especially attractive for certain embedded development tasks.
Moreover, this same certainty also makes the \textsc{mcs51} especially attractive for CerCo's ends, wherein we aim to produce a verified concretecomplexity preserving compiler that lifts a cost model for a C programinduced on the program's machine code image under compilationback through the compilation chain.
This cost model will ultimately be made manifest as complexity assertions annotating the original C source code, and these annotations may then be used by the programmerperhaps in conjunction with external reasoning toolsto derive accurate resource bounds for their program.
As a result, knowing that a given block of machine code instructions will take a fixed number of cycles to execute, irrespective of the program context or processor state when executed, permits us to generate a cycleaccurate cost model on the original C source code and make several simplifying assumptions in the correctness proof of our compiler.

Yet, such a goal introduces two complications.
First, as the CerCo cost model is induced on machine code, the CerCo compilation chain must ultimately produce machine code, rather than assembly code as in other verified compiler projects such as CompCert~\cite{Leroy2009}.
Ultimately, for the CerCo compilation chain to be fully verified, a verified assembler and a verified compiler must both be produced.

Second, the presence of an assembler allows us to simplify the proof of correctness of the CerCo compiler proper.
Rather than having the compiler choose particular jump, call or move machine code instructions, the compiler may produce idealised pseudoinstructions

To do this, we must solve the `branch displacement' problemdeciding how best to expand pseudojumps to labels in assembly language to machine code jumps.
This problem, also known as `jump encoding', is
a wellknown problem amongst assembler implementors~\cite{Hyde2006} and arises when pseudojumps can be expanded
in different ways to real machine instructions, but the different expansions
are not equivalent (e.g. jumps that differ in the distance that they `span', or instructions that differ in the number of clock cycles needed to completely execute them) and are not always
correct (e.g. correctness is only up to global constraints over the compiled
code). For instance, some jump instructions (short jumps) are very small
and can be executed quickly, but they can only reach destinations within a
certain distance from the current instruction. When the destinations are
too far away, larger and slower long jumps must be used. The use of a long jump may
augment the distance between another pseudojump and its target, forcing
another long jump use, in a cascade. The job of the optimising
compiler (assembler) is to individually expand every pseudoinstruction in such a way
that all global constraints are satisfied and that the compiled program
is minimal in size and faster in concrete time complexity.
This problem is known to be computationally hard for most CISC architectures (\textsc{np}hard, see~\cite{hyde:branch:2006}).

To simplify the CerCo C compiler we have chosen to implement an optimising assembler whose input language the compiler will target.
Labels, conditional jumps to labels, a program preamble containing global data and a \texttt{MOV} instruction for moving this global data into the MCS51's one 16bit register all feature in our assembly language.
We further simplify by ignoring linking, assuming that all our assembly programs are prelinked.

The requirements of the CerCo project add yet more complications to our proof of correctness, as we must also address a cost model.
CerCo imposes a cost model on C programs or, more specifically, on simple blocks of instructions.
This cost model is induced by the compilation process itself, and its noncompositional nature allows us to assign different costs to identical C statements depending on how they are compiled.
In short, we aim to obtain a very precise costing for a program by embracing the compilation process, not ignoring it.
At the assembler level, this is reflected by our need to induce a cost
model on the assembly code as a function of the assembly program and the
strategy used to solve the branch displacement problem. In particular, our
optimising assembler should also return a map that assigns a cost (in clock
cycles) to every instruction in the source program. We expect the induced cost
to be preserved by the assembler: we will prove that the compiled code
tightly simulates the source code by taking exactly the predicted amount of
time.
Note that the temporal `tightness' of the simulation is a fundamental prerequisite
of the correctness of the simulation, as some functions of the MCS51timers and \textsc{i/o}depend on the microprocessor's clock.
If the pseudo and concrete clock differ the result of an \textsc{i/o} operation may not be preserved.





















% This problem is the subject of the present paper. After introducing the problem
% in more detail, we will discuss the solutions used by other compilers, present
% the algorithm we use in the CerCo assembler, and discuss its verification,
% that is the proofs of termination and correctness using the Matita proof
% assistant~\cite{Asperti2007}.

% Formulating the final statement of correctness and finding the loop invariants
% have been nontrivial tasks and are, indeed, the main contribution of this
% paper. It has required considerable care and finetuning to formulate not
% only the minimal statement required for the ulterior proof of correctness of
% the assembler, but also the minimal set of invariants needed for the proof
% of correctness of the algorithm.

% The research presented in this paper has been executed within the CerCo project
% which aims at formally verifying a C compiler with cost annotations. The
% target architecture for this project is the MCS51, whose instruction set
% contains spandependent instructions. Furthermore, its maximum addressable
% memory size is very small (64 Kb), which makes it important to generate
% programs that are as small as possible. With this optimisation, however, comes increased complexity and hence increased possibility for error. We must make sure that the branch instructions are encoded correctly, otherwise the assembled program will behave unpredictably.

% All Matita files related to this development can be found on the CerCo
% website, \url{http://cerco.cs.unibo.it}. The specific part that contains the
% branch displacement algorithm is in the {\tt ASM} subdirectory, in the files
% {\tt PolicyFront.ma}, {\tt PolicyStep.ma} and {\tt Policy.ma}.

\section{The branch displacement optimisation problem}

In most modern instruction sets that have them, the only spandependent
instructions are branch instructions. Taking the ubiquitous x8664 instruction
set as an example, we find that it contains eleven different forms of the
unconditional branch instruction, all with different ranges, instruction sizes
and semantics (only six are valid in 64bit mode, for example). Some examples
are shown in Figure~\ref{f:x86jumps} (see also~\cite{IntelDev}).

\begin{figure}[h]
\begin{center}
\begin{tabular}{lll}
\hline
Instruction & Size (bytes) & Displacement range \\
\hline
Short jump & 2 & 128 to 127 bytes \\
Relative near jump & 5 & $2^{32}$ to $2^{32}1$ bytes \\
Absolute near jump & 6 & one segment (64bit address) \\
Far jump & 8 & entire memory (indirect jump) \\
\hline
\end{tabular}
\end{center}
\caption{List of x86 branch instructions}
\label{f:x86jumps}
\end{figure}

The chosen target architecture of the CerCo project is the Intel MCS51, which
features three types of branch instructions (or jump instructions; the two terms
are used interchangeably), as shown in Figure~\ref{f:mcs51jumps}.

\begin{figure}[h]
\begin{center}
\begin{tabular}{llll}
\hline
Instruction & Size & Execution time & Displacement range \\
 & (bytes) & (cycles) & \\
\hline
SJMP (`short jump') & 2 & 2 & 128 to 127 bytes \\
AJMP (`absolute jump') & 2 & 2 & one segment (11bit address) \\
LJMP (`long jump') & 3 & 3 & entire memory \\
\hline
\end{tabular}
\end{center}
\caption{List of MCS51 branch instructions}
\label{f:mcs51jumps}
\end{figure}

Conditional branch instructions are only available in short form, which
means that a conditional branch outside the short address range has to be
encoded using three branch instructions (for instructions whose logical
negation is available, it can be done with two branch instructions, but for
some instructions this is not the case). The call instruction is
only available in absolute and long forms.

Note that even though the MCS51 architecture is much less advanced and much
simpler than the x8664 architecture, the basic types of branch instruction
remain the same: a short jump with a limited range, an intrasegment jump and a
jump that can reach the entire available memory.

Generally, in code fed to the assembler as input, the only
difference between branch instructions is semantics, not span. This
means that a distinction is made between an unconditional branch and the
several kinds of conditional branch, but not between their short, absolute or
long variants.

The algorithm used by the assembler to encode these branch instructions into
the different machine instructions is known as the {\em branch displacement
algorithm}. The optimisation problem consists of finding as small an encoding as
possible, thus minimising program length and execution time.

Similar problems, e.g. the branch displacement optimisation problem for other
architectures, are known to be NPcomplete~\cite{Robertson1979,Szymanski1978},
which could make finding an optimal solution very timeconsuming.

The canonical solution, as shown by Szymanski~\cite{Szymanski1978} or more
recently by Dickson~\cite{Dickson2008} for the x86 instruction set, is to use a
fixed point algorithm that starts with the shortest possible encoding (all
branch instruction encoded as short jumps, which is likely not a correct
solution) and then iterates over the source to reencode those branch
instructions whose target is outside their range.

\subsection*{Adding absolute jumps}

In both papers mentioned above, the encoding of a jump is only dependent on the
distance between the jump and its target: below a certain value a short jump
can be used; above this value the jump must be encoded as a long jump.

Here, termination of the smallest fixed point algorithm is easy to prove. All
branch instructions start out encoded as short jumps, which means that the
distance between any branch instruction and its target is as short as possible
(all the intervening jumps are short).
If, in this situation, there is a branch instruction $b$ whose span is not
within the range for a short jump, we can be sure that we can never reach a
situation where the span of $j$ is so small that it can be encoded as a short
jump. This argument continues to hold throughout the subsequent iterations of
the algorithm: short jumps can change into long jumps, but not \emph{vice versa},
as spans only increase. Hence, the algorithm either terminates early when a fixed
point is reached or when all short jumps have been changed into long jumps.

Also, we can be certain that we have reached an optimal solution: a short jump
is only changed into a long jump if it is absolutely necessary.

However, neither of these claims (termination nor optimality) hold when we add
the absolute jump. With absolute jumps, the encoding of a branch
instruction no longer depends only on the distance between the branch
instruction and its target. An absolute jump is possible when instruction and
target are in the same segment (for the MCS51, this means that the first 5
bytes of their addresses have to be equal). It is therefore entirely possible
for two branch instructions with the same span to be encoded in different ways
(absolute if the branch instruction and its target are in the same segment,
long if this is not the case).

\begin{figure}[t]
\begin{subfigure}[b]{.45\linewidth}
\small
\begin{alltt}
 jmp X
 \ldots
L\(\sb{0}\): \ldots
% Start of new segment if
% jmp X is encoded as short
 \ldots
 jmp L\(\sb{0}\)
\end{alltt}
\caption{Example of a program where a long jump becomes absolute}
\label{f:term_example}
\end{subfigure}
\hfill
\begin{subfigure}[b]{.45\linewidth}
\small
\begin{alltt}
L\(\sb{0}\): jmp X
X: \ldots
 \ldots
L\(\sb{1}\): \ldots
% Start of new segment if
% jmp X is encoded as short
 \ldots
 jmp L\(\sb{1}\)
 \ldots
 jmp L\(\sb{1}\)
 \ldots
 jmp L\(\sb{1}\)
 \ldots
\end{alltt}
\caption{Example of a program where the fixedpoint algorithm is not optimal}
\label{f:opt_example}
\end{subfigure}
\end{figure}

This invalidates our earlier termination argument: a branch instruction, once encoded
as a long jump, can be reencoded during a later iteration as an absolute jump.
Consider the program shown in Figure~\ref{f:term_example}. At the start of the
first iteration, both the branch to {\tt X} and the branch to $\mathtt{L}_{0}$
are encoded as small jumps. Let us assume that in this case, the placement of
$\mathtt{L}_{0}$ and the branch to it are such that $\mathtt{L}_{0}$ is just
outside the segment that contains this branch. Let us also assume that the
distance between $\mathtt{L}_{0}$ and the branch to it is too large for the
branch instruction to be encoded as a short jump.

All this means that in the second iteration, the branch to $\mathtt{L}_{0}$ will
be encoded as a long jump. If we assume that the branch to {\tt X} is encoded as
a long jump as well, the size of the branch instruction will increase and
$\mathtt{L}_{0}$ will be `propelled' into the same segment as its branch
instruction, because every subsequent instruction will move one byte forward.
Hence, in the third iteration, the branch to $\mathtt{L}_{0}$ can be encoded as
an absolute jump. At first glance, there is nothing that prevents us from
constructing a configuration where two branch instructions interact in such a
way as to iterate indefinitely between long and absolute encodings.

This situation mirrors the explanation by Szymanski~\cite{Szymanski1978} of why
the branch displacement optimisation problem is NPcomplete. In this explanation,
a condition for NPcompleteness is the fact that programs be allowed to contain
{\em pathological} jumps. These are branch instructions that can normally not be
encoded as a short(er) jump, but gain this property when some other branch
instructions are encoded as a long(er) jump. This is exactly what happens in
Figure~\ref{f:term_example}. By encoding the first branch instruction as a long
jump, another branch instruction switches from long to absolute (which is
shorter).

In addition, our previous optimality argument no longer holds. Consider the
program shown in Figure~\ref{f:opt_example}. Suppose that the distance between
$\mathtt{L}_{0}$ and $\mathtt{L}_{1}$ is such that if {\tt jmp X} is encoded
as a short jump, there is a segment border just after $\mathtt{L}_{1}$. Let
us also assume that all three branches to $\mathtt{L}_{1}$ are in the same
segment, but far enough away from $\mathtt{L}_{1}$ that they cannot be encoded
as short jumps.

Then, if {\tt jmp X} were to be encoded as a short jump, which is clearly
possible, all of the branches to $\mathtt{L}_{1}$ would have to be encoded as
long jumps. However, if {\tt jmp X} were to be encoded as a long jump, and
therefore increase in size, $\mathtt{L}_{1}$ would be `propelled' across the
segment border, so that the three branches to $\mathtt{L}_{1}$ could be encoded
as absolute jumps. Depending on the relative sizes of long and absolute jumps,
this solution might actually be smaller than the one reached by the smallest
fixed point algorithm.

\section{Our algorithm}

\subsection{Design decisions}

Given the NPcompleteness of the problem, finding optimal solutions
(using, for example, a constraint solver) can potentially be very costly.

The SDCC compiler~\cite{SDCC2011}, which has a backend targeting the MCS51
instruction set, simply encodes every branch instruction as a long jump
without taking the distance into account. While certainly correct (the long
jump can reach any destination in memory) and a very fast solution to compute,
it results in a less than optimal solution in terms of output size and
execution time.

On the other hand, the {\tt gcc} compiler suite, while compiling
C on the x86 architecture, uses a greatest fix point algorithm. In other words,
it starts with all branch instructions encoded as the largest jumps
available, and then tries to reduce the size of branch instructions as much as
possible.

Such an algorithm has the advantage that any intermediate result it returns
is correct: the solution where every branch instruction is encoded as a large
jump is always possible, and the algorithm only reduces those branch
instructions whose destination address is in range for a shorter jump.
The algorithm can thus be stopped after a determined number of steps without
sacrificing correctness.

The result, however, is not necessarily optimal. Even if the algorithm is run
until it terminates naturally, the fixed point reached is the {\em greatest}
fixed point, not the least fixed point. Furthermore, {\tt gcc} (at least for
the x86 architecture) only uses short and long jumps. This makes the algorithm
more efficient, as shown in the previous section, but also results in a less
optimal solution.

In the CerCo assembler, we opted at first for a least fixed point algorithm,
taking absolute jumps into account.

Here, we ran into a problem with proving termination, as explained in the
previous section: if we only take short and long jumps into account, the jump
encoding can only switch from short to long, but never in the other direction.
When we add absolute jumps, however, it is theoretically possible for a branch
instruction to switch from absolute to long and back, as previously explained.
Proving termination then becomes difficult, because there is nothing that
precludes a branch instruction from oscillating back and forth between absolute
and long jumps indefinitely.

To keep the algorithm in the same complexity class and more easily
prove termination, we decided to explicitly enforce the `branch instructions
must always grow longer' requirement: if a branch instruction is encoded as a
long jump in one iteration, it will also be encoded as a long jump in all the
following iterations. Therefore the encoding of any branch instruction
can change at most two times: once from short to absolute (or long), and once
from absolute to long.

There is one complicating factor. Suppose that a branch instruction is encoded
in step $n$ as an absolute jump, but in step $n+1$ it is determined that
(because of changes elsewhere) it can now be encoded as a short jump. Due to
the requirement that the branch instructions must always grow longer,
the branch encoding will be encoded as an absolute jump in step
$n+1$ as well.

This is not necessarily correct. A branch instruction that can be
encoded as a short jump cannot always also be encoded as an absolute jump, as a
short jump can bridge segments, whereas an absolute jump cannot. Therefore,
in this situation we have decided to encode the branch instruction as a long
jump, which is always correct.

The resulting algorithm, therefore, will not return the least fixed point, as
it might have too many long jumps. However, it is still better than the
algorithms from SDCC and {\tt gcc}, since even in the worst case, it will still
return a smaller or equal solution.

Experimenting with our algorithm on the test suite of C programs included with
gcc 2.3.3 has shown that on average, about 25 percent of jumps are encoded as short or absolute jumps by the algorithm. As not all instructions are jumps, this does not make for a large reduction in size, but it can make for a reduction in execution time: if jumps
are executed multiple times, for example in loops, the fact that short jumps take less cycles to execute than long jumps can have great effect.

As for complexity, there are at most $2n$ iterations, where $n$ is the number of
branch instructions. Practical tests within the CerCo project on small to
medium pieces of code have shown that in almost all cases, a fixed point is
reached in 3 passes. Only in one case did the algorithm need 4. This is not surprising: after all, the difference between short/absolute and
long jumps is only one byte (three for conditional jumps). For a change from
short/absolute to long to have an effect on other jumps is therefore relatively
uncommon, which explains why a fixed point is reached so quickly.

\subsection{The algorithm in detail}

The branch displacement algorithm forms part of the translation from
pseudocode to assembler. More specifically, it is used by the function that
translates pseudoaddresses (natural numbers indicating the position of the
instruction in the program) to actual addresses in memory. Note that in pseudocode, all instructions are of size 1.

Our original intention was to have two different functions, one function
$\mathtt{policy}: \mathbb{N} \rightarrow \{\mathtt{short\_jump},
\mathtt{absolute\_jump}, \mathtt{long\_jump}\}$ to associate jumps to their
intended encoding, and a function $\sigma: \mathbb{N} \rightarrow
\mathtt{Word}$ to associate pseudoaddresses to machine addresses. $\sigma$
would use $\mathtt{policy}$ to determine the size of jump instructions. This turned out to be suboptimal from the algorithmic point of view and
impossible to prove correct.

From the algorithmic point of view, in order to create the $\mathtt{policy}$
function, we must necessarily have a translation from pseudoaddresses
to machine addresses (i.e. a $\sigma$ function): in order to judge the distance
between a jump and its destination, we must know their memory locations.
Conversely, in order to create the $\sigma$ function, we need to have the
$\mathtt{policy}$ function, otherwise we do not know the sizes of the jump
instructions in the program.

Much the same problem appears when we try to prove the algorithm correct: the
correctness of $\mathtt{policy}$ depends on the correctness of $\sigma$, and
the correctness of $\sigma$ depends on the correctness of $\mathtt{policy}$.

We solved this problem by integrating the $\mathtt{policy}$ and $\sigma$
algorithms. We now have a function
$\sigma: \mathbb{N} \rightarrow \mathtt{Word} \times \mathtt{bool}$ which
associates a pseudoaddress to a machine address. The boolean denotes a forced
long jump; as noted in the previous section, if during the fixed point
computation an absolute jump changes to be potentially reencoded as a short
jump, the result is actually a long jump. It might therefore be the case that
jumps are encoded as long jumps without this actually being necessary, and this
information needs to be passed to the code generating function.

The assembler function encodes the jumps by checking the distance between
source and destination according to $\sigma$, so it could select an absolute
jump in a situation where there should be a long jump. The boolean is there
to prevent this from happening by indicating the locations where a long jump
should be encoded, even if a shorter jump is possible. This has no effect on
correctness, since a long jump is applicable in any situation.

\begin{figure}[t]
\small
\begin{algorithmic}
\Function{f}{$labels$,$old\_sigma$,$instr$,$ppc$,$acc$}
 \State $\langle added, pc, sigma \rangle \gets acc$
 \If {$instr$ is a backward jump to $j$}
 \State $length \gets \mathrm{jump\_size}(pc,sigma_1(labels(j)))$
 \Comment compute jump distance
 \ElsIf {$instr$ is a forward jump to $j$}
 \State $length \gets \mathrm{jump\_size}(pc,old\_sigma_1(labels(j))+added)$
 \EndIf
 \State $old\_length \gets \mathrm{old\_sigma_1}(ppc)$
 \State $new\_length \gets \mathrm{max}(old\_length, length)$
 \Comment length must never decrease
 \State $old\_size \gets \mathrm{old\_sigma_2}(ppc)$
 \State $new\_size \gets \mathrm{instruction\_size}(instr,new\_length)$
 \Comment compute size in bytes
 \State $new\_added \gets added+(new\_sizeold\_size)$
 \Comment keep track of total added bytes
 \State $new\_sigma \gets old\_sigma$
 \State $new\_sigma_1(ppc+1) \gets pc+new\_size$
 \State $new\_sigma_2(ppc) \gets new\_length$
 \Comment update $\sigma$ \\
 \Return $\langle new\_added, pc+new\_size, new\_sigma \rangle$
\EndFunction
\end{algorithmic}
\caption{The heart of the algorithm}
\label{f:jump_expansion_step}
\end{figure}

The algorithm, shown in Figure~\ref{f:jump_expansion_step}, works by folding the
function {\sc f} over the entire program, thus gradually constructing $sigma$.
This constitutes one step in the fixed point calculation; successive steps
repeat the fold until a fixed point is reached. We have abstracted away the case where an instruction is not a jump, since the size of these instructions is constant.

Parameters of the function {\sc f} are:
\begin{itemize}
 \item a function $labels$ that associates a label to its pseudoaddress;
 \item $old\_sigma$, the $\sigma$ function returned by the previous
 iteration of the fixed point calculation;
 \item $instr$, the instruction currently under consideration;
 \item $ppc$, the pseudoaddress of $instr$;
 \item $acc$, the fold accumulator, which contains $added$ (the number of
 bytes added to the program size with respect to the previous iteration), $pc$
 (the highest memory address reached so far), and of course $sigma$, the
 $\sigma$ function under construction.
\end{itemize}
The first two are parameters that remain the same through one iteration, the
final three are standard parameters for a fold function (including $ppc$,
which is simply the number of instructions of the program already processed).

The $\sigma$ functions used by {\sc f} are not of the same type as the final
$\sigma$ function: they are of type
$\sigma: \mathbb{N} \rightarrow \mathbb{N} \times \{\mathtt{short\_jump},
\mathtt{absolute\_jump},\mathtt{long\_jump}\}$; a function that associates a
pseudoaddress with a memory address and a jump length. We do this to
ease the comparison of jump lengths between iterations. In the algorithm,
we use the notation $sigma_1(x)$ to denote the memory address corresponding to
$x$, and $sigma_2(x)$ for the jump length corresponding to $x$.

Note that the $\sigma$ function used for label lookup varies depending on
whether the label is behind our current position or ahead of it. For
backward branches, where the label is behind our current position, we can use
$sigma$ for lookup, since its memory address is already known. However, for
forward branches, the memory address of the address of the label is not yet
known, so we must use $old\_sigma$.

We cannot use $old\_sigma$ without change: it might be the case that we have
already increased the size of some branch instructions before, making the
program longer and moving every instruction forward. We must compensate for this
by adding the size increase of the program to the label's memory address
according to $old\_sigma$, so that branch instruction spans do not get
compromised.

%Note also that we add the pc to $sigma$ at location $ppc+1$, whereas we add the
%jump length at location $ppc$. We do this so that $sigma(ppc)$ will always
%return a pair with the start address of the instruction at $ppc$ and the
%length of its branch instruction (if any); the end address of the program can
%be found at $sigma(n+1)$, where $n$ is the number of instructions in the
%program.

\section{The proof}

In this section, we present the correctness proof for the algorithm in more
detail. The main correctness statement is shown, slightly simplified, in~Figure~\ref{statement}.
%
\begin{figure}[t]
\small
\begin{alignat*}{6}
\mathtt{sigma}&\omit\rlap{$\mathtt{\_policy\_specification} \equiv
\lambda program.\lambda sigma.$} \notag\\
 & \omit\rlap{$sigma\ 0 = 0\ \wedge$} \notag\\
 & \mathbf{let}\ & & \omit\rlap{$instr\_list \equiv code\ program\ \mathbf{in}$} \notag\\
 &&& \omit\rlap{$\forall ppc.ppc < instr\_list \rightarrow$} \notag\\
 &&& \mathbf{let}\ && pc \equiv sigma\ ppc\ \mathbf{in} \notag\\
 &&& \mathbf{let}\ && instruction \equiv \mathtt{fetch\_pseudo\_instruction}\ instr\_list\ ppc\ \mathbf{in} \notag\\
 &&& \mathbf{let}\ && next\_pc \equiv sigma\ (ppc+1)\ \mathbf{in}\notag\\
 &&&&& next\_pc = pc + \mathtt{instruction\_size}\ sigma\ ppc\ instruction\ \wedge\notag\\
 &&&&& (pc + \mathtt{instruction\_size}\ sigma\ ppc\ instruction < 2^{16}\ \vee\notag\\
 &&&&& (\forall ppc'.ppc' < instr\_list \rightarrow ppc < ppc' \rightarrow \notag\\
 &&&&& \mathbf{let}\ instruction' \equiv \mathtt{fetch\_pseudo\_instruction}\ instr\_list\ ppc'\ \mathbf{in} \notag\\
 &&&&&\ \mathtt{instruction\_size}\ sigma\ ppc'\ instruction' = 0)\ \wedge \notag\\
 &&&&& pc + \mathtt{instruction\_size}\ sigma\ ppc\ instruction = 2^{16})
\end{alignat*}
\caption{Main correctness statement\label{statement}}
\label{sigmapolspec}
\end{figure}
%
Informally, this means that when fetching a pseudoinstruction at $ppc$, the
translation by $\sigma$ of $ppc+1$ is the same as $\sigma(ppc)$ plus the size
of the instruction at $ppc$. That is, an instruction is placed consecutively
after the previous one, and there are no overlaps. The rest of the statement deals with memory size: either the next instruction fits within memory ($next\_pc < 2^{16}$) or it ends exactly at the limit memory,
in which case it must be the last translated instruction in the program (enforced by specfiying that the size of all subsequent instructions is 0: there may be comments or cost annotations that are not translated).

Finally, we enforce that the program starts at address 0, i.e. $\sigma(0) = 0$. It may seem strange that we do not explicitly include a safety property stating that every jump instruction is of the right type with respect to its target (akin to the lemma from Figure~\ref{sigmasafe}), but this is not necessary. The distance is recalculated according to the instruction addresses from $\sigma$, which implicitly expresses safety.

Since our computation is a least fixed point computation, we must prove
termination in order to prove correctness: if the algorithm is halted after
a number of steps without reaching a fixed point, the solution is not
guaranteed to be correct. More specifically, branch instructions might be
encoded which do not coincide with the span between their location and their
destination.

Proof of termination rests on the fact that the encoding of branch
instructions can only grow larger, which means that we must reach a fixed point
after at most $2n$ iterations, with $n$ the number of branch instructions in
the program. This worst case is reached if at every iteration, we change the
encoding of exactly one branch instruction; since the encoding of any branch
instruction can change first from short to absolute, and then to long, there
can be at most $2n$ changes.

%The proof has been carried out using the ``Russell'' style from~\cite{Sozeau2006}.
%We have proven some invariants of the {\sc f} function from the previous
%section; these invariants are then used to prove properties that hold for every
%iteration of the fixed point computation; and finally, we can prove some
%properties of the fixed point.

\subsection{Fold invariants}

In this section, we present the invariants that hold during the fold of {\sc f}
over the program. These will be used later on to prove the properties of the
iteration. During the fixed point computation, the $\sigma$ function is
implemented as a trie for ease of access; computing $\sigma(x)$ is achieved by
looking up the value of $x$ in the trie. Actually, during the fold, the value
we pass along is a pair $\mathbb{N} \times \mathtt{ppc\_pc\_map}$. The first
component is the number of bytes added to the program so far with respect to
the previous iteration, and the second component, {\tt ppc\_pc\_map}, is the
actual $\sigma$ trie (which we'll call $strie$ to avoid confusion).
%
{\small
\begin{alignat*}{2}
\mathtt{out} & \mathtt{\_of\_program\_none} \equiv \lambda prefix.\lambda strie. \notag\\
& \forall i.i < 2^{16} \rightarrow (i > prefix \leftrightarrow
 \mathtt{lookup\_opt}\ i\ (\mathtt{snd}\ strie) = \mathtt{None})
\end{alignat*}}
%
The first invariant states that any pseudoaddress not yet examined is not
present in the lookup trie.
%
{\small
\begin{alignat*}{2}
\mathtt{not} & \mathtt{\_jump\_default} \equiv \lambda prefix.\lambda strie.\forall i.i < prefix \rightarrow\notag\\
& \neg\mathtt{is\_jump}\ (\mathtt{nth}\ i\ prefix) \rightarrow \mathtt{lookup}\ i\ (\mathtt{snd}\ strie) = \mathtt{short\_jump}
\end{alignat*}}
%
This invariant states that when we try to look up the jump length of a
pseudoaddress where there is no branch instruction, we will get the default
value, a short jump.
%
{\small
\begin{alignat*}{4}
\mathtt{jump} & \mathtt{\_increase} \equiv \lambda pc.\lambda op.\lambda p.\forall i.i < prefix \rightarrow \notag\\
& \mathbf{let}\ oj \equiv \mathtt{lookup}\ i\ (\mathtt{snd}\ op)\ \mathbf{in} \notag\\
& \mathbf{let}\ j \equiv \mathtt{lookup}\ i\ (\mathtt{snd}\ p)\ \mathbf{in}\ \mathtt{jmpleq}\ oj\ j
\end{alignat*}}
%
This invariant states that between iterations (with $op$ being the previous
iteration, and $p$ the current one), jump lengths either remain equal or
increase. It is needed for proving termination.
%
\begin{figure}[h]
\small
\begin{alignat*}{6}
\mathtt{sigma} & \omit\rlap{$\mathtt{\_compact\_unsafe} \equiv \lambda prefix.\lambda strie.\forall n.n < prefix \rightarrow$}\notag\\
& \mathbf{match}\ && \omit\rlap{$\mathtt{lookup\_opt}\ n\ (\mathtt{snd}\ strie)\ \mathbf{with}$}\notag\\
&&& \omit\rlap{$\mathtt{None} \Rightarrow \mathrm{False}$} \notag\\
&&& \omit\rlap{$\mathtt{Some}\ \langle pc, j \rangle \Rightarrow$} \notag\\
&&& \mathbf{match}\ && \mathtt{lookup\_opt}\ (n+1)\ (\mathtt{snd}\ strie)\ \mathbf{with}\notag\\
&&&&& \mathtt{None} \Rightarrow \mathrm{False} \notag\\
&&&&& \mathtt{Some}\ \langle pc_1, j_1 \rangle \Rightarrow
 pc_1 = pc + \notag\\
&&&&& \ \ \mathtt{instruction\_size\_jmplen}\ j\ (\mathtt{nth}\ n\ prefix)
\end{alignat*}
\caption{Temporary safety property}
\label{sigmacompactunsafe}
\end{figure}
%
We now proceed with the safety lemmas. The lemma in
Figure~\ref{sigmacompactunsafe} is a temporary formulation of the main
property {\tt sigma\_policy\_specification}. Its main difference from the
final version is that it uses {\tt instruction\_size\_jmplen} to compute the
instruction size. This function uses $j$ to compute the span of branch
instructions (i.e. it uses the $\sigma$ under construction), instead
of looking at the distance between source and destination. This is because
$\sigma$ is still under construction; we will prove below that after the
final iteration, {\tt sigma\_compact\_unsafe} is equivalent to the main
property in Figure~\ref{sigmasafe} which holds at the end of the computation.
%
\begin{figure}[h]
\small
\begin{alignat*}{6}
\mathtt{sigma} & \omit\rlap{$\mathtt{\_safe} \equiv \lambda prefix.\lambda labels.\lambda old\_strie.\lambda strie.\forall i.i < prefix \rightarrow$} \notag\\
& \omit\rlap{$\forall dest\_label.\mathtt{is\_jump\_to\ (\mathtt{nth}\ i\ prefix})\ dest\_label \rightarrow$} \notag\\
& \mathbf{let} && \omit\rlap{$\ paddr \equiv \mathtt{lookup}\ labels\ dest\_label\ \mathbf{in}$} \notag\\
& \mathbf{let} && \omit\rlap{$\ \langle j, src, dest \rangle \equiv \mathbf{if} \ paddr\ \leq\ i\ \mathbf{then}$}\notag\\
&&&&& \mathbf{let}\ \langle \_, j \rangle \equiv \mathtt{lookup}\ i\ (\mathtt{snd}\ strie)\ \mathbf{in} \notag\\
&&&&& \mathbf{let}\ \langle pc\_plus\_jl, \_ \rangle \equiv \mathtt{lookup}\ (i+1)\ (\mathtt{snd}\ strie)\ \mathbf{in}\notag\\
&&&&& \mathbf{let}\ \langle addr, \_ \rangle \equiv \mathtt{lookup}\ paddr\ (\mathtt{snd}\ strie)\ \mathbf{in}\notag\\
&&&&& \langle j, pc\_plus\_jl, addr \rangle\notag\\
&&&\mathbf{else} \notag\\
&&&&&\mathbf{let}\ \langle \_, j \rangle \equiv \mathtt{lookup}\ i\ (\mathtt{snd}\ strie)\ \mathbf{in} \notag\\
&&&&&\mathbf{let}\ \langle pc\_plus\_jl, \_ \rangle \equiv \mathtt{lookup}\ (i+1)\ (\mathtt{snd}\ old\_strie)\ \mathbf{in}\notag\\
&&&&&\mathbf{let}\ \langle addr, \_ \rangle \equiv \mathtt{lookup}\ paddr\ (\mathtt{snd}\ old\_strie)\ \mathbf{in}\notag\\
&&&&&\langle j, pc\_plus\_jl, addr \rangle \mathbf{in}\ \notag\\
&&&\mathbf{match} && \ j\ \mathbf{with} \notag\\
&&&&&\mathrm{short\_jump} \Rightarrow \mathtt{short\_jump\_valid}\ src\ dest\notag\\
&&&&&\mathrm{absolute\_jump} \Rightarrow \mathtt{absolute\_jump\_valid}\ src\ dest\notag\\
&&&&&\mathrm{long\_jump} \Rightarrow \mathrm{True}
\end{alignat*}
\caption{Safety property}
\label{sigmasafe}
\end{figure}
%
We compute the distance using the memory address of the instruction
plus its size. This follows the behaviour of the MCS51 microprocessor, which
increases the program counter directly after fetching, and only then executes
the branch instruction (by changing the program counter again).

There are also some simple, properties to make sure that our policy
remains consistent, and to keep track of whether the fixed point has been
reached. We do not include them here in detail. Two of these properties give the values of $\sigma$ for the start and end of the program; $\sigma(0) = 0$ and $\sigma(n)$, where $n$ is the number of instructions up until now, is equal to the maximum memory address so far. There are also two properties that deal with what happens when the previous
iteration does not change with respect to the current one. $added$ is a
variable that keeps track of the number of bytes we have added to the program
size by changing the encoding of branch instructions. If $added$ is 0, the program
has not changed and vice versa.

%{\small
%\begin{align*}
%& \mathtt{lookup}\ 0\ (\mathtt{snd}\ strie) = 0 \notag\\
%& \mathtt{lookup}\ prefix\ (\mathtt{snd}\ strie) = \mathtt{fst}\ strie
%\end{align*}}


%{\small
%\begin{align*}
%& added = 0\ \rightarrow\ \mathtt{policy\_pc\_equal}\ prefix\ old\_strie\ strie \notag\\
%& \mathtt{policy\_jump\_equal}\ prefix\ old\_strie\ strie\ \rightarrow\ added = 0
%\end{align*}}

We need to use two different formulations, because the fact that $added$ is 0
does not guarantee that no branch instructions have changed. For instance,
it is possible that we have replaced a short jump with an absolute jump, which
does not change the size of the branch instruction. Therefore {\tt policy\_pc\_equal} states that $old\_sigma_1(x) = sigma_1(x)$, whereas {\tt policy\_jump\_equal} states that $old\_sigma_2(x) = sigma_2(x)$. This formulation is sufficient to prove termination and compactness.

Proving these invariants is simple, usually by induction on the prefix length.

\subsection{Iteration invariants}

These are invariants that hold after the completion of an iteration. The main
difference between these invariants and the fold invariants is that after the
completion of the fold, we check whether the program size does not supersede
64 Kb, the maximum memory size the MCS51 can address. The type of an iteration therefore becomes an option type: {\tt None} in case
the program becomes larger than 64 Kb, or $\mathtt{Some}\ \sigma$
otherwise. We also no longer pass along the number of bytes added to the
program size, but a boolean that indicates whether we have changed something
during the iteration or not.

If the iteration returns {\tt None}, which means that it has become too large for memory, there is an invariant that states that the previous iteration cannot
have every branch instruction encoded as a long jump. This is needed later in the proof of termination. If the iteration returns $\mathtt{Some}\ \sigma$, the fold invariants are retained without change.

Instead of using {\tt sigma\_compact\_unsafe}, we can now use the proper
invariant:
%
{\small
\begin{alignat*}{6}
\mathtt{sigma} & \omit\rlap{$\mathtt{\_compact} \equiv \lambda program.\lambda sigma.$} \notag\\
& \omit\rlap{$\forall n.n < program\ \rightarrow$} \notag\\
& \mathbf{match}\ && \omit\rlap{$\mathtt{lookup\_opt}\ n\ (\mathtt{snd}\ sigma)\ \mathbf{with}$}\notag\\
&&& \omit\rlap{$\mathrm{None}\ \Rightarrow\ \mathrm{False}$}\notag\\
&&& \omit\rlap{$\mathrm{Some}\ \langle pc, j \rangle \Rightarrow$}\notag\\
&&& \mathbf{match}\ && \mathtt{lookup\_opt}\ (n+1)\ (\mathtt{snd}\ sigma)\ \mathbf{with}\notag\\
&&&&& \mathrm{None}\ \Rightarrow\ \mathrm{False}\notag\\
&&&&& \mathrm{Some} \langle pc1, j1 \rangle \Rightarrow\notag\\
&&&&& \ \ pc1 = pc + \mathtt{instruction\_size}\ n\ (\mathtt{nth}\ n\ program)
\end{alignat*}}
%
This is almost the same invariant as ${\tt sigma\_compact\_unsafe}$, but differs in that it
computes the sizes of branch instructions by looking at the distance between
position and destination using $\sigma$. In actual use, the invariant is qualified: $\sigma$ is compact if there have
been no changes (i.e. the boolean passed along is {\tt true}). This is to
reflect the fact that we are doing a least fixed point computation: the result
is only correct when we have reached the fixed point.

There is another, trivial, invariant in case the iteration returns
$\mathtt{Some}\ \sigma$: it must hold that $\mathtt{fst}\ sigma < 2^{16}$.
We need this invariant to make sure that addresses do not overflow.

The proof of {\tt nec\_plus\_ultra} goes as follows: if we return {\tt None},
then the program size must be greater than 64 Kb. However, since the
previous iteration did not return {\tt None} (because otherwise we would
terminate immediately), the program size in the previous iteration must have
been smaller than 64 Kb.

Suppose that all the branch instructions in the previous iteration are
encoded as long jumps. This means that all branch instructions in this
iteration are long jumps as well, and therefore that both iterations are equal
in the encoding of their branch instructions. Per the invariant, this means that
$added = 0$, and therefore that all addresses in both iterations are equal.
But if all addresses are equal, the program sizes must be equal too, which
means that the program size in the current iteration must be smaller than
64 Kb. This contradicts the earlier hypothesis, hence not all branch
instructions in the previous iteration are encoded as long jumps.

The proof of {\tt sigma\_compact} follows from {\tt sigma\_compact\_unsafe} and
the fact that we have reached a fixed point, i.e. the previous iteration and
the current iteration are the same. This means that the results of
{\tt instruction\_size\_jmplen} and {\tt instruction\_size} are the same.

\subsection{Final properties}

These are the invariants that hold after $2n$ iterations, where $n$ is the
program size (we use the program size for convenience; we could also use the
number of branch instructions, but this is more complex). Here, we only
need {\tt out\_of\_program\_none}, {\tt sigma\_compact} and the fact that
$\sigma(0) = 0$.

Termination can now be proved using the fact that there is a $k \leq 2n$, with
$n$ the length of the program, such that iteration $k$ is equal to iteration
$k+1$. There are two possibilities: either there is a $k < 2n$ such that this
property holds, or every iteration up to $2n$ is different. In the latter case,
since the only changes between the iterations can be from shorter jumps to
longer jumps, in iteration $2n$ every branch instruction must be encoded as
a long jump. In this case, iteration $2n$ is equal to iteration $2n+1$ and the
fixed point is reached.

\section{Conclusion}

In the previous sections we have discussed the branch displacement optimisation
problem, presented an optimised solution, and discussed the proof of
termination and correctness for this algorithm, as formalised in Matita.

The algorithm we have presented is fast and correct, but not optimal; a true
optimal solution would need techniques like constraint solvers. While outside
the scope of the present research, it would be interesting to see if enough
heuristics could be found to make such a solution practical for implementing
in an existing compiler; this would be especially useful for embedded systems,
where it is important to have as small a solution as possible.

In itself the algorithm is already useful, as it results in a smaller solution
than the simple `every branch instruction is long' used up until nowand with
only 64 Kb of memory, every byte counts. It also results in a smaller solution
than the greatest fixed point algorithm that {\tt gcc} uses. It does this
without sacrificing speed or correctness.

The certification of an assembler that relies on the branch displacement
algorithm described in this paper was presented in~\cite{lastyear}.
The assembler computes the $\sigma$ map as described in this paper and
then works in two passes. In the first pass it builds a map
from instruction labels to addresses in the assembly code. In the
second pass it iterates over the code, translating every pseudo jump
at address $src$ to a label l associated to the assembly instruction at
address $dst$ to a jump of the size dictated by $(\sigma\ src)$ to
$(\sigma\ dst)$. In case of conditional jumps, the translated jump may be
implemented with a series of instructions.

The proof of correctness abstracts over the algorithm used and only relies on
{\tt sigma\_policy\_specification} (page~\ref{sigmapolspec}). It is a variation
of a standard 1tomany forward simulation proof~\cite{Leroy2009}. The
relation R between states just maps every code address $ppc$ stored in
registers or memory to $(\sigma\ ppc)$. To identify the code addresses,
an additional data structure is always kept together with the source
state and is updated by the semantics. The semantics is preserved
only for those programs whose source code operations
$(f\ ppc_1\ \ldots\ ppc_n)$ applied to code addresses $ppc_1 \ldots ppc_n$ are
such that $(f\ (\sigma\ ppc_1)\ ...\ (\sigma\ ppc_n) = f\ ppc_1\ ppc_n))$. For
example, an injective $\sigma$ preserves a binary equality test f for code
addresses, but not pointer subtraction.

The main lemma (fetching simulation), which relies on\\
{\tt sigma\_policy\_specification} and is established by structural induction
over the source code, says that fetching an assembly instruction at
position ppc is equal to fetching the translation of the instruction at
position $(\sigma\ ppc)$, and that the new incremented program counter is at
the beginning of the next instruction (compactness). The only exception is
when the instruction fetched is placed at the end of code memory and is
followed only by dead code. Execution simulation is trivial because of the
restriction over well behaved programs w.r.t. sigma. The condition
$\sigma\ 0 = 0$ is necessary because the hardware model prescribes that the
first instruction to be executed will be at address 0. For the details
see~\cite{lastyear}.

Instead of verifying the algorithm directly, another solution to the problem
would be to run an optimisation algorithm, and then verify the safety of the
result using a verified validator. Such a validator would be easier to verify
than the algorithm itself and it would also be efficient, requiring only a
linear pass over the source code to test the specification. However, it is
surely also interesting to formally prove that the assembler never rejects
programs that should be accepted, i.e. that the algorithm itself is correct.
This is the topic of the current paper.

\subsection{Related work}

As far as we are aware, this is the first formal discussion of the branch
displacement optimisation algorithm.

The CompCert project is another verified compiler project.
Their backend~\cite{Leroy2009} generates assembly code for (amongst others) subsets of the
PowerPC and x86 (32bit) architectures. At the assembly code stage, there is
no distinction between the spandependent jump instructions, so a branch
displacement optimisation algorithm is not needed.

%An offshoot of the CompCert project is the CompCertTSO project, which adds
%thread concurrency and synchronisation to the CompCert compiler~\cite{TSO2011}.
%This compiler also generates assembly code and therefore does not include a
%branch displacement algorithm.

%Finally, there is also the Piton stack~\cite{Moore1996}, which not only includes the
%formal verification of a compiler, but also of the machine architecture
%targeted by that compiler, a microprocessor called the FM9001.
%However, this architecture does not have different
%jump sizes (branching is simulated by assigning values to the program counter),
%so the branch displacement problem is irrelevant.


\bibliographystyle{alpha}
\bibliography{biblio}
%\bibliographystyle{spbasic}

\end{document}
Index: Papers/jarassembler2015/old_paper.tex
===================================================================
 Papers/jarassembler2015/old_paper.tex (revision 3556)
+++ Papers/jarassembler2015/old_paper.tex (revision 3556)
@@ 0,0 +1,923 @@
+%\documentclass[twocolumn,draft]{svjour3}
+\documentclass[a4paper]{article}
+\usepackage{algpseudocode}
+%\usepackage{algorithmicx}
+\usepackage{alltt}
+\usepackage{amsfonts}
+\usepackage{amsmath}
+\usepackage[british]{babel}
+\usepackage{caption}
+\usepackage[colorlinks]{hyperref}
+\usepackage[utf8]{inputenc}
+\usepackage{listings}
+\usepackage{microtype}
+\usepackage{subcaption}
+
+\usepackage{kpfonts}
+\usepackage[T1]{fontenc}
+
+\renewcommand{\verb}{\lstinline}
+\def\lstlanguagefiles{lstgrafite.tex}
+\lstset{language=Grafite}
+
+\begin{document}
+
+\title{On the proof of correctness of a verified optimising assembler
+\thanks{Research supported by the CerCo project, within the Future and Emerging Technologies (FET) programme of the Seventh Framework Programme for Research of the European Commission, under FETOpen grant number 243881}}
+\author{Jaap Boender \and Dominic P. Mulligan \and Claudio Sacerdoti Coen}
+%\institute{Department of Computer Science, University of Middlesex \and Computer Laboratory, University of Cambridge \and Dipartimento di Informatica, University of Bologna}
+
+\maketitle
+
+\begin{abstract}
+Optimising assemblers face the `branch displacement' or `jump encoding' problem, that is: how best to choose between concrete machine code jump instructionstypically of differing instruction and offset sizeswhen expanding pseudoinstructions.
+Ideally, on space constrained hardware, an optimising assembler would choose the set of pseudoinstruction expansions that minimises the size of the resulting machine code program, a task that is provably \textsc{np}hard.
+
+As part of CerCo (`Certified Complexity')an \textsc{eu}funded project to develop a verified concrete complexity preserving compiler for a large subset of the C programming languagewe have implemented and proved correct an optimising assembler within the Matita interactive theorem prover.
+Our assembler targets the instruction set of a typical microcontroller, the Intel \textsc{mcs51} series.
+As is common with embedded systems development this microcontroller has a paucity of available code memory and therefore we face an additional pressure in reducing the size of any assembled machine code program.
+Out of necessity, then, our assembler implements an algorithm for solving the branch displacement problem, and we have proved that this algorithm is correct.
+
+However, this efficient expansion of jump pseudoinstructions into machine code equivalents is complex, and therefore could unneccessarily complicate the proof of correctness for the rest of the assembler.
+We therefore isolate the decision making over how jumps should be expanded from the expansion process itself as much as possible using `policies', making the rest of the proof of correctness more straightforward.
+Our proof strategy contains a tracking facility for `good addresses' and only programs that use good addresses have their semantics preserved under assembly, as we observe that it is impossible for an assembler to preserve the semantics of every assembly program.
+Our strategy offers increased flexibility over the traditional approach to proving the correctness of assemblers, wherein addresses in assembly are kept opaque and immutable.
+In particular, our approach permits experimentation with the benign manipulation of addresses.
+
+We discuss wider issues associated with a proof of correctness for an assembler, detail our algorithm solving the `branch displacement' problem, and discuss our proof of correctness, employing `policies', for the assembler.
+
+%\keywords{Formal verification, interactive theorem proving, assembler, branch displacement optimisation, CerCo (`Certified Complexity'), MCS51 microcontroller, Matita proof assistant}
+\end{abstract}
+
+\section{Introduction}
+
+We consider the formalisation of an assembler for the Intel MCS51 8bit microprocessor in the Matita proof assistant~\cite{asperti:user:2007}.
+This formalisation forms a major component of the EUfunded CerCo (`Certified Complexity') project~\cite{cerco:2011}, concerning the construction and formalisation of a concrete complexity preserving compiler for a large subset of the C programming language.
+Though this document describes the formalisation of an assembler for a specific microprocessor, we believe that the techniques described are general, and should be of interest to anybody providing a proof of correctness for an optimising assembler.
+
+The \textsc{mcs51}, commonly called the 8051, is an 8bit Harvard architecture \textsc{cisc} instruction set microprocessor dating from the early 1980s.
+An extended variant, the \textsc{mcs52}, or 8052, was subsequently released adding extra \textsc{ram} and \textsc{rom} above and beyond that offered by the original \textsc{mcs51}, along with an additional timer.
+Originally manufactured and marketed by Intel, the microprocessor quickly gained and maintained its popularity, especially in embedded systems development, with dozens of semiconductor foundries subsequently releasing derivative copies many of which are still manufactured today.
+
+Admittedly, a typical \textsc{mcs51} derivative lacks many features compared to its more modern brethren.
+In particular, most derivatives lack a cache, any pipelining features, branch speculation, outoforder execution, or other advanced microarchitectural features we now take for granted on stateoftheart microprocessors.
+As a direct result program execution is much more predictable: as part of the processor's specification provided by the manufacturer we can be sure that a \texttt{NOP} instruction will always take exactly one processor cycle to execute, whilst an \texttt{ADD} instruction always takes two processor cycles, and so on and so forth for all instructions in the processor's instruction set.
+
+Certainty surrounding execution times makes the \textsc{mcs51} especially attractive for certain embedded development tasks.
+Moreover, this same certainty also makes the \textsc{mcs51} especially attractive for CerCo's ends, wherein we aim to produce a verified concretecomplexity preserving compiler that lifts a cost model for a C programinduced on the program's machine code image under compilationback through the compilation chain.
+This cost model will ultimately be made manifest as complexity assertions annotating the original C source code, and these annotations may then be used by the programmerperhaps in conjunction with external reasoning toolsto derive accurate resource bounds for their program.
+As a result, knowing that a given block of machine code instructions will take a fixed number of cycles to execute, irrespective of the program context or processor state when executed, permits us to generate a cycleaccurate cost model on the original C source code and make several simplifying assumptions in the correctness proof of our compiler.
+
+Yet, such a goal introduces two complications.
+First, as the CerCo cost model is induced on machine code, the CerCo compilation chain must ultimately produce machine code, rather than assembly code as in other verified compiler projects such as CompCert~\cite{Leroy2009}.
+Ultimately, for the CerCo compilation chain to be fully verified, a verified assembler and a verified compiler must both be produced.
+
+Second, the presence of an assembler allows us to simplify the proof of correctness of the CerCo compiler proper.
+Rather than having the compiler choose particular jump, call or move machine code instructions, the compiler may produce idealised pseudoinstructions
+
+To do this, we must solve the `branch displacement' problemdeciding how best to expand pseudojumps to labels in assembly language to machine code jumps.
+This problem, also known as `jump encoding', is
+a wellknown problem amongst assembler implementors~\cite{Hyde2006} and arises when pseudojumps can be expanded
+in different ways to real machine instructions, but the different expansions
+are not equivalent (e.g. jumps that differ in the distance that they `span', or instructions that differ in the number of clock cycles needed to completely execute them) and are not always
+correct (e.g. correctness is only up to global constraints over the compiled
+code). For instance, some jump instructions (short jumps) are very small
+and can be executed quickly, but they can only reach destinations within a
+certain distance from the current instruction. When the destinations are
+too far away, larger and slower long jumps must be used. The use of a long jump may
+augment the distance between another pseudojump and its target, forcing
+another long jump use, in a cascade. The job of the optimising
+compiler (assembler) is to individually expand every pseudoinstruction in such a way
+that all global constraints are satisfied and that the compiled program
+is minimal in size and faster in concrete time complexity.
+This problem is known to be computationally hard for most CISC architectures (\textsc{np}hard, see~\cite{hyde:branch:2006}).
+
+To simplify the CerCo C compiler we have chosen to implement an optimising assembler whose input language the compiler will target.
+Labels, conditional jumps to labels, a program preamble containing global data and a \texttt{MOV} instruction for moving this global data into the MCS51's one 16bit register all feature in our assembly language.
+We further simplify by ignoring linking, assuming that all our assembly programs are prelinked.
+
+The requirements of the CerCo project add yet more complications to our proof of correctness, as we must also address a cost model.
+CerCo imposes a cost model on C programs or, more specifically, on simple blocks of instructions.
+This cost model is induced by the compilation process itself, and its noncompositional nature allows us to assign different costs to identical C statements depending on how they are compiled.
+In short, we aim to obtain a very precise costing for a program by embracing the compilation process, not ignoring it.
+At the assembler level, this is reflected by our need to induce a cost
+model on the assembly code as a function of the assembly program and the
+strategy used to solve the branch displacement problem. In particular, our
+optimising assembler should also return a map that assigns a cost (in clock
+cycles) to every instruction in the source program. We expect the induced cost
+to be preserved by the assembler: we will prove that the compiled code
+tightly simulates the source code by taking exactly the predicted amount of
+time.
+Note that the temporal `tightness' of the simulation is a fundamental prerequisite
+of the correctness of the simulation, as some functions of the MCS51timers and \textsc{i/o}depend on the microprocessor's clock.
+If the pseudo and concrete clock differ the result of an \textsc{i/o} operation may not be preserved.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+% This problem is the subject of the present paper. After introducing the problem
+% in more detail, we will discuss the solutions used by other compilers, present
+% the algorithm we use in the CerCo assembler, and discuss its verification,
+% that is the proofs of termination and correctness using the Matita proof
+% assistant~\cite{Asperti2007}.
+
+% Formulating the final statement of correctness and finding the loop invariants
+% have been nontrivial tasks and are, indeed, the main contribution of this
+% paper. It has required considerable care and finetuning to formulate not
+% only the minimal statement required for the ulterior proof of correctness of
+% the assembler, but also the minimal set of invariants needed for the proof
+% of correctness of the algorithm.
+
+% The research presented in this paper has been executed within the CerCo project
+% which aims at formally verifying a C compiler with cost annotations. The
+% target architecture for this project is the MCS51, whose instruction set
+% contains spandependent instructions. Furthermore, its maximum addressable
+% memory size is very small (64 Kb), which makes it important to generate
+% programs that are as small as possible. With this optimisation, however, comes increased complexity and hence increased possibility for error. We must make sure that the branch instructions are encoded correctly, otherwise the assembled program will behave unpredictably.
+
+% All Matita files related to this development can be found on the CerCo
+% website, \url{http://cerco.cs.unibo.it}. The specific part that contains the
+% branch displacement algorithm is in the {\tt ASM} subdirectory, in the files
+% {\tt PolicyFront.ma}, {\tt PolicyStep.ma} and {\tt Policy.ma}.
+
+\section{The branch displacement optimisation problem}
+
+In most modern instruction sets that have them, the only spandependent
+instructions are branch instructions. Taking the ubiquitous x8664 instruction
+set as an example, we find that it contains eleven different forms of the
+unconditional branch instruction, all with different ranges, instruction sizes
+and semantics (only six are valid in 64bit mode, for example). Some examples
+are shown in Figure~\ref{f:x86jumps} (see also~\cite{IntelDev}).
+
+\begin{figure}[h]
+\begin{center}
+\begin{tabular}{lll}
+\hline
+Instruction & Size (bytes) & Displacement range \\
+\hline
+Short jump & 2 & 128 to 127 bytes \\
+Relative near jump & 5 & $2^{32}$ to $2^{32}1$ bytes \\
+Absolute near jump & 6 & one segment (64bit address) \\
+Far jump & 8 & entire memory (indirect jump) \\
+\hline
+\end{tabular}
+\end{center}
+\caption{List of x86 branch instructions}
+\label{f:x86jumps}
+\end{figure}
+
+The chosen target architecture of the CerCo project is the Intel MCS51, which
+features three types of branch instructions (or jump instructions; the two terms
+are used interchangeably), as shown in Figure~\ref{f:mcs51jumps}.
+
+\begin{figure}[h]
+\begin{center}
+\begin{tabular}{llll}
+\hline
+Instruction & Size & Execution time & Displacement range \\
+ & (bytes) & (cycles) & \\
+\hline
+SJMP (`short jump') & 2 & 2 & 128 to 127 bytes \\
+AJMP (`absolute jump') & 2 & 2 & one segment (11bit address) \\
+LJMP (`long jump') & 3 & 3 & entire memory \\
+\hline
+\end{tabular}
+\end{center}
+\caption{List of MCS51 branch instructions}
+\label{f:mcs51jumps}
+\end{figure}
+
+Conditional branch instructions are only available in short form, which
+means that a conditional branch outside the short address range has to be
+encoded using three branch instructions (for instructions whose logical
+negation is available, it can be done with two branch instructions, but for
+some instructions this is not the case). The call instruction is
+only available in absolute and long forms.
+
+Note that even though the MCS51 architecture is much less advanced and much
+simpler than the x8664 architecture, the basic types of branch instruction
+remain the same: a short jump with a limited range, an intrasegment jump and a
+jump that can reach the entire available memory.
+
+Generally, in code fed to the assembler as input, the only
+difference between branch instructions is semantics, not span. This
+means that a distinction is made between an unconditional branch and the
+several kinds of conditional branch, but not between their short, absolute or
+long variants.
+
+The algorithm used by the assembler to encode these branch instructions into
+the different machine instructions is known as the {\em branch displacement
+algorithm}. The optimisation problem consists of finding as small an encoding as
+possible, thus minimising program length and execution time.
+
+Similar problems, e.g. the branch displacement optimisation problem for other
+architectures, are known to be NPcomplete~\cite{Robertson1979,Szymanski1978},
+which could make finding an optimal solution very timeconsuming.
+
+The canonical solution, as shown by Szymanski~\cite{Szymanski1978} or more
+recently by Dickson~\cite{Dickson2008} for the x86 instruction set, is to use a
+fixed point algorithm that starts with the shortest possible encoding (all
+branch instruction encoded as short jumps, which is likely not a correct
+solution) and then iterates over the source to reencode those branch
+instructions whose target is outside their range.
+
+\subsection*{Adding absolute jumps}
+
+In both papers mentioned above, the encoding of a jump is only dependent on the
+distance between the jump and its target: below a certain value a short jump
+can be used; above this value the jump must be encoded as a long jump.
+
+Here, termination of the smallest fixed point algorithm is easy to prove. All
+branch instructions start out encoded as short jumps, which means that the
+distance between any branch instruction and its target is as short as possible
+(all the intervening jumps are short).
+If, in this situation, there is a branch instruction $b$ whose span is not
+within the range for a short jump, we can be sure that we can never reach a
+situation where the span of $j$ is so small that it can be encoded as a short
+jump. This argument continues to hold throughout the subsequent iterations of
+the algorithm: short jumps can change into long jumps, but not \emph{vice versa},
+as spans only increase. Hence, the algorithm either terminates early when a fixed
+point is reached or when all short jumps have been changed into long jumps.
+
+Also, we can be certain that we have reached an optimal solution: a short jump
+is only changed into a long jump if it is absolutely necessary.
+
+However, neither of these claims (termination nor optimality) hold when we add
+the absolute jump. With absolute jumps, the encoding of a branch
+instruction no longer depends only on the distance between the branch
+instruction and its target. An absolute jump is possible when instruction and
+target are in the same segment (for the MCS51, this means that the first 5
+bytes of their addresses have to be equal). It is therefore entirely possible
+for two branch instructions with the same span to be encoded in different ways
+(absolute if the branch instruction and its target are in the same segment,
+long if this is not the case).
+
+\begin{figure}[t]
+\begin{subfigure}[b]{.45\linewidth}
+\small
+\begin{alltt}
+ jmp X
+ \ldots
+L\(\sb{0}\): \ldots
+% Start of new segment if
+% jmp X is encoded as short
+ \ldots
+ jmp L\(\sb{0}\)
+\end{alltt}
+\caption{Example of a program where a long jump becomes absolute}
+\label{f:term_example}
+\end{subfigure}
+\hfill
+\begin{subfigure}[b]{.45\linewidth}
+\small
+\begin{alltt}
+L\(\sb{0}\): jmp X
+X: \ldots
+ \ldots
+L\(\sb{1}\): \ldots
+% Start of new segment if
+% jmp X is encoded as short
+ \ldots
+ jmp L\(\sb{1}\)
+ \ldots
+ jmp L\(\sb{1}\)
+ \ldots
+ jmp L\(\sb{1}\)
+ \ldots
+\end{alltt}
+\caption{Example of a program where the fixedpoint algorithm is not optimal}
+\label{f:opt_example}
+\end{subfigure}
+\end{figure}
+
+This invalidates our earlier termination argument: a branch instruction, once encoded
+as a long jump, can be reencoded during a later iteration as an absolute jump.
+Consider the program shown in Figure~\ref{f:term_example}. At the start of the
+first iteration, both the branch to {\tt X} and the branch to $\mathtt{L}_{0}$
+are encoded as small jumps. Let us assume that in this case, the placement of
+$\mathtt{L}_{0}$ and the branch to it are such that $\mathtt{L}_{0}$ is just
+outside the segment that contains this branch. Let us also assume that the
+distance between $\mathtt{L}_{0}$ and the branch to it is too large for the
+branch instruction to be encoded as a short jump.
+
+All this means that in the second iteration, the branch to $\mathtt{L}_{0}$ will
+be encoded as a long jump. If we assume that the branch to {\tt X} is encoded as
+a long jump as well, the size of the branch instruction will increase and
+$\mathtt{L}_{0}$ will be `propelled' into the same segment as its branch
+instruction, because every subsequent instruction will move one byte forward.
+Hence, in the third iteration, the branch to $\mathtt{L}_{0}$ can be encoded as
+an absolute jump. At first glance, there is nothing that prevents us from
+constructing a configuration where two branch instructions interact in such a
+way as to iterate indefinitely between long and absolute encodings.
+
+This situation mirrors the explanation by Szymanski~\cite{Szymanski1978} of why
+the branch displacement optimisation problem is NPcomplete. In this explanation,
+a condition for NPcompleteness is the fact that programs be allowed to contain
+{\em pathological} jumps. These are branch instructions that can normally not be
+encoded as a short(er) jump, but gain this property when some other branch
+instructions are encoded as a long(er) jump. This is exactly what happens in
+Figure~\ref{f:term_example}. By encoding the first branch instruction as a long
+jump, another branch instruction switches from long to absolute (which is
+shorter).
+
+In addition, our previous optimality argument no longer holds. Consider the
+program shown in Figure~\ref{f:opt_example}. Suppose that the distance between
+$\mathtt{L}_{0}$ and $\mathtt{L}_{1}$ is such that if {\tt jmp X} is encoded
+as a short jump, there is a segment border just after $\mathtt{L}_{1}$. Let
+us also assume that all three branches to $\mathtt{L}_{1}$ are in the same
+segment, but far enough away from $\mathtt{L}_{1}$ that they cannot be encoded
+as short jumps.
+
+Then, if {\tt jmp X} were to be encoded as a short jump, which is clearly
+possible, all of the branches to $\mathtt{L}_{1}$ would have to be encoded as
+long jumps. However, if {\tt jmp X} were to be encoded as a long jump, and
+therefore increase in size, $\mathtt{L}_{1}$ would be `propelled' across the
+segment border, so that the three branches to $\mathtt{L}_{1}$ could be encoded
+as absolute jumps. Depending on the relative sizes of long and absolute jumps,
+this solution might actually be smaller than the one reached by the smallest
+fixed point algorithm.
+
+\section{Our algorithm}
+
+\subsection{Design decisions}
+
+Given the NPcompleteness of the problem, finding optimal solutions
+(using, for example, a constraint solver) can potentially be very costly.
+
+The SDCC compiler~\cite{SDCC2011}, which has a backend targeting the MCS51
+instruction set, simply encodes every branch instruction as a long jump
+without taking the distance into account. While certainly correct (the long
+jump can reach any destination in memory) and a very fast solution to compute,
+it results in a less than optimal solution in terms of output size and
+execution time.
+
+On the other hand, the {\tt gcc} compiler suite, while compiling
+C on the x86 architecture, uses a greatest fix point algorithm. In other words,
+it starts with all branch instructions encoded as the largest jumps
+available, and then tries to reduce the size of branch instructions as much as
+possible.
+
+Such an algorithm has the advantage that any intermediate result it returns
+is correct: the solution where every branch instruction is encoded as a large
+jump is always possible, and the algorithm only reduces those branch
+instructions whose destination address is in range for a shorter jump.
+The algorithm can thus be stopped after a determined number of steps without
+sacrificing correctness.
+
+The result, however, is not necessarily optimal. Even if the algorithm is run
+until it terminates naturally, the fixed point reached is the {\em greatest}
+fixed point, not the least fixed point. Furthermore, {\tt gcc} (at least for
+the x86 architecture) only uses short and long jumps. This makes the algorithm
+more efficient, as shown in the previous section, but also results in a less
+optimal solution.
+
+In the CerCo assembler, we opted at first for a least fixed point algorithm,
+taking absolute jumps into account.
+
+Here, we ran into a problem with proving termination, as explained in the
+previous section: if we only take short and long jumps into account, the jump
+encoding can only switch from short to long, but never in the other direction.
+When we add absolute jumps, however, it is theoretically possible for a branch
+instruction to switch from absolute to long and back, as previously explained.
+Proving termination then becomes difficult, because there is nothing that
+precludes a branch instruction from oscillating back and forth between absolute
+and long jumps indefinitely.
+
+To keep the algorithm in the same complexity class and more easily
+prove termination, we decided to explicitly enforce the `branch instructions
+must always grow longer' requirement: if a branch instruction is encoded as a
+long jump in one iteration, it will also be encoded as a long jump in all the
+following iterations. Therefore the encoding of any branch instruction
+can change at most two times: once from short to absolute (or long), and once
+from absolute to long.
+
+There is one complicating factor. Suppose that a branch instruction is encoded
+in step $n$ as an absolute jump, but in step $n+1$ it is determined that
+(because of changes elsewhere) it can now be encoded as a short jump. Due to
+the requirement that the branch instructions must always grow longer,
+the branch encoding will be encoded as an absolute jump in step
+$n+1$ as well.
+
+This is not necessarily correct. A branch instruction that can be
+encoded as a short jump cannot always also be encoded as an absolute jump, as a
+short jump can bridge segments, whereas an absolute jump cannot. Therefore,
+in this situation we have decided to encode the branch instruction as a long
+jump, which is always correct.
+
+The resulting algorithm, therefore, will not return the least fixed point, as
+it might have too many long jumps. However, it is still better than the
+algorithms from SDCC and {\tt gcc}, since even in the worst case, it will still
+return a smaller or equal solution.
+
+Experimenting with our algorithm on the test suite of C programs included with
+gcc 2.3.3 has shown that on average, about 25 percent of jumps are encoded as short or absolute jumps by the algorithm. As not all instructions are jumps, this does not make for a large reduction in size, but it can make for a reduction in execution time: if jumps
+are executed multiple times, for example in loops, the fact that short jumps take less cycles to execute than long jumps can have great effect.
+
+As for complexity, there are at most $2n$ iterations, where $n$ is the number of
+branch instructions. Practical tests within the CerCo project on small to
+medium pieces of code have shown that in almost all cases, a fixed point is
+reached in 3 passes. Only in one case did the algorithm need 4. This is not surprising: after all, the difference between short/absolute and
+long jumps is only one byte (three for conditional jumps). For a change from
+short/absolute to long to have an effect on other jumps is therefore relatively
+uncommon, which explains why a fixed point is reached so quickly.
+
+\subsection{The algorithm in detail}
+
+The branch displacement algorithm forms part of the translation from
+pseudocode to assembler. More specifically, it is used by the function that
+translates pseudoaddresses (natural numbers indicating the position of the
+instruction in the program) to actual addresses in memory. Note that in pseudocode, all instructions are of size 1.
+
+Our original intention was to have two different functions, one function
+$\mathtt{policy}: \mathbb{N} \rightarrow \{\mathtt{short\_jump},
+\mathtt{absolute\_jump}, \mathtt{long\_jump}\}$ to associate jumps to their
+intended encoding, and a function $\sigma: \mathbb{N} \rightarrow
+\mathtt{Word}$ to associate pseudoaddresses to machine addresses. $\sigma$
+would use $\mathtt{policy}$ to determine the size of jump instructions. This turned out to be suboptimal from the algorithmic point of view and
+impossible to prove correct.
+
+From the algorithmic point of view, in order to create the $\mathtt{policy}$
+function, we must necessarily have a translation from pseudoaddresses
+to machine addresses (i.e. a $\sigma$ function): in order to judge the distance
+between a jump and its destination, we must know their memory locations.
+Conversely, in order to create the $\sigma$ function, we need to have the
+$\mathtt{policy}$ function, otherwise we do not know the sizes of the jump
+instructions in the program.
+
+Much the same problem appears when we try to prove the algorithm correct: the
+correctness of $\mathtt{policy}$ depends on the correctness of $\sigma$, and
+the correctness of $\sigma$ depends on the correctness of $\mathtt{policy}$.
+
+We solved this problem by integrating the $\mathtt{policy}$ and $\sigma$
+algorithms. We now have a function
+$\sigma: \mathbb{N} \rightarrow \mathtt{Word} \times \mathtt{bool}$ which
+associates a pseudoaddress to a machine address. The boolean denotes a forced
+long jump; as noted in the previous section, if during the fixed point
+computation an absolute jump changes to be potentially reencoded as a short
+jump, the result is actually a long jump. It might therefore be the case that
+jumps are encoded as long jumps without this actually being necessary, and this
+information needs to be passed to the code generating function.
+
+The assembler function encodes the jumps by checking the distance between
+source and destination according to $\sigma$, so it could select an absolute
+jump in a situation where there should be a long jump. The boolean is there
+to prevent this from happening by indicating the locations where a long jump
+should be encoded, even if a shorter jump is possible. This has no effect on
+correctness, since a long jump is applicable in any situation.
+
+\begin{figure}[t]
+\small
+\begin{algorithmic}
+\Function{f}{$labels$,$old\_sigma$,$instr$,$ppc$,$acc$}
+ \State $\langle added, pc, sigma \rangle \gets acc$
+ \If {$instr$ is a backward jump to $j$}
+ \State $length \gets \mathrm{jump\_size}(pc,sigma_1(labels(j)))$
+ \Comment compute jump distance
+ \ElsIf {$instr$ is a forward jump to $j$}
+ \State $length \gets \mathrm{jump\_size}(pc,old\_sigma_1(labels(j))+added)$
+ \EndIf
+ \State $old\_length \gets \mathrm{old\_sigma_1}(ppc)$
+ \State $new\_length \gets \mathrm{max}(old\_length, length)$
+ \Comment length must never decrease
+ \State $old\_size \gets \mathrm{old\_sigma_2}(ppc)$
+ \State $new\_size \gets \mathrm{instruction\_size}(instr,new\_length)$
+ \Comment compute size in bytes
+ \State $new\_added \gets added+(new\_sizeold\_size)$
+ \Comment keep track of total added bytes
+ \State $new\_sigma \gets old\_sigma$
+ \State $new\_sigma_1(ppc+1) \gets pc+new\_size$
+ \State $new\_sigma_2(ppc) \gets new\_length$
+ \Comment update $\sigma$ \\
+ \Return $\langle new\_added, pc+new\_size, new\_sigma \rangle$
+\EndFunction
+\end{algorithmic}
+\caption{The heart of the algorithm}
+\label{f:jump_expansion_step}
+\end{figure}
+
+The algorithm, shown in Figure~\ref{f:jump_expansion_step}, works by folding the
+function {\sc f} over the entire program, thus gradually constructing $sigma$.
+This constitutes one step in the fixed point calculation; successive steps
+repeat the fold until a fixed point is reached. We have abstracted away the case where an instruction is not a jump, since the size of these instructions is constant.
+
+Parameters of the function {\sc f} are:
+\begin{itemize}
+ \item a function $labels$ that associates a label to its pseudoaddress;
+ \item $old\_sigma$, the $\sigma$ function returned by the previous
+ iteration of the fixed point calculation;
+ \item $instr$, the instruction currently under consideration;
+ \item $ppc$, the pseudoaddress of $instr$;
+ \item $acc$, the fold accumulator, which contains $added$ (the number of
+ bytes added to the program size with respect to the previous iteration), $pc$
+ (the highest memory address reached so far), and of course $sigma$, the
+ $\sigma$ function under construction.
+\end{itemize}
+The first two are parameters that remain the same through one iteration, the
+final three are standard parameters for a fold function (including $ppc$,
+which is simply the number of instructions of the program already processed).
+
+The $\sigma$ functions used by {\sc f} are not of the same type as the final
+$\sigma$ function: they are of type
+$\sigma: \mathbb{N} \rightarrow \mathbb{N} \times \{\mathtt{short\_jump},
+\mathtt{absolute\_jump},\mathtt{long\_jump}\}$; a function that associates a
+pseudoaddress with a memory address and a jump length. We do this to
+ease the comparison of jump lengths between iterations. In the algorithm,
+we use the notation $sigma_1(x)$ to denote the memory address corresponding to
+$x$, and $sigma_2(x)$ for the jump length corresponding to $x$.
+
+Note that the $\sigma$ function used for label lookup varies depending on
+whether the label is behind our current position or ahead of it. For
+backward branches, where the label is behind our current position, we can use
+$sigma$ for lookup, since its memory address is already known. However, for
+forward branches, the memory address of the address of the label is not yet
+known, so we must use $old\_sigma$.
+
+We cannot use $old\_sigma$ without change: it might be the case that we have
+already increased the size of some branch instructions before, making the
+program longer and moving every instruction forward. We must compensate for this
+by adding the size increase of the program to the label's memory address
+according to $old\_sigma$, so that branch instruction spans do not get
+compromised.
+
+%Note also that we add the pc to $sigma$ at location $ppc+1$, whereas we add the
+%jump length at location $ppc$. We do this so that $sigma(ppc)$ will always
+%return a pair with the start address of the instruction at $ppc$ and the
+%length of its branch instruction (if any); the end address of the program can
+%be found at $sigma(n+1)$, where $n$ is the number of instructions in the
+%program.
+
+\section{The proof}
+
+In this section, we present the correctness proof for the algorithm in more
+detail. The main correctness statement is shown, slightly simplified, in~Figure~\ref{statement}.
+%
+\begin{figure}[t]
+\small
+\begin{alignat*}{6}
+\mathtt{sigma}&\omit\rlap{$\mathtt{\_policy\_specification} \equiv
+\lambda program.\lambda sigma.$} \notag\\
+ & \omit\rlap{$sigma\ 0 = 0\ \wedge$} \notag\\
+ & \mathbf{let}\ & & \omit\rlap{$instr\_list \equiv code\ program\ \mathbf{in}$} \notag\\
+ &&& \omit\rlap{$\forall ppc.ppc < instr\_list \rightarrow$} \notag\\
+ &&& \mathbf{let}\ && pc \equiv sigma\ ppc\ \mathbf{in} \notag\\
+ &&& \mathbf{let}\ && instruction \equiv \mathtt{fetch\_pseudo\_instruction}\ instr\_list\ ppc\ \mathbf{in} \notag\\
+ &&& \mathbf{let}\ && next\_pc \equiv sigma\ (ppc+1)\ \mathbf{in}\notag\\
+ &&&&& next\_pc = pc + \mathtt{instruction\_size}\ sigma\ ppc\ instruction\ \wedge\notag\\
+ &&&&& (pc + \mathtt{instruction\_size}\ sigma\ ppc\ instruction < 2^{16}\ \vee\notag\\
+ &&&&& (\forall ppc'.ppc' < instr\_list \rightarrow ppc < ppc' \rightarrow \notag\\
+ &&&&& \mathbf{let}\ instruction' \equiv \mathtt{fetch\_pseudo\_instruction}\ instr\_list\ ppc'\ \mathbf{in} \notag\\
+ &&&&&\ \mathtt{instruction\_size}\ sigma\ ppc'\ instruction' = 0)\ \wedge \notag\\
+ &&&&& pc + \mathtt{instruction\_size}\ sigma\ ppc\ instruction = 2^{16})
+\end{alignat*}
+\caption{Main correctness statement\label{statement}}
+\label{sigmapolspec}
+\end{figure}
+%
+Informally, this means that when fetching a pseudoinstruction at $ppc$, the
+translation by $\sigma$ of $ppc+1$ is the same as $\sigma(ppc)$ plus the size
+of the instruction at $ppc$. That is, an instruction is placed consecutively
+after the previous one, and there are no overlaps. The rest of the statement deals with memory size: either the next instruction fits within memory ($next\_pc < 2^{16}$) or it ends exactly at the limit memory,
+in which case it must be the last translated instruction in the program (enforced by specfiying that the size of all subsequent instructions is 0: there may be comments or cost annotations that are not translated).
+
+Finally, we enforce that the program starts at address 0, i.e. $\sigma(0) = 0$. It may seem strange that we do not explicitly include a safety property stating that every jump instruction is of the right type with respect to its target (akin to the lemma from Figure~\ref{sigmasafe}), but this is not necessary. The distance is recalculated according to the instruction addresses from $\sigma$, which implicitly expresses safety.
+
+Since our computation is a least fixed point computation, we must prove
+termination in order to prove correctness: if the algorithm is halted after
+a number of steps without reaching a fixed point, the solution is not
+guaranteed to be correct. More specifically, branch instructions might be
+encoded which do not coincide with the span between their location and their
+destination.
+
+Proof of termination rests on the fact that the encoding of branch
+instructions can only grow larger, which means that we must reach a fixed point
+after at most $2n$ iterations, with $n$ the number of branch instructions in
+the program. This worst case is reached if at every iteration, we change the
+encoding of exactly one branch instruction; since the encoding of any branch
+instruction can change first from short to absolute, and then to long, there
+can be at most $2n$ changes.
+
+%The proof has been carried out using the ``Russell'' style from~\cite{Sozeau2006}.
+%We have proven some invariants of the {\sc f} function from the previous
+%section; these invariants are then used to prove properties that hold for every
+%iteration of the fixed point computation; and finally, we can prove some
+%properties of the fixed point.
+
+\subsection{Fold invariants}
+
+In this section, we present the invariants that hold during the fold of {\sc f}
+over the program. These will be used later on to prove the properties of the
+iteration. During the fixed point computation, the $\sigma$ function is
+implemented as a trie for ease of access; computing $\sigma(x)$ is achieved by
+looking up the value of $x$ in the trie. Actually, during the fold, the value
+we pass along is a pair $\mathbb{N} \times \mathtt{ppc\_pc\_map}$. The first
+component is the number of bytes added to the program so far with respect to
+the previous iteration, and the second component, {\tt ppc\_pc\_map}, is the
+actual $\sigma$ trie (which we'll call $strie$ to avoid confusion).
+%
+{\small
+\begin{alignat*}{2}
+\mathtt{out} & \mathtt{\_of\_program\_none} \equiv \lambda prefix.\lambda strie. \notag\\
+& \forall i.i < 2^{16} \rightarrow (i > prefix \leftrightarrow
+ \mathtt{lookup\_opt}\ i\ (\mathtt{snd}\ strie) = \mathtt{None})
+\end{alignat*}}
+%
+The first invariant states that any pseudoaddress not yet examined is not
+present in the lookup trie.
+%
+{\small
+\begin{alignat*}{2}
+\mathtt{not} & \mathtt{\_jump\_default} \equiv \lambda prefix.\lambda strie.\forall i.i < prefix \rightarrow\notag\\
+& \neg\mathtt{is\_jump}\ (\mathtt{nth}\ i\ prefix) \rightarrow \mathtt{lookup}\ i\ (\mathtt{snd}\ strie) = \mathtt{short\_jump}
+\end{alignat*}}
+%
+This invariant states that when we try to look up the jump length of a
+pseudoaddress where there is no branch instruction, we will get the default
+value, a short jump.
+%
+{\small
+\begin{alignat*}{4}
+\mathtt{jump} & \mathtt{\_increase} \equiv \lambda pc.\lambda op.\lambda p.\forall i.i < prefix \rightarrow \notag\\
+& \mathbf{let}\ oj \equiv \mathtt{lookup}\ i\ (\mathtt{snd}\ op)\ \mathbf{in} \notag\\
+& \mathbf{let}\ j \equiv \mathtt{lookup}\ i\ (\mathtt{snd}\ p)\ \mathbf{in}\ \mathtt{jmpleq}\ oj\ j
+\end{alignat*}}
+%
+This invariant states that between iterations (with $op$ being the previous
+iteration, and $p$ the current one), jump lengths either remain equal or
+increase. It is needed for proving termination.
+%
+\begin{figure}[h]
+\small
+\begin{alignat*}{6}
+\mathtt{sigma} & \omit\rlap{$\mathtt{\_compact\_unsafe} \equiv \lambda prefix.\lambda strie.\forall n.n < prefix \rightarrow$}\notag\\
+& \mathbf{match}\ && \omit\rlap{$\mathtt{lookup\_opt}\ n\ (\mathtt{snd}\ strie)\ \mathbf{with}$}\notag\\
+&&& \omit\rlap{$\mathtt{None} \Rightarrow \mathrm{False}$} \notag\\
+&&& \omit\rlap{$\mathtt{Some}\ \langle pc, j \rangle \Rightarrow$} \notag\\
+&&& \mathbf{match}\ && \mathtt{lookup\_opt}\ (n+1)\ (\mathtt{snd}\ strie)\ \mathbf{with}\notag\\
+&&&&& \mathtt{None} \Rightarrow \mathrm{False} \notag\\
+&&&&& \mathtt{Some}\ \langle pc_1, j_1 \rangle \Rightarrow
+ pc_1 = pc + \notag\\
+&&&&& \ \ \mathtt{instruction\_size\_jmplen}\ j\ (\mathtt{nth}\ n\ prefix)
+\end{alignat*}
+\caption{Temporary safety property}
+\label{sigmacompactunsafe}
+\end{figure}
+%
+We now proceed with the safety lemmas. The lemma in
+Figure~\ref{sigmacompactunsafe} is a temporary formulation of the main
+property {\tt sigma\_policy\_specification}. Its main difference from the
+final version is that it uses {\tt instruction\_size\_jmplen} to compute the
+instruction size. This function uses $j$ to compute the span of branch
+instructions (i.e. it uses the $\sigma$ under construction), instead
+of looking at the distance between source and destination. This is because
+$\sigma$ is still under construction; we will prove below that after the
+final iteration, {\tt sigma\_compact\_unsafe} is equivalent to the main
+property in Figure~\ref{sigmasafe} which holds at the end of the computation.
+%
+\begin{figure}[h]
+\small
+\begin{alignat*}{6}
+\mathtt{sigma} & \omit\rlap{$\mathtt{\_safe} \equiv \lambda prefix.\lambda labels.\lambda old\_strie.\lambda strie.\forall i.i < prefix \rightarrow$} \notag\\
+& \omit\rlap{$\forall dest\_label.\mathtt{is\_jump\_to\ (\mathtt{nth}\ i\ prefix})\ dest\_label \rightarrow$} \notag\\
+& \mathbf{let} && \omit\rlap{$\ paddr \equiv \mathtt{lookup}\ labels\ dest\_label\ \mathbf{in}$} \notag\\
+& \mathbf{let} && \omit\rlap{$\ \langle j, src, dest \rangle \equiv \mathbf{if} \ paddr\ \leq\ i\ \mathbf{then}$}\notag\\
+&&&&& \mathbf{let}\ \langle \_, j \rangle \equiv \mathtt{lookup}\ i\ (\mathtt{snd}\ strie)\ \mathbf{in} \notag\\
+&&&&& \mathbf{let}\ \langle pc\_plus\_jl, \_ \rangle \equiv \mathtt{lookup}\ (i+1)\ (\mathtt{snd}\ strie)\ \mathbf{in}\notag\\
+&&&&& \mathbf{let}\ \langle addr, \_ \rangle \equiv \mathtt{lookup}\ paddr\ (\mathtt{snd}\ strie)\ \mathbf{in}\notag\\
+&&&&& \langle j, pc\_plus\_jl, addr \rangle\notag\\
+&&&\mathbf{else} \notag\\
+&&&&&\mathbf{let}\ \langle \_, j \rangle \equiv \mathtt{lookup}\ i\ (\mathtt{snd}\ strie)\ \mathbf{in} \notag\\
+&&&&&\mathbf{let}\ \langle pc\_plus\_jl, \_ \rangle \equiv \mathtt{lookup}\ (i+1)\ (\mathtt{snd}\ old\_strie)\ \mathbf{in}\notag\\
+&&&&&\mathbf{let}\ \langle addr, \_ \rangle \equiv \mathtt{lookup}\ paddr\ (\mathtt{snd}\ old\_strie)\ \mathbf{in}\notag\\
+&&&&&\langle j, pc\_plus\_jl, addr \rangle \mathbf{in}\ \notag\\
+&&&\mathbf{match} && \ j\ \mathbf{with} \notag\\
+&&&&&\mathrm{short\_jump} \Rightarrow \mathtt{short\_jump\_valid}\ src\ dest\notag\\
+&&&&&\mathrm{absolute\_jump} \Rightarrow \mathtt{absolute\_jump\_valid}\ src\ dest\notag\\
+&&&&&\mathrm{long\_jump} \Rightarrow \mathrm{True}
+\end{alignat*}
+\caption{Safety property}
+\label{sigmasafe}
+\end{figure}
+%
+We compute the distance using the memory address of the instruction
+plus its size. This follows the behaviour of the MCS51 microprocessor, which
+increases the program counter directly after fetching, and only then executes
+the branch instruction (by changing the program counter again).
+
+There are also some simple, properties to make sure that our policy
+remains consistent, and to keep track of whether the fixed point has been
+reached. We do not include them here in detail. Two of these properties give the values of $\sigma$ for the start and end of the program; $\sigma(0) = 0$ and $\sigma(n)$, where $n$ is the number of instructions up until now, is equal to the maximum memory address so far. There are also two properties that deal with what happens when the previous
+iteration does not change with respect to the current one. $added$ is a
+variable that keeps track of the number of bytes we have added to the program
+size by changing the encoding of branch instructions. If $added$ is 0, the program
+has not changed and vice versa.
+
+%{\small
+%\begin{align*}
+%& \mathtt{lookup}\ 0\ (\mathtt{snd}\ strie) = 0 \notag\\
+%& \mathtt{lookup}\ prefix\ (\mathtt{snd}\ strie) = \mathtt{fst}\ strie
+%\end{align*}}
+
+
+%{\small
+%\begin{align*}
+%& added = 0\ \rightarrow\ \mathtt{policy\_pc\_equal}\ prefix\ old\_strie\ strie \notag\\
+%& \mathtt{policy\_jump\_equal}\ prefix\ old\_strie\ strie\ \rightarrow\ added = 0
+%\end{align*}}
+
+We need to use two different formulations, because the fact that $added$ is 0
+does not guarantee that no branch instructions have changed. For instance,
+it is possible that we have replaced a short jump with an absolute jump, which
+does not change the size of the branch instruction. Therefore {\tt policy\_pc\_equal} states that $old\_sigma_1(x) = sigma_1(x)$, whereas {\tt policy\_jump\_equal} states that $old\_sigma_2(x) = sigma_2(x)$. This formulation is sufficient to prove termination and compactness.
+
+Proving these invariants is simple, usually by induction on the prefix length.
+
+\subsection{Iteration invariants}
+
+These are invariants that hold after the completion of an iteration. The main
+difference between these invariants and the fold invariants is that after the
+completion of the fold, we check whether the program size does not supersede
+64 Kb, the maximum memory size the MCS51 can address. The type of an iteration therefore becomes an option type: {\tt None} in case
+the program becomes larger than 64 Kb, or $\mathtt{Some}\ \sigma$
+otherwise. We also no longer pass along the number of bytes added to the
+program size, but a boolean that indicates whether we have changed something
+during the iteration or not.
+
+If the iteration returns {\tt None}, which means that it has become too large for memory, there is an invariant that states that the previous iteration cannot
+have every branch instruction encoded as a long jump. This is needed later in the proof of termination. If the iteration returns $\mathtt{Some}\ \sigma$, the fold invariants are retained without change.
+
+Instead of using {\tt sigma\_compact\_unsafe}, we can now use the proper
+invariant:
+%
+{\small
+\begin{alignat*}{6}
+\mathtt{sigma} & \omit\rlap{$\mathtt{\_compact} \equiv \lambda program.\lambda sigma.$} \notag\\
+& \omit\rlap{$\forall n.n < program\ \rightarrow$} \notag\\
+& \mathbf{match}\ && \omit\rlap{$\mathtt{lookup\_opt}\ n\ (\mathtt{snd}\ sigma)\ \mathbf{with}$}\notag\\
+&&& \omit\rlap{$\mathrm{None}\ \Rightarrow\ \mathrm{False}$}\notag\\
+&&& \omit\rlap{$\mathrm{Some}\ \langle pc, j \rangle \Rightarrow$}\notag\\
+&&& \mathbf{match}\ && \mathtt{lookup\_opt}\ (n+1)\ (\mathtt{snd}\ sigma)\ \mathbf{with}\notag\\
+&&&&& \mathrm{None}\ \Rightarrow\ \mathrm{False}\notag\\
+&&&&& \mathrm{Some} \langle pc1, j1 \rangle \Rightarrow\notag\\
+&&&&& \ \ pc1 = pc + \mathtt{instruction\_size}\ n\ (\mathtt{nth}\ n\ program)
+\end{alignat*}}
+%
+This is almost the same invariant as ${\tt sigma\_compact\_unsafe}$, but differs in that it
+computes the sizes of branch instructions by looking at the distance between
+position and destination using $\sigma$. In actual use, the invariant is qualified: $\sigma$ is compact if there have
+been no changes (i.e. the boolean passed along is {\tt true}). This is to
+reflect the fact that we are doing a least fixed point computation: the result
+is only correct when we have reached the fixed point.
+
+There is another, trivial, invariant in case the iteration returns
+$\mathtt{Some}\ \sigma$: it must hold that $\mathtt{fst}\ sigma < 2^{16}$.
+We need this invariant to make sure that addresses do not overflow.
+
+The proof of {\tt nec\_plus\_ultra} goes as follows: if we return {\tt None},
+then the program size must be greater than 64 Kb. However, since the
+previous iteration did not return {\tt None} (because otherwise we would
+terminate immediately), the program size in the previous iteration must have
+been smaller than 64 Kb.
+
+Suppose that all the branch instructions in the previous iteration are
+encoded as long jumps. This means that all branch instructions in this
+iteration are long jumps as well, and therefore that both iterations are equal
+in the encoding of their branch instructions. Per the invariant, this means that
+$added = 0$, and therefore that all addresses in both iterations are equal.
+But if all addresses are equal, the program sizes must be equal too, which
+means that the program size in the current iteration must be smaller than
+64 Kb. This contradicts the earlier hypothesis, hence not all branch
+instructions in the previous iteration are encoded as long jumps.
+
+The proof of {\tt sigma\_compact} follows from {\tt sigma\_compact\_unsafe} and
+the fact that we have reached a fixed point, i.e. the previous iteration and
+the current iteration are the same. This means that the results of
+{\tt instruction\_size\_jmplen} and {\tt instruction\_size} are the same.
+
+\subsection{Final properties}
+
+These are the invariants that hold after $2n$ iterations, where $n$ is the
+program size (we use the program size for convenience; we could also use the
+number of branch instructions, but this is more complex). Here, we only
+need {\tt out\_of\_program\_none}, {\tt sigma\_compact} and the fact that
+$\sigma(0) = 0$.
+
+Termination can now be proved using the fact that there is a $k \leq 2n$, with
+$n$ the length of the program, such that iteration $k$ is equal to iteration
+$k+1$. There are two possibilities: either there is a $k < 2n$ such that this
+property holds, or every iteration up to $2n$ is different. In the latter case,
+since the only changes between the iterations can be from shorter jumps to
+longer jumps, in iteration $2n$ every branch instruction must be encoded as
+a long jump. In this case, iteration $2n$ is equal to iteration $2n+1$ and the
+fixed point is reached.
+
+\section{Conclusion}
+
+In the previous sections we have discussed the branch displacement optimisation
+problem, presented an optimised solution, and discussed the proof of
+termination and correctness for this algorithm, as formalised in Matita.
+
+The algorithm we have presented is fast and correct, but not optimal; a true
+optimal solution would need techniques like constraint solvers. While outside
+the scope of the present research, it would be interesting to see if enough
+heuristics could be found to make such a solution practical for implementing
+in an existing compiler; this would be especially useful for embedded systems,
+where it is important to have as small a solution as possible.
+
+In itself the algorithm is already useful, as it results in a smaller solution
+than the simple `every branch instruction is long' used up until nowand with
+only 64 Kb of memory, every byte counts. It also results in a smaller solution
+than the greatest fixed point algorithm that {\tt gcc} uses. It does this
+without sacrificing speed or correctness.
+
+The certification of an assembler that relies on the branch displacement
+algorithm described in this paper was presented in~\cite{lastyear}.
+The assembler computes the $\sigma$ map as described in this paper and
+then works in two passes. In the first pass it builds a map
+from instruction labels to addresses in the assembly code. In the
+second pass it iterates over the code, translating every pseudo jump
+at address $src$ to a label l associated to the assembly instruction at
+address $dst$ to a jump of the size dictated by $(\sigma\ src)$ to
+$(\sigma\ dst)$. In case of conditional jumps, the translated jump may be
+implemented with a series of instructions.
+
+The proof of correctness abstracts over the algorithm used and only relies on
+{\tt sigma\_policy\_specification} (page~\ref{sigmapolspec}). It is a variation
+of a standard 1tomany forward simulation proof~\cite{Leroy2009}. The
+relation R between states just maps every code address $ppc$ stored in
+registers or memory to $(\sigma\ ppc)$. To identify the code addresses,
+an additional data structure is always kept together with the source
+state and is updated by the semantics. The semantics is preserved
+only for those programs whose source code operations
+$(f\ ppc_1\ \ldots\ ppc_n)$ applied to code addresses $ppc_1 \ldots ppc_n$ are
+such that $(f\ (\sigma\ ppc_1)\ ...\ (\sigma\ ppc_n) = f\ ppc_1\ ppc_n))$. For
+example, an injective $\sigma$ preserves a binary equality test f for code
+addresses, but not pointer subtraction.
+
+The main lemma (fetching simulation), which relies on\\
+{\tt sigma\_policy\_specification} and is established by structural induction
+over the source code, says that fetching an assembly instruction at
+position ppc is equal to fetching the translation of the instruction at
+position $(\sigma\ ppc)$, and that the new incremented program counter is at
+the beginning of the next instruction (compactness). The only exception is
+when the instruction fetched is placed at the end of code memory and is
+followed only by dead code. Execution simulation is trivial because of the
+restriction over well behaved programs w.r.t. sigma. The condition
+$\sigma\ 0 = 0$ is necessary because the hardware model prescribes that the
+first instruction to be executed will be at address 0. For the details
+see~\cite{lastyear}.
+
+Instead of verifying the algorithm directly, another solution to the problem
+would be to run an optimisation algorithm, and then verify the safety of the
+result using a verified validator. Such a validator would be easier to verify
+than the algorithm itself and it would also be efficient, requiring only a
+linear pass over the source code to test the specification. However, it is
+surely also interesting to formally prove that the assembler never rejects
+programs that should be accepted, i.e. that the algorithm itself is correct.
+This is the topic of the current paper.
+
+\subsection{Related work}
+
+As far as we are aware, this is the first formal discussion of the branch
+displacement optimisation algorithm.
+
+The CompCert project is another verified compiler project.
+Their backend~\cite{Leroy2009} generates assembly code for (amongst others) subsets of the
+PowerPC and x86 (32bit) architectures. At the assembly code stage, there is
+no distinction between the spandependent jump instructions, so a branch
+displacement optimisation algorithm is not needed.
+
+%An offshoot of the CompCert project is the CompCertTSO project, which adds
+%thread concurrency and synchronisation to the CompCert compiler~\cite{TSO2011}.
+%This compiler also generates assembly code and therefore does not include a
+%branch displacement algorithm.
+
+%Finally, there is also the Piton stack~\cite{Moore1996}, which not only includes the
+%formal verification of a compiler, but also of the machine architecture
+%targeted by that compiler, a microprocessor called the FM9001.
+%However, this architecture does not have different
+%jump sizes (branching is simulated by assigning values to the program counter),
+%so the branch displacement problem is irrelevant.
+
+
+\bibliographystyle{alpha}
+\bibliography{biblio}
+%\bibliographystyle{spbasic}
+
+\end{document}