\documentclass[11pt,epsf,a4wide]{article}
\usepackage[mathletters]{ucs}
\usepackage[utf8x]{inputenc}
\usepackage{listings}
\usepackage{../../style/cerco}
\newcommand{\ocaml}{OCaml}
\newcommand{\clight}{Clight}
\newcommand{\matita}{Matita}
\newcommand{\sdcc}{\texttt{sdcc}}
\newcommand{\textSigma}{\ensuremath{\Sigma}}
% LaTeX Companion, p 74
\newcommand{\todo}[1]{\marginpar{\raggedright - #1}}
\lstdefinelanguage{coq}
{keywords={Definition,Lemma,Theorem,Remark,Qed,Save,Inductive,Record},
morekeywords={[2]if,then,else},
}
\lstdefinelanguage{matita}
{keywords={definition,lemma,theorem,remark,inductive,record,qed,let,rec,match,with,Type,and,on},
morekeywords={[2]whd,normalize,elim,cases,destruct},
mathescape=true,
morecomment=[n]{(*}{*)},
}
\lstset{language=matita,basicstyle=\small\tt,columns=flexible,breaklines=false,
keywordstyle=\color{red}\bfseries,
keywordstyle=[2]\color{blue},
commentstyle=\color{green},
stringstyle=\color{blue},
showspaces=false,showstringspaces=false}
\lstset{extendedchars=false}
\lstset{inputencoding=utf8x}
\DeclareUnicodeCharacter{8797}{:=}
\DeclareUnicodeCharacter{10746}{++}
\DeclareUnicodeCharacter{9001}{\ensuremath{\langle}}
\DeclareUnicodeCharacter{9002}{\ensuremath{\rangle}}
\title{
INFORMATION AND COMMUNICATION TECHNOLOGIES\\
(ICT)\\
PROGRAMME\\
\vspace*{1cm}Project FP7-ICT-2009-C-243881 \cerco{}}
\date{ }
\author{}
\begin{document}
\thispagestyle{empty}
\vspace*{-1cm}
\begin{center}
\includegraphics[width=0.6\textwidth]{../../style/cerco_logo.png}
\end{center}
\begin{minipage}{\textwidth}
\maketitle
\end{minipage}
\vspace*{0.5cm}
\begin{center}
\begin{LARGE}
\bf
Report n. D3.2\\
CIC encoding: Front-end\\
\end{LARGE}
\end{center}
\vspace*{2cm}
\begin{center}
\begin{large}
Version 1.0
\end{large}
\end{center}
\vspace*{0.5cm}
\begin{center}
\begin{large}
Main Authors:\\
Brian~Campbell
\end{large}
\end{center}
\vspace*{\fill}
\noindent
Project Acronym: \cerco{}\\
Project full title: Certified Complexity\\
Proposal/Contract no.: FP7-ICT-2009-C-243881 \cerco{}\\
\clearpage \pagestyle{myheadings} \markright{\cerco{}, FP7-ICT-2009-C-243881}
\newpage
\vspace*{7cm}
\paragraph{Abstract}
We describe the translation of the front-end of the \cerco{} compiler from the
\ocaml{} prototype to the Calculus of Inductive Constructions (CIC) in the
\matita{} proof assistant. This transforms programs in the C-like
\textsf{Clight} language to the \textsf{RTLabs} language, which is reasonably
target-independent and in the form of a control flow graph.
We also report on progress enriching these transformations with dependent
types so as to establish key invariants in each intermediate language, which
removes potential sources of failure within the compiler and
will assist in future work proving correctness properties.
This work was Task~3.2 of the \cerco{} project, translating the prototype
described in Deliverable~2.2~\cite{d2.2} into CIC using the intermediate languages
formalized in Deliverables~3.1~\cite{d3.1} and~3.3. It will feed into the front-end
correctness proofs in Task~3.4 and is a counterpart to the back-end
formalization in Task~4.2.
\newpage
\tableofcontents
% TODO: clear up any -ize vs -ise
% CHECK: clear up any "front end" vs "front-end"
% CHECK: clear up any mentions of languages that aren't textsf'd.
% CHECK: fix unicode in listings
% CHEKC: capitalise deliverable/task when referring to a particular one
\section{Introduction}
The \cerco{} compiler has been prototyped in \ocaml{}~\cite{d2.1,d2.2}, but
the certified compiler will be a program written in the Calculus of Inductive
Constructions (CIC), as realised by the \matita{} proof assistant. This
deliverable reports on the translation of the front-end of the compiler into
CIC and the subsequent efforts to start exploiting dependent types to maintain
invariants and rule out potential sources of failure in the compiler.
The input language for the formalized compiler is the \textsf{Clight}
language. This is a C-like language with side-effect free expressions that
was adapted from the CompCert project~\cite{compcertfm06}\footnote{We will
also use their CIL-based C parser to generate \textsf{Clight} abstract syntax
trees, but will not formalize this code.} and provided with an executable
semantics. See~\cite{d3.1} for more details on the syntax and semantics.
\begin{figure}
\begin{center}
\begin{minipage}{.8\linewidth}
\begin{tabbing}
\quad \= $\downarrow$ \quad \= \kill
\textsf{C} (unformalized)\\
\> $\downarrow$ \> CIL parser (unformalized \ocaml)\\
\textsf{Clight}\\
\> $\downarrow$ \> cast removal\\
\> $\downarrow$ \> add runtime functions\\
\> $\downarrow$ \> labelling\\
\> $\downarrow$ \> stack variable allocation and control structure
simplification\\
\textsf{Cminor}\\
\> $\downarrow$ \> generate global variable initialisation code\\
\> $\downarrow$ \> transform to RTL graph\\
\textsf{RTLabs}\\
\> $\downarrow$ \> start of target specific back-end\\
\>\quad \vdots
\end{tabbing}
\end{minipage}
\end{center}
\caption{Front-end languages and transformations}
\label{fig:summary}
\end{figure}
The front-end of the compiler is summarised in Figure~\ref{fig:summary}. The
two intermediate languages involved are
\begin{description}
\item[\textsf{Cminor}] --- a C-like language where local variables are not
explicitly allocated memory and control structures are simpler
\item[\textsf{RTLabs}] --- a language in the form of a control flow graph
which retains the values and front-end operations from \textsf{Cminor}
\end{description}
More details on the formalisation of the syntax and semantics of these
languages can be found in the accompanying Deliverable 3.4.
Development of the formalized front-end was conducted in concert with the
development of these intermediate languages to facilitate testing.
\subsection{Revisions to the prototype compiler}
We have been tracking revisions to the prototype compiler during the
development of the formalized version. Most of these changes were minor, but
one exception is a major change to the structure of the compiler.
The original plan for the front-end featured a \textsf{Clight} to
\textsf{Clight8} phase near the start which replaced all of the integer
values and operations by 8 bit counterparts, while pointers were split into
bytes at a later stage. Experience has shown that it would be difficult to
produce good code with this approach. Instead, we now have:
\begin{itemize}
\item full size integers, pointers and operations until code selection (the
first part of the back-end after \textsf{RTLabs}), and
\item a cast removal stage which simplifies \textsf{Clight} expressions such
as
\begin{lstlisting}[language=C,belowskip=0pt]
(char)((int)x + (int)y)
\end{lstlisting}
into equivalent operations on simpler types, \lstinline'x+y' in this case.
The cast removal is important because C requires \emph{arithmetic promotion}
of integer types to (at least) \lstinline'int' before an operation is
performed. The \textsf{Clight} semantics do not perform the promotions,
instead they are added as casts by the CIL-based parser. However, our targets
benefit immensely from performing operations on the smallest possible integer
type, so it is important that we remove promotions where possible.
\end{itemize}
This document describes the formalized front-end after these changes.
\section{\textsf{Clight} phases}
In addition to the conversion to \textsf{Cminor}, there are several
transformations which act directly on the \textsf{Clight} language.
\subsection{Cast simplification}
We noted above that the arithmetic promotion required by C (and implemented in
the CIL-based parser) adds numerous casts, causing arithmetic operations to be
performed on 32 bit integers. If left alone, the resulting code will be
larger and slower. This phase removes many of the casts so that the
operations can be performed more efficiently.
The prototype version worked by recognising fixed patterns in the
\textsf{Clight} abstract syntax tree such as
\[ (t)((t_1)e_1\ op\ (t_2)e_2), \]
subject to restrictions on the types. These are replaced with a simpler
version without the casts. Such `deep' pattern matching is slightly awkward in
\matita{} and this approach does not capture compositions of operations, such as
\begin{lstlisting}[language=C]
(char)(((int)a + (int)b) + (int)c)
\end{lstlisting}
where \lstinline'a', \lstinline'b' and \lstinline'c' are of type
\lstinline'char', because the intermediate expression is not cast to and from
\lstinline'char'.
The formalized version uses a different method, recursively examining each
expression constructor to see if the expression can be coerced to some
`desired' type. For example, when processing the above expression it reaches
each \lstinline'int' cast with a desired type of \lstinline'char', notes that
the subexpression is of type \lstinline'char' and eliminates the cast.
Moreover, when the recursive processing is complete the \lstinline'char' cast
is also eliminated because its subexpression is already of the correct type.
This has been implemented in \matita. We have also performed a few proofs that
the arithmetic behind these changes is correct to gain confidence in the
technique. During Task 3.4 we will extend these proofs to cover more
operations and show that the semantics of the expressions are equivalent, not
just the underlying arithmetic.
\subsection{Labelling}
This phase adds cost labels to the \textsf{Clight} program. It is a fairly
simple recursive definition, and was straightforward to port to \matita. The
generation of cost labels was handled by our generic identifiers code,
described in the accompanying Deliverable 3.3 on intermediate languages.
\subsection{Runtime functions}
Some operations on integers do not have a simple translation to the target
machine code. In particular, we need to replace operations for 16 and 32-bit
division and most bitwise shifts with calls to runtime functions. These
functions need to be added to the program at an early stage because of their
impact on execution time: any loops must be available to our labelling
mechanism so that we can report on how long the resulting machine code will
take to execute.
We follow the prototype in replacing the affected expressions, which requires
us to break up expressions into multiple statements because function calls are
not permitted in \textsf{Clight} expressions. We may investigate moving these
substitutions to a later stage of the compiler if they prove difficult to
reason about. However, this would also require adjusting the semantics so
that the costs still appear in the evaluation of \textsf{Clight} programs.
The prototype adds the functions themselves by generating C code as text and
reparsing the program. This is unsuitable for formalization, so we generate
\textsf{Clight} abstract syntax trees directly.
\subsection{Conversion to \textsf{Cminor}}
The conversion to \textsf{Cminor} performs two essential tasks. First, it
determines which local variables need to be stored in memory and generates
explicit memory accesses for them. Second, it must translate the control
structures (\lstinline'for', \lstinline'while', \dots) into \textsf{Cminor}'s
more basic structures.
These are both performed by code similar to that in the prototype, although
the use of generic fold operations on statements and expressions has been
replaced by simpler recursive definitions.
There are two additional pieces of work that the formalized translation must
do. The \textsf{Cminor} definition features some mild constraints of the
types of expressions, which we can enforce in the translation using some type
checking. The error monad is used to dispose of ill-typed \textsf{Clight}
programs.
The other difficulty is that we need to generate fresh temporary variables to
store function results in before they are written to memory. This is
necessary because \textsf{Clight} allows arbitrary \emph{lvalue} expressions
as the destination for the returned value, but \textsf{Cminor} only allows
local variables. All other variable names in the \textsf{Cminor} program came
from the \textsf{Clight} program, but we need to construct a method for
generating fresh names for the temporaries.
Our identifiers are based on binary numbers, and generation of fresh names is
handled by keeping track of the highest allocated number. Normally this is
initialised at zero, but if initialised by the largest existing identifier in
the \textsf{Clight} program then the generated names will be fresh.
To do this, we extract the maximum identifier by recursively finding the maximum
variable name used in every expression, statement and function of the program.
\section{\textsf{Cminor} phases}
\textsf{Cminor} programs are processed by two passes: one deals with the
initialisation of global variables, and the other produces \textsf{RTLabs}
code.
\subsection{Initialisation code}
This replaces the initialisation data with explicit code in the main function.
The only remarkable point in the formalization is that we have two slightly
different instantiations of the \textsf{Cminor} syntax: one with
initialisation data that this pass takes as input, and one with only size
information that is the output. In addition to reflecting the purpose of this
pass in its type, it also ensures that the pass cannot be accidentally omitted.
\subsection{Conversion to \textsf{RTLabs}}
This pass breaks down the structure of the \textsf{Cminor} program into a
control flow graph, but maintains the same set of operations. The algorithm
is stateful in the sense that it builds up the \textsf{RTLabs} function body
incrementally, but all of the relevant state is already present in the
function record (including the fresh register and graph label name generators)
and the prototype passes this around. Thus the formalized code is very
similar in nature.
One possible variation would be to explicitly define a state monad to carry
the function under construction around, but it is not yet clear if this will
make the correctness results easier to prove.
\section{Adding and using invariants}
The compiler phases described above all use the error monad to deal with
inconsistencies in the program being transformed. In particular, lookups in
environments may fail, control flow graphs may have missing statements and
various structural problems may be present. We would like to show that these
failures are absent where possible by establishing that programs are well
formed early in the compilation process.
This work overlaps with Deliverable 3.3 (where more details of the additions
to the syntax and semantics of the intermediate languages can be found) and
Task 3.4 on the correctness of the compiler. Thus this work is experimental
in nature, and will evolve during Task 3.4.
The use of the invariants follows a common pattern. Each language embeds
invariants in the function record that constrain the function body by other
information in the record (such as the list of local variables and types, or
the set of labels declared). However, during the transformations they
typically need to be refined to constraints on individual statements and
expressions with respect to data structures used in the transformation.
A similar change in invariants is required between the transformation and the
new function.
For example, consider the use of local variables in the \textsf{Cminor} to
\textsf{RTLabs} stage. We start with
\begin{lstlisting}[language=matita]
record internal_function : Type[0] ≝
{ f_return : option typ
; f_params : list (ident $\times$ typ)
; f_vars : list (ident $\times$ typ)
; f_stacksize : nat
; f_body : stmt
; f_inv : stmt_P ($\lambda$s.stmt_vars ($\lambda$i.Exists ? ($\lambda$x.\fst x = i) (f_params @ f_vars)) s $\wedge$
stmt_labels ($\lambda$l.Exists ? ($\lambda$l'.l' = l) (labels_of f_body)) s) f_body
}.
\end{lstlisting}
where the first half of \lstinline'f_inv' requires every variable in the
function body to appear in the parameter or variable list. In the translation
to \textsf{RTLabs}, variable lookups are performed in a map to \textsf{RTLabs}
pseudo-registers:
\begin{lstlisting}[language=matita]
definition env ≝ identifier_map SymbolTag register.
let rec add_expr (le:label_env) (env:env) (ty:typ) (e:expr ty)
(Env:expr_vars ty e (present ?? env))
(dst:register) (f:partial_fn le) on e
: $\Sigma$f':partial_fn le. fn_graph_included le f f' ≝
match e return $\lambda$ty,e.expr_vars ty e (present ?? env) $\rightarrow$
$\Sigma$f':partial_fn le. fn_graph_included le f f' with
[ Id _ i $\Rightarrow$ $\lambda$Env.
let r ≝ lookup_reg env i Env in
...
\end{lstlisting}
Note that \lstinline'lookup_reg' returns a register without any possibility of
error. The reason this works is that the pattern match on \lstinline'e'
refines the type of the invariant \lstinline'Env' to a proof that the variable
\lstinline'i' is present. We then pass this proof to the lookup function to
rule out failure.
When this map \lstinline'env' is constructed at the start of the phase, we
prove that the proof \lstinline'f_inv' from the function implies the invariant
on variables needed by \lstinline'add_expr' and its equivalent on statements:
\begin{lstlisting}[language=matita]
lemma populates_env : $\forall$l,e,u,l',e',u'.
populate_env e u l = $\langle$l',e',u'$\rangle$ $\rightarrow$
$\forall$i. Exists ? ($\lambda$x.\fst x = i) l $\rightarrow$
present ?? e' i.
\end{lstlisting}
A similar mechanism is used to show that \texttt{goto} labels are always
declared in the function.
Also note the return type of \lstinline'add_expr' is a dependent pair. We
build the resulting \textsf{RTLabs} function incrementally, using a type
\lstinline'partial_fn' that does not contain the final invariant for
functions. We always require the \lstinline'fn_graph_included' property for
partially built functions to show that the graph only gets larger, a key part
of the proof that the resulting control flow graph is closed. Dependent pairs
are used in a similar manner in the \textsf{Clight} to \textsf{Cminor} phase
too.
This work does not currently cover all of the possible sources of failure; in
particular some structural constraints on functions are not yet covered and
some properties of \textsf{RTLabs} programs that may be useful for later
stages or the correctness proofs are not produced. Moreover, we may
experiment with variations to try to make the proof obligations and syntax
simpler to deal with. However, it does show that retrofitting these
properties using dependent types in \matita{} is feasible.
\section{Testing}
To provide some early testing and bug fixing of this code we constructed it in
concert with the executable semantics described in Deliverable 3.3, and
\matita{} term pretty printers in the prototype compiler. Using these, we
were able to test the phases individually and together by running programs
within the proof assistant itself, and comparing the results with the expected
output.
\section{Conclusion}
We have formalized the front-end of the \cerco{} compiler in the \matita{}
proof assistant, and shown that invariants can be added to the intermediate
languages to help show properties of it. This work provides a solid basis for
the compiler correctness proofs in Task 3.4.
\bibliographystyle{plain}
\bibliography{report}
\end{document}