Changeset 3243


Ignore:
Timestamp:
Apr 30, 2013, 11:34:54 PM (4 years ago)
Author:
sacerdot
Message:

The report, spellchecking needed.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • Deliverables/D6.6/report.tex

    r3234 r3243  
    6161\begin{LARGE}
    6262\textbf{
    63 Report n. D5.2\\
    64 Trusted CerCo Prototype}
     63Report n. D6.6\\
     64Packages for Linux distributions and Live CD }
    6565\end{LARGE}
    6666\end{center}
     
    7777\begin{large}
    7878Main Authors:\\
    79 XXXX %Dominic P. Mulligan and Claudio Sacerdoti Coen
     79Claudio Sacerdoti Coen
    8080\end{large}
    8181\end{center}
     
    9696\vspace*{7cm}
    9797\paragraph{Abstract}
    98 The Trusted CerCo Prototype is meant to be the last piece of software produced
    99 in the CerCo project. It consists of
    100 \begin{enumerate}
    101  \item A compiler from a large subset of C to the Intel HEX format for the
    102    object code of the 8051 microprocessor family. The compiler also computes
    103    the cost models for the time spent in basic blocks and the stack space used
    104    by the functions in the source code. The cost models are exposed to the
    105    user by producing an instrumented C code obtained injecting in the original
    106    source code some instructions to update three global variables that record
    107    the current clock, the current stack usage and the max stack usage so far.
    108  \item A plug-in for the Frama-C Software Analyser architecture. The plug-in
    109    takes in input a C file, compiles it with the CerCo compiler and then
    110    automatically infers cost invariants for every loop and every function in
    111    the source code. The invariants can be fed by Frama-C to various theorem
    112    provers to automatically verify them. Those that are not automatically
    113    verified can be manually checked by the user using a theorem prover.
    114 \end{enumerate}
    115 
    116 The plug-in is fully functional and it does not need to be formally verified
    117 because it does not belong to the trusted code base of programs verified using
    118 CerCo's technology. A bug in the plug-in can just infer wrong time/space
    119 invariants and bounds that will be rejected by the automatic provers.
    120 
    121 The compiler is currently fully functional, even if not fully
    122 certified yet. The certification will continue after the end of the project.
    123 
    124 In this deliverable we discuss the status of the Trusted CerCo Prototype at the
    125 time of the official end of the project.
     98We provide Debian packages for all the software developed in CerCo and for all
     99its software dependencies. The Debian packages are for the forthcoming stable
     100Debian distribution. Moreover, we provide a Live CD, also based on Debian, that
     101contains a pre-installed version of all the packages and a copy of the
     102formalization.
    126103\newpage
    127104
     
    133110\label{sect.task}
    134111
    135 The Grant Agreement describes deliverable D5.2 as follows:
     112The Grant Agreement describes deliverable D6.6 as follows:
    136113\begin{quotation}
    137 \textbf{Trusted CerCo Prototype}: Final, fully trustable version of the
    138 system.
     114\textbf{Packages for Linux distributions and Live CD}:
     115In order
     116to foster adoption of the CerCo compiler in a wider community, we will provide packages for selected Linux
     117distributions and a Live CD with the software developed in the project. We will also consider and discuss the
     118integration of our software in an extensible platform dedicated to source-code analysis of C software, like
     119Frama-C. However, at the moment it is unclear how these rapidly changing platforms will evolve in the next
     120two years, if integration would provide an added value and if it would be possible to practically achieve the
     121integration in the project timeframe without an actual involvement of the platform developers (that would need to
     122contribute man-power to the task).
    139123\end{quotation}
    140124
    141 This report describes the state of the implementation of the Trusted CerCo
    142 Prototype at the official end of the project.
    143 
    144 \section{The Trusted CerCo Compiler}
    145 \subsection{Command line interface}
    146 The Trusted CerCo Compiler is the first component of the Trusted CerCo
    147 Prototype, the second one being the Frama-C plug-in already developed in
    148 D5.1 and not modified. The Trusted CerCo compiler replaces the Untrusted
    149 CerCo Compiler already delivered in D5.1 as part of the Untrusted CerCo
    150 Prototype. The Trusted and Untrusted compilers are meant to provide the
    151 same command line interface, so that they can be easily swapped without
    152 affecting the plug-in. Actually, the two compilers share the following
    153 minimal subset of options which are sufficient to the plug-in:
    154 
    155 \begin{verbatim}
    156 Usage: acc [options] file...
    157   -a                        Add cost annotations on the source code.
    158   -is                       Outputs and interprets all the compilation passes,
    159                             showing the execution traces
    160   -o                        Prefix of the output files.
    161 \end{verbatim}
    162 
    163 \begin{figure}[t]
    164 \begin{verbatim}
    165 char a[] = {3, 2, 7, -4};
    166 char treshold = 4;
    167 
    168 int main() {
    169         char j;
    170         char *p = a;
    171         int found = 0;
    172         for (j=0; j < 4; j++) {
    173                 if (*p <= treshold) { found++; }
    174                 p++;
    175         }
    176         return found;
    177 }
    178 \end{verbatim}
    179 \caption{A simple C program~\label{test1}}
    180 \end{figure}
    181 
    182 \begin{figure}[!p]
    183 \begin{verbatim}
    184 int __cost = 33;
    185 
    186 int __stack_size = 5, __stack_size_max = 5;
    187 
    188 void __cost_incr(int incr) {
    189   __cost = __cost + incr;
    190 }
    191 
    192 void __stack_size_incr(int incr) {
    193   __stack_size = __stack_size + incr;
    194   __stack_size_max = __stack_size_max < __stack_size ? __stack_size : __stack_size_max;
    195 }
    196 
    197 unsigned char a[4] = { 3, 2, 7, 252, };
    198 
    199 unsigned char treshold = 4;
    200 
    201 int main(void)
    202 {
    203   unsigned char j;
    204   unsigned char *p;
    205   int found;
    206   __stack_size_incr(0);
    207   __cost_incr(91);
    208   p = a;
    209   found = 0;
    210   for (j = (unsigned char)0; (int)j < 4; j = (unsigned char)((int)j + 1)) {
    211     __cost_incr(76);
    212     if ((int)*p <= (int)treshold) {
    213       __cost_incr(144);
    214       found = found + 1;
    215     } else {
    216       __cost_incr(122);
    217       /*skip*/;
    218     }
    219     p = p + 1;
    220   }
    221   __cost_incr(37);
    222   /*skip*/;
    223   __stack_size_incr(-0);
    224   return found;
    225   __stack_size_incr(-0);
    226 
    227 }
    228 \end{verbatim}
    229 \caption{The instrumented version of the program in Figure~\ref{test1}\label{itest1}}
    230 \end{figure}
    231 
    232 Let the file \texttt{test.c} contains the code presented in Figure~\ref{test1}.
    233 By calling \texttt{acc -a test.c}, the user obtains the following files:
    234 \begin{itemize}
    235  \item \texttt{test-instrumented.c} (Figure~\ref{itest1}): the file is a copy
    236    of \texttt{test.c} obtained by adding code that keeps track of the amount
    237    of time and space used by the source code.
    238 
    239    The global variable
    240    \texttt{\_\_cost} records the number of clock cycles used by the
    241    microprocessor. Its initial value may not be zero because we emit object code
    242    to initialize the global variables of the program. The initialization code
    243    is not part of \texttt{main} (to allow \texttt{main} to be recursive, even
    244    if it is not clear from the standard if this should be allowed or not).
    245    The initial value of \texttt{\_\_cost} is the time spent in the
    246    initialization code.
    247    The \texttt{\_\_cost} variable is incremented at the beginning of every
    248    basic block using the \texttt{\_\_cost\_incr} function, whose argument
    249    is the number of clock cycles that will be spent by the basic block that
    250    is beginning. Therefore the value stored in the variable is always a safe
    251    upper bound of the actual time. Moreover, the difference at the begin and
    252    end of a function of the value of the \texttt{\_\_clock} variable is
    253    also exact. The latter statement --- with several technical complications
    254    --- is the one that will be eventually certified in Matita.
    255 
    256    The global variables \texttt{\_\_stack\_size} and
    257    \texttt{\_\_stack\_size\_max} are handled similarly to \texttt{\_\_cost}.
    258    Their value is meant to represent the amount of external data memory
    259    currently in use and the maximum amount of memory used so far by the program.
    260    The two values are updated by the \texttt{\_\_stack\_size\_incr} function
    261    at the beginning of every function body, at its end and just before every
    262    \texttt{return}. A negative increment is used to represent stack release.
    263    The initial value of the two variables may not be zero because some
    264    external data memory is used for global and static variables. Moreover, the
    265    last byte of external data memory may not be addressable in the 8051, so
    266    we avoid using it. The compiled code runs correctly only until stack
    267    overflow, which happens when the value of \texttt{\_\_stack\_size} reaches
    268    $2^{16}$. It is up to the user to state and maintain the invariant that no
    269    stack overflow occurs. In case of stack overflow, the compiled code does
    270    no longer simulate the source code. Automatic provers are often able to
    271    prove the invariant in simple cases.
    272 
    273    In order to instrument the code, all basic blocks that are hidden in the
    274    source code but that will contain code in the object code need to be made
    275    manifest. In particular, \texttt{if-then} instructions without an explicit
    276    \texttt{else} are given one (compare Figures~\ref{test1} and \ref{itest1}).
    277  \item \texttt{test.hex}: the file is an Intel HEX representation of the
    278    object code for an 8051 microprocessor. The file can be loaded in any
    279    8051 emulator (like the MCU 8051 IDE) for running or dissassemblying it.
    280    Moreover, an executable semantics (an emulator) for the 8051 is linked
    281    with the CerCo compilers and can be used to run the object code at the
    282    end of the compilation.
    283  \item \texttt{test.cerco} and \texttt{test.stack\_cerco}: these two files
    284    are used to pass information from the CerCo compilers to the Frama-C plug-in
    285    and they are not interesting to the user. They just list the
    286    global variables and increment functions inserted by the compiler and used
    287    later by the plug-in to compute the cost invariants.
    288 \end{itemize}
    289 
    290 When the option \texttt{-is} is used, the compilers run the program after
    291 every intermediate compilation pass. The untrusted compiler only outputs the
    292 trace of (cost) observables and the final value returned by main; the
    293 trusted compiler also observes every function call and return (the stack-usage
    294 observables) and therefore allows to better track program execution.
    295 The traces observed after every pass should always be equal up to the initial
    296 observable that corresponds to the initialization of global variables. That
    297 observable is currently only emitted in back-end languages. The preservation of
    298 the traces will actually be granted by the main theorem of the formalization
    299 (the forward simulation theorem) when the proof will be completed.
    300 
    301 The compilers also create a file
    302 for each pass with the intermediate code. However, the syntaxes used by the two
    303 compilers for the intermediate passes are not the same and we do not output
    304 yet any intermediate syntax for the assembly code. The latter can be obtained
    305 by disassemblying the object code, e.g. by using the MCU 8051 IDE.
    306 
    307 \subsection{Code structure}
    308 The code of the trusted compiler is made of three different parts:
    309 \begin{itemize}
    310  \item Code extracted to OCaml from the formalization of the compiler in Matita.
    311    This is the code that will eventually be fully certified using Matita.
    312    It is fully contained in the \texttt{extracted} directory (not comprising
    313    the subdirectory \texttt{unstrusted}).
    314    It translates the source C-light code to:
    315    \begin{itemize}
    316     \item An instrumented version of the same C-light code. The instrumented
    317       version is obtained inserting cost emission statements~\texttt{COST k}
    318       at the beginning of basic blocks. The emitted labels are the ones that
    319       are observed calling \texttt{acc -is}. They will be replaced in the
    320       final instrumented code with incrementes of the \texttt{\_\_cost}
    321       variable.
    322     \item The object code for the 8051 as a list of bytes to be loaded in
    323       code memory. The code is coupled with a trie over bitvectors that maps
    324       program counters to cost labels \texttt{k}. The intended semantics is
    325       that, when the current program counter is associated to \texttt{k},
    326       the observable label \texttt{k} should be emitted during the execution
    327       of the next object code instruction. Similarly, a second trie, reminiscent
    328       of a symbol table, maps program counters to function names to emit the
    329       observables associated to stack usage.
    330    \end{itemize}
    331    During the ERTL to RTL pass that deals with register allocation,
    332    the source code calls two untrusted components that we are now going to
    333    describe.
    334  \item Untrusted code called by trusted compiler (directory
    335    \texttt{extracted/unstrusted}). The two main untrusted components in this
    336     directory are
    337   \begin{itemize}
    338    \item The \texttt{Fix.ml} module by Fran\c{c}ois Pottier that provides a
    339     generic algorithm to compute the least solution of a system of monotonic
    340     equations, described in the paper \cite{LazyLeastFixedPointsInML}.
    341     The trusted code uses this piece of code to do liveness analysis and only
    342     assumes that the module computes a pre-fix point of the system of equations
    343     provided. The performance of this piece of code is critical to the compiler
    344     and the implementation used exploits some clever programming techniques
    345     and imperative data structures that would be difficult to prove correct.
    346     Checking if the result of the computation is actually a pre-fixpoint is
    347     instead very simple to do using low computational complexity functional
    348     code only. Therefore we do not plan to prove this piece of code correct.
    349     Instead, we will certify a verifier for the results of the computation.
    350    
    351     Note: this module, as well as the next one, have been reused from the
    352     untrusted compiler. Therefore they have been thoroughly tested for bugs
    353     during the last year of the project.
    354    \item The \texttt{coloring.ml} module taken from Fran\c{c}ois Pottier
    355     PP compiler, used in a compiler's course for undergraduates.
    356     The module takes an interference graph
    357     (data structure implemented in module \texttt{untrusted\_interference.ml})
    358     and colors it, assigning to nodes of the interference graph either a
    359     color or the constant \texttt{Spilled}. An heuristic is used to drive the
    360     algorithm: the caller must provide a function that returns the number of
    361     times a register is accessed, to decrease it likelihood of being spilled.
    362     To minimise bugs and reduce code size, we actually implement the heuristics
    363     in Matita and then extract the code. Therefore the untrusted code also calls
    364     back extracted code for untrusted computations.
    365 
    366     Finally, module \texttt{compute\_colouring.ml} is the one that actually
    367     computes the colouring of an interference graph of registers given the
    368     result of the liveness analysis. The code first creates the interference
    369     graph; then colours it once using real registers as colours to determine
    370     which pseudo-registers need spilling;
    371     finally it colours it again using real registers and spilling slots as
    372     colours to assign to each pseudoregister either a spilling slot or a
    373     real register.
    374   \end{itemize}
    375   The directory also contains a few more files that provide glue code between
    376   OCaml's data structures from the standard library (e.g. booleans and lists)
    377   and the corresponding data structures extracted from Matita. The glue is
    378   necessary to translate results back and forth between the trusted (extracted)
    379   and untrusted components, because the latter have been written using data
    380   structures from the standard library of OCaml.
    381  \item Untrusted code that drives the trusted compiler (directory
    382    \texttt{driver}). The directory contains the \texttt{acc.ml} module
    383    that implements the command line interface of the trusted compiler.
    384    It also contains or links three untrusted components which are safety
    385    critical and that enter the trusted code base of CerCo:
    386    \begin{itemize}
    387     \item The CIL parser~\cite{CIL} that parses standard C code and simplifies
    388       it to C-light code.
    389     \item A pretty-printer for C-light code, used to print the
    390       \texttt{-instrumented.c} file.
    391     \item The instrumenter module (integrated with the pretty-printer for
    392       C-light code). It takes the output of the compiler and replaces every
    393       statement \texttt{COST k} (that emits the cost label \texttt{k}) with
    394       \texttt{\_\_cost\_incr(n)} where $n$ is the cost associated to
    395       \texttt{k} in the map returned by the compiler. A similar operation is
    396       performed to update the stack-related global variables. Eventually this
    397       modules needs to be certified with the following specification: if
    398       a run of the labelled program produces the trace
    399       $\tau = k_1 \ldots k_n$, the
    400       equivalent run of the instrumented program should yield the same
    401       result and be such that at the end the value of the \texttt{\_\_cost}
    402       variable is equal to $\Sigma_{k \in \tau} K(k)$ where $K(k)$ is the
    403       lookup of the cost of $k$ in the cost map $K$ returned by the compiler.
    404       A similar statement is expected for stack usage.
    405    \end{itemize}
    406 \end{itemize}
    407 
    408 \subsection{Functional differences w.r.t. the untrusted compiler}
    409 From the user perspective, the trusted and untrusted compiler have some
    410 differences:
    411 \begin{itemize}
    412  \item The untrusted compiler put the initialization code for global variables
    413    at the beginning of main. The trusted compiler uses a separate function.
    414    Therefore only the trusted compiler allows recursive calls. To pay for
    415    the initialization code, the \texttt{\_\_cost} variable is not always
    416    initialized to 0 in the trusted compiler, while it is always 0 in the
    417    untrusted code.
    418  \item The two compilers do not compile the code in exactly the same way, even
    419    if they adopt the same compilation scheme.
    420    Therefore the object code produced is different and the control blocks are
    421    given different costs. On average, the code generated by the trusted
    422    compiler is about 3 times faster and may use less stack space.
    423  \item The trusted compiler is slightly slower than the untrusted one and the
    424    trusted executable semantics are also slightly slower than the untrusted
    425    ones.
    426    The only passes that at the moment are significantly much slower are the
    427    policy computation pass, which is a preliminary to the assembly, and the
    428    assembly pass. These are the passes that manipulate the largest quantity
    429    of data, because assembly programs are much larger than the corresponding
    430    Clight sources. The reason for the slowness is currently under investigation.
    431    It is likely to be due to the quality of the extracted code (see
    432    Section~\ref{quality}).
    433  \item The back-ends of both compilers do not handle external functions
    434    because we have not implemented a linker. The trusted compiler fails
    435    during compilation, while the untrusted compiler silently turns every
    436    external function call into a \texttt{NOP}.
    437  \item The untrusted compiler had ad-hoc options to deal with C files
    438    generated from a Lustre compiler. The ad-hoc code simplified the C files
    439    by avoiding some calls to external functions and it was adding some
    440    debugging code to print the actual reaction time of every Lustre node.
    441    The trusted compiler does not implement any ad-hoc Lustre option yet.
    442 \end{itemize}
    443 
    444 \subsection{Implementative differences w.r.t. the untrusted compiler}\label{quality}
    445 The code of the trusted compiler greatly differs from the code of the
    446 untrusted prototype. The main architectural difference is the one of
    447 representation of back-end languages. In the trusted compiler we have adopted
    448 a unified syntax (data-structure), semantics and pretty-printing for every
    449 back-end language.
    450 In order to accomodate the differences between the original languages, the
    451 syntax and semantics have been abstracted over a number of parameters, like
    452 the type of arguments of the instructions. For example, early languages use
    453 pseudo-registers to hold data while late languages store data into real machine
    454 registers or stack locations. The unification of the languages have bringed
    455 a few benefits and can potentially bring new ones in the future:
    456 \begin{itemize}
    457  \item Code size reduction and faster detection and correction of bugs.
    458 
    459    About the 3/4th of the code for the semantics and pretty-printing of
    460    back-end languages is shared, while 1/4th is pass-dependent. Sharing
    461    the semantics automatically implies sharing semantic properties, i.e.
    462    reducing to 1/6th the number of lemmas to be proved (6 is the number of
    463    back-end intermediate languages). Moreover, several back-end passes
    464    --- a pass between two alternative semantics for RTL, the RTL to ERTL pass
    465    and the ERTL to LTL pass --- transform a graph instance of the generic
    466    back-end intermediate language to another graph instance. The
    467    graph-to-graph transformation has also been generalized and parameterized
    468    over the pass-specific details. While the code saved in this way is not
    469    much, several significant lemmas are provided once and for all on the
    470    transformation. At the time this deliverable has been written,
    471    the generalized languages, semantics, transformations and relative
    472    properties take about 3,900 lines of Matita code (definitions and lemmas).
    473 
    474    We also benefit from the typical advantage of code sharing over cut\&paste:
    475    once a bug is found and fixed, the fix immediately applies to every
    476    instance. This becomes particularly significant for code certification,
    477    where one simple bug fix usually requires a complex work to fix the
    478    related correctness proofs.
    479 
    480  \item Some passes and several proofs can be given in greater generality,
    481    allowing more reusal.
    482 
    483    For example, in the untrusted prototype the
    484    LTL to LIN pass was turning a graph language into a linearized language
    485    with the very same instructions and similar semantics. In particular, the
    486    two semantics shared the same execute phase, while fetching was different.
    487    The pass consisted in performing a visit of the graph, emitting the
    488    instructions in visit order. When the visit detected a cycle, the back-link
    489    arc was represent with a new explicitly introduced \texttt{GOTO} statement.
    490 
    491    Obviously, the transformation just described could be applied as well to
    492    any language with a \texttt{GOTO} statement. In our formalization in Matita,
    493    we have rewritten and proved correct the translation between any two
    494    instances of the generalized back-end languages such that: 1) the
    495    fetching-related parameters of the two passes are instantiated respectively
    496    with graphs and linear operations; 2) the execute-related parameters are
    497    shared.
    498 
    499    Obviously, we could also prove correct the reverse translation, from a
    500    linear to a graph-based version of the same language. The two combined
    501    passes would allow to temporarily switch to a graph based representation
    502    only when a data-flow analysis over the code is required. In our compiler,
    503    for example, at the moment only the RTL to ERTL pass is based on a data
    504    flow analysis. A similar pair of translations could be also provided between
    505    one of the two representations and a static single assignment (SSA) one.
    506    As a final observation, the insertion of another graph-based language after
    507    the LTL one is now made easy: the linearization pass needs not be redone
    508    for the new pass.
    509    
    510 \item Pass commutation and reusal.
    511    Every pass is responsible for reducing a difference
    512    between the source and target languages. For example, the RTL to ERTL pass
    513    is responsible for the parameter passing conversion, while the ERTL to LTL
    514    pass performs pseudo-registers allocation. At the moment, both CompCert
    515    and CerCo fix the relative order of the two passes in an opposite and
    516    partially arbitrary way and it is not possible to simply switch the
    517    two passes. We believe instead that it should be possible to generalize
    518    most passes in such a way that they could be freely composed in any order,
    519    also with repetitions. For example, real world compilers like GCC perform
    520    some optimizations like constant propagation multiple times, after
    521    every optimization that is likely to trigger more constant propagation.
    522    Thanks to our generalized intermediate language, we can already implement
    523    a generic constant propagation pass that can be freely applied.
    524 %   For instance, table below shows the amount of Matita
    525 %   code taken by the description of the syntax, semantics and pretty printing
    526 %   rules for the back-end passes. It also compares the size with the untrusted
    527 %   C code.
    528 %   \begin{tabular}[h]{lrr}
    529 %       & Untrusted code & Matita code \\
    530 %    RTL    &   615          &    456 \\
    531 %   ERTL    &   891          &    342 \\
    532 %   LIN     &   568          &     79 \\
    533 %   LTL     &   636          &     94 \\
    534 %   LTL/LIN &                &    466 \\
    535 %   joint   &                &   1615 \\ \hline \\
    536 %   Tot.    & 2,710          &   3,052 \\ 
    537 %   The tables shows that the size of all untrusted code languages is
    538 %   comparable.
    539 \end{itemize}
    540 
    541 
    542 \subsection{Quality of the extracted code}\label{quality}
    543 
    544 We have extracted the Matita code of the compiler to OCaml in order to compile
    545 and execute it in an efficient way and without any need to install Matita.
    546 The implementation of code extraction for Matita has been obtained by
    547 generalizing the one of Coq over the data structures of Coq, and then
    548 instantiating the resulting code for Matita. Differences in the two calculi
    549 have also been taken in account during the generalization. Therefore we can
    550 expect the extraction procedure to be reasonably bug free, because bugs in the
    551 core of the code extraction would be likely to be detected in Coq also.
    552 
    553 The quality of the extracted code ranges from sub-optimal to poor, depending
    554 on the part of the formalization. We analyse here the causes for the poor
    555 quality:
    556 \begin{itemize}
    557  \item Useless computations. The extraction procedure
    558    removes from the extracted code almost all of the non computational parts,
    559    replacing the ones that are not removed with code with a low complexity.
    560    However, following Coq's tradition, detection of the useless parts is not
    561    done according to the computationally expensive algorithm by
    562    Berardi~\cite{berardixxx}. Instead, the user decides which data structures
    563    should be assigned computation interest by declaring them in one of the
    564    \texttt{Type\_i} sorts of the Calculus of (Co)Inductive Constructions.
    565    The non computational structures are declared in \texttt{Prop}, the sort
    566    of impredicative, possibly classical propositions. Every computation that
    567    produces a data structure in \texttt{Prop} is granted to be computationally
    568    irrelevant. Computations that produce data structures in \texttt{Type\_i},
    569    instead, may actually be relevant of irrelevant, even if the extraction code
    570    conservatively consider them relevant. The result consists in extracted
    571    OCaml code that computes values that will be passed to functions that do
    572    not use the result, or that will be returned to the caller that will ignore
    573    the result.
    574 
    575    The phenomenon is particularly visible when the dependently typed discipline
    576    is employed, which is our choice for CerCo. Under this discipline, terms
    577    can be passed to type formers. For example, the data type for back-end
    578    languages in CerCo is parameterized over the list of global variables that
    579    may be referred to by the code. Another example is the type of vectors
    580    that is parameterized over a natural which is the size of the vector, or
    581    the type of vector tries which is parameterized over the fixed height of
    582    the tree and that can be read and updated only using keys (vectors of
    583    bits) whose lenght is the height of the trie. Functions that compute these
    584    dependent types also have to compute the new indexes (parameters) for the
    585    types, even if this information is used only for typing. For example,
    586    appending two vectors require the computation of the lenght of the result
    587    vector just to type the result. In turn, this computation requires the
    588    lenghts of the two vectors in input. Therefore, functions that call append
    589    have to compute the length of the vectors to append even if the lenghts
    590    will actually be ignored by the extracted code of the append function.
    591 
    592    In the litterature there are proposals for allowing the user to more
    593    accurately distinguish computational from non computational arguments of
    594    functions. The proposals introduce two different types of
    595    $\lambda$-abstractions and ad-hoc typing rules to ensure that computationally
    596    irrelevant bound variables are not used in computationally relevant
    597    positions. An OCaml prototype that implements these ideas for Coq is
    598    available~\cite{xxyy}, but heavily bugged. We did not try yet to do
    599    anything along these lines in Matita. To avoid modifying the system,
    600    another approach based on the explicit use of a non computational monad
    601    has been also proposed in the litterature, but it introduces many
    602    complications in the formalization and it cannot be used in every situation.
    603 
    604    Improvement of the code extraction to more aggresively remove irrelevant
    605    code from code extracted from Matita is left as future work.
    606    At the moment, it seems that useless computations are indeed responsible
    607    for poor performances of some parts of the extracted code. We have
    608    experimented with a few manual optimizations of the extracted code and
    609    we saw that a few minor patches already allow a 25\% speed up of the
    610    assembly pass. The code released with this deliverable is the one without
    611    the manual patches to maximize reliability.
    612  \item Poor typing. A nice theoretical result is that the terms of the
    613    Calculus of Constructions (CoC),
    614    the upper-right corner of Barendregt cube, can be re-typed in System
    615    $F_\omega$, the corresponding typed lambda calculus without dependent
    616    types~\cite{christinexx}. The calculi implemented by Coq and Matita, however,
    617    are more expressive than CoC, and several type constructions have no
    618    counterparts in System $F_\omega$. Moreover, core OCaml does not even
    619    implement $F_\omega$, but only the Hindley-Milner fragment of it.
    620    Therefore, in all situations listed below, code extraction is not able to
    621    type the extracted code using informative types, but it uses the
    622    super-type \texttt{Obj.magic} of OCaml --- abbreviated \texttt{\_\_} in
    623    the extracted code. The lack of more precise typing has very limited
    624    impact on the performance of the compiler OCaml code, but it makes
    625    very hard to plug the extracted code together with untrusted code. The latter
    626    needs to introduce explicit unsafe casts between the super-type and the
    627    concrete types used by instances of the generic functions. The code written
    628    in this is very error prone. For this reason we have decided to write in
    629    Matita also parts of the untrusted code of the CerCo compiler (e.g. the
    630    pretty-printing functions), in order to benefit from the type checker of
    631    Matita.
    632 
    633    The exact situations that triggers uses of the super-type are:
    634    \begin{enumerate}
    635    \item They calculi feature a cumulative
    636     hierarchy of universes that allows to write functions that can be used
    637     both as term formers and type formers, according to the arguments that are
    638     passed to them. In System $F_\omega$, instead, terms and types are
    639     syntactially distinct. Extracting terms according to all their possible
    640     uses may be unpractical because the number of uses is exponential in the
    641     number of arguments of sort $Type_i$ with $i \geq 2$.
    642    \item Case analysis and recursion over inhabitants of primitive inductive
    643     types can be used in types (strong elimination), which is not allowed in
    644     $F_\omega$. In the CerCo compiler we largely exploit type formers declared
    645     in this way, for example to provide the same level of type safety achieved
    646     in the untrusted compiler via polymorphic variants~\cite{guarriguesxx}.
    647     In particular, we have terms to syntactically describe as first class
    648     citizens the large number of combinations of operand modes of object code
    649     instructions. On the instructions we provide ``generic'' functions that work
    650     on some combinations of the operand modes, and whose type is computed by
    651     primitive recursion on the syntactic description of the operand modes of
    652     the argument of the function.
    653    \item $\Sigma$-types and, more generally, dependently typed records can have
    654     at the same time fields that are type declarations and fields that are
    655     terms. This situation happens all the time in CerCo because we are sticking
    656     to the dependently typed discipline and because we often generalize our
    657     data structures over the types used in them. Concretely, the generalization
    658     happens over a record containing a type --- e.g. the type of
    659     (pseudo)-registers for back-end languages --- some terms working on the
    660     type --- e.g. functions to set/get values from (pseudo)-registers --- and
    661     properties over them. In System $F_\omega$ terms and types abstractions
    662     are kept syntactically separate and there is no way to pack them in
    663     records.
    664   \item The type of the extracted function does not belong to the Hindley-Milner
    665     fragment. Sometimes simple code transformations could be used to make the
    666     function typeable, but the increased extraction code complexity would
    667     outweight the benefits.
    668  \end{enumerate}
    669 \end{itemize}
    670 
    671 We should note how other projects, in particular CompCert, have decided from
    672 the beginning to avoid dependent types to grant high quality of the extracted
    673 code and maximal control over it. Therefore, at the current state of the art of
    674 code extraction, there seems to be a huge trade-off between extracted code
    675 quality and exploitation of advanced typing and proving techniques in the
    676 source code. In the near future, the code base of CerCo can provide an
    677 interesting example of a large formalization based on dependent types and
    678 in need of high quality of extracted code. Therefore we plan to use it as a
    679 driver and test bench for future works in the improvement of code extraction.
    680 In particular, we are planning to study the following improvements to the
    681 code extraction of Matita:
    682 \begin{itemize}
    683  \item We already have a prototype that extracts code from Matita to GHC plus
    684   several extensions that allow GHC to use a very large subset of System
    685   $F_\omega$. However, the prototype is not fully functional yet because we
    686   still need to solve at the theoretical level a problem of interaction
    687   between $F_\omega$ types and strong elimination. Roughly speaking, the
    688   two branches of a strong elimination always admit a most general unifier in
    689   Hindley-Milner plus the super-type \texttt{Obj.magic}, but the same property
    690   is lost for $F_\omega$. As a consequence, we loose modularity in code
    691   extraction and we need to figure out static analysis techniques to reduce
    692   the impact of the loss of modularity.
    693  \item The two most recent versions of OCaml have introduced first class
    694   modules, which are exactly the feature needed for extracting code that uses
    695   records contining both types and term declarations. However, the syntax
    696   required for first class modules is extremely cumbersome and it requires
    697   the explicit introduction of type expressions to make manifest the type
    698   declaration/definition fields of the module. This complicates code
    699   extraction with the needs of performing some computations at extraction time,
    700   which are not in the tradition of code extraction. Moreover, the actual
    701   performance of OCaml code that uses first class modules heavily is unknown
    702   to us. We plan to experiment with first class modules for extraction very
    703   soon.
    704   \item Algebraic data types are generalized to families of algebraic data
    705     types in the calculi implemented by Coq and Matita, even if the two
    706     generalizations are different. Families of algebraic data type are
    707     traditionally not supported by programming languages, but some
    708     restrictions have been recently considered under the name of
    709     Generalized Abstract Data Types (GADTs) and they are now implemented in
    710     recent versions of OCaml and GHCs. A future work is the investigation on
    711     the possibility of exploiting GADTs during code extraction.
    712 \end{itemize}
    713 
    714 \section{The Cost-Annotating Plug-In}
    715 
    716 The functionalities of the Cost Annotating Plug-In have already been described
    717 in Deliverables~D5.1 and D5.3.
    718 The plug-in interfaces with the CerCo compiler via
    719 the command line. Therefore there was no need to update the plug-in code for
    720 integration in the Trusted CerCo Prototype. Actually, it is even possible
    721 to install at the same time the untrusted and the trusted compilers. The
    722 \texttt{-cost-acc} flag of the plug-in can be used to select the executable
    723 to be used for compilation.
    724 
    725 The code of the plug-in has been modified w.r.t. D5.1 in order to take care
    726 also of the cost model for stack-size consumption. From the user point of view,
    727 time and space cost annotations and invariants are handled in a similar way.
    728 Nevertheless, we expect automated theorem provers to face more difficulties
    729 in dealing with stack usage because elapsed time is additive, whereas what
    730 is interesting for space usage is the maximum amount of stack used, which is
    731 not additive. On the other hand, programs produced by our compiler require
    732 more stack only at function calls. Therefore the proof obligations generated
    733 to bound the maximum stack size for non recursive programs are trivial.
    734 Most C programs, and in particular those used in time critical systems, avoid
    735 recursive functions.
    736 
    737 \section{Connection with other deliverables}
    738 \label{subsect.connection.other.deliverables}
    739 
    740 This deliverable represents the final milestone of the CerCo project.
    741 The software delivered links together most of the software already developed
    742 in previous deliverables, and it benefits from the studies performed in
    743 other deliverables. In particular:
    744 
    745 \begin{itemize}
    746  \item The compiler links the code extracted from the executable formal
    747   semantics of C (D3.1), machine code (D4.1), front-end intermediate languages
    748   (D3.3) and back-end intermediate languages (D4.3). The \texttt{-is} flag
    749   of the compiler invokes the semantics of every intermediate
    750   representation of the program to be compiled. The executability of the
    751   C and machine code languages has been important to debug the the two
    752   semantics, that are part of the trusted code base of the compiler.
    753   The executability of the intermediate languages has been important during
    754   development for early detection of bugs both in the semantics and in the
    755   compiler passes. They are also useful to users to profile programs in early
    756   compilation stages to detect where the code spends more time.
    757   Those semantics, however, are not part of the trusted code base.
    758  \item The compiler links the code extracted from the CIC encoding of the
    759   Front-end (D3.2) and Back-end (D4.2). The two encodings have been partially
    760   proved correct in D3.4 (Front End Correctness Proof) and D4.4 (Back End
    761   Correctness Proof). The formal proofs to be delivered in those deliverables
    762   have not been completed. The most urgent future work after the end of the
    763   project will consist in completing those proofs.
    764  \item Debian Packages have been provided in D6.6 for the Cost-Annotating
    765   Plug-In, the Untrusted CerCo Compiler and the Trusted CerCo Compiler.
    766   The first and third installed together form a full installation of the
    767   Trusted CerCo Prototype. In order to allow interested people to test the
    768   prototype more easily, we also provided in D6.6 a Live CD based on
    769   Debian with the CerCo Packages pre-installed.
    770 \end{itemize}
     125Integration with Frama-C has been done in D5.1-D5.3.
     126This report overviews the packages that have been provided and the contents
     127of the Live CD.
     128
     129\section{Debian packages}
     130We show here the description of every Debian package made for the
     131software developed in CerCo, as reported by \texttt{apt-cache show}.
     132Only the relevant fields are shown.
     133\begin{enumerate}
     134 \item Package name: \texttt{acc}, Version: 0.2-1\\
     135   \textbf{CerCo's Annotating C Compiler}\\
     136 CerCo's Annotating C Compiler is a C compiler for the 8051 microprocessor.
     137 It produces both an Intel HEX code memory representation and an instrumented
     138 version of the source code. The instrumented source code is a copy of the
     139 original source code augmented with additional instructions to keep track of
     140 the exact number of clock cycles spent by the microprocessor to execute the
     141 program basic blocks. Combined with CerCo's Frama-C Cost Plugin it allows to
     142 certify exact parametric time and stack bounds for the source code.
     143 Limitations: the source code must all be in one .c file and it must not
     144 contain external calls.
     145 \item Package name: \texttt{acc-trusted}, Version: 0.2-1\\
     146   \textbf{CerCo's Trusted Annotating C Compiler}\\
     147 CerCo's Annotating C Compiler is a C compiler for the 8051 microprocessor.
     148 It produces both an Intel HEX code memory representation and an instrumented
     149 version of the source code. The instrumented source code is a copy of the
     150 original source code augmented with additional instructions to keep track of
     151 the exact number of clock cycles spent by the microprocessor to execute the
     152 program basic blocks. Combined with CerCo's Frama-C Cost Plugin it allows to
     153 certify exact parametric time and stack bounds for the source code.
     154 Limitations: the source code must all be in one .c file and it must not
     155 contain external calls.
     156 This version of the compiler has been certified for preservation of the
     157 inferred costs using the interactive theorem prover Matita.
     158 \item Package name: \texttt{frama-c-cost-plugin}, Version: 0.2-1\\
     159   Depends: \texttt{acc | acc-trusted},
     160   Recommends: \texttt{why, why3, cvc3}\\
     161   \textbf{CerCo's Frama-C Cost Plugin}\\
     162 The CerCo's Frama-C Cost Plugin has been developed by the CerCo project\\
     163 http://cerco.cs.unibo.it.
     164 It compiles a C source file to an 8051 Intel HEX using either the acc
     165 or the acc-trusted compilers. The compilation also produces an instrumented
     166 copy of the source code obtained inserting instractions to keep track of the
     167 number of clock cycles spent by the program on the real hardware. The plugin
     168 implements an invariant generator that computes parametric bounds for the
     169 total program execution time and stack usage. Moreover, the plugin can invoke
     170 the Jessie condition generators to generate proof obligations to certify the
     171 produced bound. The proof generators may be automatically closed using the
     172 why3 tool that invokes automated theorem provers. The cvc3 prover works
     173 particularly well with the proof obligations generated from the cost plugin.
     174\end{enumerate}
     175
     176The following additional Debian packages are required by the CerCo's Frama-C
     177Cost Plugin. They do not install software developed in CerCo, but their Debian
     178packages themselves have been made by CerCo. Among all the theorem provers
     179that can be used in combination with Why3 (and the CerCo Cost Plugin), we
     180have decided to package cvc3 because its licence allows free redistribution
     181and because it is powerful enough to close most proof obligations generated
     182by CerCo on common examples. The user can still manually install and try other
     183provers that will be automatically recognized by why3 just by running
     184\texttt{why3config}. We do not provide any Debian packages for them.
     185
     186\begin{enumerate}
     187 \item Package name: \texttt{why}, Version: 2.31+cerco-1\\
     188   \textbf{Software verification tool}\\
     189 Why aims at being a verification conditions generator (VCG) back-end
     190 for other verification tools. It provides a powerful input language
     191 including higher-order functions, polymorphism, references, arrays and
     192 exceptions. It generates proof obligations for many systems: the proof
     193 assistants Coq, PVS, Isabelle/HOL, HOL 4, HOL Light, Mizar and the
     194 decision procedures Simplify, Alt-Ergo, Yices, CVC Lite and haRVey.
     195 \item Package name: \texttt{why3}, Version: 0.73+cerco-1\\
     196   \textbf{Software verification tool}\\
     197 Why aims at being a verification conditions generator (VCG) back-end
     198 for other verification tools. It provides a powerful input language
     199 including higher-order functions, polymorphism, references, arrays and
     200 exceptions. It generates proof obligations for many systems: the proof
     201 assistants Coq, PVS, Isabelle/HOL, HOL 4, HOL Light, Mizar and the
     202 decision procedures Simplify, Alt-Ergo, Yices, CVC Lite and haRVey.
     203 \item Package name: \texttt{cvc3}, Version: 2.4.1-1\\
     204   \textbf{CVC3 is an automatic theorem prover for Satisfiability Module Theories}\\
     205  CVC3 is an automatic theorem prover for Satisfiability Modulo Theories (SMT) problems. It can be used to prove the validity (or, dually, the satisfiability) of first-order formulas in a large number of built-in logical theories and their combination.
     206 CVC3 is the last offspring of a series of popular SMT provers, which originated at Stanford University with the SVC system. In particular, it builds on the code base of CVC Lite, its most recent predecessor. Its high level design follows that of the Sammy prover.
     207 CVC3 works with a version of first-order logic with polymorphic types and has a wide variety of features including:
     208 * several built-in base theories: rational and integer linear arithmetic, arrays, tuples, records, inductive data types, bit vectors, and equality over uninterpreted function symbols;
     209 * support for quantifiers;
     210 * an interactive text-based interface;
     211 * a rich C and C++ API for embedding in other systems;
     212 * proof and model generation abilities;
     213 * predicate subtyping;
     214 * essentially no limit on its use for research or commercial purposes (see license).
     215\end{enumerate}
     216
     217Finally, we have released and packaged a new version of Matita that contains
     218several improvements driven by CerCo, like code extraction.
     219
     220\begin{enumerate}
     221 \item Package name: \texttt{matita}, Version: 0.99.2-1\\
     222   \textbf{Interactive theorem prover}\\
     223 Matita is a graphical interactive theorem prover based on the Calculus of
     224 (Co)Inductive Constructions.
     225\end{enumerate}
     226
     227All Debian packages mentioned above can be downloaded from the CerCo's website
     228\texttt{http://cerco.cs.unibo.it} and are compatible with the \emph{testing}
     229release of Debian, soon to be freezed and become the forthcoming \emph{stable}
     230release. The source code to recompile the Debian packages from scratch is part
     231of the CerCo repository and can be also downloaded from the website.
     232
     233\section{Live CD}
     234
     235We also provide a Live CD to easily test all the software implemented in CerCo
     236without having to recompile the software by hand or install a Debian
     237distribution. The Live CD can be downloaded from the CerCo's website
     238\texttt{http://cerco.cs.unibo.it}. It is a Live Debian CD that can be either
     239burnt on a physical disk used to boot a real machine, or it can be used to
     240bootstrap a virtual machine using any virtualizer like \texttt{virtualbox, qemu}
     241or similar ones.
     242
     243The Live CD contains an xserver and a standard gnome environment and it
     244connects automatically to any available network. The virtualbox guest tools
     245have also been installed to facilitate communication between the host and
     246the virtual machine, e.g. for file transfer.
     247
     248The Live CD also has pre-installed all the Debian packages described in the
     249previous section. The user can simply boot the system and invoke the
     250\texttt{cerco} executable to compile any compatible C file to the Intel HEX
     251format for the 8051. The executable also computes the cost model for the
     252source program, its cost invariants and tries to certify them using the
     253cvc3 theorem prover. In case of failure, the flag \texttt{-ide} can be used
     254to open an interactive why3 session to experiment with different theorem
     255prover flags (e.g. increasing the prover timeout).
     256
     257Further informations on the \texttt{cerco} executable can be found in
     258Deliverable D5.2.
     259
     260Finally, the Live CD also has a preinstalled copy of the interactive theorem
     261prover Matita and a snapshot of the compiler certification. The snapshot can
     262be found in the user's home directory, under \texttt{cerco/src}. Matita
     263files can be opened running \texttt{matita}. The other
     264directory \texttt{cerco/driver} contains the extracted code together with
     265the untrusted code that forms the CerCo Trusted Compiler.
    771266
    772267\end{document}
Note: See TracChangeset for help on using the changeset viewer.