Changeset 3243 for Deliverables/D6.6
- Timestamp:
- Apr 30, 2013, 11:34:54 PM (8 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
Deliverables/D6.6/report.tex
r3234 r3243 61 61 \begin{LARGE} 62 62 \textbf{ 63 Report n. D 5.2\\64 Trusted CerCo Prototype}63 Report n. D6.6\\ 64 Packages for Linux distributions and Live CD } 65 65 \end{LARGE} 66 66 \end{center} … … 77 77 \begin{large} 78 78 Main Authors:\\ 79 XXXX %Dominic P. Mulligan andClaudio Sacerdoti Coen79 Claudio Sacerdoti Coen 80 80 \end{large} 81 81 \end{center} … … 96 96 \vspace*{7cm} 97 97 \paragraph{Abstract} 98 The Trusted CerCo Prototype is meant to be the last piece of software produced 99 in the CerCo project. It consists of 100 \begin{enumerate} 101 \item A compiler from a large subset of C to the Intel HEX format for the 102 object code of the 8051 microprocessor family. The compiler also computes 103 the cost models for the time spent in basic blocks and the stack space used 104 by the functions in the source code. The cost models are exposed to the 105 user by producing an instrumented C code obtained injecting in the original 106 source code some instructions to update three global variables that record 107 the current clock, the current stack usage and the max stack usage so far. 108 \item A plug-in for the Frama-C Software Analyser architecture. The plug-in 109 takes in input a C file, compiles it with the CerCo compiler and then 110 automatically infers cost invariants for every loop and every function in 111 the source code. The invariants can be fed by Frama-C to various theorem 112 provers to automatically verify them. Those that are not automatically 113 verified can be manually checked by the user using a theorem prover. 114 \end{enumerate} 115 116 The plug-in is fully functional and it does not need to be formally verified 117 because it does not belong to the trusted code base of programs verified using 118 CerCo's technology. A bug in the plug-in can just infer wrong time/space 119 invariants and bounds that will be rejected by the automatic provers. 120 121 The compiler is currently fully functional, even if not fully 122 certified yet. The certification will continue after the end of the project. 123 124 In this deliverable we discuss the status of the Trusted CerCo Prototype at the 125 time of the official end of the project. 98 We provide Debian packages for all the software developed in CerCo and for all 99 its software dependencies. The Debian packages are for the forthcoming stable 100 Debian distribution. Moreover, we provide a Live CD, also based on Debian, that 101 contains a pre-installed version of all the packages and a copy of the 102 formalization. 126 103 \newpage 127 104 … … 133 110 \label{sect.task} 134 111 135 The Grant Agreement describes deliverable D 5.2as follows:112 The Grant Agreement describes deliverable D6.6 as follows: 136 113 \begin{quotation} 137 \textbf{Trusted CerCo Prototype}: Final, fully trustable version of the 138 system. 114 \textbf{Packages for Linux distributions and Live CD}: 115 In order 116 to foster adoption of the CerCo compiler in a wider community, we will provide packages for selected Linux 117 distributions and a Live CD with the software developed in the project. We will also consider and discuss the 118 integration of our software in an extensible platform dedicated to source-code analysis of C software, like 119 Frama-C. However, at the moment it is unclear how these rapidly changing platforms will evolve in the next 120 two years, if integration would provide an added value and if it would be possible to practically achieve the 121 integration in the project timeframe without an actual involvement of the platform developers (that would need to 122 contribute man-power to the task). 139 123 \end{quotation} 140 124 141 This report describes the state of the implementation of the Trusted CerCo 142 Prototype at the official end of the project. 143 144 \section{The Trusted CerCo Compiler} 145 \subsection{Command line interface} 146 The Trusted CerCo Compiler is the first component of the Trusted CerCo 147 Prototype, the second one being the Frama-C plug-in already developed in 148 D5.1 and not modified. The Trusted CerCo compiler replaces the Untrusted 149 CerCo Compiler already delivered in D5.1 as part of the Untrusted CerCo 150 Prototype. The Trusted and Untrusted compilers are meant to provide the 151 same command line interface, so that they can be easily swapped without 152 affecting the plug-in. Actually, the two compilers share the following 153 minimal subset of options which are sufficient to the plug-in: 154 155 \begin{verbatim} 156 Usage: acc [options] file... 157 -a Add cost annotations on the source code. 158 -is Outputs and interprets all the compilation passes, 159 showing the execution traces 160 -o Prefix of the output files. 161 \end{verbatim} 162 163 \begin{figure}[t] 164 \begin{verbatim} 165 char a[] = {3, 2, 7, -4}; 166 char treshold = 4; 167 168 int main() { 169 char j; 170 char *p = a; 171 int found = 0; 172 for (j=0; j < 4; j++) { 173 if (*p <= treshold) { found++; } 174 p++; 175 } 176 return found; 177 } 178 \end{verbatim} 179 \caption{A simple C program~\label{test1}} 180 \end{figure} 181 182 \begin{figure}[!p] 183 \begin{verbatim} 184 int __cost = 33; 185 186 int __stack_size = 5, __stack_size_max = 5; 187 188 void __cost_incr(int incr) { 189 __cost = __cost + incr; 190 } 191 192 void __stack_size_incr(int incr) { 193 __stack_size = __stack_size + incr; 194 __stack_size_max = __stack_size_max < __stack_size ? __stack_size : __stack_size_max; 195 } 196 197 unsigned char a[4] = { 3, 2, 7, 252, }; 198 199 unsigned char treshold = 4; 200 201 int main(void) 202 { 203 unsigned char j; 204 unsigned char *p; 205 int found; 206 __stack_size_incr(0); 207 __cost_incr(91); 208 p = a; 209 found = 0; 210 for (j = (unsigned char)0; (int)j < 4; j = (unsigned char)((int)j + 1)) { 211 __cost_incr(76); 212 if ((int)*p <= (int)treshold) { 213 __cost_incr(144); 214 found = found + 1; 215 } else { 216 __cost_incr(122); 217 /*skip*/; 218 } 219 p = p + 1; 220 } 221 __cost_incr(37); 222 /*skip*/; 223 __stack_size_incr(-0); 224 return found; 225 __stack_size_incr(-0); 226 227 } 228 \end{verbatim} 229 \caption{The instrumented version of the program in Figure~\ref{test1}\label{itest1}} 230 \end{figure} 231 232 Let the file \texttt{test.c} contains the code presented in Figure~\ref{test1}. 233 By calling \texttt{acc -a test.c}, the user obtains the following files: 234 \begin{itemize} 235 \item \texttt{test-instrumented.c} (Figure~\ref{itest1}): the file is a copy 236 of \texttt{test.c} obtained by adding code that keeps track of the amount 237 of time and space used by the source code. 238 239 The global variable 240 \texttt{\_\_cost} records the number of clock cycles used by the 241 microprocessor. Its initial value may not be zero because we emit object code 242 to initialize the global variables of the program. The initialization code 243 is not part of \texttt{main} (to allow \texttt{main} to be recursive, even 244 if it is not clear from the standard if this should be allowed or not). 245 The initial value of \texttt{\_\_cost} is the time spent in the 246 initialization code. 247 The \texttt{\_\_cost} variable is incremented at the beginning of every 248 basic block using the \texttt{\_\_cost\_incr} function, whose argument 249 is the number of clock cycles that will be spent by the basic block that 250 is beginning. Therefore the value stored in the variable is always a safe 251 upper bound of the actual time. Moreover, the difference at the begin and 252 end of a function of the value of the \texttt{\_\_clock} variable is 253 also exact. The latter statement --- with several technical complications 254 --- is the one that will be eventually certified in Matita. 255 256 The global variables \texttt{\_\_stack\_size} and 257 \texttt{\_\_stack\_size\_max} are handled similarly to \texttt{\_\_cost}. 258 Their value is meant to represent the amount of external data memory 259 currently in use and the maximum amount of memory used so far by the program. 260 The two values are updated by the \texttt{\_\_stack\_size\_incr} function 261 at the beginning of every function body, at its end and just before every 262 \texttt{return}. A negative increment is used to represent stack release. 263 The initial value of the two variables may not be zero because some 264 external data memory is used for global and static variables. Moreover, the 265 last byte of external data memory may not be addressable in the 8051, so 266 we avoid using it. The compiled code runs correctly only until stack 267 overflow, which happens when the value of \texttt{\_\_stack\_size} reaches 268 $2^{16}$. It is up to the user to state and maintain the invariant that no 269 stack overflow occurs. In case of stack overflow, the compiled code does 270 no longer simulate the source code. Automatic provers are often able to 271 prove the invariant in simple cases. 272 273 In order to instrument the code, all basic blocks that are hidden in the 274 source code but that will contain code in the object code need to be made 275 manifest. In particular, \texttt{if-then} instructions without an explicit 276 \texttt{else} are given one (compare Figures~\ref{test1} and \ref{itest1}). 277 \item \texttt{test.hex}: the file is an Intel HEX representation of the 278 object code for an 8051 microprocessor. The file can be loaded in any 279 8051 emulator (like the MCU 8051 IDE) for running or dissassemblying it. 280 Moreover, an executable semantics (an emulator) for the 8051 is linked 281 with the CerCo compilers and can be used to run the object code at the 282 end of the compilation. 283 \item \texttt{test.cerco} and \texttt{test.stack\_cerco}: these two files 284 are used to pass information from the CerCo compilers to the Frama-C plug-in 285 and they are not interesting to the user. They just list the 286 global variables and increment functions inserted by the compiler and used 287 later by the plug-in to compute the cost invariants. 288 \end{itemize} 289 290 When the option \texttt{-is} is used, the compilers run the program after 291 every intermediate compilation pass. The untrusted compiler only outputs the 292 trace of (cost) observables and the final value returned by main; the 293 trusted compiler also observes every function call and return (the stack-usage 294 observables) and therefore allows to better track program execution. 295 The traces observed after every pass should always be equal up to the initial 296 observable that corresponds to the initialization of global variables. That 297 observable is currently only emitted in back-end languages. The preservation of 298 the traces will actually be granted by the main theorem of the formalization 299 (the forward simulation theorem) when the proof will be completed. 300 301 The compilers also create a file 302 for each pass with the intermediate code. However, the syntaxes used by the two 303 compilers for the intermediate passes are not the same and we do not output 304 yet any intermediate syntax for the assembly code. The latter can be obtained 305 by disassemblying the object code, e.g. by using the MCU 8051 IDE. 306 307 \subsection{Code structure} 308 The code of the trusted compiler is made of three different parts: 309 \begin{itemize} 310 \item Code extracted to OCaml from the formalization of the compiler in Matita. 311 This is the code that will eventually be fully certified using Matita. 312 It is fully contained in the \texttt{extracted} directory (not comprising 313 the subdirectory \texttt{unstrusted}). 314 It translates the source C-light code to: 315 \begin{itemize} 316 \item An instrumented version of the same C-light code. The instrumented 317 version is obtained inserting cost emission statements~\texttt{COST k} 318 at the beginning of basic blocks. The emitted labels are the ones that 319 are observed calling \texttt{acc -is}. They will be replaced in the 320 final instrumented code with incrementes of the \texttt{\_\_cost} 321 variable. 322 \item The object code for the 8051 as a list of bytes to be loaded in 323 code memory. The code is coupled with a trie over bitvectors that maps 324 program counters to cost labels \texttt{k}. The intended semantics is 325 that, when the current program counter is associated to \texttt{k}, 326 the observable label \texttt{k} should be emitted during the execution 327 of the next object code instruction. Similarly, a second trie, reminiscent 328 of a symbol table, maps program counters to function names to emit the 329 observables associated to stack usage. 330 \end{itemize} 331 During the ERTL to RTL pass that deals with register allocation, 332 the source code calls two untrusted components that we are now going to 333 describe. 334 \item Untrusted code called by trusted compiler (directory 335 \texttt{extracted/unstrusted}). The two main untrusted components in this 336 directory are 337 \begin{itemize} 338 \item The \texttt{Fix.ml} module by Fran\c{c}ois Pottier that provides a 339 generic algorithm to compute the least solution of a system of monotonic 340 equations, described in the paper \cite{LazyLeastFixedPointsInML}. 341 The trusted code uses this piece of code to do liveness analysis and only 342 assumes that the module computes a pre-fix point of the system of equations 343 provided. The performance of this piece of code is critical to the compiler 344 and the implementation used exploits some clever programming techniques 345 and imperative data structures that would be difficult to prove correct. 346 Checking if the result of the computation is actually a pre-fixpoint is 347 instead very simple to do using low computational complexity functional 348 code only. Therefore we do not plan to prove this piece of code correct. 349 Instead, we will certify a verifier for the results of the computation. 350 351 Note: this module, as well as the next one, have been reused from the 352 untrusted compiler. Therefore they have been thoroughly tested for bugs 353 during the last year of the project. 354 \item The \texttt{coloring.ml} module taken from Fran\c{c}ois Pottier 355 PP compiler, used in a compiler's course for undergraduates. 356 The module takes an interference graph 357 (data structure implemented in module \texttt{untrusted\_interference.ml}) 358 and colors it, assigning to nodes of the interference graph either a 359 color or the constant \texttt{Spilled}. An heuristic is used to drive the 360 algorithm: the caller must provide a function that returns the number of 361 times a register is accessed, to decrease it likelihood of being spilled. 362 To minimise bugs and reduce code size, we actually implement the heuristics 363 in Matita and then extract the code. Therefore the untrusted code also calls 364 back extracted code for untrusted computations. 365 366 Finally, module \texttt{compute\_colouring.ml} is the one that actually 367 computes the colouring of an interference graph of registers given the 368 result of the liveness analysis. The code first creates the interference 369 graph; then colours it once using real registers as colours to determine 370 which pseudo-registers need spilling; 371 finally it colours it again using real registers and spilling slots as 372 colours to assign to each pseudoregister either a spilling slot or a 373 real register. 374 \end{itemize} 375 The directory also contains a few more files that provide glue code between 376 OCaml's data structures from the standard library (e.g. booleans and lists) 377 and the corresponding data structures extracted from Matita. The glue is 378 necessary to translate results back and forth between the trusted (extracted) 379 and untrusted components, because the latter have been written using data 380 structures from the standard library of OCaml. 381 \item Untrusted code that drives the trusted compiler (directory 382 \texttt{driver}). The directory contains the \texttt{acc.ml} module 383 that implements the command line interface of the trusted compiler. 384 It also contains or links three untrusted components which are safety 385 critical and that enter the trusted code base of CerCo: 386 \begin{itemize} 387 \item The CIL parser~\cite{CIL} that parses standard C code and simplifies 388 it to C-light code. 389 \item A pretty-printer for C-light code, used to print the 390 \texttt{-instrumented.c} file. 391 \item The instrumenter module (integrated with the pretty-printer for 392 C-light code). It takes the output of the compiler and replaces every 393 statement \texttt{COST k} (that emits the cost label \texttt{k}) with 394 \texttt{\_\_cost\_incr(n)} where $n$ is the cost associated to 395 \texttt{k} in the map returned by the compiler. A similar operation is 396 performed to update the stack-related global variables. Eventually this 397 modules needs to be certified with the following specification: if 398 a run of the labelled program produces the trace 399 $\tau = k_1 \ldots k_n$, the 400 equivalent run of the instrumented program should yield the same 401 result and be such that at the end the value of the \texttt{\_\_cost} 402 variable is equal to $\Sigma_{k \in \tau} K(k)$ where $K(k)$ is the 403 lookup of the cost of $k$ in the cost map $K$ returned by the compiler. 404 A similar statement is expected for stack usage. 405 \end{itemize} 406 \end{itemize} 407 408 \subsection{Functional differences w.r.t. the untrusted compiler} 409 From the user perspective, the trusted and untrusted compiler have some 410 differences: 411 \begin{itemize} 412 \item The untrusted compiler put the initialization code for global variables 413 at the beginning of main. The trusted compiler uses a separate function. 414 Therefore only the trusted compiler allows recursive calls. To pay for 415 the initialization code, the \texttt{\_\_cost} variable is not always 416 initialized to 0 in the trusted compiler, while it is always 0 in the 417 untrusted code. 418 \item The two compilers do not compile the code in exactly the same way, even 419 if they adopt the same compilation scheme. 420 Therefore the object code produced is different and the control blocks are 421 given different costs. On average, the code generated by the trusted 422 compiler is about 3 times faster and may use less stack space. 423 \item The trusted compiler is slightly slower than the untrusted one and the 424 trusted executable semantics are also slightly slower than the untrusted 425 ones. 426 The only passes that at the moment are significantly much slower are the 427 policy computation pass, which is a preliminary to the assembly, and the 428 assembly pass. These are the passes that manipulate the largest quantity 429 of data, because assembly programs are much larger than the corresponding 430 Clight sources. The reason for the slowness is currently under investigation. 431 It is likely to be due to the quality of the extracted code (see 432 Section~\ref{quality}). 433 \item The back-ends of both compilers do not handle external functions 434 because we have not implemented a linker. The trusted compiler fails 435 during compilation, while the untrusted compiler silently turns every 436 external function call into a \texttt{NOP}. 437 \item The untrusted compiler had ad-hoc options to deal with C files 438 generated from a Lustre compiler. The ad-hoc code simplified the C files 439 by avoiding some calls to external functions and it was adding some 440 debugging code to print the actual reaction time of every Lustre node. 441 The trusted compiler does not implement any ad-hoc Lustre option yet. 442 \end{itemize} 443 444 \subsection{Implementative differences w.r.t. the untrusted compiler}\label{quality} 445 The code of the trusted compiler greatly differs from the code of the 446 untrusted prototype. The main architectural difference is the one of 447 representation of back-end languages. In the trusted compiler we have adopted 448 a unified syntax (data-structure), semantics and pretty-printing for every 449 back-end language. 450 In order to accomodate the differences between the original languages, the 451 syntax and semantics have been abstracted over a number of parameters, like 452 the type of arguments of the instructions. For example, early languages use 453 pseudo-registers to hold data while late languages store data into real machine 454 registers or stack locations. The unification of the languages have bringed 455 a few benefits and can potentially bring new ones in the future: 456 \begin{itemize} 457 \item Code size reduction and faster detection and correction of bugs. 458 459 About the 3/4th of the code for the semantics and pretty-printing of 460 back-end languages is shared, while 1/4th is pass-dependent. Sharing 461 the semantics automatically implies sharing semantic properties, i.e. 462 reducing to 1/6th the number of lemmas to be proved (6 is the number of 463 back-end intermediate languages). Moreover, several back-end passes 464 --- a pass between two alternative semantics for RTL, the RTL to ERTL pass 465 and the ERTL to LTL pass --- transform a graph instance of the generic 466 back-end intermediate language to another graph instance. The 467 graph-to-graph transformation has also been generalized and parameterized 468 over the pass-specific details. While the code saved in this way is not 469 much, several significant lemmas are provided once and for all on the 470 transformation. At the time this deliverable has been written, 471 the generalized languages, semantics, transformations and relative 472 properties take about 3,900 lines of Matita code (definitions and lemmas). 473 474 We also benefit from the typical advantage of code sharing over cut\&paste: 475 once a bug is found and fixed, the fix immediately applies to every 476 instance. This becomes particularly significant for code certification, 477 where one simple bug fix usually requires a complex work to fix the 478 related correctness proofs. 479 480 \item Some passes and several proofs can be given in greater generality, 481 allowing more reusal. 482 483 For example, in the untrusted prototype the 484 LTL to LIN pass was turning a graph language into a linearized language 485 with the very same instructions and similar semantics. In particular, the 486 two semantics shared the same execute phase, while fetching was different. 487 The pass consisted in performing a visit of the graph, emitting the 488 instructions in visit order. When the visit detected a cycle, the back-link 489 arc was represent with a new explicitly introduced \texttt{GOTO} statement. 490 491 Obviously, the transformation just described could be applied as well to 492 any language with a \texttt{GOTO} statement. In our formalization in Matita, 493 we have rewritten and proved correct the translation between any two 494 instances of the generalized back-end languages such that: 1) the 495 fetching-related parameters of the two passes are instantiated respectively 496 with graphs and linear operations; 2) the execute-related parameters are 497 shared. 498 499 Obviously, we could also prove correct the reverse translation, from a 500 linear to a graph-based version of the same language. The two combined 501 passes would allow to temporarily switch to a graph based representation 502 only when a data-flow analysis over the code is required. In our compiler, 503 for example, at the moment only the RTL to ERTL pass is based on a data 504 flow analysis. A similar pair of translations could be also provided between 505 one of the two representations and a static single assignment (SSA) one. 506 As a final observation, the insertion of another graph-based language after 507 the LTL one is now made easy: the linearization pass needs not be redone 508 for the new pass. 509 510 \item Pass commutation and reusal. 511 Every pass is responsible for reducing a difference 512 between the source and target languages. For example, the RTL to ERTL pass 513 is responsible for the parameter passing conversion, while the ERTL to LTL 514 pass performs pseudo-registers allocation. At the moment, both CompCert 515 and CerCo fix the relative order of the two passes in an opposite and 516 partially arbitrary way and it is not possible to simply switch the 517 two passes. We believe instead that it should be possible to generalize 518 most passes in such a way that they could be freely composed in any order, 519 also with repetitions. For example, real world compilers like GCC perform 520 some optimizations like constant propagation multiple times, after 521 every optimization that is likely to trigger more constant propagation. 522 Thanks to our generalized intermediate language, we can already implement 523 a generic constant propagation pass that can be freely applied. 524 % For instance, table below shows the amount of Matita 525 % code taken by the description of the syntax, semantics and pretty printing 526 % rules for the back-end passes. It also compares the size with the untrusted 527 % C code. 528 % \begin{tabular}[h]{lrr} 529 % & Untrusted code & Matita code \\ 530 % RTL & 615 & 456 \\ 531 % ERTL & 891 & 342 \\ 532 % LIN & 568 & 79 \\ 533 % LTL & 636 & 94 \\ 534 % LTL/LIN & & 466 \\ 535 % joint & & 1615 \\ \hline \\ 536 % Tot. & 2,710 & 3,052 \\ 537 % The tables shows that the size of all untrusted code languages is 538 % comparable. 539 \end{itemize} 540 541 542 \subsection{Quality of the extracted code}\label{quality} 543 544 We have extracted the Matita code of the compiler to OCaml in order to compile 545 and execute it in an efficient way and without any need to install Matita. 546 The implementation of code extraction for Matita has been obtained by 547 generalizing the one of Coq over the data structures of Coq, and then 548 instantiating the resulting code for Matita. Differences in the two calculi 549 have also been taken in account during the generalization. Therefore we can 550 expect the extraction procedure to be reasonably bug free, because bugs in the 551 core of the code extraction would be likely to be detected in Coq also. 552 553 The quality of the extracted code ranges from sub-optimal to poor, depending 554 on the part of the formalization. We analyse here the causes for the poor 555 quality: 556 \begin{itemize} 557 \item Useless computations. The extraction procedure 558 removes from the extracted code almost all of the non computational parts, 559 replacing the ones that are not removed with code with a low complexity. 560 However, following Coq's tradition, detection of the useless parts is not 561 done according to the computationally expensive algorithm by 562 Berardi~\cite{berardixxx}. Instead, the user decides which data structures 563 should be assigned computation interest by declaring them in one of the 564 \texttt{Type\_i} sorts of the Calculus of (Co)Inductive Constructions. 565 The non computational structures are declared in \texttt{Prop}, the sort 566 of impredicative, possibly classical propositions. Every computation that 567 produces a data structure in \texttt{Prop} is granted to be computationally 568 irrelevant. Computations that produce data structures in \texttt{Type\_i}, 569 instead, may actually be relevant of irrelevant, even if the extraction code 570 conservatively consider them relevant. The result consists in extracted 571 OCaml code that computes values that will be passed to functions that do 572 not use the result, or that will be returned to the caller that will ignore 573 the result. 574 575 The phenomenon is particularly visible when the dependently typed discipline 576 is employed, which is our choice for CerCo. Under this discipline, terms 577 can be passed to type formers. For example, the data type for back-end 578 languages in CerCo is parameterized over the list of global variables that 579 may be referred to by the code. Another example is the type of vectors 580 that is parameterized over a natural which is the size of the vector, or 581 the type of vector tries which is parameterized over the fixed height of 582 the tree and that can be read and updated only using keys (vectors of 583 bits) whose lenght is the height of the trie. Functions that compute these 584 dependent types also have to compute the new indexes (parameters) for the 585 types, even if this information is used only for typing. For example, 586 appending two vectors require the computation of the lenght of the result 587 vector just to type the result. In turn, this computation requires the 588 lenghts of the two vectors in input. Therefore, functions that call append 589 have to compute the length of the vectors to append even if the lenghts 590 will actually be ignored by the extracted code of the append function. 591 592 In the litterature there are proposals for allowing the user to more 593 accurately distinguish computational from non computational arguments of 594 functions. The proposals introduce two different types of 595 $\lambda$-abstractions and ad-hoc typing rules to ensure that computationally 596 irrelevant bound variables are not used in computationally relevant 597 positions. An OCaml prototype that implements these ideas for Coq is 598 available~\cite{xxyy}, but heavily bugged. We did not try yet to do 599 anything along these lines in Matita. To avoid modifying the system, 600 another approach based on the explicit use of a non computational monad 601 has been also proposed in the litterature, but it introduces many 602 complications in the formalization and it cannot be used in every situation. 603 604 Improvement of the code extraction to more aggresively remove irrelevant 605 code from code extracted from Matita is left as future work. 606 At the moment, it seems that useless computations are indeed responsible 607 for poor performances of some parts of the extracted code. We have 608 experimented with a few manual optimizations of the extracted code and 609 we saw that a few minor patches already allow a 25\% speed up of the 610 assembly pass. The code released with this deliverable is the one without 611 the manual patches to maximize reliability. 612 \item Poor typing. A nice theoretical result is that the terms of the 613 Calculus of Constructions (CoC), 614 the upper-right corner of Barendregt cube, can be re-typed in System 615 $F_\omega$, the corresponding typed lambda calculus without dependent 616 types~\cite{christinexx}. The calculi implemented by Coq and Matita, however, 617 are more expressive than CoC, and several type constructions have no 618 counterparts in System $F_\omega$. Moreover, core OCaml does not even 619 implement $F_\omega$, but only the Hindley-Milner fragment of it. 620 Therefore, in all situations listed below, code extraction is not able to 621 type the extracted code using informative types, but it uses the 622 super-type \texttt{Obj.magic} of OCaml --- abbreviated \texttt{\_\_} in 623 the extracted code. The lack of more precise typing has very limited 624 impact on the performance of the compiler OCaml code, but it makes 625 very hard to plug the extracted code together with untrusted code. The latter 626 needs to introduce explicit unsafe casts between the super-type and the 627 concrete types used by instances of the generic functions. The code written 628 in this is very error prone. For this reason we have decided to write in 629 Matita also parts of the untrusted code of the CerCo compiler (e.g. the 630 pretty-printing functions), in order to benefit from the type checker of 631 Matita. 632 633 The exact situations that triggers uses of the super-type are: 634 \begin{enumerate} 635 \item They calculi feature a cumulative 636 hierarchy of universes that allows to write functions that can be used 637 both as term formers and type formers, according to the arguments that are 638 passed to them. In System $F_\omega$, instead, terms and types are 639 syntactially distinct. Extracting terms according to all their possible 640 uses may be unpractical because the number of uses is exponential in the 641 number of arguments of sort $Type_i$ with $i \geq 2$. 642 \item Case analysis and recursion over inhabitants of primitive inductive 643 types can be used in types (strong elimination), which is not allowed in 644 $F_\omega$. In the CerCo compiler we largely exploit type formers declared 645 in this way, for example to provide the same level of type safety achieved 646 in the untrusted compiler via polymorphic variants~\cite{guarriguesxx}. 647 In particular, we have terms to syntactically describe as first class 648 citizens the large number of combinations of operand modes of object code 649 instructions. On the instructions we provide ``generic'' functions that work 650 on some combinations of the operand modes, and whose type is computed by 651 primitive recursion on the syntactic description of the operand modes of 652 the argument of the function. 653 \item $\Sigma$-types and, more generally, dependently typed records can have 654 at the same time fields that are type declarations and fields that are 655 terms. This situation happens all the time in CerCo because we are sticking 656 to the dependently typed discipline and because we often generalize our 657 data structures over the types used in them. Concretely, the generalization 658 happens over a record containing a type --- e.g. the type of 659 (pseudo)-registers for back-end languages --- some terms working on the 660 type --- e.g. functions to set/get values from (pseudo)-registers --- and 661 properties over them. In System $F_\omega$ terms and types abstractions 662 are kept syntactically separate and there is no way to pack them in 663 records. 664 \item The type of the extracted function does not belong to the Hindley-Milner 665 fragment. Sometimes simple code transformations could be used to make the 666 function typeable, but the increased extraction code complexity would 667 outweight the benefits. 668 \end{enumerate} 669 \end{itemize} 670 671 We should note how other projects, in particular CompCert, have decided from 672 the beginning to avoid dependent types to grant high quality of the extracted 673 code and maximal control over it. Therefore, at the current state of the art of 674 code extraction, there seems to be a huge trade-off between extracted code 675 quality and exploitation of advanced typing and proving techniques in the 676 source code. In the near future, the code base of CerCo can provide an 677 interesting example of a large formalization based on dependent types and 678 in need of high quality of extracted code. Therefore we plan to use it as a 679 driver and test bench for future works in the improvement of code extraction. 680 In particular, we are planning to study the following improvements to the 681 code extraction of Matita: 682 \begin{itemize} 683 \item We already have a prototype that extracts code from Matita to GHC plus 684 several extensions that allow GHC to use a very large subset of System 685 $F_\omega$. However, the prototype is not fully functional yet because we 686 still need to solve at the theoretical level a problem of interaction 687 between $F_\omega$ types and strong elimination. Roughly speaking, the 688 two branches of a strong elimination always admit a most general unifier in 689 Hindley-Milner plus the super-type \texttt{Obj.magic}, but the same property 690 is lost for $F_\omega$. As a consequence, we loose modularity in code 691 extraction and we need to figure out static analysis techniques to reduce 692 the impact of the loss of modularity. 693 \item The two most recent versions of OCaml have introduced first class 694 modules, which are exactly the feature needed for extracting code that uses 695 records contining both types and term declarations. However, the syntax 696 required for first class modules is extremely cumbersome and it requires 697 the explicit introduction of type expressions to make manifest the type 698 declaration/definition fields of the module. This complicates code 699 extraction with the needs of performing some computations at extraction time, 700 which are not in the tradition of code extraction. Moreover, the actual 701 performance of OCaml code that uses first class modules heavily is unknown 702 to us. We plan to experiment with first class modules for extraction very 703 soon. 704 \item Algebraic data types are generalized to families of algebraic data 705 types in the calculi implemented by Coq and Matita, even if the two 706 generalizations are different. Families of algebraic data type are 707 traditionally not supported by programming languages, but some 708 restrictions have been recently considered under the name of 709 Generalized Abstract Data Types (GADTs) and they are now implemented in 710 recent versions of OCaml and GHCs. A future work is the investigation on 711 the possibility of exploiting GADTs during code extraction. 712 \end{itemize} 713 714 \section{The Cost-Annotating Plug-In} 715 716 The functionalities of the Cost Annotating Plug-In have already been described 717 in Deliverables~D5.1 and D5.3. 718 The plug-in interfaces with the CerCo compiler via 719 the command line. Therefore there was no need to update the plug-in code for 720 integration in the Trusted CerCo Prototype. Actually, it is even possible 721 to install at the same time the untrusted and the trusted compilers. The 722 \texttt{-cost-acc} flag of the plug-in can be used to select the executable 723 to be used for compilation. 724 725 The code of the plug-in has been modified w.r.t. D5.1 in order to take care 726 also of the cost model for stack-size consumption. From the user point of view, 727 time and space cost annotations and invariants are handled in a similar way. 728 Nevertheless, we expect automated theorem provers to face more difficulties 729 in dealing with stack usage because elapsed time is additive, whereas what 730 is interesting for space usage is the maximum amount of stack used, which is 731 not additive. On the other hand, programs produced by our compiler require 732 more stack only at function calls. Therefore the proof obligations generated 733 to bound the maximum stack size for non recursive programs are trivial. 734 Most C programs, and in particular those used in time critical systems, avoid 735 recursive functions. 736 737 \section{Connection with other deliverables} 738 \label{subsect.connection.other.deliverables} 739 740 This deliverable represents the final milestone of the CerCo project. 741 The software delivered links together most of the software already developed 742 in previous deliverables, and it benefits from the studies performed in 743 other deliverables. In particular: 744 745 \begin{itemize} 746 \item The compiler links the code extracted from the executable formal 747 semantics of C (D3.1), machine code (D4.1), front-end intermediate languages 748 (D3.3) and back-end intermediate languages (D4.3). The \texttt{-is} flag 749 of the compiler invokes the semantics of every intermediate 750 representation of the program to be compiled. The executability of the 751 C and machine code languages has been important to debug the the two 752 semantics, that are part of the trusted code base of the compiler. 753 The executability of the intermediate languages has been important during 754 development for early detection of bugs both in the semantics and in the 755 compiler passes. They are also useful to users to profile programs in early 756 compilation stages to detect where the code spends more time. 757 Those semantics, however, are not part of the trusted code base. 758 \item The compiler links the code extracted from the CIC encoding of the 759 Front-end (D3.2) and Back-end (D4.2). The two encodings have been partially 760 proved correct in D3.4 (Front End Correctness Proof) and D4.4 (Back End 761 Correctness Proof). The formal proofs to be delivered in those deliverables 762 have not been completed. The most urgent future work after the end of the 763 project will consist in completing those proofs. 764 \item Debian Packages have been provided in D6.6 for the Cost-Annotating 765 Plug-In, the Untrusted CerCo Compiler and the Trusted CerCo Compiler. 766 The first and third installed together form a full installation of the 767 Trusted CerCo Prototype. In order to allow interested people to test the 768 prototype more easily, we also provided in D6.6 a Live CD based on 769 Debian with the CerCo Packages pre-installed. 770 \end{itemize} 125 Integration with Frama-C has been done in D5.1-D5.3. 126 This report overviews the packages that have been provided and the contents 127 of the Live CD. 128 129 \section{Debian packages} 130 We show here the description of every Debian package made for the 131 software developed in CerCo, as reported by \texttt{apt-cache show}. 132 Only the relevant fields are shown. 133 \begin{enumerate} 134 \item Package name: \texttt{acc}, Version: 0.2-1\\ 135 \textbf{CerCo's Annotating C Compiler}\\ 136 CerCo's Annotating C Compiler is a C compiler for the 8051 microprocessor. 137 It produces both an Intel HEX code memory representation and an instrumented 138 version of the source code. The instrumented source code is a copy of the 139 original source code augmented with additional instructions to keep track of 140 the exact number of clock cycles spent by the microprocessor to execute the 141 program basic blocks. Combined with CerCo's Frama-C Cost Plugin it allows to 142 certify exact parametric time and stack bounds for the source code. 143 Limitations: the source code must all be in one .c file and it must not 144 contain external calls. 145 \item Package name: \texttt{acc-trusted}, Version: 0.2-1\\ 146 \textbf{CerCo's Trusted Annotating C Compiler}\\ 147 CerCo's Annotating C Compiler is a C compiler for the 8051 microprocessor. 148 It produces both an Intel HEX code memory representation and an instrumented 149 version of the source code. The instrumented source code is a copy of the 150 original source code augmented with additional instructions to keep track of 151 the exact number of clock cycles spent by the microprocessor to execute the 152 program basic blocks. Combined with CerCo's Frama-C Cost Plugin it allows to 153 certify exact parametric time and stack bounds for the source code. 154 Limitations: the source code must all be in one .c file and it must not 155 contain external calls. 156 This version of the compiler has been certified for preservation of the 157 inferred costs using the interactive theorem prover Matita. 158 \item Package name: \texttt{frama-c-cost-plugin}, Version: 0.2-1\\ 159 Depends: \texttt{acc | acc-trusted}, 160 Recommends: \texttt{why, why3, cvc3}\\ 161 \textbf{CerCo's Frama-C Cost Plugin}\\ 162 The CerCo's Frama-C Cost Plugin has been developed by the CerCo project\\ 163 http://cerco.cs.unibo.it. 164 It compiles a C source file to an 8051 Intel HEX using either the acc 165 or the acc-trusted compilers. The compilation also produces an instrumented 166 copy of the source code obtained inserting instractions to keep track of the 167 number of clock cycles spent by the program on the real hardware. The plugin 168 implements an invariant generator that computes parametric bounds for the 169 total program execution time and stack usage. Moreover, the plugin can invoke 170 the Jessie condition generators to generate proof obligations to certify the 171 produced bound. The proof generators may be automatically closed using the 172 why3 tool that invokes automated theorem provers. The cvc3 prover works 173 particularly well with the proof obligations generated from the cost plugin. 174 \end{enumerate} 175 176 The following additional Debian packages are required by the CerCo's Frama-C 177 Cost Plugin. They do not install software developed in CerCo, but their Debian 178 packages themselves have been made by CerCo. Among all the theorem provers 179 that can be used in combination with Why3 (and the CerCo Cost Plugin), we 180 have decided to package cvc3 because its licence allows free redistribution 181 and because it is powerful enough to close most proof obligations generated 182 by CerCo on common examples. The user can still manually install and try other 183 provers that will be automatically recognized by why3 just by running 184 \texttt{why3config}. We do not provide any Debian packages for them. 185 186 \begin{enumerate} 187 \item Package name: \texttt{why}, Version: 2.31+cerco-1\\ 188 \textbf{Software verification tool}\\ 189 Why aims at being a verification conditions generator (VCG) back-end 190 for other verification tools. It provides a powerful input language 191 including higher-order functions, polymorphism, references, arrays and 192 exceptions. It generates proof obligations for many systems: the proof 193 assistants Coq, PVS, Isabelle/HOL, HOL 4, HOL Light, Mizar and the 194 decision procedures Simplify, Alt-Ergo, Yices, CVC Lite and haRVey. 195 \item Package name: \texttt{why3}, Version: 0.73+cerco-1\\ 196 \textbf{Software verification tool}\\ 197 Why aims at being a verification conditions generator (VCG) back-end 198 for other verification tools. It provides a powerful input language 199 including higher-order functions, polymorphism, references, arrays and 200 exceptions. It generates proof obligations for many systems: the proof 201 assistants Coq, PVS, Isabelle/HOL, HOL 4, HOL Light, Mizar and the 202 decision procedures Simplify, Alt-Ergo, Yices, CVC Lite and haRVey. 203 \item Package name: \texttt{cvc3}, Version: 2.4.1-1\\ 204 \textbf{CVC3 is an automatic theorem prover for Satisfiability Module Theories}\\ 205 CVC3 is an automatic theorem prover for Satisfiability Modulo Theories (SMT) problems. It can be used to prove the validity (or, dually, the satisfiability) of first-order formulas in a large number of built-in logical theories and their combination. 206 CVC3 is the last offspring of a series of popular SMT provers, which originated at Stanford University with the SVC system. In particular, it builds on the code base of CVC Lite, its most recent predecessor. Its high level design follows that of the Sammy prover. 207 CVC3 works with a version of first-order logic with polymorphic types and has a wide variety of features including: 208 * several built-in base theories: rational and integer linear arithmetic, arrays, tuples, records, inductive data types, bit vectors, and equality over uninterpreted function symbols; 209 * support for quantifiers; 210 * an interactive text-based interface; 211 * a rich C and C++ API for embedding in other systems; 212 * proof and model generation abilities; 213 * predicate subtyping; 214 * essentially no limit on its use for research or commercial purposes (see license). 215 \end{enumerate} 216 217 Finally, we have released and packaged a new version of Matita that contains 218 several improvements driven by CerCo, like code extraction. 219 220 \begin{enumerate} 221 \item Package name: \texttt{matita}, Version: 0.99.2-1\\ 222 \textbf{Interactive theorem prover}\\ 223 Matita is a graphical interactive theorem prover based on the Calculus of 224 (Co)Inductive Constructions. 225 \end{enumerate} 226 227 All Debian packages mentioned above can be downloaded from the CerCo's website 228 \texttt{http://cerco.cs.unibo.it} and are compatible with the \emph{testing} 229 release of Debian, soon to be freezed and become the forthcoming \emph{stable} 230 release. The source code to recompile the Debian packages from scratch is part 231 of the CerCo repository and can be also downloaded from the website. 232 233 \section{Live CD} 234 235 We also provide a Live CD to easily test all the software implemented in CerCo 236 without having to recompile the software by hand or install a Debian 237 distribution. The Live CD can be downloaded from the CerCo's website 238 \texttt{http://cerco.cs.unibo.it}. It is a Live Debian CD that can be either 239 burnt on a physical disk used to boot a real machine, or it can be used to 240 bootstrap a virtual machine using any virtualizer like \texttt{virtualbox, qemu} 241 or similar ones. 242 243 The Live CD contains an xserver and a standard gnome environment and it 244 connects automatically to any available network. The virtualbox guest tools 245 have also been installed to facilitate communication between the host and 246 the virtual machine, e.g. for file transfer. 247 248 The Live CD also has pre-installed all the Debian packages described in the 249 previous section. The user can simply boot the system and invoke the 250 \texttt{cerco} executable to compile any compatible C file to the Intel HEX 251 format for the 8051. The executable also computes the cost model for the 252 source program, its cost invariants and tries to certify them using the 253 cvc3 theorem prover. In case of failure, the flag \texttt{-ide} can be used 254 to open an interactive why3 session to experiment with different theorem 255 prover flags (e.g. increasing the prover timeout). 256 257 Further informations on the \texttt{cerco} executable can be found in 258 Deliverable D5.2. 259 260 Finally, the Live CD also has a preinstalled copy of the interactive theorem 261 prover Matita and a snapshot of the compiler certification. The snapshot can 262 be found in the user's home directory, under \texttt{cerco/src}. Matita 263 files can be opened running \texttt{matita}. The other 264 directory \texttt{cerco/driver} contains the extracted code together with 265 the untrusted code that forms the CerCo Trusted Compiler. 771 266 772 267 \end{document}
Note: See TracChangeset
for help on using the changeset viewer.