On unpacking the ckit sources, you should see a src directory, a doc directory and a README file (and possibly other directories, depending on the distribution).
The src directory contains the following subdirectories:
- CM.make();To test the parser on "test.c", type
- ParseToAst.fileToAst "test.c";This parses and typechecks "test.c" and returns an abstract syntax tree for "test.c". Alternatively, to parse, type-check and then pretty-print "test.c", type
- ParseToAst.fileToC "test.c";
C source programs are processed in two steps. The lexer and parser translate the source to parse trees (Parser.parseFile), and the "elaboration" or static semantics phase (BuildAst.makeAst) performs type checking and translates to abstract syntax. The parse tree datatypes are defined in parse/parse-tree-sig.sml and the abstract syntax types in ast/ast-sig.sml. These definitions are fairly straightforward and should be self-explanatory.
Top level driving functions are in module ParseToAst
(see
ast/parse-to-ast-sig.sml). The following subsections describe some
commonly used ckit functions.
ParseToAst.fileToAst: string -> ParseToAst.astBundle
This is the main function to parse a file and produce abstract syntax.
When applied to a string (the C source file name), it produces a
bundle of information of type astBundle
:
type astBundle = {ast: Ast.ast, tidtab: Bindings.tidBinding Tidtab.uidtab, errorCount: int, warningCount: int, auxiliaryInfo: {astTypes: Tables.astIndexTab, implicits: Tables.astIndexTab, env: State.symtab}}where:
ParseToAst.fileToC : string -> unit
Process a file and pretty print the resulting ast.
Parser.parseFile : Error.errorState -> string -> ParseTree.externalDecl list
Parser.parseFile
(see parser/parser-sig.sml).
This function takes an errorState
and the
name of a (preprocessed) C source file and returns a list of external
declaration parse trees corresponding to the top-level declarations in
the source file. See parser/parse-tree-sig.sml for definitions of
the parse tree types and parser/util/error-sig.sml for documentation
on Error.errorState.
The frontend consists of a number of phases. The first phase consists of a lexer/parser (written using ml-lex and ml-yacc respectively). The output of this phase is a data-structure (parse tree) that is a simple "unprocessed" form that closely follows the structure of C's grammar. The next phase inputs the parse tree data-structure, type checks it, and produces a "processed" abstract syntax tree representation (Ast).
These are built using ml-lex and ml-yacc. The lex and yacc files can be found in src/parser/grammar/[c.lex,c.grm]. The parser performs only a minimal amount of syntactic processing. Many syntactic restrictions are enforced during the type-checking phase e.g restrictions on the number and combination of type specifiers used in a type.
Similarly, most scoping issues are addressed during type-checking. One exception is typedef. This must be handled during parsing because typedefs introduce new types and these can dramatically alter the shape of parse trees. In principle, the scoping of typedefs could be delayed till later processing, but in practice this is not feasible: in particular, if typedefs are not processed during parsing, then we cannot distinguish between declaration forms and expressions. Consider, the following program.
char x; f() { typedef int x; { x * x; } }Here, "
x * x
" declares x
as a pointer to an integer.
However, if the typedef is commented out, then
"x * x
" is interpreted as an expression.
The treatment of typedefs involves a subtle interaction between the parser and lexer. When the parser recognizes a typedef for an identifier, it communicates to the lexer that the identifier should now be treated as a "type". Parser lookahead introduces additional complication: we cannot lex a token until any preceding typedefs have been processed. In particular, we must limit lookahead to one symbol. In fact, this only works because C's grammar requires typedefs to end in a semicolon --- this semicolon acts as a buffer so that a typedef will be completely processed before any use of the new type is lexed. Note that typedefs can be scoped (e.g. see the above program), and so the parser must tell the lexer to undo the effect of a typedef when the typedef's scope is exited. Another complication is the error recovery mechanism of ml-yacc.
The parser produces parse trees (see src/parser/parse-tree-sig.sml). This data structure is a simple "unprocessed" form that closely follows the structure of C's grammar. These parse trees are built up by the actions of the ml-yacc grammar.
Any language extensions is likely to involve extensions to the lexer, parser and to the parse tree datatype. When extending the lexer and parser, care must be taken to preserve the interaction between the lexer, the parser, and the use of one-token lookahead. Extensions to the parse tree datatype are supported via a collection of "Ext" constructors in the parse tree datatypes. The file extensions/c/parse-tree-ext.sml contains the default "empty extension" for standard C.
Files:
BuildAst (src/ast/build-ast.sml) consumes parse trees and builds up abstract syntax trees (Ast's) while performing type checking. Ast's (src/ast/ast.sml) are defined so that each of the major syntactic categories (statements, expressions, and declarations) have a unique integer index associated with them. These indices are used to associate information with specific parts of the code. Care must be taken to preserve their uniqueness when performing code transformations.
Objects (global variables, local variables, functions, etc) and struct/union fields are assigned globally unique integers called program identifiers (pids). This simplifies treatment of scope in Ast. Similarly, types introduced by structs, unions, enums and typedefs are assigned globally unique integers called type identifiers (tids).
BuildAst performs the following tasks:
PPLib.suppressPidUnderscores: controls printing of pids PPLib.suppressPidGlobalUnderscores: controls printing of pids for global objects PPLib.suppressTidUnderscores: controls printing of tids.Files:
Files:
Program phrases (expressions, declarations, statements) are annotated in the abstract syntax with source code locations, which are represented by a data structure that determines a region within a source file. See src/parser/sourcemap-sig.sml.