The essence of the code analysis library VivaCore.
The essence of the code analysis library VivaCore.
Annotation. The article tells developers about VivaCore library, preconditions of its creation, its possibilities, structure and scope of use. This article was written simultaneously with the development of VivaCore library and that's why some of the details of the final realization may differ from the features described here. But this won't prevent the developers from getting acquainted with the general work principles of the library, mechanisms of analysis and processing of C and C++ source code.
Introduction.
VivaCore is an open source library for work with C and C++ code. The library is intended for realizing code refactoring systems, systems of static and dynamic analysis, systems of transformation and optimization, systems of language extensions code, subsystems of syntax highlighting, systems of building documentation on the code and other similar tools on its basis.
The idea of development of the library appeared when our team was creating the static code analyzer Viva64 [1]. The Viva64 tool is intended for diagnosing errors in C/C++ programs related to the peculiarities of code migration on 64-bit Windows systems.
During the process of development of Viva64 our team faced the absence of open libraries convenient for realization of such projects. OpenC++ (OpenCxx) library [2] was chosen as the basis and on the whole we were satisfied with our choice. But during the development of the static analyzer our team made a lot of corrections and improvements in OpenC++ library. And now when the development of the first versions of Viva64 is over we would like to offer the exterior developers our remade variant of OpenC++ library which we called VivaCore. We think that the changes we've made can help the developers a lot who are going to begin the development of software products in the sphere of analysis or processing of C/C++ code.
The license of VivaCore library allows to use, copy, distribute and modify it in binary form or as an original code both for commercial and non-commercial use free, without any payments to the library's authors. One should only mention the authors of the original libraries (OpenC++ and VivaCore).
The difference between VivaCore library and OpenC++ library.
The main difference between VivaCore and OpenC++ is that VivaCore is a living project and it continues to gain functionality. Unfortunately, OpenC++ library hasn't developed for a long time. The latest change in the library was in 2004. And the latest change related to the support of new keywords was in 2003. This correction is an unlucky try to add wchar_t data type, which caused adding about five errors of different types.
The question is evident why we introduce our own library and didn't make changes in OpenC++. We don't have enough free resources for that. A lot of modifications were carried out and it is a very difficult task to introduce all the changes into OpenC++. Many changes are of a specific character and may not fit the general ideology of OpenC++ library and that's why they will need additional adaptation. All this makes the task of OpenC++ library update very resource-intensive, and from our viewpoint unreasonable.
Let's list the key functional abilities realized in VivaCore library in comparison with OpenC++.
- The classical C language is supported. A different set of tokens is used and it gives an opportunity to name the variables as ?class' or declare a function in the following old C style:
PureC_Foo(ptr)
char *ptr;
{
...
} - Great work was carried out to support the specific character of C++ syntax used in the development in VisualStudio 2005/2008 environment. For example, the library processes keywords __noop, __if_exists, __ptr32, __pragma, __interface, etc.
- Some new keywords and other constructions included into new language standards are supported. In particular, a keyword register is added and evocation of templated function with the use of the word template: object.template foo<int>();.
- Computation of literal constant is realized.
- The library is adopted and optimized for the work on 64-bit systems with the help of Viva64 code analyzer.
- A lot of errors and defects are corrected. Such an example is support of string literals separated by space (const wchar_t *str=L"begin" L"end") or divided into two strings by a slash:
const char *name=Viva\
Core";
Another example is correct processing of expressions like "bool r=a < 1 || b > (int) 2;" taken by OpenC++ for templates.
And so on. - The mechanism of primary preprocessing of the original text is created which helps to realize some specific code modifications.
In the nearest future in the VivaCore library it is planned to realize:
- its own preprocessor (on the basis of The Wave C++ preprocessor library), instead of an exterior one (for example, Visual Studio preprocessor). This will allow to avoid some positioning errors on the numbers of the code strings when showing the compilation errors, and also provide better control over the process of code parse and analysis;
- support of coding of complex types, occupying more than 127 symbols in the coded form;
- a simple application presenting the basic principles of the use of VivaCore library.
We should mention that despite all the listed differences between OpenC++ and VivaCore libraries, they have a lot in common and that's why the documentation on OpenC++ doesn't lost its relevance. Our team will try to pay attention the documenting of VivaCore library. But as far as this documentation will cover the differences and new abilities realized in VivaCore library, it will be useful to get acquainted with the OpenC++ documentation in any case.
The scope of use of VivaCore library.
VivaCore library may be of interest for companies and organizations that are creating or are planning to create tools for working with the code. Of course it is impossible to list all the acceptable spheres and methods of use, but we'll list some directions in order to show VivaCore from different viewpoints. In brackets we put the products related to every solution class. One shouldn't think that they are realized on the basis of VivaCore - it is just the examples of solutions. So, with the help of VivaCore one can develop:
- code refactoring (VisualAssist, DevExpress Refactoring, JetBrains Resharper);
- static analyzers of general and specialized purpose (Viva64, lint, Gimpel Software PC-Lint, Microsoft FxCop, Parasoft C++test);
- dynamic code analyzers (Compuware BoundsChecker, AutomatedQA AQTime);
- C/C++ languages extensions, including support of metaprogramming (OpenTS);
- automated code testing (Parasoft C++test)
- code transformation, for example, for optimization;
- syntax highlighting (Whole Tomato Software VisualAssist, any up-to-date development environment);
- systems of code documentation building (Synopsis, Doxygen);
- high-precision determination of changes in the original code or analysis of changes evolution;
- search of duplicating code on the level of grammatical language constructions;
- metrics calculation (C and C++ Code Counter - CCCC);
- support of coding standards (Gimpel Software PC-Lint);
- tools simplifying the code migration on other program and hardware platforms (Viva64);
- automatic code generation;
- code visualizers, systems of building dependency graphs (Source-Navigator);
- code formatting (Ochre SourceStyler).
More detailed about the use of code parse technology one may learn from the fundamental book on compilers [3]. We also recommend you to get acquainted with principles of program analysis [4].
One shouldn't mix up VivaCore and professional multifunctional parsers of C/C++ code. If a user needs a front-end code parser supporting fully the modern C++ language standard and allowing to create his own compiler for a specific platform, he should pay attention to GCC or other expensive commercial solutions. For example, such solutions are offered by Semantic Designs [5].
But if a company develops a tool requiring classical analysis of C/C++ code, then the rational solution is the use of a convenient and open code library, that is of VivaCore.
Basic terms.
Before we begin to speak about VivaCore library in details, let's recollect some terms which will be used during the description.
Processing - a mechanism browsing the incoming ".c/.cpp" file, executing preprocessor's directions in it, including the content of other files pointed in #include directions into it etc. As a result, a file appears which doesn't contain preprocessor's directions, all the used macros are open, instead of #include directions the content of the corresponding files is placed. The file with the result of preprocessing usually has ".i" suffix. The result of preprocessing is called a translation unit.
Syntax analysis (parsing) - the process of analysis of the incoming sequence of symbols with the purpose of parsing of the grammar structure. Usually syntax analysis is divided into two levels: lexical analysis and grammatical analysis.
Lexical analysis - the procedure of processing the incoming sequence of symbols with the purpose to get at the output the sequence of symbols called lexemes (or "tokens"). Each lexeme can be presented as a structure containing the lexeme type and, if necessary, the corresponding meaning.
Grammatical analysis (grammatical parsing) - the process of comparison of the linear sequence of lexemes (words, tokens) of the language with its formal grammar. The result is usually the derivation tree or the abstract syntax tree.
Abstract Syntax Tree (AST) - the finite, marked, orientated tree in which the inner tops are assimilated with the programming language operators, and the leaves with the corresponding operands. So, the leaves are empty operators and they are just variables and constants. The AST differs from the Derivation Tree (DT) in that way that there are no nodes for those syntax rules which don't influence the program's semantics. The classical example of such absence is the grouping brackets, as in the AST the grouping of operands is defined by the tree structure explicitly.
Metaprogramming - creation of programs which create other programs as the result of their work or change and complement themselves during execution. In the sphere of VivaCore library metaprogramming should be understood as the possibility of extending the syntax and functionality of C/C++ language with the purpose of creating one's own programming language. Metaprograms created on this programming language may be then translated into C/C++ code with the use of VivaCore and compiled by an exterior compiler.
Tree traversal - traversal of all the tops and leaves of the syntax tree with the purpose of gathering information of different kinds, analysis and modification.
The general structure of VivaCore library.
The general functional structure of VivaCore library is shown on picture 1. At this moment the library is intended for the full integration with the user's application and is represented as a set of original source code.
Picture 1. General functional structure of VivaCore library.
We offer you to examine the library's functional blocks in that order in which they process the incoming original source code as it is shown on picture 2. We'll see what actions a functional block produces, what information it allows to get and how it can be modified for specific purposes.
Picture 2. The sequence of code processing.
1) Input subsystem.
VivaCore library may process correctly only the original C/C++ code processed by the preprocessor. Further, the possibility of using the preprocessor from The Wave C++ preprocessor library is taken into consideration, but it won't be realized in the first version of the library. To get a preprocessed file one may use a compiler (for example, Microsoft Visual C++) and get a processed file which usually has "i" extension.
In particular cases unprocessed C/C++ files may be input, but in this case one should work with VivaCore not further than on the level of splitting the file into lexemes. This will be enough for calculating metrics or for other purposes. But one shouldn't try to build and analyze the Parse Tree, because the result, most probably, will be unsuitable for processing.
Having the preprocessed code, the user may transfer it to the input subsystem as a file or memory buffer. The purpose of the input subsystem is to arrange the data in the inner structures of VivaCore library. The input system accepts also configurational data which report what libraries should be considered system and what user.
See in the code: VivaConfiguration, ReadFile.
2) Preprocessor subsystem.
We would like to point out that this subsystem doesn't provide code preprocessing in its classical meaning. As was said before, it is the preprocessed code that should be input into VivaCore library. The subsystem under consideration serves to solve the following tasks:
- Splitting the source code into strings and classifying them into two logical groups. To the first group the system code is referred (the code of the compiler's libraries and so on). The second group includes the user code which is of interest for analysis. As a result, while developing a static analyzer, the user has an opportunity to decide if he will analyze the code of system libraries or not.
- Specialized modification of the program text in memory. Such an example is the removal of constructions of a concrete development environment and of constructions not related to C/C++ languages from the code. For example, Viva64 analyzer removes such key constructions as SA_Success or SA_FormatString which is present in Visual Studio's header files.
See in the code: VivaPreprocessor, CreateStringInfo, IsInterestingLine, GetLineNumberByPtr, PreprocessingItem, SkipUninterestingCode.
3) Lexer.
So, we've come to those levels of data processing which are of practical interest for developers. Having parsed the code into lexemes, the user has an opportunity to calculate many metrics, realize a specific algorithm of syntax highlighting in different applications.
Lexical analyzer VivaCore parses the source code into a set of objects of Token type (see file Token.h) which contain information about the lexeme type, its location in the program text and its length. The lexeme types are enumerated in file tokennames.h. Examples of lexeme types:
CLASS - "class"-keyword
WCHAR - "wchar_t"-keyword
If necessary, the user can extend the set of lexemes. It may be required when it is necessary to support the specific syntax of a concrete language realization or when developing one's own language extension.
When adding lexemes it is necessary to define them in tokennames.h file and add into tables "table" or "tableC" in file Lex.cc. The former table is intended for processing of C++ files and the latter for C files. This is natural, as the set of lexemes in C and C++ languages is different. For example, there is no CLASS lexeme in C language, because the word ?class? is not a keyword in C and can indicate a name of a variable.
The set of lexemes can be represented both as a simple array and by transferring it into a file. Lexemes are kept in array tokens_ in Lex class. You can get an array wholly and as separate lexemes as well by using functions GetToken, LookAhead, CanLookAhead.
The user can get lexemes in a form of the following unstructured text or by using function DumpEx in the following formatted form:
Code:
258 LC_ID 5 258 lc_id 5 91 [ 1 262 6 1 93 ] 1 59 ; 1 303 struct 6 123 { 1 282 char 4 42 * 1 258 locale 6
See in the code: Token, Lex, TokenContainer.
4) Parser.
The parser is intended for building the derivation tree that may be further subjected to analysis and transformation. Pay attention that the parser of VivaCore library builds not an abstract syntax tree but just the derivation tree. This allows to carry out the support of metaprogram constructions easier, which may be added by the user into C or C++ language. If the user will urgently need to work just with the abstract syntax tree, we hope it won't be difficult to improve the analyzer so that it could traverse the whole derivation tree and remove nodes and leaves that are not used in abstract syntax.
Building of the tree in VivaCore library takes place in functions of Parser class. The nodes and leaves of the tree are the objects whose classes are inherited from base classes NonLeaf and Leaf. On picture 3 a part of hierarchy of classes used for presentation of the tree is shown.
Picture 3. A part of hierarchy of classes used for building the derivation tree.
As it is seen from the picture, Ptree class is a base for all the others and serves to organizing the single interface for work with other classes. In Ptree class there is a set of pure virtual functions realized in descendants. For example, function "virtual bool IsLeaf() const=0;" is realized in NonLeaf and Leaf classes. Actually the classes realize only this function and they serve to make the hierarchy of classes more logical and nice.
As far as the work with the tree occupies the great size of the library, Ptree has a large set of functions for work with the tree nodes. For convenience these functions are presented as analogs of functions for work with lists in Lisp language. Here are some of them: Car, Cdr, Cadr, Cddr, LastNth, Length, Eq.
To get general idea about the parser's work let's show the derivation tree as an example which will be built on the basis of the following code:
Code:
int MyFoo(const float value) { if (value < 1.0) return sizeof(unsigned long *); return value * 4.0f < 10.0f ? 0 : 1; }
[attachmentid=6109]
Picture 4.1. Color indication of nodes of the semantic tree.
Picture 4.2. Presentation of the function header.
Picture 4.3. Presentation of the function body.
Picture 4.4. Presentation of the function body.
We should mention one more important component of the analyzer's work. This is the gathering of information about types of various objects (functions, variables etc), what is carried out in Encoding class. Information about a type is represented in the form of a specially coded string with whose format you may get acquainted in file Encoding.cc. There is also a special class TypeInfo in the library that allows to get information about types and also modify it. For example, by using such functions as IsFunction, IsPointerType, IsBuiltInType it is easy to identify the type of the element processed.
The description of the approaches to the addition of new types of nodes and leaves is a difficult task and cannot be observed in this review article. The rational solution is to choose one of the classes, for example PtreeExprStatement and examine all the places in the code where the creation of objects of this object takes place as well as work with them and so on.
The tree that we get finally may be saved in the format of ".?/.cpp" file but it is pointless. It will be reasonable after changing the derivation tree what can take place on the following steps. By saving the tree at the moment in the form of the program code we'll have the same what we had on the input. However, it may be rather useful for testing changes introduced into lexer and parser.
Of more interest is the possibility to save the tree for further processing in any format which the user would like to realize. Such an example is the following textual presentation of the program code, described earlier:
Code:
PtreeDeclaration:[ 0 NonLeaf:[ LeafINT:int ] PtreeDeclarator:[ Leaf:MyFoo Leaf:( NonLeaf:[ NonLeaf:[ NonLeaf:[ LeafCONST:const NonLeaf:[ LeafFLOAT:float ] ] PtreeDeclarator:[ Leaf:value ] ] ] Leaf:) ] [{ NonLeaf:[ PtreeIfStatement:[ LeafReserved:if Leaf:( PtreeInfixExpr:[ LeafName:value Leaf:< Leaf:1.0 ] Leaf:) PtreeReturnStatement:[ LeafReserved:return PtreeSizeofExpr:[ Leaf:sizeof Leaf:( NonLeaf:[ NonLeaf:[ LeafUNSIGNED:unsigned LeafLONG:long ] PtreeDeclarator:[ Leaf:* ] ] Leaf:) ] Leaf:; ] ] PtreeReturnStatement:[ LeafReserved:return PtreeCondExpr:[ PtreeInfixExpr:[ PtreeInfixExpr:[ LeafName:value Leaf:* Leaf:4.0f ] Leaf:< Leaf:10.0f ] Leaf:? Leaf:0 Leaf:: Leaf:1 ] Leaf:; ] ] Leaf:} }] ]
See in the code: Parser, Ptree, Leaf, NonLeaf, Encoding, TypeInfo, Typeof, PtreeUtil.
5) The tree traversal.
For the developers of static code analyzers (the detailed introduction into this task book [6]) or systems of building documentation on the code of the most interesting should be the step of the tree traversal which is realized with the use of classes Walker, ClassWalker, ClassBodyWalker. The tree traversal may be carried out several times what allows to create systems modifying the code in several traversals or provide analysis taking into consideration the information got during previous tree traversals.
Walker class serves for traversing base C/C++ constructions.
ClassWalker class is inherited from Walker class and adds functionality related to the specific features of classes, present in C++ language.
When it is necessary to traverse the class body, objects of ClassBodyWalker class are created and used for a short time.
If no changes are introduced into VivaCore library the simple traversal over all the tree elements will take place. The tree won't change then.
If the user realizes functionality that will modify some tree elements the library may rebuild some of its other nodes as well. As an example, let's examine the code translating unary operations:
Code:
Ptree* ClassWalker::TranslateUnary(Ptree* exp) { using namespace PtreeUtil; Ptree* unaryop=exp->Car(); Ptree* right=PtreeUtil::Second(exp); Ptree* right2=Translate(right); if(right == right2) return exp; else return new (GC_QuickAlloc) PtreeUnaryExpr(unaryop, PtreeUtil::List(right2)); }
To make it clearer, let's examine this example in details.
The processing of the node that corresponds a unary operation over some expression and that has PtreeUnaryExpr type begins. The first element in the list that is extracted with the help of exp->Car() operation, is the unary operation itself. The second element extracted with the help of PtreeUtil::Second(exp) is the expression over which the unary operation is carried out.
The translation of the expression takes place and the result is placed into right2 variable. If this address differs from the current it means that the expression was changed. In this case creation of a new object of PtreeUnaryExpr type takes place, which will be taken back from TranslateUnary function. Otherwise nothing was changed and the same object returns, which was on the input.
If the user needs to carry out the gathering of information while traversing the tree or its modification, it will be the most natural that it will be inherited from ClassWalker and ClassBodyWalker classes.
Let's show an example taken from the static analyzer Viva64 where specialized analysis takes place while passing "throw" operator.
Code:
Ptree* VivaWalker::TranslateThrow(Ptree *p) { Ptree *result=ClassWalker::TranslateThrow(p); Ptree* oprnd=PtreeUtil::Second(result); if (oprnd != NULL) { //if oprnd==NULL then this is "throw;". if (!CreateWiseType(oprnd)) { return result; } if (IsErrorActive(115) && !ApplyRuleN10(oprnd->m_wiseType.m_simpleType)) { AddError(VivaErrors::V115(), p, 115); } } return result; }
Speaking about the tree traversal, we should also mention a very important Environment class providing the gathering of information about types of various objects in different scopes.
Here is an example of using Environment class represented by env object for getting the type of declTypeInfo object:
Code:
TypeInfo declTypeInfo; if (env->Lookup(decl, declTypeInfo)) { ... }
6) Support of metaprogramming.
Metaprogramming is based on the approach to code generation when the program code is written not manually but is created by a generating program on the basis of another, simpler program. This approach becomes reasonable if different additional rules are produced while programming (more high-level paradigms, fulfillment of the requirements of exterior libraries, stereotype methods of realization of particular functions). At this point, a part of the code loses its substantial sense and becomes just a mechanical fulfillment of the rules. When this part becomes considerable, a thought may appear to define only the substantial part manually and let all the rest be added automatically. This is the purpose of the generator.
Sometimes this generator is necessary for translation of an invented language into C/C++ operators. VivaCore has a mechanism for convenient creation of C/C++ extensions on the basis of metaobjects. It is possible to change or build new syntax trees in order to save them into the code in C/C++ language.
You may get acquainted with the metaprogramming paradigm and methods of using metaobjects in detail in the documentation on OpenC++ library.
See in the code: Metaclass.
7) Saving of the results.
As it was said already, you may carry out saving of the necessary information on any step of the procedure of the processing of the original code inside VivaCore library. We mentioned also that the derived and changed derivation tree may be saved in the form of the source code or in any other format. So, we won't repeat it once more. It is also clear that one may carry out the gathering of the necessary information, for example at static analysis or calculation of metrics, in different ways and that's why it's senseless to enumerate the means of realization.
Let's say only some words about the use of XML format that was mentioned more than once in this article. XML is a textual format intended for keeping of structured data, for exchange of information between programs and different subsystems of information processing. XML is a simpler subset of SGML language.
We use XML for exporting various information in hope that it will make it easier for exterior developers to use VivaCore in their program developments in other programming languages. For example, it will be very convenient for C# programs. And what is not less important, XML as a data format make it simpler to structure the information and present it in a form familiar to a programmer.
Conclusion.
We understand that after reading this article, more new questions about VivaCore may appear than the answers received. But the good news is that our Viva64.com team is always ready to communicate, to discuss appearing questions and give recommendations on the use of VivaCore. Write us!
Resources.
- Evgeniy Ryzhkov. Viva64: what is it and for whom is it meant? http://www.viva64.com/articles/Viva6...s_and_for.html
- OpenC++ library. http://opencxx.sourceforge.net
- Alfred V. Aho, Monica S. Lam, Ravi Sethi, Jeffrey D. Ullman. Compilers: Principles, Techniques, and Tools (2nd Edition). Addison Wesley, 2006, 1000 pages.
- Flemming Nielson, Hanne R. Nielson, Chris Hankin. Principles of Program Analysis. Springer, 2004, 452 pages.
- Semantic Designs site. http://www.semdesigns.com
- Patrick Cousot. Static Analysis. Springer, 2001, 450 pages.
0 comments:
Comment here / Ask your Query !!