skip to main content

developerWorks  >  Autonomic computing | Information Management  >

Building a CDT-based editor, Part 4: Advanced CDT Parsing and the Persisted Document Object Model

Understanding CDT Indexing and the PDOM

developerWorks
Document options

Document options requiring JavaScript are not displayed

Discuss

Sample code


Rate this page

Help us improve this content


Level: Intermediate

Matthew Scarpino (mattscar@yahoo.com), Java Developer, Eclipse Engineering LLC

12 May 2006

This article, the fourth in a five-part series, introduces the second and more sophisticated of the parsers used by Eclipse's C/C++ Development Tooling (CDT). This new process structures its information in a Persisted Document Object Model (PDOM) and enables indexing, code completion, and content assist. If you intend to improve or extend the CDT for your own custom tool, understanding the PDOM and the new parsing will be essential.

Why learn about another parser?

External linkage is one of the many features of C/C++ that make it hard to analyze code. No matter where you declare a variable or function, the extern keyword allows other files to use it without knowing where it was defined. When it comes to displaying this interdependence, Eclipse's Outline View, shown on the left of Figure 1, isn't helpful - it only shows elements of single source files. But on the right, the Index View presents each element of every file in every CDT project. If you're trying to make sense of complex, interconnected projects, this second approach is vital.



Figure 1: Outline and Index Views
Outline and Index Views

We could build an Index View using the parser I described in the last article, but it would have to parse every source file in the CDT workspace whenever any code was modified. This isn't practical, so to make capabilities like the Index View possible, the CDT developers (thanks!) created a new parsing process that stores its results in memory-mapped files. This persistence enables the parser to access all the stored elements in a project while only analyzing files that have recently changed. In addition to the Index View, this improved parsing also makes possible features like code completion and content assist.

This doesn't mean that the discussion in Part 3 is irrelevant. The new parsers function in mainly the same way as the old Parser - they contain similar methods and create similar Abstract Syntax Trees (ASTs). But much of the new process is new and I'll summarize its operation in four steps:

  1. Starting the Indexer Job
  2. Parsing a TranslationUnit
  3. Updating the PDOM
  4. PDOM Structure and Storage

So that you can see how these concepts are implemented in code, I've updated the Bare Bones C/C++ Development Tooling (BBCDT) to create an index of the project's elements and display specific node types. To activate it, I've updated the previous ASTAction to visit elements in the AST and the PDOM's storage. You can download the updated BBCDT here.

Step 1: Starting the Indexer Job

When the Core Plugin starts, it creates a PDOMManager to oversee the new parsing and persistence. The manager does little on its own, but responds when new CDT elements (particularly CProjects and TranslationUnits) appear in the workspace. When a new CProject is created, the PDOMManager creates an indexer to find and store all the elements in the project's source files.

The indexer implements IPDOMIndexer but its class depends on the user's CDT preferences. Figure 2 shows the three indexing options, and 'No Indexing' is chosen by default. In the default case, the PDOMManager creates a NullPDOMIndexer, which does nothing. It creates a FastPDOMIndexer if the 'Fast Indexing' option is chosen and a FullPDOMIndexer if the 'Full Indexing' option is chosen. Once the indexer is created, the manager tells it to start creating an index.



Figure 2: Indexer Preferences
Indexer Preferences

The goal of indexing is to record all the named elements in the indexer's project's source code and store them with their related data. This is a complicated process and can run in the background, so it's performed as an Eclipse Job. Specifically, the indexer creates an indexing Job with a priority of Job.LONG and adds it to the platform's queue.

If the project has just been created, the Job doesn't accomplish much - there's no source code to index. But it does build a very important object for the project called a PDOM. This provides the routines that store the indexer's data within the Persisted Document Object Model (PDOM). I'll cover the PDOM and its operation shortly.

The Protean PDOM

As I write this, the PDOM and its parsing is not finalized. In fact, many bewildered respondents on the CDT newsgroup will tell you it's not even stable. But this is the future of CDT code analysis, and it's important to understand the concepts even if the details change. It plays an important role in enabling the CDT's features (content assist, code completion, indexing, etc.) and this role will only expand as development continues.

Unlike the Parser in the last article, the indexer doesn't begin immediately or automatically. Instead, the PDOMManager waits for changes to the CProject, such as a new, modified, or deleted source file. When it detects a change, it tells the indexer to start analyzing, and this time the indexer Job works in earnest. In particular, it finds the altered TranslationUnit and the ILanguage that represents the language of the unit's source code. Then it starts the parsing process by calling ILanguage.getASTTranslationUnit().

Step 2: Parsing the TranslationUnit

The Parser in the previous article made no distinction between programming languages. It looks for C and C++ elements and uses them to create a generic AST. But the new process works differently. Each TranslationUnit has an ILanguage object that specifies, among other things, how the unit should be parsed. If you'd like to add a custom language to the CDT, creating your own ILanguage is the first place to start. Thankfully, the CDT provides two by default: GCCLanguage and GPPLanguage.

Similarly, the CDT provides two default parsers that implement ISourceCodeParser: GNUCSourceParser and GNUCPPSourceParser. The parsing process itself isn't different from that described in the previous article. The code reader creates a character buffer from the source file, the scanner combines characters into ITokens, and the parser combines ITokens into IASTNodes. Even the method names (translationUnit(), declaration(), statement(), et al) are the same. The main differences between the new parsing process and the old are listed as follows:

  • The new parsers are language specific - nodes extend CASTNode or CPPASTNode.
  • The new parsers always run in COMPLETE_PARSE mode, so they always parse inside functions.
  • The new parsers store additional parsing information (problems, node locations) in an ILocationResolver instead of a callback object.
  • The new parsers only determine an element's scope when it is specifically requested.
  • The top node in a new AST is an IASTTranslationUnit, whose structure can be traversed by ASTVisitors.

Remember: The old Parser returned an IASTCompilationUnit. The PDOM parsers return IASTTranslationUnits, or more specifically, CASTTranslationUnits and CPPASTTranslationUnits.

The third difference is important. The parser can't know how an element is used unless it knows the element's scope. The old Parser creates a top-level scope when it starts the parse with translationUnit(). Then, it passes the scope to further methods, such as declaration(), and these create and pass along sub-scopes as necessary. When the old Parser finishes, it has the scope of every element in its AST.

But the new parser creates a single scope and doesn't change it during the course of its operation. This is because the indexer is only interested in a specific set of nodes. To explain this and other interesting features, I need to describe what the indexer does with the parser's AST and how it interacts with the PDOM.

Step 3: Updating the PDOM

After the indexer receives the IASTTranslationUnit, it gets a write lock on the PDOM and begins persisting the TranslationUnit's data. First, it calls IASTTranslationUnit.getIncludeDirectives() and IASTTranslationUnit.getMacroDefinitions() to acquire the unit's included files and macro definitions. Then, it stores a reference to the unit's file in the PDOM along with references to the included/macro files. These references take the form of PDOMFiles that can be stored with PDOM.addFile().

To search through, or visit, the AST, the indexer creates a new ASTVisitor and calls IASTTranslationUnit.accept(ASTVisitor). Each of the IASTNodes in the AST has an accept() method, and each works in much the same way:

  1. It checks one of the ASTVisitor's boolean fields to determine if the visitor is interested in visiting this node.
  2. If the visitor is interested, the node calls the visitor's visit() method. This returns one of three values:
    • PROCESS_SKIP: The visitor isn't interested in the node's children, so accept() returns true.
    • PROCESS_ABORT: The visitor isn't interested in visiting at all, so accept() returns false.
    • PROCESS_CONTINUE: The visitor wants to visit the node's children, so accept() continues to the next step.
  3. If the boolean field is false or visit() returns PROCESS_CONTINUE, accept() calls the accept() method of each of its children and passes the visitor as its argument.

For example, if a C function declaration has a declaration specifier, such as typedef or static, its corresponding CASTFunctionDefinition.accept() method will check the visitor's shouldVisitDeclarations field. If this is true, it will call the visitor's visit() method. If the field is false, or visit() returns PROCESS_CONTINUE, the function node will access its child and call CASTSimpleDeclSpecifier.accept(). When the visitor reaches one of the final, or terminal, nodes, accept() simply returns true.

Table 1 shows the boolean fields that control which types of nodes are allowed to call the visitor's visit() method. By default, all of them are set to false, so visit() is never called.



Table 1. The ASTVisitor fields that control node visitation
shouldVisitNamesshouldVisitDeclarationsshouldVisitInitializers
shouldVisitDeclaratorsshouldVisitDeclSpecifiersshouldVisitExpressions
shouldVisitTypeIdsshouldVisitEnumeratorsshouldVisitTranslationUnit
shouldVisitParameterDeclarationsshouldVisitStatementsshouldVisitProblems

The best way to understand how ASTVisitors work is to look at an example. The code listing below shows the visitor used by the PDOMFullIndexerJob to find IASTName nodes. To visit these nodes, it sets shouldVisitNames and shouldVisitDeclarations to true.



Listing 1. The PDOMFullIndexerJob's ASTVisitor
ast.accept(new ASTVisitor() {
   // Set the boolean fields so that name and declaration nodes call visit()
   {
      shouldVisitNames = true;
      shouldVisitDeclarations = true;
   }

   // Perform the visit
   public int visit(IASTName name) {
      try {
         IASTFileLocation fileloc = name.getFileLocation();
         if (fileloc != null) {
            PDOMFile file = pdom.addFile(fileloc.getFileName());
            linkage.addName(name, file);
         }
         return PROCESS_CONTINUE;
      } catch (Throwable e) {
         CCorePlugin.log(e);
         return ++errorCount > MAX_ERRORS ? PROCESS_ABORT : PROCESS_CONTINUE;
      }
   };
});

As shown, the indexer's visitor is only interested in IASTNames, which are the only nodes stored in the PDOM. These represent specific identifiers, such as the names of functions, variables, and enumerated types. When the visitor reaches one of these nodes, it accesses a PDOMFile object representing the name's source file. Then, it uses the PDOM's linkage object (PDOMLinkage) to add the IASTName to the index.

But before the IASTName can be stored, the linkage needs to know its type, or binding - whether it's the name of a variable, function, struct, etc. To find out, the linkage finds the node's parent and its scope. With this scope, the linkage figures outs how the name is used and encapsulates this information in an IBinding. This is why the new parsing process makes scope determinations after the parse - it's only interested in the scopes of IASTNames.

To clarify names and bindings, I debugged the CDT's parsing of "Hello, World!" and provided the AST breakdown shown in Figure 3. The AST contains a single IASTDeclarationUnit that represents the main() function. The CASTFunctionDeclarator holds a CASTName whose IBinding is a CFunction.



Figure 3: Hello World AST
Hello World AST

The linkage adapts the IBinding into a persistible PDOMBinding and the IASTName into a PDOMName. The constructors for these new objects access the PDOM to store their information. To show how this works, I need to back up and explain how the PDOM operates.

Step 4: The PDOM Structure

If you've created any CDT projects, I recommend you open your workspace directory and the .metadata/plugins/org.eclipse.cdt.core directory underneath it. There, you'll see a *.pdom file for each of your projects. When the PDOM is created for a project, it constructs a Database object to manage its low-level storage. The Database builds a new RandomAccessFile and gives it the project name and the time in milliseconds - these are the *.pdom files you see.

Java IO and Java NIO

Most of the Java input/output routines I've seen still use the Java IO Application Programming Interface (API). This moves data in byte streams, providing InputStreams, OutputStreams, Writers, and Readers. This is fine when performance isn't an issue, but the classes in the Java New I/O (NIO) API move data using the same efficient, bulk-transfer methods as your operating system. This isn't important for small files, but makes a big difference when file sizes approach the size of operating system pages (e.g. 2kB, 4kB, 8kB). Not coincidentally, the Database initializes its *.pdom files at 16 kB.

In the CDT, the most important Buffer is the MappedByteBuffer, which is created on top of an existing RandomAccessFile. Normal file access requires buffering, heap allocation, and file I/O routines. But MappedByteBuffers are immediately updated by the virtual memory controller, and the memory doesn't affect the Java heap. Finally, since MappedByteBuffers lie outside process memory, they can be accessed efficiently by multiple processes at once. However, because it operates so closely to the operating system, its behavior is more OS-dependent than Java IO routines.

This Database has more in common with regular databases than just the name. It divides memory into variable-length records composed of 16-byte blocks. Each record represents a different PDOMNode, such as a PDOMLinkage or PDOMBinding. To enable memory access, it provides put and set methods for ints, chars, and bytes.

To make reading and writing to the file as efficient as possible, the Database uses an array ofChunks which manage MappedByteBuffers on the *.pdom file. When the Database creates this file, it sets its size to 16kB and allocates the first Chunk in the array to buffer. The Database creates further 16kB Chunks as the index requires more memory.

The first object added to the database is the PDOMLinkage, and new linkages are added as new languages are used in the project. When its constructor is called, it a series of tasks:

  • It sets its initial record size at 24 bytes
  • It stores a 0 to represent the node's type (linkage)
  • It stores a 0 to show that it has no parent node
  • It stores the size of memory needed to store the linkage's name
  • It allocates the memory and stores the linkage's name
  • It stores the size of memory needed to store the linkage's ID
  • It allocates the memory and stores the linkage's ID
  • If this is the first project linkage, the PDOM stores its record size at the LINKAGES address. If not, it finds the next available address and stores the size there.

The memory allocation is performed by Database.malloc(int size), which starts by determining the minimum number of blocks needed to provide the requested memory. If there isn't enough memory available, it creates a new Chunk. Otherwise, it accesses the Chunk that can provide the memory, clears the required number of blocks, and returns the location of the first allocated block.

The storage process is similar for other objects, such as when the linkage creates a new PDOMBinding as part of storing a named node. If this binding isn't contained within the scope of another binding, the linkage sets itself as the binding's parent. Then, the PDOMBinding constructor sets its initial record size at 24 bytes and performs most of the same steps as those listed above.

The PDOMName works differently. There are three ways a name can be associated with its binding - as a definition, a declaration, or as a reference. When the PDOMName determines which role it plays, it updates the PDOMBinding's record. Then it stores the location of the PDOMFile whose file contains it, along with its location and length. Finally, it populates the file's index by adding its block location to the PDOMFile.

Because of the parent/child relationship of PDOMNodes, the index forms a tree similar to the parser's AST. The first time PDOMLinkage.getIndex() is called, it creates a BTree using the PDOM's Database and the root block location. You can search through this tree with an IPDOMVisitor. This functions like the ASTVisitor with three important differences:

  • No boolean fields - all nodes are visited
  • Only a single visit(IPDOMNode node) method, which checks if the visited node is of a certain type
  • Contains a leave(IPDOMNode node) to stop search when a particular node type is reached

You can also search through the PDOM using the static methods provided in the DOMSearchUtil class.

Updating the BBCDT

I've added the PDOM classes, and considering the size of the application, 'Bare Bones' just doesn't seem appropriate. I've set the PDOMManager to create onlyPDOMFullIndexers for project indexing, so there are no preferences or descriptors involved. Also, I changed the ASTAction to perform two tasks: search through the IASTTranslationUnit with a BBASTVisitor and search through the PDOM with a BBPDOMVisitor. These classes are in the org.dworks.bbcdt.ui.action package. The result of clicking ASTAction is shown in Figure 4.



Figure 4: Contents of the BBCDT Index
Contents of the BBCDT Index

The BBASTVisitor searches through the AST, finding IASTNames and IASTForStatements. When it discovers a loop, it displays three expressions: the initializer, condition, and iteration. The BBPDOMVisitor finds functions in the BBPDOMVisitor and displays the number present.

Conclusion

Parsing, Java NIO, and pseudo-database access - indexing is an complicated process, no question. But this complexity is needed to provide efficient, all-embracing code analysis and storage. Now that I've explained the PDOM, it's time to see how the CDT applies it. In the next article, I'll discuss code completion and content assist - how they work and specifically, how they access the PDOM index.




Back to top


Download

DescriptionNameSizeDownload method
BBCDT code for article 4bbcdt_part4.zip4.7MBHTTP
Information about download methodsGet Adobe® Reader®


Back to top


Resources



Back to top


About the author

Author1 photo

Matthew Scarpino is a project manager and Java developer at Eclipse Engineering LLC. He is the lead author of SWT/JFace in Action (Manning, 2004) and made a minor but important contribution to the Standard Widget Toolkit (SWT). He enjoys Irish folk music, marathon running, the poetry of William Blake, and the Graphical Editing Framework (GEF). You can reach Matthew at mattscar@yahoo.com.




Back to top


Rate this page


Please take a moment to complete this form to help us better serve you.



YesNoDon't know
 


 


12345
Not
useful
Extremely
useful
 


Back to top


This is the first trademark attribution statement. This is the second trademark attribution statement. Other company, product, or service names may be trademarks or service marks of others.


    About IBMPrivacyContact