Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
RE: [cdt-dev] some questions about CDT c/cpp parsing

Super, my favourite type of questions. I should really put together an architecture document on this stuff. Something to do over the summer. In the meantime, answers below.

 

hi,

 I've a few questions about how the CDT parser code is organised that I was hoping people on this list might be able to help with. I'm aware that there has been an active history in this area of code (?), so if some of the questions below are too broad brush to make sense then please say.


(a) is the parser(s) compliant with a particular standard?


> We had a copy of the ISO standards in front of us when building the C and C++ parsers. We also had the gcc manual which lists gcc extensions. The parsers used today are parsers for gcc.


(b) I've only heard of dom in the context of xml, but the code mentions both ast and dom. Is dom being used here in the same sense as ast? and just to confirm, in the pdom context its being used in a different context where it really is a higher-level representation than the ast?


> I am seeing the term DOM used to generally mean a programmatic interface to the contents of a document at a higher level than text. So the CDT DOM is the programmatic interface to get at the structure and semantics of code. Today we support C and C++ and I want to add more, like C# and Fortran, in the future. The AST is the part of the DOM that represents the structure or syntax of the code. The parser directly creates an AST. From the AST, you can get Bindings which represent logical semantic things and binds all declarations and references of those things together so you can navigate them. The PDOM is pieces of the DOM persisted to a database and serves as an index into the code and serves as cheap replacements for the DOM elements during the “Fast” parsing used by the new Fast indexer.


(c) is there a way to tell which code on HEAD belongs to the current parser(s), and which is unused or to be phased out? e.g. by package?

> The DOM is everything in *.core.dom.* The C parser is GNUCSourceParser and the C++ parser is GNUCPPSourceParser. There is also the DOMScanner which is the scanner. Take a look at the getTranslationUnit methods of GCCLanguage and GPPLanguage to see how the parsing is kicked off. The *.core.parser* stuff is the old parser used by the CModel and should be phased out.


(d) is there a way to hook into the parser to allow for vendor specific language customizations? (or at least tell the parser some sections are safe to ignore)

> This has always been on our wish list and I don’t think we have a formal way of doing that yet. One problem would be knowing which parser to use for a given file since we are currently triggering of ContentType which would need to be different depending on what compiler you were using. The next would be how add rules to the parsers, or would you create a whole new parser.


(e) the CModel also seems to store information about entities from the ast, is this something worth knowing more about when getting an overview of the source code parsing, or can it be considered independent? Maybe a better question is whether the pdom entities would replace some of the cmodel entities in the future?

> The CModel gets its information from the old parser currently. It’s a different AST than the DOM AST. Also the CModel doesn’t really “store” anything, i.e., it is not persisted. It is used mainly as content for the Outline and C/C++ Projects view. It is built on the fly so needs to use a fast parser. Hopefully, with the PDOM and the faster parsing that it brings that we can use it to create the CModel. But, no, it will not replace any of the elements in the CModel, we have too many object contributions registered to CModel elements to do that.


(f) my understanding is that, at least in theory, you can't parse a c/cpp source file without specifying a set of macros which apply. How is this addressed by the parser, and also the pdom?

> Yes, we get the list of Macros and Include Paths from the build system. We call it ScannerInfo which you’ll see in the getTranslationUnit methods mentioned above. In the future, we’ll be standardizing on getting this information directly from the CModel. The PDOM stores includes and macros for each file and we use that information in the fast indexer when we skip over parsing header files.


(g) are there different levels of parsing, for picking up a summary of the members of a source file? (I've seen there are constants for not following includes, but not e.g. skipping parsing the body of a function, assuming this is feasible)

> In the old parser, we had the concept of Quick Parse which skipped over headers and function bodies and that is what we use to populate the CModel. I don’t think we have full support for that mode in the new DOM parser, at least the skipping function body part. As mentioned earlier, the Fast parse mode skips over headers if it can find information about them in the PDOM. Skipping function bodies would need to be done before we can use it for the CModel.

I have attempted to answer some questions myself from the source code, but have been getting a little overwhelmed, so any help would be appreciated :).
thanks,
Andrew

 

Yes, it is overwhelming and you are not alone. There are a number of people getting interested in indexing (e.g. folk from Wind River and IBM) and I’ll need to put together an architecture guide ASAP to help everyone get started. Even a picture would be a great start…

 

Doug Schaefer, QNX Software Systems

Eclipse CDT Project Lead, Tools PMC member

http://cdtdoug.blogspot.com


Back to the top