Super, my favourite type
of questions. I should really put together an architecture document on this
stuff. Something to do over the summer. In the meantime, answers below.
hi,
I've
a few questions about how the CDT parser code is organised that I was hoping
people on this list might be able to help with. I'm aware that there has been
an active history in this area of code (?), so if some of the questions below
are too broad brush to make sense then please say.
(a)
is the parser(s) compliant with a particular standard?
> We had a copy of the ISO standards in front of us when building the C and
C++ parsers. We also had the gcc manual which lists gcc extensions. The parsers
used today are parsers for gcc.
(b) I've only heard of dom in the context of xml, but the
code mentions both ast and dom. Is dom being used here in the same sense as
ast? and just to confirm, in the pdom context its being used in a different
context where it really is a higher-level representation than the ast?
> I am seeing the term DOM used to generally mean a programmatic interface
to the contents of a document at a higher level than text. So the CDT DOM is
the programmatic interface to get at the structure and semantics of code. Today
we support C and C++ and I want to add more, like C# and Fortran, in the
future. The AST is the part of the DOM that represents the structure or syntax
of the code. The parser directly creates an AST. From the AST, you can get
Bindings which represent logical semantic things and binds all declarations and
references of those things together so you can navigate them. The PDOM is
pieces of the DOM persisted to a database and serves as an index into the code
and serves as cheap replacements for the DOM elements during the “Fast”
parsing used by the new Fast indexer.
(c) is there a way to tell which code on HEAD belongs
to the current parser(s), and which is unused or to be phased out? e.g. by
package?
> The DOM is everything in *.core.dom.*
The C parser is GNUCSourceParser and the C++ parser is GNUCPPSourceParser.
There is also the DOMScanner which is the scanner. Take a look at the getTranslationUnit
methods of GCCLanguage and GPPLanguage to see how the parsing is kicked off. The
*.core.parser* stuff is the old
parser used by the CModel and should be phased out.
(d) is there a way to hook into the parser to allow for
vendor specific language customizations? (or at least tell the parser some
sections are safe to ignore)
> This has always been on our wish list
and I don’t think we have a formal way of doing that yet. One problem
would be knowing which parser to use for a given file since we are currently
triggering of ContentType which would need to be different depending on what
compiler you were using. The next would be how add rules to the parsers, or
would you create a whole new parser.
(e) the CModel also seems to store information about
entities from the ast, is this something worth knowing more about when getting
an overview of the source code parsing, or can it be considered independent?
Maybe a better question is whether the pdom entities would replace some of the
cmodel entities in the future?
> The CModel gets its information from the
old parser currently. It’s a different AST than the DOM AST. Also the
CModel doesn’t really “store” anything, i.e., it is not
persisted. It is used mainly as content for the Outline and C/C++ Projects view.
It is built on the fly so needs to use a fast parser. Hopefully, with the PDOM
and the faster parsing that it brings that we can use it to create the CModel.
But, no, it will not replace any of the elements in the CModel, we have too
many object contributions registered to CModel elements to do that.
(f) my understanding is that, at least in theory, you
can't parse a c/cpp source file without specifying a set of macros which apply.
How is this addressed by the parser, and also the pdom?
> Yes, we get the list of Macros and Include
Paths from the build system. We call it ScannerInfo which you’ll see in
the getTranslationUnit methods mentioned above. In the future, we’ll be
standardizing on getting this information directly from the CModel. The PDOM
stores includes and macros for each file and we use that information in the
fast indexer when we skip over parsing header files.
(g) are there different levels of parsing, for picking
up a summary of the members of a source file? (I've seen there are constants
for not following includes, but not e.g. skipping parsing the body of a
function, assuming this is feasible)
> In the old parser, we had the concept
of Quick Parse which skipped over headers and function bodies and that is what
we use to populate the CModel. I don’t think we have full support for
that mode in the new DOM parser, at least the skipping function body part. As
mentioned earlier, the Fast parse mode skips over headers if it can find
information about them in the PDOM. Skipping function bodies would need to be
done before we can use it for the CModel.
I
have attempted to answer some questions myself from the source code, but have
been getting a little overwhelmed, so any help would be appreciated :).
thanks,
Andrew
Yes, it is overwhelming and you are not
alone. There are a number of people getting interested in indexing (e.g. folk
from Wind River and IBM) and I’ll need
to put together an architecture guide ASAP to help everyone get started. Even a
picture would be a great start…