459065 – wrong string parsing

Bug 459065 - wrong string parsing

Summary: wrong string parsing

Status:	NEW

Alias:	None

Product:	CDT
Classification:	Tools
Component:	cdt-parser (show other bugs)
Version:	8.4.0
Hardware:	PC Linux

Importance:	P3 major (vote)
Target Milestone:	---
Assignee:	Project Inbox
QA Contact:	Jonah Graham

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2015-02-03 12:40 EST by Marco Trudel
Modified:	2020-09-04 15:21 EDT (History)
CC List:	3 users (show)

See Also:

Attachments
proposed patch (1.35 KB, patch) 2015-02-04 11:06 EST, Marco Trudel	no flags	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Marco Trudel

2015-02-03 12:40:19 EST

printf("%ld\n", sizeof("\0""0")); // outputs 3 (\0, 0, implicit \0)
is parsed as (the string is concatenated):
   printf("%ld\n", sizeof("\00")); // outputs 2 (\00, implicit \0)
it should probably be:
   printf("%ld\n", sizeof("\0000")); // outputs 3 (\000, 0, implicit \0)

Comment 1 Nathan Ridge

2015-02-04 00:18:06 EST

I see something slightly different: CDT evaluates 'sizeof("\0")' as 3, suggesting that it accounts for the implicit null-terminator, but not for the escape character.

'sizeof("\0""0")' is evaluated as 4, which suffers from the same problem again, but there doesn't seem to be any additional problem related to concatenation.

Comment 2 Marco Trudel

2015-02-04 08:07:48 EST

Sorry, I was not clear enough. I'm talking about the AST created when parsing such strings. One IASTLiteralExpression is created with the value "\00", which is wrong. Consider:

import org.eclipse.cdt.core.dom.ast.*;
import org.eclipse.cdt.core.dom.ast.gnu.c.GCCLanguage;
import org.eclipse.cdt.core.model.ILanguage;
import org.eclipse.cdt.core.parser.*;

public class Test
{
  public static void main(String[] args) throws Exception
  {
    String code = "char *s = \"\\0\"\"0\";"; // char *s = "\0""0";
    IASTTranslationUnit tu =  GCCLanguage.getDefault().getASTTranslationUnit(
        FileContent.create("foo.c", code.toCharArray()),
        new ScannerInfo(),
        null,
        null,
        ILanguage.OPTION_IS_SOURCE_UNIT,
        new NullLogService()
      );
    tu.accept(new ASTVisitor(true)
      {
        @Override
        public int visit(IASTExpression node)
        {
          if(node instanceof IASTLiteralExpression)
          {
            System.out.println("AST literal: " + node.toString()); // outputs "\00"
          }
          return PROCESS_CONTINUE;
        }
      });
  }
}

Comment 3 Nathan Ridge

2015-02-04 10:28:49 EST

I don't necessarily think that's wrong. I'm not aware of IASTLiteralExpression.toString() making any kind of specific guarantees about what sort of string it returns, such as one with escape sequences processed.

If we want to expose the string literal with escape sequences processed in the AST, we could specify that toString() does so, or add a new API to IASTLiteralExpression, but that would be a new feature, not a bug.

Incorrect calculation of 'sizeof' for such literals, however, *is* a bug, because it can cause CDT to produce false-positive errors (and other IDE misbehaviour) in certain code.

Comment 4 Marco Trudel

2015-02-04 11:06:55 EST

Created attachment 250505 [details]
proposed patch

Fixes the concatenation of string parts.
E.g. the parsing of "\1""2":
- resulting string without the fix: "\12"
- resulting string without the fix: "\0012"

Comment 5 Marco Trudel

2015-02-04 11:11:47 EST

(In reply to Marco Trudel from comment #4)
> Created attachment 250505 [details]
> proposed patch
> 
> Fixes the concatenation of string parts.
> E.g. the parsing of "\1""2":
> - resulting string without the fix: "\12"
> - resulting string without the fix: "\0012"

"\0012" is of course _with_ the fix.

This fixes the parsing of real world applications:
- links, http://links.twibright.com/
- vim, http://www.vim.org/

Comment 6 Marco Trudel

2015-02-04 11:19:39 EST

(In reply to Nathan Ridge from comment #3)
> I don't necessarily think that's wrong. I'm not aware of
> IASTLiteralExpression.toString() making any kind of specific guarantees
> about what sort of string it returns, such as one with escape sequences
> processed.

As a user of CDT I assume that if the AST contains an IASTLiteralExpression, it represents the correct value as found in the original source code.
The AST of parsing "12 + 34" really shouldn't be:
+ binary expression of "56" and "78"

> If we want to expose the string literal with escape sequences processed in
> the AST, we could specify that toString() does so, or add a new API to
> IASTLiteralExpression, but that would be a new feature, not a bug.
> 
> Incorrect calculation of 'sizeof' for such literals, however, *is* a bug,
> because it can cause CDT to produce false-positive errors (and other IDE
> misbehaviour) in certain code.

Comment 7 Nathan Ridge

2015-02-04 11:25:15 EST

(In reply to Marco Trudel from comment #4)
> Created attachment 250505 [details]
> proposed patch
> 
> Fixes the concatenation of string parts.
> E.g. the parsing of "\1""2":
> - resulting string without the fix: "\12"
> - resulting string without the fix: "\0012"

I feel like I'm missing something here. Where does the "00" come from?

In any case, could you submit the patch to Gerrit please: https://wiki.eclipse.org/CDT/git#Using_Gerrit_for_CDT

Comment 8 Nathan Ridge

2015-02-04 11:26:23 EST

(In reply to Marco Trudel from comment #6)
> (In reply to Nathan Ridge from comment #3)
> > I don't necessarily think that's wrong. I'm not aware of
> > IASTLiteralExpression.toString() making any kind of specific guarantees
> > about what sort of string it returns, such as one with escape sequences
> > processed.
> 
> As a user of CDT I assume that if the AST contains an IASTLiteralExpression,
> it represents the correct value as found in the original source code.

Well, the characters found in the source _are_ '\', '0', and '0'.

Comment 9 Marco Trudel

2015-02-04 11:36:33 EST

(In reply to Nathan Ridge from comment #7)
> (In reply to Marco Trudel from comment #4)
> > Created attachment 250505 [details]
> > proposed patch
> > 
> > Fixes the concatenation of string parts.
> > E.g. the parsing of "\1""2":
> > - resulting string without the fix: "\12"
> > - resulting string without the fix: "\0012"
> 
> I feel like I'm missing something here. Where does the "00" come from?

Copy paste error. I meant: resulting string with the fix: "\0012"

In case you don't understand why the additional 00 is needed:
"\1""2" is a string with three elements: {'\1', '2', '\0'}
CDT makes "\12" out of that. This is a string with 2 elements: {'\12', '\0'}
Now the fix is padding the octal literal, so it's three elements again: {'\001', '2', '\0'}
The string CDT has in its AST is not the string that's in the source code.

> In any case, could you submit the patch to Gerrit please:
> https://wiki.eclipse.org/CDT/git#Using_Gerrit_for_CDT

Comment 10 Marco Trudel

2015-02-04 11:41:35 EST

(In reply to Nathan Ridge from comment #8)
> (In reply to Marco Trudel from comment #6)
> > (In reply to Nathan Ridge from comment #3)
> > > I don't necessarily think that's wrong. I'm not aware of
> > > IASTLiteralExpression.toString() making any kind of specific guarantees
> > > about what sort of string it returns, such as one with escape sequences
> > > processed.
> > 
> > As a user of CDT I assume that if the AST contains an IASTLiteralExpression,
> > it represents the correct value as found in the original source code.
> 
> Well, the characters found in the source _are_ '\', '0', and '0'.

No. The characters found in the source _are_ '"', '\', '0', '"', '"', '0', and '"'.
CDT says it found '"', '\', '0', '0', and '"'. This is not the same. If the different parts of the strings are merged, it has to be done correct.

Comment 11 Nathan Ridge

2015-02-04 13:12:35 EST

Ah, I didn't know about octal literals! Thanks for your patience, I understand now :)

So there seem to be two bugs here:

  (1) When merging different fragments of a string literal,
      we can unintentionally create an octal literal that,
      after being expanded, produces different characters
      than if the two fragments were expanded independently
      and then concatenated.

  (2) 'sizeof' does not process escape sequences when
      determining the size of a string.

Your patch seems to deal with (1). (Which is fine, we can fix (2) in a separate patch or bug). If you post it to Gerrit I'd be happy to review it.

Comment 12 Marco Trudel

2015-02-05 04:47:25 EST

(In reply to Nathan Ridge from comment #11)
> Ah, I didn't know about octal literals! Thanks for your patience, I
> understand now :)
> 
> So there seem to be two bugs here:
> 
>   (1) When merging different fragments of a string literal,
>       we can unintentionally create an octal literal that,
>       after being expanded, produces different characters
>       than if the two fragments were expanded independently
>       and then concatenated.

Exactly. Sorry, I should have added a little explanation instead of just an example.

>   (2) 'sizeof' does not process escape sequences when
>       determining the size of a string.

I don't care much about that. I found the binding/type/sizeof computation of CDT to be somewhat buggy/incomplete. It was easier to rewrite them from scratch myself. Maybe you could create a new bug for this issue?

> Your patch seems to deal with (1). (Which is fine, we can fix (2) in a
> separate patch or bug). If you post it to Gerrit I'd be happy to review it.

Will do. But to be honest, this is quite low on my todo list since this took me hours last time (Gerrit is much more complicated than just a pull request on GitHub).

Comment 13 Nathan Ridge

2015-02-05 12:16:24 EST

(In reply to Marco Trudel from comment #12)
> >   (2) 'sizeof' does not process escape sequences when
> >       determining the size of a string.
> 
> I don't care much about that. I found the binding/type/sizeof computation of
> CDT to be somewhat buggy/incomplete. It was easier to rewrite them from
> scratch myself. 

I'm trying to fix those bugs :) If you have any specific ones in mind, I would appreciate you filing bugs for them so that they can be fixed.

If at all possible, you may also want to consider contributing some of your reimplemented functionality back to the project, so that others can benefit from it.

> Maybe you could create a new bug for this issue?

Will do.

> > Your patch seems to deal with (1). (Which is fine, we can fix (2) in a
> > separate patch or bug). If you post it to Gerrit I'd be happy to review it.
> 
> Will do. But to be honest, this is quite low on my todo list since this took
> me hours last time (Gerrit is much more complicated than just a pull request
> on GitHub).

I would encourage you post to cdt-dev about this. If contributors are being discouraged from contributing due to difficulties with process, that needs to be addressed. Perhaps the project will consider maintaining a GitHub mirror, or something.

Comment 14 Nathan Ridge

2015-02-05 12:23:12 EST

(In reply to Nathan Ridge from comment #13)
> > Will do. But to be honest, this is quite low on my todo list since this took
> > me hours last time (Gerrit is much more complicated than just a pull request
> > on GitHub).
> 
> I would encourage you post to cdt-dev about this. If contributors are being
> discouraged from contributing due to difficulties with process, that needs
> to be addressed. Perhaps the project will consider maintaining a GitHub
> mirror, or something.

(That said, I personally find GitHub much less intuitive than Gerrit. It makes me maintain my own fork of the repo on GitHub, so that there are three clones I have to work with: 1) the official repo, 2) my fork on GitHub, 3) my local clone. I have to make sure I keep both (2) and (3) up to date with commits from (1), and contributing a patch involves two steps, first getting it into (2), and then into (3). By contrast, with Gerrit I only have to maintain my own local clone.)

Comment 15 Nathan Ridge

2015-02-06 01:46:55 EST

(In reply to Nathan Ridge from comment #13)
> > >   (2) 'sizeof' does not process escape sequences when
> > >       determining the size of a string.
> 
> > Maybe you could create a new bug for this issue?
> 
> Will do.

Filed bug 459279.