Bug 492609 - Unable to delete module that failed to correctly deploy
Summary: Unable to delete module that failed to correctly deploy
Status: RESOLVED FIXED
Alias: None
Product: CFT
Classification: ECD
Component: General (show other bugs)
Version: unspecified   Edit
Hardware: PC Mac OS X
: P3 normal
Target Milestone: 1.0 RC1   Edit
Assignee: Elson Yuen CLA
QA Contact:
URL:
Whiteboard:
Keywords: plan
Depends on:
Blocks:
 
Reported: 2016-04-27 18:00 EDT by Nieraj Singh CLA
Modified: 2016-05-10 15:56 EDT (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Nieraj Singh CLA 2016-04-27 18:00:49 EDT
This recurring error should be fixed by RC1.

When an application failed to deploy due to errors in publish operation, for example:

- wrong buildpack

The module cannot be deleted because in DeleteModulesOperation, module deletion is stopped due to this check:

if (appModule == null || !appModule.isDeployed()) {
				continue;
}

This check is in place for another issue:

Bug 485228

We need to better distinguish preventing deleting of existing app on CF when cancelling deployment, which the above fixes, compared to a user explicitly deleting a module.

Cleaning the server does not fix the problem
Comment 1 Nieraj Singh CLA 2016-04-28 17:23:52 EDT
To duplicate this error, add/remove or drag/drop a new application onto a CF server, and in the deployment wizard enter an incorrect buildpack.

Once CF shows an error, try deleting the module. The module appears to be deleted in the Servers view, but when deploying the application again, the deployment wizard does not open and the application is auto-deployed with the wrong buildpack.

The user does not have the option to change the buildpack on any subsequent publish even if they deleted the module from the view (but clearly not from the server).
Comment 2 Nieraj Singh CLA 2016-04-28 17:26:16 EDT
A possible cause for this error is that if an error occurs in a publish operation, and the application is successfully created in CF before the error occurs, the publish state of the module is still UNKNOWN even though the application exists in CF.

To find the cause of the error, trace where module publish state is set to UNKNOWN when an error occurs during any publish operation. If the application already exists in CF when publish error occurs, the publish state should reflect the state of the app in CF (e.g. STOP, START, etc..).
Comment 3 Elson Yuen CLA 2016-05-05 15:15:40 EDT
I take a look at the problem. Based on my investigation, the problem is indeed coming from apppModule.isDeployed() return false.  However, the problem is not coming from the module status is unknown but the module does not exist at all.

In org.eclipse.cft.server.core.internal.client.CloudFoundryApplicationModule.isDeployed(), it has:
exists() && getState() != IServer.STATE_UNKNOWN

exists() is false and I also check the actual CF server and the application itself does not exist in the system.  Therefore, the cause of the problem is the cloud server in Eclipse still keeps the cloud module around on the server that is out of sync with the actual server itself.

Therefore, the fix is to remove the cloud module application from the server instance to match that the actual CF server does not really have that application exist.
Comment 4 Elson Yuen CLA 2016-05-05 15:19:28 EDT
Code released to master
Comment 5 Nieraj Singh CLA 2016-05-05 15:45:15 EDT
Thanks for looking at this.

I also took a look, but I came up with a different solution

I believe generally the problem is what you mentioned, that locally we are holding on to the module even if app does not exist, so Server is out of synch with CF.

But I believe the code that does this is this internal "cache" of modules that are being deployed. See ModuleCache, which tags and untags a module during publish to prevent a module that is being added from also being deleted by a parallel process.

The problem is that if error occurs during publish, this cache does not "untag" the module that was being added, resulting in the cache holding on to the module, whether the app was created in CF or not.

So next time a user tries to delete the module, when addRemoveModules(..) is called on CloudFoundryServer, the logic that checks ModuleCache will prevent the module from being deleted from the cache, as it is still being tagged as "in deployment"

I believe a suitable fix is:

1. Rename the "tag/untag" methods in ModuleCache to better reflect what they really are doing, which is prevent a module that is being added from accidentally being deleted by a parallel process in:

org.eclipse.cft.server.core.internal.CloudFoundryServer.addAndDeleteModules()

A better rename for these methods I think would be:

tagAsUndeployed(IModule module) -> tagAddModuleInProgress(IModule)
tagAsDeployed(IModule module) -> tagAddModuleCompleted(IModule)
isUndeployed(IModule) -> isAddingModule(IModule)

2. When CoreException is handled in AbstractPublishApplicationOperation:

Here we do an update of modules, as expected, but we also forget to untag the module from the cache. The module should be untagged any time a publish operation terminates, whether it successfully published the module, or it failed. I believe here is the actual error.

I was able to verify that if we do a proper untag of module in step two above (so in the error handling code in AbstractPublishApplicationOperation)  , the Server stays correctly in synch with Cloud Foundry:

A. If publish failed, but app was created, the "update modules" step in error handler correctly creates the module in Server
B. If publish failed, but app was not created, the "update modules" step in error handler correctly removes the module in Server
Comment 6 Nieraj Singh CLA 2016-05-05 16:04:54 EDT
My solution was in error handling in AbstractPublishApplicationOperation:

catch (CoreException e) {

// Untag module from cache			getBehaviour().getCloudFoundryServer().tagAsCompleted(getModule());

// Proceed with regular update module
getBehaviour().operations().updateModule(getModule()).run(monitor);

...

}
Comment 7 Elson Yuen CLA 2016-05-05 16:51:14 EDT
My original fix was based on the fact that the cloud application has been created in the PushApplication operation so it should be cleaned up before leaving the method if the application has not been created yet.

Based on the latest findings on comment #5, there is a better way to fix the problem that will cover the another failing scenario as described in the comment.  

Therefore, I am reopening this and I am going to roll back my change.  On a offline discussion with Nieraj, we also need to similar module update operation on the operation cancel case as well since I notice that cancelling the publish operation may cause the same out of sync problem as well.
Comment 8 Nieraj Singh CLA 2016-05-10 15:56:45 EDT
Pushed the fix to master. Now modules are being cleaned from the internal cache both when:

1. Operation is cancelled
2. Any errors handled during publish operation

Also added a test case to test the fix for condition 2 above. Testing cancellation reliably is a bit more challenging than testing for errors during publish, and requires additional testing harness support for asynch execution so it will be omitted.