Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
[ptp-user] Job submission problems with PTP 5.0.3 and Torque RM

Hello all,

After the recent 5.0.3 update, I'm back to trying to get PTP working well on a cluster to which I have access.  The cluster uses Torque 2.5.7 for batch queue software.  PTP 5.0.3 and Eclipse 3.7.1 don't seem to be able to completely talk to this Torque installation, and there are other problems that I can't seem to diagnose.  Perhaps someone else has figured these out already, or has some suggestions about how to fix or work around these problems?

For the following description I'm in the Parallel Runtime perspective, and working with a local MPI-based C++ project.

The first hint something is wrong is that it isn't clear whether the PBS-Generic-Batch resource manager is fully started or not.  I created a stock PBS-Generic-Batch RM.  I use the context menu to start the RM, and after a second or so the RM icon changes from grey to green.  However, if I select it and look at the properties view, the properties view still indicates the RM is in the STOPPED state, with "num machines" and "num queues" both 0.  Furthermore, nothing shows up in the "Machines" or "Jobs list" views.  These views seem contradictory - the RM view showing me it has started but others suggesting not.

I'm able to create a Run Configuration for my program.  Interestingly, the set of available queues in the "Run Configuration" dialog's Resources tab is the set of queues on our system, so PTP must have been able to obtain the correct set of queues.

If I attempt to Run my Run Configuration (i.e., submit it to the batch queue), the progress view gets as far as saying 'submit-batch' with a large hex number (a GUID?) but sticks at 75%.  PTP creates a file in my home directory named $(GUID)managed_file_for_script, but with a different GUID than the one shown with the submit-batch progress bar.  

Eventually I have to cancel the submit-batch operation in the progress view.  When I do so, a message is displayed to the screen and written to the workspace log file saying that the qsub command failed because it couldn't find the batch script.  The message shows the command was trying to use the path $HOME$HOME$GUIDmanaged_file_for_script (i.e., the home directory path is listed twice).  I can't see how to modify the PBS-Generic-Batch XML file to keep it from building the path with $HOME twice.

Just to explore, I created the directories and a symlink so that $HOME$HOME existed and pointed to $HOME.  PTP was able to submit the job but it produced no output to the Console view nor did it change the Machines or Jobs list views.

Does anyone have any ideas?

Phil Roth

P.S., using diagnostics advice given previously on this list, I found that the LML_da_driver.pl script is not correctly finding the version of Torque.  When given the --version flag, the qstat command with Torque 2.5.7 writes its output on stderr, and that script sends stderr to /dev/null.  If I change the script so that it sends stderr to stdout for this test, the script determines the version correctly but it has no effect on the problems I describe above.


-- 
Philip C. Roth | +1 865 241-1543 | http://ft.ornl.gov/~rothpc





Back to the top