Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
[ptp-dev] BUG report of PBS JAXB resource manager

Hi all,

Recently I have been trying PTP_5.0.6 with PBS-Generic-Batch(LML_JAXB) resource manager.
The underlying RMS is PBS Pro 11.2.0 and everything works well.
Here are two problems:

1. With the default installation and configuration, while starting the PBS-Generic-Batch resource manager,
a error window popup with error message:
"LML DA Driver (Local) has encountered a problem. Server finished with exit code 1".
And the PBS-Generic-Batch resource manager failed to start. 

Under $HOME/.eclipsesettings directory, there occurs some temporary directories, for example, tmp_node0_31974 (here node0 is the hostname of the front end node of PBS cluster).
There are "report.log" and "request.xml" files under the temporary directory.
Following is the contents of report.log:

   ------------------------------------------------------------------------------------------
     LLVIEW Data Access Workflow Manager Driver 1.15, starting at (Thu Mar 22 05:09:19 CST 2012)
      command line args:
   ------------------------------------------------------------------------------------------
   LML_da_driver.pl: temporary directory not found, create new directory ./tmp_node0_31947 ...
   LML_da_driver.pl: tmpdir created (./tmp_node0_31947)
   LML_da_driver.pl: requestfile=-
   LML_da_driver.pl: parsing XML requestfile in 0.0027 sec
   LML_da_driver.pl: check request for rms hint ...
   LML_da_driver.pl: check_for rms, got hint from request ... (TORQUE)
   LML_da_driver.pl: check_rms_TORQUE: found pbsnodes by which (/opt/pbs/11.2.0.113417/bin/pbsnodes)
   LML_da_driver.pl: check_rms_TORQUE: found qstat by which (/opt/pbs/11.2.0.113417/bin/qstat)
   LML_da_driver.pl: check_rms_TORQUE: PBSpro found
   LML_da_driver.pl: check_rms_TORQUE: seems not to be a TORQUE system
   LML_da_driver.pl: rms/TORQUE/da_check_info_LML.pl unable to locate rms TORQUE
   LML_da_driver.pl: ERROR LML_da_driver.pl: could not determine rms, exiting ...


It seems that LML_da_driver.pl assumes that the underlying RMS is TORQUE, not PBS. 
If I modify the LML_da_driver.pl and force the rms to PBS (just before echoing "check_for rms, got hint from request ..."), everything works well.
So how to configure the LML_da_driver so that the correct underlying RMS can be detected?

2. Even after starting the PBS-Generic-Batch resource manager successfully, I still failed to launch a batch job within PTP.

Following is the batch script generated by PTP with my job configuration:
-------------------------------------------------------------------------
#!/bin/bash
#PBS -q workq
#PBS -N ptp_job
#PBS -l nodes=1
#PBS -l walltime=00:30:00
#PBS -V
MPI_ARGS="-np 4"
if [ "-np" == "${MPI_ARGS}" ] ; then
 MPI_ARGS=
fi
COMMAND=mpirun
if [ -n "${COMMAND}" ] ; then
 COMMAND="${COMMAND} ${MPI_ARGS} /vol/test/demoApp/Debug/testMPI "
else
 COMMAND="/vol/test/demoApp/Debug/testMPI "
fi
cd /home/jiangjie
${COMMAND}

-------------------------------------------------------------------------

And following is the configuration output:
-------------------------------------------------------------------------
Job_Name=ptp_job
Resource_List.nodes=1
Resource_List.walltime=00:30:00
control.address=localhost
control.queue.name=workq
control.user.name=jiangjie
control.working.dir=/home/jiangjie
current_controller=Basic.PBS.Settings
destination=workq
directory=/home/jiangjie
enabled_Basic.PBS.Settings=Account_Name Job_Name Resource_List.mem Resource_List.nodes Resource_List.walltime destination export_all mpiCommand mpiCores
executableDirectory=/vol/test/demoApp/Debug
executablePath=/vol/test/demoApp/Debug/testMPI
export_all=-V
invalid_Basic.PBS.Settings=script_path
managed_file_for_script=/home/jiangjie/home/jiangjie/7bafb232-c374-4eec-8687-6ffc653d86a1managed_file_for_script
mpiCommand=mpirun
mpiCores=4
org.eclipse.debug.core.appendEnvironmentVariables=true
org.eclipse.ptp.launch.ATTR_CONSOLE=true
org.eclipse.ptp.launch.ATTR_COPY_EXECUTABLE_FROM_LOCAL=false
org.eclipse.ptp.launch.ATTR_REMOTE_EXECUTABLE_PATH=/vol/test/demoApp/Debug/testMPI
org.eclipse.ptp.launch.ATTR_SYNC_AFTER=false
org.eclipse.ptp.launch.ATTR_SYNC_BEFORE=false
org.eclipse.ptp.launch.ATTR_SYNC_RULES=[]
org.eclipse.ptp.launch.PROJECT_ATTR=testMPI
org.eclipse.ptp.launch.RESOURCE_MANAGER_NAME=eeb2b0e5-4035-4131-8f64-7e38ab9d179c
ptpDirectory=/home/jiangjie/.eclipsesettings
queues=[workq]
script=#!/bin/bash
#PBS -q workq
#PBS -N ptp_job
#PBS -l nodes=1
#PBS -l walltime=00:30:00
#PBS -V
MPI_ARGS="-np 4"
if [ "-np" == "${MPI_ARGS}" ] ; then
 MPI_ARGS=
fi
COMMAND=mpirun
if [ -n "${COMMAND}" ] ; then
 COMMAND="${COMMAND} ${MPI_ARGS} /vol/test/demoApp/Debug/testMPI "
else
 COMMAND="/vol/test/demoApp/Debug/testMPI "
fi
cd /home/jiangjie
${COMMAND}

stderr_remote_path=${ptp_rm:directory#value}/${ptp_rm:Job_Name#value}.e${ptp_rm:@jobId#default}
stdout_remote_path=${ptp_rm:directory#value}/${ptp_rm:Job_Name#value}.o${ptp_rm:@jobId#default}
valid_Basic.PBS.Settings=Account_Name Job_Name Resource_List.mem Resource_List.nodes Resource_List.walltime bindir control.address control.queue.name control.user.name control.working.dir current_controller destination directory enabled_Basic.PBS.Settings executableDirectory executablePath export_all invalid_Basic.PBS.Settings managed_file_for_script mpiCommand mpiCores ptpDirectory queues script stderr_remote_path stdout_remote_path valid_Basic.PBS.Settings visible_Basic.PBS.Settings
visible_Basic.PBS.Settings=Account_Name Job_Name Resource_List.mem Resource_List.nodes Resource_List.walltime destination export_all mpiCommand mpiCores
-------------------------------------------------------------------------------

Note the output line 
"managed_file_for_script=/home/jiangjie/home/jiangjie/7bafb232-c374-4eec-8687-6ffc653d86a1managed_file_for_script".
The correct path to the batch script should be "/home/jiangjie/7bafb232-c374-4eec-8687-6ffc653d86a1managed_file_for_script"!
And the console also outputs that "submit-batch: 9954302e-5f0c-4bbd-bd17-805df85936d1: qsub /home/jiangjie/home/jiangjie/7bafb232-c374-4eec-8687-6ffc653d86a1managed_file_for_script".
Maybe it is the wrong script path that causes the job launch hangs.

How to fix it?


Regards,
Jie
 		 	   		  

Back to the top