Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [ptp-dev] BUG report of PBS JAXB resource manager

Jie,

Yes, the RM is a bit misnamed. It should really be Torque-Generic-Batch. 

In order to use PBS, you need to import the RM definition into your workspace. Go to Import..., then open Resource Managers>Resource Manager Definition From Plug-in. Choose the PBS-Generic-Batch RM from the combo. This should create a folder in your workspace called resourceManagers. Open this and double click on the xml file. Go to the end of the file, and change line 828 from '<monitor-data schedulerType="TORQUE">' to '<monitor-data schedulerType="PBS">'.

Now, go to the System Monitoring perspective and add a new RM. Use the definition you just edited (should have "(1)" after the name).

You can also change the batch file that gets created to see if this fixes the hanging problem. 

If you're planning to change the XML after you have used it to create a RM, go to Preferences then select Parallel Tools>Resource Managers> Configurable Resource Manager and check the "Always load XML from URL" option. Stopping and restarting the RM will then reload the new definition. 

Cheers,
Greg

On Mar 21, 2012, at 10:03 AM, JiangJie wrote:

> 
> Hi all,
> 
> Recently I have been trying PTP_5.0.6 with PBS-Generic-Batch(LML_JAXB) resource manager.
> The underlying RMS is PBS Pro 11.2.0 and everything works well.
> Here are two problems:
> 
> 1. With the default installation and configuration, while starting the PBS-Generic-Batch resource manager,
> a error window popup with error message:
> "LML DA Driver (Local) has encountered a problem. Server finished with exit code 1".
> And the PBS-Generic-Batch resource manager failed to start. 
> 
> Under $HOME/.eclipsesettings directory, there occurs some temporary directories, for example, tmp_node0_31974 (here node0 is the hostname of the front end node of PBS cluster).
> There are "report.log" and "request.xml" files under the temporary directory.
> Following is the contents of report.log:
> 
>    ------------------------------------------------------------------------------------------
>      LLVIEW Data Access Workflow Manager Driver 1.15, starting at (Thu Mar 22 05:09:19 CST 2012)
>       command line args:
>    ------------------------------------------------------------------------------------------
>    LML_da_driver.pl: temporary directory not found, create new directory ./tmp_node0_31947 ...
>    LML_da_driver.pl: tmpdir created (./tmp_node0_31947)
>    LML_da_driver.pl: requestfile=-
>    LML_da_driver.pl: parsing XML requestfile in 0.0027 sec
>    LML_da_driver.pl: check request for rms hint ...
>    LML_da_driver.pl: check_for rms, got hint from request ... (TORQUE)
>    LML_da_driver.pl: check_rms_TORQUE: found pbsnodes by which (/opt/pbs/11.2.0.113417/bin/pbsnodes)
>    LML_da_driver.pl: check_rms_TORQUE: found qstat by which (/opt/pbs/11.2.0.113417/bin/qstat)
>    LML_da_driver.pl: check_rms_TORQUE: PBSpro found
>    LML_da_driver.pl: check_rms_TORQUE: seems not to be a TORQUE system
>    LML_da_driver.pl: rms/TORQUE/da_check_info_LML.pl unable to locate rms TORQUE
>    LML_da_driver.pl: ERROR LML_da_driver.pl: could not determine rms, exiting ...
> 
> 
> It seems that LML_da_driver.pl assumes that the underlying RMS is TORQUE, not PBS. 
> If I modify the LML_da_driver.pl and force the rms to PBS (just before echoing "check_for rms, got hint from request ..."), everything works well.
> So how to configure the LML_da_driver so that the correct underlying RMS can be detected?
> 
> 2. Even after starting the PBS-Generic-Batch resource manager successfully, I still failed to launch a batch job within PTP.
> 
> Following is the batch script generated by PTP with my job configuration:
> -------------------------------------------------------------------------
> #!/bin/bash
> #PBS -q workq
> #PBS -N ptp_job
> #PBS -l nodes=1
> #PBS -l walltime=00:30:00
> #PBS -V
> MPI_ARGS="-np 4"
> if [ "-np" == "${MPI_ARGS}" ] ; then
>  MPI_ARGS=
> fi
> COMMAND=mpirun
> if [ -n "${COMMAND}" ] ; then
>  COMMAND="${COMMAND} ${MPI_ARGS} /vol/test/demoApp/Debug/testMPI "
> else
>  COMMAND="/vol/test/demoApp/Debug/testMPI "
> fi
> cd /home/jiangjie
> ${COMMAND}
> 
> -------------------------------------------------------------------------
> 
> And following is the configuration output:
> -------------------------------------------------------------------------
> Job_Name=ptp_job
> Resource_List.nodes=1
> Resource_List.walltime=00:30:00
> control.address=localhost
> control.queue.name=workq
> control.user.name=jiangjie
> control.working.dir=/home/jiangjie
> current_controller=Basic.PBS.Settings
> destination=workq
> directory=/home/jiangjie
> enabled_Basic.PBS.Settings=Account_Name Job_Name Resource_List.mem Resource_List.nodes Resource_List.walltime destination export_all mpiCommand mpiCores
> executableDirectory=/vol/test/demoApp/Debug
> executablePath=/vol/test/demoApp/Debug/testMPI
> export_all=-V
> invalid_Basic.PBS.Settings=script_path
> managed_file_for_script=/home/jiangjie/home/jiangjie/7bafb232-c374-4eec-8687-6ffc653d86a1managed_file_for_script
> mpiCommand=mpirun
> mpiCores=4
> org.eclipse.debug.core.appendEnvironmentVariables=true
> org.eclipse.ptp.launch.ATTR_CONSOLE=true
> org.eclipse.ptp.launch.ATTR_COPY_EXECUTABLE_FROM_LOCAL=false
> org.eclipse.ptp.launch.ATTR_REMOTE_EXECUTABLE_PATH=/vol/test/demoApp/Debug/testMPI
> org.eclipse.ptp.launch.ATTR_SYNC_AFTER=false
> org.eclipse.ptp.launch.ATTR_SYNC_BEFORE=false
> org.eclipse.ptp.launch.ATTR_SYNC_RULES=[]
> org.eclipse.ptp.launch.PROJECT_ATTR=testMPI
> org.eclipse.ptp.launch.RESOURCE_MANAGER_NAME=eeb2b0e5-4035-4131-8f64-7e38ab9d179c
> ptpDirectory=/home/jiangjie/.eclipsesettings
> queues=[workq]
> script=#!/bin/bash
> #PBS -q workq
> #PBS -N ptp_job
> #PBS -l nodes=1
> #PBS -l walltime=00:30:00
> #PBS -V
> MPI_ARGS="-np 4"
> if [ "-np" == "${MPI_ARGS}" ] ; then
>  MPI_ARGS=
> fi
> COMMAND=mpirun
> if [ -n "${COMMAND}" ] ; then
>  COMMAND="${COMMAND} ${MPI_ARGS} /vol/test/demoApp/Debug/testMPI "
> else
>  COMMAND="/vol/test/demoApp/Debug/testMPI "
> fi
> cd /home/jiangjie
> ${COMMAND}
> 
> stderr_remote_path=${ptp_rm:directory#value}/${ptp_rm:Job_Name#value}.e${ptp_rm:@jobId#default}
> stdout_remote_path=${ptp_rm:directory#value}/${ptp_rm:Job_Name#value}.o${ptp_rm:@jobId#default}
> valid_Basic.PBS.Settings=Account_Name Job_Name Resource_List.mem Resource_List.nodes Resource_List.walltime bindir control.address control.queue.name control.user.name control.working.dir current_controller destination directory enabled_Basic.PBS.Settings executableDirectory executablePath export_all invalid_Basic.PBS.Settings managed_file_for_script mpiCommand mpiCores ptpDirectory queues script stderr_remote_path stdout_remote_path valid_Basic.PBS.Settings visible_Basic.PBS.Settings
> visible_Basic.PBS.Settings=Account_Name Job_Name Resource_List.mem Resource_List.nodes Resource_List.walltime destination export_all mpiCommand mpiCores
> -------------------------------------------------------------------------------
> 
> Note the output line 
> "managed_file_for_script=/home/jiangjie/home/jiangjie/7bafb232-c374-4eec-8687-6ffc653d86a1managed_file_for_script".
> The correct path to the batch script should be "/home/jiangjie/7bafb232-c374-4eec-8687-6ffc653d86a1managed_file_for_script"!
> And the console also outputs that "submit-batch: 9954302e-5f0c-4bbd-bd17-805df85936d1: qsub /home/jiangjie/home/jiangjie/7bafb232-c374-4eec-8687-6ffc653d86a1managed_file_for_script".
> Maybe it is the wrong script path that causes the job launch hangs.
> 
> How to fix it?
> 
> 
> Regards,
> Jie
> 		 	   		  
> _______________________________________________
> ptp-dev mailing list
> ptp-dev@xxxxxxxxxxx
> https://dev.eclipse.org/mailman/listinfo/ptp-dev



Back to the top