I'm using gridengine 6.2u5 and openmpi 1.3.3. I'm submitting a parallel
job and would like to specify a rankfile to set processor binding but am
getting errors.
The $PE_HOSTFILE generated by gridengine is:
amos.cora.nwra.com 4 clouds.q_at_[hidden] UNDEFINED
andrew.cora.nwra.com 4 clouds.q_at_[hidden] UNDEFINED
The rankfile I'm using is:
rank 0=amos.cora.nwra.com slot=0
rank 1=andrew.cora.nwra.com slot=0
rank 2=amos.cora.nwra.com slot=4
rank 3=andrew.cora.nwra.com slot=4
rank 4=amos.cora.nwra.com slot=1
rank 5=andrew.cora.nwra.com slot=1
rank 6=amos.cora.nwra.com slot=5
rank 7=andrew.cora.nwra.com slot=5
The error I'm getting is:
Rankfile claimed host amos.cora.nwra.com that was not allocated or
oversubscribed it's slots:
--------------------------------------------------------------------------
[amos:05727] [[44126,0],0] ORTE_ERROR_LOG: Bad parameter in file
rmaps_rank_file.c at line 108
[amos:05727] [[44126,0],0] ORTE_ERROR_LOG: Bad parameter in file
base/rmaps_base_map_job.c at line 87
[amos:05727] [[44126,0],0] ORTE_ERROR_LOG: Bad parameter in file
base/plm_base_launch_support.c at line 77
[amos:05727] [[44126,0],0] ORTE_ERROR_LOG: Bad parameter in file
plm_rsh_module.c at line 990
--------------------------------------------------------------------------
A daemon (pid unknown) died unexpectedly on signal 1 while attempting to
launch so we are aborting.
Any ideas?
Thanks!
- Orion
--
Orion Poplawski
Technical Manager 303-415-9701 x222
NWRA/CoRA Division FAX: 303-415-9702
3380 Mitchell Lane orion_at_[hidden]
Boulder, CO 80301 http://www.cora.nwra.com
|