I'm afraid the rank-file mapper in 1.3.3 has several known problems that have been described on the list by users. We hopefully have those fixed in the upcoming 1.3.4 release.


On Aug 31, 2009, at 10:01 AM, Sacerdoti, Federico wrote:

Hi,
 
I am trying to use the rankmap to bind a 4-proc mpi job to one socket of a two-socket, 8 core machine. However I'm getting a strange error.
 
CMDS USED
orterun --hostfile hostlist.1 -n 4  --mca rmaps_rank_file_path ./rankmap.1 desres-netscan  -o $OUTDIR
 
$ cat rankmap.1
rank 0=drdb0235.en slot=0:0
rank 1=drdb0235.en slot=0:1
rank 2=drdb0235.en slot=0:2
rank 3=drdb0235.en slot=0:3
 
$ cat hostlist.1
drdb0235.en slots=8
ERROR SEEN
--------------------------------------------------------------------------
Rankfile claimed host drdb0235.en that was not allocated or oversubscribed it's slots:
--------------------------------------------------------------------------
[drdb0235.en.desres.deshaw.com:14242] [[37407,0],0] ORTE_ERROR_LOG: Bad parameter in file rmaps_rank_file.c at line 108
[drdb0235.en.desres.deshaw.com:14242] [[37407,0],0] ORTE_ERROR_LOG: Bad parameter in file base/rmaps_base_map_job.c at line 87
[drdb0235.en.desres.deshaw.com:14242] [[37407,0],0] ORTE_ERROR_LOG: Bad parameter in file base/plm_base_launch_support.c at line 77
[drdb0235.en.desres.deshaw.com:14242] [[37407,0],0] ORTE_ERROR_LOG: Bad parameter in file plm_rsh_module.c at line 985
 
From looking at the code in rmaps_rank_file.c it seems the error occurs when the node-gathering code wraps twice around the hostlist. However I dont see why that is happening.
 
If I specify 8 slots in the rankmap, I see a different error: Error, invalid rank (4) in the rankfile (./rankmap.1)
 
Thanks,
Federico
 

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users