I'm afraid the rank-file mapper in 1.3.3 has several known problems that have been described on the list by users. We hopefully have those fixed in the upcoming 1.3.4 release.

On Aug 31, 2009, at 10:01 AM, Sacerdoti, Federico wrote:

I am trying to use the rankmap to bind a 4-proc mpi job to one socket of a two-socket, 8 core machine. However I'm getting a strange error.
orterun --hostfile hostlist.1 -n 4  --mca rmaps_rank_file_path ./rankmap.1 desres-netscan  -o $OUTDIR
$ cat rankmap.1
rank 0=drdb0235.en slot=0:0
rank 1=drdb0235.en slot=0:1
rank 2=drdb0235.en slot=0:2
rank 3=drdb0235.en slot=0:3
$ cat hostlist.1
drdb0235.en slots=8
Rankfile claimed host drdb0235.en that was not allocated or oversubscribed it's slots:
[drdb0235.en.desres.deshaw.com:14242] [[37407,0],0] ORTE_ERROR_LOG: Bad parameter in file rmaps_rank_file.c at line 108
[drdb0235.en.desres.deshaw.com:14242] [[37407,0],0] ORTE_ERROR_LOG: Bad parameter in file base/rmaps_base_map_job.c at line 87
[drdb0235.en.desres.deshaw.com:14242] [[37407,0],0] ORTE_ERROR_LOG: Bad parameter in file base/plm_base_launch_support.c at line 77
[drdb0235.en.desres.deshaw.com:14242] [[37407,0],0] ORTE_ERROR_LOG: Bad parameter in file plm_rsh_module.c at line 985
From looking at the code in rmaps_rank_file.c it seems the error occurs when the node-gathering code wraps twice around the hostlist. However I dont see why that is happening.
If I specify 8 slots in the rankmap, I see a different error: Error, invalid rank (4) in the rankfile (./rankmap.1)

users mailing list