Found the bug - we indeed failed to update the jdata->num_procs field when adding the non-rf-mapped procs to the job.

Fix coming shortly.

On Jul 15, 2009, at 2:40 PM, Ralph Castain wrote:

Ah - interesting scenario!

Definitely a "bug" in the code, then. What it looks like, though, is that the jdata->num_procs is wrong. There shouldn't be any way that the num_procs in the node array is different than jdata->num_procs.

My guess is that the rank_file mapper isn't correctly maintaining the bookkeeping when we map the procs beyond those in the rankfile. I'll dig into it - have to fix something for Lenny anyway.

Meantime, this change looks fine regardless as it (a) is better code and (b) protects us against such errors.

Thanks for catching it!

On Wed, Jul 15, 2009 at 2:30 PM, George Bosilca <> wrote:
I think I found a better solution (in r21688). Here is what I was trying to do.

I have a more or less homogeneous cluster. In fact all processors are identical, except that some are quad core and some dual core. Of course I care how my processes are mapped on the quad cores, but not really on the dual cores.

My approach was to use the following configuration files.

In /home/bosilca/.openmpi/mca-params.conf I have:

rmaps_rank_file_path = /home/bosilca/.openmpi/rankfile
rmaps_rank_file_priority = 100

In /home/bosilca/.openmpi/machinefile I have the full description of the cluster. As an example:
node01 slots=4
node02 slots=4
node03 slots=2
node04 slots=2

And in the /home/bosilca/.openmpi/rankfile file I have:
rank 0=+n0 slot=0
rank 1=+n0 slot=1
rank 2=+n1 slot=0
rank 3=+n1 slot=1

As long as I spawn jobs with less than 4 processes everything worked fine. But when I used more than 4 processes, orterun segfaulted. After debugging I found that the nodes, lrank and nrank arrays were allocated based on the jdata->num_procs, but then filled based on the total number of processes in the jdata->nodes array. As it appears that the jdata->num_procs is somehow modified based on the number of entries in the rankfile, we end-up writing outside the allocation and then segfault. Now with the latest patch, we can cope with such a scenario by only packing the known information (and thus not writing outside the allocated arrays).

This might not be the best approach, but it is doing what I'm looking for ...


On Jul 15, 2009, at 15:50 , Ralph Castain wrote:

The routed comm system relies on each daemon having complete information as to where every process is located, so the expectation was that only full maps would ever be sent. Thus, the nidmap code is setup to always send a full map.

I don't know how to even generate a "partial" map. I assume you are doing something offline? Is this to update changed info? If so, you'll also have to do something to update the daemon's maps or the comm system will break down.


On Wed, Jul 15, 2009 at 1:40 PM, George Bosilca <> wrote:
I have a question regarding the mapping. How can I declare a partial mapping ? In fact I only care about how some of the processes are mapped on some specific nodes. Right now if the rmaps doesn't contain information about all nodes, we give up (before this patch we segfaulted).

Does it means we always have to declare the whole mapping or it's just that we overlooked this strange case?


Begin forwarded message:

Author: bosilca
Date: 2009-07-15 15:36:53 EDT (Wed, 15 Jul 2009)
New Revision: 21686

Reorder the nidmap encoding function. Add a check to make sure we don't write
outside the boundaries of the allocated array.

However, the problem is still there. If we have rmaps file containing only
partial information the num_procs get set to the wrong value (the number of
hosts in the rmaps file instead of the number of processes requested on the
command line).

devel mailing list

devel mailing list

devel mailing list