Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] now 1.9 [was: I have still a problem with rankfiles in openmpi-1.6.4rc3]
From: Eugene Loh (eugene.loh_at_[hidden])
Date: 2013-02-09 13:42:33


On 02/09/13 00:32, Ralph Castain wrote:
> On Feb 6, 2013, at 2:59 PM, Eugene Loh <eugene.loh_at_[hidden]> wrote:
>
>> On 02/06/13 04:29, Siegmar Gross wrote:
>>> thank you very much for your answer. I have compiled your program
>>> and get different behaviours for openmpi-1.6.4rc3 and openmpi-1.9.
>> I think what's happening is that although you specified "0:0" or "0:1" in the rankfile, the string "0,0" or "0,1" is getting passed in (at least in the runs I looked at). That colon became a comma. So, it's just by accident that myrankfile_0 is working out all right.
>>
>> Could someone who knows the code better than I do help me narrow this down? E.g., where is the rankfile parsed? For what it's worth, by the time mpirun reaches orte_odls_base_default_get_add_procs_data(), orte_job_data already contains the corrupted cpu_bitmap string.
>
> You'll want to look at orte/mca/rmaps/rank_file/rmaps_rank_file.c - the bit map is now computed in mpirun and then sent to the daemons

Actually, I'm getting lost in this code. Anyhow, I don't think the problem is related to Solaris. I think it's also on Linux.
E.g., I can reproduce the problem with 1.9a1r28035 on Linux using GCC compilers.

Siegmar: can you confirm this is a problem also on Linux? E.g., with OMPI 1.9, on one of your Linux nodes (linpc0?) try

     % cat myrankfile
     rank 0=linpc0 slot=0:1
     % mpirun --report-bindings --rankfile myrankfile numactl --show

For me, the binding I get is not 0:1 but 0,1.

Could someone else take a look at this?