Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] problem with rankfile and openmpi-1.6.4rc3r27923
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-01-29 15:24:49


Aha - I'm able to replicate it, will fix.

On Jan 29, 2013, at 11:57 AM, Ralph Castain <rhc_at_[hidden]> wrote:

> Using an svn checkout of the current 1.6 branch, if works fine for me:
>
> [rhc_at_odin ~/v1.6]$ cat rf
> rank 0=odin127 slot=0:0-1,1:0-1
> rank 1=odin128 slot=1
>
> [rhc_at_odin ~/v1.6]$ mpirun -n 2 -rf ./rf --report-bindings hostname
> [odin127.cs.indiana.edu:12078] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)
> [odin128.cs.indiana.edu:12156] MCW rank 1 bound to socket 0[core 1]: [. B][. .] (slot list 1)
> odin127.cs.indiana.edu
> odin128.cs.indiana.edu
>
> Note that those two nodes were indeed allocated by Slurm - are you using a resource manager? Or is the allocation being defined by the rankfile?
>
> If the latter, please add --display-allocation to your cmd line and let's see what it thinks was allocated. Also, if you configure OMPI --enable-debug, you could add "-mca ras_base_verbose 5" to the cmd line and get further diagnostic output
>
>
> On Jan 29, 2013, at 10:54 AM, Siegmar Gross <Siegmar.Gross_at_[hidden]> wrote:
>
>> Hi
>>
>> today I have installed openmpi-1.6.4rc3r27923. Unfortunately I
>> still have a problem with rankfiles, if I start a process on a
>> remote machine.
>>
>>
>> tyr rankfiles 114 ssh linpc1 ompi_info | grep "Open MPI:"
>> Open MPI: 1.6.4rc3r27923
>>
>> tyr rankfiles 115 cat rf_linpc1
>> rank 0=linpc1 slot=0:0-1,1:0-1
>>
>> tyr rankfiles 116 mpiexec -report-bindings -np 1 \
>> -rf rf_linpc1 hostname
>> ------------------------------------------------------------------
>> All nodes which are allocated for this job are already filled.
>> ------------------------------------------------------------------
>>
>>
>> The following command still works.
>>
>> tyr rankfiles 119 mpiexec -report-bindings -np 1 -host linpc1 \
>> -cpus-per-proc 4 -bycore -bind-to-core hostname
>> [linpc1:32262] MCW rank 0 bound to socket 0[core 0-1]
>> socket 1[core 0-1]: [B B][B B]
>> linpc1
>> tyr rankfiles 120
>>
>>
>> Everything is fine, if I use the rankfile on the local machine.
>>
>> linpc1 rankfiles 103 ompi_info | grep "Open MPI:"
>> Open MPI: 1.6.4rc3r27923
>>
>> linpc1 rankfiles 104 cat rf_linpc1
>> rank 0=linpc1 slot=0:0-1,1:0-1
>>
>> linpc1 rankfiles 105 mpiexec -report-bindings -np 1 \
>> -rf rf_linpc1 hostname
>> [linpc1:32385] MCW rank 0 bound to socket 0[core 0-1]
>> socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)
>> linpc1
>> linpc1 rankfiles 106
>>
>>
>> In my opinion it should also work if I start a process on a
>> remote machine. Can somebody look once more into this issue?
>> Thank you very much for your help in advance.
>>
>>
>> Kind regards
>>
>> Siegmar
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>