Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] problem with rankfile and openmpi-1.6.4rc3r27923
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-01-29 15:24:49


Aha - I'm able to replicate it, will fix.

On Jan 29, 2013, at 11:57 AM, Ralph Castain <rhc_at_[hidden]> wrote:

> Using an svn checkout of the current 1.6 branch, if works fine for me:
>
> [rhc_at_odin ~/v1.6]$ cat rf
> rank 0=odin127 slot=0:0-1,1:0-1
> rank 1=odin128 slot=1
>
> [rhc_at_odin ~/v1.6]$ mpirun -n 2 -rf ./rf --report-bindings hostname
> [odin127.cs.indiana.edu:12078] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)
> [odin128.cs.indiana.edu:12156] MCW rank 1 bound to socket 0[core 1]: [. B][. .] (slot list 1)
> odin127.cs.indiana.edu
> odin128.cs.indiana.edu
>
> Note that those two nodes were indeed allocated by Slurm - are you using a resource manager? Or is the allocation being defined by the rankfile?
>
> If the latter, please add --display-allocation to your cmd line and let's see what it thinks was allocated. Also, if you configure OMPI --enable-debug, you could add "-mca ras_base_verbose 5" to the cmd line and get further diagnostic output
>
>
> On Jan 29, 2013, at 10:54 AM, Siegmar Gross <Siegmar.Gross_at_[hidden]> wrote:
>
>> Hi
>>
>> today I have installed openmpi-1.6.4rc3r27923. Unfortunately I
>> still have a problem with rankfiles, if I start a process on a
>> remote machine.
>>
>>
>> tyr rankfiles 114 ssh linpc1 ompi_info | grep "Open MPI:"
>> Open MPI: 1.6.4rc3r27923
>>
>> tyr rankfiles 115 cat rf_linpc1
>> rank 0=linpc1 slot=0:0-1,1:0-1
>>
>> tyr rankfiles 116 mpiexec -report-bindings -np 1 \
>> -rf rf_linpc1 hostname
>> ------------------------------------------------------------------
>> All nodes which are allocated for this job are already filled.
>> ------------------------------------------------------------------
>>
>>
>> The following command still works.
>>
>> tyr rankfiles 119 mpiexec -report-bindings -np 1 -host linpc1 \
>> -cpus-per-proc 4 -bycore -bind-to-core hostname
>> [linpc1:32262] MCW rank 0 bound to socket 0[core 0-1]
>> socket 1[core 0-1]: [B B][B B]
>> linpc1
>> tyr rankfiles 120
>>
>>
>> Everything is fine, if I use the rankfile on the local machine.
>>
>> linpc1 rankfiles 103 ompi_info | grep "Open MPI:"
>> Open MPI: 1.6.4rc3r27923
>>
>> linpc1 rankfiles 104 cat rf_linpc1
>> rank 0=linpc1 slot=0:0-1,1:0-1
>>
>> linpc1 rankfiles 105 mpiexec -report-bindings -np 1 \
>> -rf rf_linpc1 hostname
>> [linpc1:32385] MCW rank 0 bound to socket 0[core 0-1]
>> socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)
>> linpc1
>> linpc1 rankfiles 106
>>
>>
>> In my opinion it should also work if I start a process on a
>> remote machine. Can somebody look once more into this issue?
>> Thank you very much for your help in advance.
>>
>>
>> Kind regards
>>
>> Siegmar
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>