I think I know what goes wrong. Since they are in different 'universes',
they will have exactly the same 'Open MPI name', and therefore the
algorithm in intercomm_merge can not determine which process should be
first and which is second.
Practically, all jobs which are connected at a certain point in there
lifetime have to be in the same MPI universe, such that all jobs will
have different jobid's and therefore different names. To use the same
universe, you have to start the orted daemon in the persistent mode, so
the sequence should be:
orted --seed --persistent --scope public
mpirun -np x ./app1
mpirun -np y ./app2
In this case everything should work as expected, you could do the
comm_join between app1 and app2 and the intercomm_merge should work as well.
Hope this helps
Edgar Gabriel wrote:
> could you provide me a simple testcode for that? Comm_join and
> intercomm_merge should work, I would have a look at that...
> (separate answer to your second email is coming soon)
> Robert Latham wrote:
>>I've got a bit of an odd bug here. I've been playing around with MPI
>>process management routines and I notied the following behavior with
>>Two processes (a and b), linked with ompi, but started independently
>>(no mpiexec, just started the programs directly).
>>- a and b: call MPI_Init
>>- a: open a unix network socket on 'fd'
>>- b: connect to a's socket
>>- a and b: call MPI_Comm_join over 'fd'
>>- a and b: call MPI_Intercomm_merge, get intracommunicator.
>>These steps all work fine.
>>Now the odd part: a and b call MPI_Comm_rank and MPI_Comm_size over
>>the intracommunicator. Both (correctly) think Comm_size is two, but
>>both also think (incorrectly) that they are rank 1.
> users mailing list
Department of Computer Science email:gabriel_at_[hidden]
University of Houston http://www.cs.uh.edu/~gabriel
Philip G. Hoffman Hall, Room 524 Tel: +1 (713) 743-3857
Houston, TX-77204, USA Fax: +1 (713) 743-3335