Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] MPI_Comm_connect/Accept
From: Aurélien Bouteiller (bouteill_at_[hidden])
Date: 2008-04-08 16:23:00


Still no luck here,

I launch those three processes :
term1$ ompi-server -d --report-uri URIFILE

term2$ mpirun -mca routed unity -ompi-server file:URIFILE -np 1
simple_accept

term3$ mpirun -mca routed unity -ompi-server file:URIFILE -np 1
simple_connect

The output of ompi-server shows a successful publish and lookup. I get
the correct port on the client side. However, the result is the same
as when not using the Publish/Lookup mechanism: the connect fails
saying the
port cannot be reached.

Found port < 1940389889.0;tcp://
160.36.252.99:49777;tcp6://2002:a024:ed65:9:21b:63ff:fecb:
28:49778;tcp6://fec0::9:21b:63ff:fecb:28:49778;tcp6://2002:a024:ff7f:
9:21b:63ff:fecb:28:49778:300 >
[abouteil.nomad.utk.edu:60339] [[29620,1],0] ORTE_ERROR_LOG: A message
is attempting to be sent to a process whose contact information is
unknown in file ../../../../../trunk/orte/mca/rml/oob/rml_oob_send.c
at line 140
[abouteil.nomad.utk.edu:60339] [[29620,1],0] attempted to send to
[[29608,1],0]
[abouteil.nomad.utk.edu:60339] [[29620,1],0] ORTE_ERROR_LOG: A message
is attempting to be sent to a process whose contact information is
unknown in file ../../../../../trunk/ompi/mca/dpm/orte/dpm_orte.c at
line 455
[abouteil.nomad.utk.edu:60339] *** An error occurred in MPI_Comm_connect
[abouteil.nomad.utk.edu:60339] *** on communicator MPI_COMM_SELF
[abouteil.nomad.utk.edu:60339] *** MPI_ERR_UNKNOWN: unknown error
[abouteil.nomad.utk.edu:60339] *** MPI_ERRORS_ARE_FATAL (goodbye)

I took a look in the source code, and I think the problem comes from a
conceptional mistake in MPI_Connect. The function "connect_accept" in
dpm_orte.c takes a orte_process_name_t as the destination port. This
structure only contains the jobid and the vpid (always set to 0, I
guess meaning you plan to contact the HNP of that job). Obviously, if
the accepting process does not share the same HNP with the connecting
process, there is no way for the MPI_Comm_connect function to fill
correctly this field. The all purpose of the port_name string is to
provide a consistent way to access the remote endpoint without a
complicated name resolution service. I think this function should take
the port_name instead (the string returned by open_port) and contact
directly with OOB this endpoint to get the contact informations it
needs from there, and not from the local HNP.

Aurelien

Le 4 avr. 08 à 15:21, Ralph H Castain a écrit :
> Okay, I have a partial fix in there now. You'll have to use -mca
> routed
> unity as I still need to fix it for routed tree.
>
> Couple of things:
>
> 1. I fixed the --debug flag so it automatically turns on the debug
> output
> from the data server code itself. Now ompi-server will tell you when
> it is
> accessed.
>
> 2. remember, we added an MPI_Info key that specifies if you want the
> data
> stored locally (on your own mpirun) or globally (on the ompi-
> server). If you
> specify nothing, there is a precedence built into the code that
> defaults to
> "local". So you have to tell us that this data is to be published
> "global"
> if you want to connect multiple mpiruns.
>
> I believe Jeff wrote all that up somewhere - could be in an email
> thread,
> though. Been too long ago for me to remember... ;-) You can look it
> up in
> the code though as a last resort - it is in
> ompi/mca/pubsub/orte/pubsub_orte.c.
>
> Ralph
>
>
>
> On 4/4/08 12:55 PM, "Ralph H Castain" <rhc_at_[hidden]> wrote:
>
>> Well, something got borked in here - will have to fix it, so this
>> will
>> probably not get done until next week.
>>
>>
>> On 4/4/08 12:26 PM, "Ralph H Castain" <rhc_at_[hidden]> wrote:
>>
>>> Yeah, you didn't specify the file correctly...plus I found a bug
>>> in the code
>>> when I looked (out-of-date a little in orterun).
>>>
>>> I am updating orterun (commit soon) and will include a better help
>>> message
>>> about the proper format of the orterun cmd-line option. The syntax
>>> is:
>>>
>>> -ompi-server uri
>>>
>>> or -ompi-server file:filename-where-uri-exists
>>>
>>> Problem here is that you gave it a uri of "test", which means
>>> nothing. ;-)
>>>
>>> Should have it up-and-going soon.
>>> Ralph
>>>
>>> On 4/4/08 12:02 PM, "Aurélien Bouteiller" <bouteill_at_[hidden]>
>>> wrote:
>>>
>>>> Ralph,
>>>>
>>>> I've not been very successful at using ompi-server. I tried this :
>>>>
>>>> xterm1$ ompi-server --debug-devel -d --report-uri test
>>>> [grosse-pomme.local:01097] proc_info: hnp_uri NULL
>>>> daemon uri NULL
>>>> [grosse-pomme.local:01097] [[34900,0],0] ompi-server: up and
>>>> running!
>>>>
>>>>
>>>> xterm2$ mpirun -ompi-server test -np 1 mpi_accept_test
>>>> Port name:
>>>> 2285895681.0;tcp://192.168.0.101:50065;tcp://
>>>> 192.168.0.150:50065:300
>>>>
>>>> xterm3$ mpirun -ompi-server test -np 1 simple_connect
>>>> --------------------------------------------------------------------------
>>>> Process rank 0 attempted to lookup from a global ompi_server that
>>>> could not be contacted. This is typically caused by either not
>>>> specifying the contact info for the server, or by the server not
>>>> currently executing. If you did specify the contact info for a
>>>> server, please check to see that the server is running and start
>>>> it again (or have your sys admin start it) if it isn't.
>>>>
>>>> --------------------------------------------------------------------------
>>>> [grosse-pomme.local:01122] *** An error occurred in MPI_Lookup_name
>>>> [grosse-pomme.local:01122] *** on communicator MPI_COMM_WORLD
>>>> [grosse-pomme.local:01122] *** MPI_ERR_NAME: invalid name argument
>>>> [grosse-pomme.local:01122] *** MPI_ERRORS_ARE_FATAL (goodbye)
>>>> --------------------------------------------------------------------------
>>>>
>>>>
>>>>
>>>> The server code Open_port, and then PublishName. Looks like the
>>>> LookupName function cannot reach the ompi-server. The ompi-server
>>>> in
>>>> debug mode does not show any output when a new event occurs (like
>>>> when
>>>> the server is launched). Is there something wrong in the way I
>>>> use it ?
>>>>
>>>> Aurelien
>>>>
>>>> Le 3 avr. 08 à 17:21, Ralph Castain a écrit :
>>>>> Take a gander at ompi/tools/ompi-server - I believe I put a man
>>>>> page
>>>>> in
>>>>> there. You might just try "man ompi-server" and see if it shows
>>>>> up.
>>>>>
>>>>> Holler if you have a question - not sure I documented it very
>>>>> thoroughly at
>>>>> the time.
>>>>>
>>>>>
>>>>> On 4/3/08 3:10 PM, "Aurélien Bouteiller" <bouteill_at_[hidden]>
>>>>> wrote:
>>>>>
>>>>>> Ralph,
>>>>>>
>>>>>>
>>>>>> I am using trunk. Is there a documentation for ompi-server ?
>>>>>> Sounds
>>>>>> exactly like what I need to fix point 1.
>>>>>>
>>>>>> Aurelien
>>>>>>
>>>>>> Le 3 avr. 08 à 17:06, Ralph Castain a écrit :
>>>>>>> I guess I'll have to ask the basic question: what version are
>>>>>>> you
>>>>>>> using?
>>>>>>>
>>>>>>> If you are talking about the trunk, there no longer is a
>>>>>>> "universe"
>>>>>>> concept
>>>>>>> anywhere in the code. Two mpiruns can connect/accept to each
>>>>>>> other
>>>>>>> as long
>>>>>>> as they can make contact. To facilitate that, we created an
>>>>>>> "ompi-
>>>>>>> server"
>>>>>>> tool that is supposed to be run by the sys-admin (or a user,
>>>>>>> doesn't
>>>>>>> matter
>>>>>>> which) on the head node - there are various ways to tell mpirun
>>>>>>> how to
>>>>>>> contact the server, or it can self-discover it.
>>>>>>>
>>>>>>> I have tested publish/lookup pretty thoroughly and it seems to
>>>>>>> work. I
>>>>>>> haven't spent much time testing connect/accept except via
>>>>>>> comm_spawn, which
>>>>>>> seems to be working. Since that uses the same mechanism, I would
>>>>>>> have
>>>>>>> expected connect/accept to work as well.
>>>>>>>
>>>>>>> If you are talking about 1.2.x, then the story is totally
>>>>>>> different.
>>>>>>>
>>>>>>> Ralph
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 4/3/08 2:29 PM, "Aurélien Bouteiller" <bouteill_at_[hidden]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>> I'm trying to figure out how complete is the implementation of
>>>>>>>> Comm_connect/Accept. I found two problematic cases.
>>>>>>>>
>>>>>>>> 1) Two different programs are started in two different
>>>>>>>> mpirun. One
>>>>>>>> makes accept, the second one use connect. I would not expect
>>>>>>>> MPI_Publish_name/Lookup_name to work because they do not
>>>>>>>> share the
>>>>>>>> HNP. Still I would expect to be able to connect by copying
>>>>>>>> (with
>>>>>>>> printf-scanf) the port_name string generated by Open_port;
>>>>>>>> especially
>>>>>>>> considering that in Open MPI, the port_name is a string
>>>>>>>> containing
>>>>>>>> the
>>>>>>>> tcp address and port of the rank 0 in the server communicator.
>>>>>>>> However, doing so results in "no route to host" and the
>>>>>>>> connecting
>>>>>>>> application aborts. Is the problem related to an explicit
>>>>>>>> check of
>>>>>>>> the
>>>>>>>> universes on the accept HNP ? Do I expect too much from the MPI
>>>>>>>> standard ? Is it because my two applications does not share the
>>>>>>>> same
>>>>>>>> universe ? Should we (re) add the ability to use the same
>>>>>>>> universe
>>>>>>>> for
>>>>>>>> several mpirun ?
>>>>>>>>
>>>>>>>> 2) Second issue is when the program setup a port, and then
>>>>>>>> accept
>>>>>>>> multiple clients on this port. Everything works fine for the
>>>>>>>> first
>>>>>>>> client, and then accept stalls forever when waiting for the
>>>>>>>> second
>>>>>>>> one. My understanding of the standard is that it should work:
>>>>>>>> 5.4.2
>>>>>>>> states "it must call MPI_Open_port to establish a port [...] it
>>>>>>>> must
>>>>>>>> call MPI_Comm_accept to accept connections from clients". I
>>>>>>>> understand
>>>>>>>> that for one MPI_Open_port I should be able to manage several
>>>>>>>> MPI
>>>>>>>> clients. Am I understanding correctly the standard here and
>>>>>>>> should we
>>>>>>>> fix this ?
>>>>>>>>
>>>>>>>> Here is a copy of the non-working code for reference.
>>>>>>>>
>>>>>>>> /*
>>>>>>>> * Copyright (c) 2004-2007 The Trustees of the University of
>>>>>>>> Tennessee.
>>>>>>>> * All rights reserved.
>>>>>>>> * $COPYRIGHT$
>>>>>>>> *
>>>>>>>> * Additional copyrights may follow
>>>>>>>> *
>>>>>>>> * $HEADER$
>>>>>>>> */
>>>>>>>> #include <stdlib.h>
>>>>>>>> #include <stdio.h>
>>>>>>>> #include <mpi.h>
>>>>>>>>
>>>>>>>> int main(int argc, char *argv[])
>>>>>>>> {
>>>>>>>> char port[MPI_MAX_PORT_NAME];
>>>>>>>> int rank;
>>>>>>>> int np;
>>>>>>>>
>>>>>>>>
>>>>>>>> MPI_Init(&argc, &argv);
>>>>>>>> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>>>>>>> MPI_Comm_size(MPI_COMM_WORLD, &np);
>>>>>>>>
>>>>>>>> if(rank)
>>>>>>>> {
>>>>>>>> MPI_Comm comm;
>>>>>>>> /* client */
>>>>>>>> MPI_Recv(port, MPI_MAX_PORT_NAME, MPI_CHAR, 0, 0,
>>>>>>>> MPI_COMM_WORLD, MPI_STATUS_IGNORE);
>>>>>>>> printf("Read port: %s\n", port);
>>>>>>>> MPI_Comm_connect(port, MPI_INFO_NULL, 0, MPI_COMM_SELF,
>>>>>>>> &comm);
>>>>>>>>
>>>>>>>> MPI_Send(&rank, 1, MPI_INT, 0, 1, comm);
>>>>>>>> MPI_Comm_disconnect(&comm);
>>>>>>>> }
>>>>>>>> else
>>>>>>>> {
>>>>>>>> int nc = np - 1;
>>>>>>>> MPI_Comm *comm_nodes = (MPI_Comm *) calloc(nc,
>>>>>>>> sizeof(MPI_Comm));
>>>>>>>> MPI_Request *reqs = (MPI_Request *) calloc(nc,
>>>>>>>> sizeof(MPI_Request));
>>>>>>>> int *event = (int *) calloc(nc, sizeof(int));
>>>>>>>> int i;
>>>>>>>>
>>>>>>>> MPI_Open_port(MPI_INFO_NULL, port);
>>>>>>>> /* MPI_Publish_name("test_service_el", MPI_INFO_NULL,
>>>>>>>> port);*/
>>>>>>>> printf("Port name: %s\n", port);
>>>>>>>> for(i = 1; i < np; i++)
>>>>>>>> MPI_Send(port, MPI_MAX_PORT_NAME, MPI_CHAR, i, 0,
>>>>>>>> MPI_COMM_WORLD);
>>>>>>>>
>>>>>>>> for(i = 0; i < nc; i++)
>>>>>>>> {
>>>>>>>> MPI_Comm_accept(port, MPI_INFO_NULL, 0,
>>>>>>>> MPI_COMM_SELF,
>>>>>>>> &comm_nodes[i]);
>>>>>>>> printf("Accept %d\n", i);
>>>>>>>> MPI_Irecv(&event[i], 1, MPI_INT, 0, 1, comm_nodes[i],
>>>>>>>> &reqs[i]);
>>>>>>>> printf("IRecv %d\n", i);
>>>>>>>> }
>>>>>>>> MPI_Close_port(port);
>>>>>>>> MPI_Waitall(nc, reqs, MPI_STATUSES_IGNORE);
>>>>>>>> for(i = 0; i < nc; i++)
>>>>>>>> {
>>>>>>>> printf("event[%d] = %d\n", i, event[i]);
>>>>>>>> MPI_Comm_disconnect(&comm_nodes[i]);
>>>>>>>> printf("Disconnect %d\n", i);
>>>>>>>> }
>>>>>>>> }
>>>>>>>>
>>>>>>>> MPI_Finalize();
>>>>>>>> return EXIT_SUCCESS;
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> * Dr. Aurélien Bouteiller
>>>>>>>> * Sr. Research Associate at Innovative Computing Laboratory
>>>>>>>> * University of Tennessee
>>>>>>>> * 1122 Volunteer Boulevard, suite 350
>>>>>>>> * Knoxville, TN 37996
>>>>>>>> * 865 974 6321
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> devel_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel