Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] MPI_Comm_connect/Accept
From: Ralph H Castain (rhc_at_[hidden])
Date: 2008-04-04 14:55:20


Well, something got borked in here - will have to fix it, so this will
probably not get done until next week.

On 4/4/08 12:26 PM, "Ralph H Castain" <rhc_at_[hidden]> wrote:

> Yeah, you didn't specify the file correctly...plus I found a bug in the code
> when I looked (out-of-date a little in orterun).
>
> I am updating orterun (commit soon) and will include a better help message
> about the proper format of the orterun cmd-line option. The syntax is:
>
> -ompi-server uri
>
> or -ompi-server file:filename-where-uri-exists
>
> Problem here is that you gave it a uri of "test", which means nothing. ;-)
>
> Should have it up-and-going soon.
> Ralph
>
> On 4/4/08 12:02 PM, "Aurélien Bouteiller" <bouteill_at_[hidden]> wrote:
>
>> Ralph,
>>
>> I've not been very successful at using ompi-server. I tried this :
>>
>> xterm1$ ompi-server --debug-devel -d --report-uri test
>> [grosse-pomme.local:01097] proc_info: hnp_uri NULL
>> daemon uri NULL
>> [grosse-pomme.local:01097] [[34900,0],0] ompi-server: up and running!
>>
>>
>> xterm2$ mpirun -ompi-server test -np 1 mpi_accept_test
>> Port name:
>> 2285895681.0;tcp://192.168.0.101:50065;tcp://192.168.0.150:50065:300
>>
>> xterm3$ mpirun -ompi-server test -np 1 simple_connect
>> --------------------------------------------------------------------------
>> Process rank 0 attempted to lookup from a global ompi_server that
>> could not be contacted. This is typically caused by either not
>> specifying the contact info for the server, or by the server not
>> currently executing. If you did specify the contact info for a
>> server, please check to see that the server is running and start
>> it again (or have your sys admin start it) if it isn't.
>>
>> --------------------------------------------------------------------------
>> [grosse-pomme.local:01122] *** An error occurred in MPI_Lookup_name
>> [grosse-pomme.local:01122] *** on communicator MPI_COMM_WORLD
>> [grosse-pomme.local:01122] *** MPI_ERR_NAME: invalid name argument
>> [grosse-pomme.local:01122] *** MPI_ERRORS_ARE_FATAL (goodbye)
>> --------------------------------------------------------------------------
>>
>>
>>
>> The server code Open_port, and then PublishName. Looks like the
>> LookupName function cannot reach the ompi-server. The ompi-server in
>> debug mode does not show any output when a new event occurs (like when
>> the server is launched). Is there something wrong in the way I use it ?
>>
>> Aurelien
>>
>> Le 3 avr. 08 à 17:21, Ralph Castain a écrit :
>>> Take a gander at ompi/tools/ompi-server - I believe I put a man page
>>> in
>>> there. You might just try "man ompi-server" and see if it shows up.
>>>
>>> Holler if you have a question - not sure I documented it very
>>> thoroughly at
>>> the time.
>>>
>>>
>>> On 4/3/08 3:10 PM, "Aurélien Bouteiller" <bouteill_at_[hidden]>
>>> wrote:
>>>
>>>> Ralph,
>>>>
>>>>
>>>> I am using trunk. Is there a documentation for ompi-server ? Sounds
>>>> exactly like what I need to fix point 1.
>>>>
>>>> Aurelien
>>>>
>>>> Le 3 avr. 08 à 17:06, Ralph Castain a écrit :
>>>>> I guess I'll have to ask the basic question: what version are you
>>>>> using?
>>>>>
>>>>> If you are talking about the trunk, there no longer is a "universe"
>>>>> concept
>>>>> anywhere in the code. Two mpiruns can connect/accept to each other
>>>>> as long
>>>>> as they can make contact. To facilitate that, we created an "ompi-
>>>>> server"
>>>>> tool that is supposed to be run by the sys-admin (or a user, doesn't
>>>>> matter
>>>>> which) on the head node - there are various ways to tell mpirun
>>>>> how to
>>>>> contact the server, or it can self-discover it.
>>>>>
>>>>> I have tested publish/lookup pretty thoroughly and it seems to
>>>>> work. I
>>>>> haven't spent much time testing connect/accept except via
>>>>> comm_spawn, which
>>>>> seems to be working. Since that uses the same mechanism, I would
>>>>> have
>>>>> expected connect/accept to work as well.
>>>>>
>>>>> If you are talking about 1.2.x, then the story is totally different.
>>>>>
>>>>> Ralph
>>>>>
>>>>>
>>>>>
>>>>> On 4/3/08 2:29 PM, "Aurélien Bouteiller" <bouteill_at_[hidden]>
>>>>> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> I'm trying to figure out how complete is the implementation of
>>>>>> Comm_connect/Accept. I found two problematic cases.
>>>>>>
>>>>>> 1) Two different programs are started in two different mpirun. One
>>>>>> makes accept, the second one use connect. I would not expect
>>>>>> MPI_Publish_name/Lookup_name to work because they do not share the
>>>>>> HNP. Still I would expect to be able to connect by copying (with
>>>>>> printf-scanf) the port_name string generated by Open_port;
>>>>>> especially
>>>>>> considering that in Open MPI, the port_name is a string containing
>>>>>> the
>>>>>> tcp address and port of the rank 0 in the server communicator.
>>>>>> However, doing so results in "no route to host" and the connecting
>>>>>> application aborts. Is the problem related to an explicit check of
>>>>>> the
>>>>>> universes on the accept HNP ? Do I expect too much from the MPI
>>>>>> standard ? Is it because my two applications does not share the
>>>>>> same
>>>>>> universe ? Should we (re) add the ability to use the same universe
>>>>>> for
>>>>>> several mpirun ?
>>>>>>
>>>>>> 2) Second issue is when the program setup a port, and then accept
>>>>>> multiple clients on this port. Everything works fine for the first
>>>>>> client, and then accept stalls forever when waiting for the second
>>>>>> one. My understanding of the standard is that it should work: 5.4.2
>>>>>> states "it must call MPI_Open_port to establish a port [...] it
>>>>>> must
>>>>>> call MPI_Comm_accept to accept connections from clients". I
>>>>>> understand
>>>>>> that for one MPI_Open_port I should be able to manage several MPI
>>>>>> clients. Am I understanding correctly the standard here and
>>>>>> should we
>>>>>> fix this ?
>>>>>>
>>>>>> Here is a copy of the non-working code for reference.
>>>>>>
>>>>>> /*
>>>>>> * Copyright (c) 2004-2007 The Trustees of the University of
>>>>>> Tennessee.
>>>>>> * All rights reserved.
>>>>>> * $COPYRIGHT$
>>>>>> *
>>>>>> * Additional copyrights may follow
>>>>>> *
>>>>>> * $HEADER$
>>>>>> */
>>>>>> #include <stdlib.h>
>>>>>> #include <stdio.h>
>>>>>> #include <mpi.h>
>>>>>>
>>>>>> int main(int argc, char *argv[])
>>>>>> {
>>>>>> char port[MPI_MAX_PORT_NAME];
>>>>>> int rank;
>>>>>> int np;
>>>>>>
>>>>>>
>>>>>> MPI_Init(&argc, &argv);
>>>>>> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>>>>> MPI_Comm_size(MPI_COMM_WORLD, &np);
>>>>>>
>>>>>> if(rank)
>>>>>> {
>>>>>> MPI_Comm comm;
>>>>>> /* client */
>>>>>> MPI_Recv(port, MPI_MAX_PORT_NAME, MPI_CHAR, 0, 0,
>>>>>> MPI_COMM_WORLD, MPI_STATUS_IGNORE);
>>>>>> printf("Read port: %s\n", port);
>>>>>> MPI_Comm_connect(port, MPI_INFO_NULL, 0, MPI_COMM_SELF,
>>>>>> &comm);
>>>>>>
>>>>>> MPI_Send(&rank, 1, MPI_INT, 0, 1, comm);
>>>>>> MPI_Comm_disconnect(&comm);
>>>>>> }
>>>>>> else
>>>>>> {
>>>>>> int nc = np - 1;
>>>>>> MPI_Comm *comm_nodes = (MPI_Comm *) calloc(nc,
>>>>>> sizeof(MPI_Comm));
>>>>>> MPI_Request *reqs = (MPI_Request *) calloc(nc,
>>>>>> sizeof(MPI_Request));
>>>>>> int *event = (int *) calloc(nc, sizeof(int));
>>>>>> int i;
>>>>>>
>>>>>> MPI_Open_port(MPI_INFO_NULL, port);
>>>>>> /* MPI_Publish_name("test_service_el", MPI_INFO_NULL,
>>>>>> port);*/
>>>>>> printf("Port name: %s\n", port);
>>>>>> for(i = 1; i < np; i++)
>>>>>> MPI_Send(port, MPI_MAX_PORT_NAME, MPI_CHAR, i, 0,
>>>>>> MPI_COMM_WORLD);
>>>>>>
>>>>>> for(i = 0; i < nc; i++)
>>>>>> {
>>>>>> MPI_Comm_accept(port, MPI_INFO_NULL, 0, MPI_COMM_SELF,
>>>>>> &comm_nodes[i]);
>>>>>> printf("Accept %d\n", i);
>>>>>> MPI_Irecv(&event[i], 1, MPI_INT, 0, 1, comm_nodes[i],
>>>>>> &reqs[i]);
>>>>>> printf("IRecv %d\n", i);
>>>>>> }
>>>>>> MPI_Close_port(port);
>>>>>> MPI_Waitall(nc, reqs, MPI_STATUSES_IGNORE);
>>>>>> for(i = 0; i < nc; i++)
>>>>>> {
>>>>>> printf("event[%d] = %d\n", i, event[i]);
>>>>>> MPI_Comm_disconnect(&comm_nodes[i]);
>>>>>> printf("Disconnect %d\n", i);
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> MPI_Finalize();
>>>>>> return EXIT_SUCCESS;
>>>>>> }
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> * Dr. Aurélien Bouteiller
>>>>>> * Sr. Research Associate at Innovative Computing Laboratory
>>>>>> * University of Tennessee
>>>>>> * 1122 Volunteer Boulevard, suite 350
>>>>>> * Knoxville, TN 37996
>>>>>> * 865 974 6321
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel