Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] RE : MPI_Comm_connect() fails
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-03-17 15:54:51


Edgar --

Can you make a patch for the 1.2 series?

On Mar 17, 2008, at 3:45 PM, Edgar Gabriel wrote:

> Martin,
>
> I found the problem in the inter-allgather, and fixed it in patch
> 17849.
> The same test using however MPI_Intercomm_create (just to simplify my
> life compared to Connect/Accept) using 2 vs 4 processes in the two
> groups passes for me -- and did fail with the previous version.
>
>
> Thanks
> Edgar
>
>
> Audet, Martin wrote:
>> Hi Jeff,
>>
>> As I said in my last message (see bellow) the patch (or at least
>> the patch I got) don't fixes the problem for me. Whether I apply it
>> over OpenMPI 1.2.5 or 1.2.6rc2, I still get the same problem:
>>
>> The client aborts with a truncation error message while the server
>> freeze when for example the server is started on 3 process and the
>> client on 2 process.
>>
>> Feel free to try yourself the two small client and server programs
>> I posted in my first message.
>>
>> Thanks,
>>
>> Martin
>>
>>
>> Subject: [OMPI users] RE : users Digest, Vol 841, Issue 3
>> From: Audet, Martin (Martin.Audet_at_[hidden])
>> Date: 2008-03-13 17:04:25
>>
>> Hi Georges,
>>
>> Thanks for your patch, but I'm not sure I got it correctly. The
>> patch I got modify a few arguments passed to isend()/irecv()/recv()
>> in coll_basic_allgather.c. Here is the patch I applied:
>>
>> Index: ompi/mca/coll/basic/coll_basic_allgather.c
>> ===================================================================
>> --- ompi/mca/coll/basic/coll_basic_allgather.c (revision 17814)
>> +++ ompi/mca/coll/basic/coll_basic_allgather.c (working copy)
>> @@ -149,7 +149,7 @@
>> }
>>
>> /* Do a send-recv between the two root procs. to avoid
>> deadlock */
>> - err = MCA_PML_CALL(isend(sbuf, scount, sdtype, 0,
>> + err = MCA_PML_CALL(isend(sbuf, scount, sdtype, root,
>> MCA_COLL_BASE_TAG_ALLGATHER,
>> MCA_PML_BASE_SEND_STANDARD,
>> comm, &reqs[rsize]));
>> @@ -157,7 +157,7 @@
>> return err;
>> }
>>
>> - err = MCA_PML_CALL(irecv(rbuf, rcount, rdtype, 0,
>> + err = MCA_PML_CALL(irecv(rbuf, rcount, rdtype, root,
>> MCA_COLL_BASE_TAG_ALLGATHER, comm,
>> &reqs[0]));
>> if (OMPI_SUCCESS != err) {
>> @@ -186,14 +186,14 @@
>> return err;
>> }
>>
>> - err = MCA_PML_CALL(isend(rbuf, rsize * rcount, rdtype, 0,
>> + err = MCA_PML_CALL(isend(rbuf, rsize * scount, sdtype, root,
>> MCA_COLL_BASE_TAG_ALLGATHER,
>> MCA_PML_BASE_SEND_STANDARD, comm,
>> &req));
>> if (OMPI_SUCCESS != err) {
>> goto exit;
>> }
>>
>> - err = MCA_PML_CALL(recv(tmpbuf, size * scount, sdtype, 0,
>> + err = MCA_PML_CALL(recv(tmpbuf, size * rcount, rdtype, root,
>> MCA_COLL_BASE_TAG_ALLGATHER, comm,
>> MPI_STATUS_IGNORE));
>> if (OMPI_SUCCESS != err) {
>>
>> However with this patch, I still have the problem. Suppose I start
>> the server with three process and the client with two, the clients
>> prints:
>>
>> [audet_at_linux15 dyn_connect]$ mpiexec --universe univ1 -n 2 ./
>> aclient '0.2.0:2000'
>> intercomm_flag = 1
>> intercomm_remote_size = 3
>> rem_rank_tbl[3] = { 0 1 2}
>> [linux15:26114] *** An error occurred in MPI_Allgather
>> [linux15:26114] *** on communicator
>> [linux15:26114] *** MPI_ERR_TRUNCATE: message truncated
>> [linux15:26114] *** MPI_ERRORS_ARE_FATAL (goodbye)
>> mpiexec noticed that job rank 0 with PID 26113 on node linux15
>> exited on signal 15 (Terminated).
>> [audet_at_linux15 dyn_connect]$
>>
>> and abort. The server on the other side simply hang (as before).
>>
>> Regards,
>>
>> Martin
>>
>> -----Original Message-----
>> From: users-bounces_at_[hidden] [mailto:users-bounces_at_open-
>> mpi.org] On Behalf Of Jeff Squyres
>> Sent: March 14, 2008 19:45
>> To: Open MPI Users
>> Subject: Re: [OMPI users] RE : MPI_Comm_connect() fails
>>
>> Yes, please let us know if this fixes it. We're working on a 1.2.6
>> release; we can definitely put this fix in there if it's correct.
>>
>> Thanks!
>>
>>
>> On Mar 13, 2008, at 4:07 PM, George Bosilca wrote:
>>
>>> I dig into the sources and I think you correctly pinpoint the bug.
>>> It seems we have a mismatch between the local and remote sizes in
>>> the inter-communicator allgather in the 1.2 series (which explain
>>> the message truncation error when the local and remote groups have a
>>> different number of processes). Attached to this email you can find
>>> a patch that [hopefully] solve this problem. If you can please test
>>> it and let me know if this solve your problem.
>>>
>>> Thanks,
>>> george.
>>>
>>> <inter_allgather.patch>
>>>
>>>
>>> On Mar 13, 2008, at 1:11 PM, Audet, Martin wrote:
>>>
>>>> Hi,
>>>>
>>>> After re-checking the MPI standard (www.mpi-forum.org and MPI - The
>>>> Complete Reference), I'm more and more convinced that my small
>>>> examples programs establishing a intercommunicator with
>>>> MPI_Comm_Connect()/MPI_Comm_accept() over an MPI port and
>>>> exchanging data over it with MPI_Allgather() is correct. Especially
>>>> calling MPI_Allgather() with recvcount=1 (its third argument)
>>>> instead of the total number of MPI_INT that will be received (e.g.
>>>> intercomm_remote_size in the examples) is both correct and
>>>> consistent with MPI_Allgather() behavior on intracommunicator (e.g.
>>>> "normal" communicator).
>>>>
>>>> MPI_Allgather(&comm_rank, 1, MPI_INT,
>>>> rem_rank_tbl, 1, MPI_INT,
>>>> intercomm);
>>>>
>>>> Also the recvbuf argument (the second argument) of MPI_Allgather()
>>>> in the examples should have a size of intercomm_remote_size (e.g.
>>>> the size of the remote group), not the sum of the local and remote
>>>> groups in the client and sever process. The standard says that for
>>>> all-to-all type of operations over an intercommunicator, the
>>>> process send and receives data from the remote group only (anyway
>>>> it is not possible to exchange data with process of the local group
>>>> over an intercommunicator).
>>>>
>>>> So, for me there is no reason for stopping the process with an
>>>> error message complaining about message truncation. There should be
>>>> no truncation, sendcount, sendtype, recvcount and recvtype
>>>> arguments of MPI_Allgather() are correct and consistent.
>>>>
>>>> So again for me the OpenMPI behavior with my example look more and
>>>> more like a bug...
>>>>
>>>> Concerning George comment about valgrind and TCP/IP, I totally
>>>> agree, messages reported by valgrind are only a clue of a bug,
>>>> especially in this contex, not a proof of bug. Another clue is that
>>>> my small examples work perfectly with mpich2 ch3:sock.
>>>>
>>>> Regards,
>>>>
>>>> Martin Audet
>>>>
>>>>
>>>> ------------------------------
>>>>
>>>> Message: 4
>>>> Date: Thu, 13 Mar 2008 08:21:51 +0100
>>>> From: jody <jody.xha_at_[hidden]>
>>>> Subject: Re: [OMPI users] RE : MPI_Comm_connect() fails
>>>> To: "Open MPI Users" <users_at_[hidden]>
>>>> Message-ID:
>>>> <9b0da5ce0803130021l4ead0f91qaf43e4ac7d332c93_at_[hidden]>
>>>> Content-Type: text/plain; charset=ISO-8859-1
>>>>
>>>> HI
>>>> I think the recvcount argument you pass to MPI_Allgather should not
>>>> be
>>>> 1 but instead
>>>> the number of MPI_INTs your buffer rem_rank_tbl can contain.
>>>> As it stands now, you tell MPI_Allgather that it may only receive 1
>>>> MPI_INT.
>>>>
>>>> Furthermore, i'm not sure, but i think your receive buffer should
>>>> be
>>>> large enough
>>>> to contain messages from *all* processes, and not just from the
>>>> "far side"
>>>>
>>>> Jody
>>>>
>>>> .
>>>>
>>>>
>>>> ------------------------------
>>>>
>>>> Message: 6
>>>> Date: Thu, 13 Mar 2008 09:06:47 -0500
>>>> From: George Bosilca <bosilca_at_[hidden]>
>>>> Subject: Re: [OMPI users] RE : MPI_Comm_connect() fails
>>>> To: Open MPI Users <users_at_[hidden]>
>>>> Message-ID: <82E9FF28-FB87-4FFB-A492-DDE472D5DEA7_at_[hidden]>
>>>> Content-Type: text/plain; charset="us-ascii"
>>>>
>>>> I am not aware of any problems with the allreduce/allgather. But,
>>>> we
>>>> are aware of the problem with valgrind that report non initialized
>>>> values when used with TCP. It's a long story, but I can guarantee
>>>> that
>>>> this should not affect a correct MPI application.
>>>>
>>>> george.
>>>>
>>>> PS: For those who want to know the details: we have to send a
>>>> header
>>>> over TCP which contain some very basic information, including the
>>>> size
>>>> of the fragment. Unfortunately, we have a 2 bytes gap in the
>>>> header.
>>>> As we never initialize these 2 unused bytes, but we send them over
>>>> the
>>>> wire, valgrind correctly detect the non initialized data transfer.
>>>>
>>>>
>>>> On Mar 12, 2008, at 3:58 PM, Audet, Martin wrote:
>>>>
>>>>> Hi again,
>>>>>
>>>>> Thanks Pak for the link and suggesting to start an "orted" deamon,
>>>>> by doing so my clients and servers jobs were able to establish an
>>>>> intercommunicator between them.
>>>>>
>>>>> However I modified my programs to perform an MPI_Allgather() of a
>>>>> single "int" over the new intercommunicator to test
>>>>> communication a
>>>>> litle bit and I did encountered problems. I am now wondering if
>>>>> there is a problem in MPI_Allreduce() itself for
>>>>> intercommunicators.
>>>>> Note that the same program run without problems with mpich2
>>>>> (ch3:sock).
>>>>>
>>>>> For example if I start orted as follows:
>>>>>
>>>>> orted --persistent --seed --scope public --universe univ1
>>>>>
>>>>> and then start the server with three process:
>>>>>
>>>>> mpiexec --universe univ1 -n 3 ./aserver
>>>>>
>>>>> it prints:
>>>>>
>>>>> Server port = '0.2.0:2000'
>>>>>
>>>>> Now if I start the client with two process as follow (using the
>>>>> server port):
>>>>>
>>>>> mpiexec --universe univ1 -n 2 ./aclient '0.2.0:2000'
>>>>>
>>>>> The server prints:
>>>>>
>>>>> intercomm_flag = 1
>>>>> intercomm_remote_size = 2
>>>>> rem_rank_tbl[2] = { 0 1}
>>>>>
>>>>> which is the correct output. The client then prints:
>>>>>
>>>>> intercomm_flag = 1
>>>>> intercomm_remote_size = 3
>>>>> rem_rank_tbl[3] = { 0 1 2}
>>>>> [linux15:30895] *** An error occurred in MPI_Allgather
>>>>> [linux15:30895] *** on communicator
>>>>> [linux15:30895] *** MPI_ERR_TRUNCATE: message truncated
>>>>> [linux15:30895] *** MPI_ERRORS_ARE_FATAL (goodbye)
>>>>> mpiexec noticed that job rank 0 with PID 30894 on node linux15
>>>>> exited on signal 15 (Terminated).
>>>>>
>>>>> As you can see the first messages are correct but the client job
>>>>> terminate with an error (and the server hang).
>>>>>
>>>>> After re-reading the documentation about MPI_Allgather() over an
>>>>> intercommunicator, I don't see anything wrong in my simple code.
>>>>> Also if I run the client and server process with valgrind, I get a
>>>>> few messages like:
>>>>>
>>>>> ==29821== Syscall param writev(vector[...]) points to
>>>>> uninitialised
>>>>> byte(s)
>>>>> ==29821== at 0x36235C2130: writev (in /lib64/libc-2.3.5.so)
>>>>> ==29821== by 0x7885583: mca_btl_tcp_frag_send (in /home/
>>>>> publique/
>>>>> openmpi-1.2.5/lib/openmpi/mca_btl_tcp.so)
>>>>> ==29821== by 0x788501B: mca_btl_tcp_endpoint_send (in /home/
>>>>> publique/openmpi-1.2.5/lib/openmpi/mca_btl_tcp.so)
>>>>> ==29821== by 0x7467947: mca_pml_ob1_send_request_start_prepare
>>>>> (in /home/publique/openmpi-1.2.5/lib/openmpi/mca_pml_ob1.so)
>>>>> ==29821== by 0x7461494: mca_pml_ob1_isend (in /home/publique/
>>>>> openmpi-1.2.5/lib/openmpi/mca_pml_ob1.so)
>>>>> ==29821== by 0x798BF9D: mca_coll_basic_allgather_inter (in /
>>>>> home/
>>>>> publique/openmpi-1.2.5/lib/openmpi/mca_coll_basic.so)
>>>>> ==29821== by 0x4A5069C: PMPI_Allgather (in /home/publique/
>>>>> openmpi-1.2.5/lib/libmpi.so.0.0.0)
>>>>> ==29821== by 0x400EED: main (aserver.c:53)
>>>>> ==29821== Address 0x40d6cac is not stack'd, malloc'd or
>>>>> (recently)
>>>>> free'd
>>>>>
>>>>> in both MPI_Allgather() and MPI_Comm_disconnect() calls for client
>>>>> and server with valgrind always reporting that the address in
>>>>> question are "not stack'd, malloc'd or (recently) free'd".
>>>>>
>>>>> So is there a problem with MPI_Allgather() on intercommunicators
>>>>> or
>>>>> am I doing something wrong ?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Martin
>>>>>
>>>>>
>>>>> /* aserver.c */
>>>>> #include <stdio.h>
>>>>> #include <mpi.h>
>>>>>
>>>>> #include <assert.h>
>>>>> #include <stdlib.h>
>>>>>
>>>>> int main(int argc, char **argv)
>>>>> {
>>>>> int comm_rank,comm_size;
>>>>> char port_name[MPI_MAX_PORT_NAME];
>>>>> MPI_Comm intercomm;
>>>>> int ok_flag;
>>>>>
>>>>> int intercomm_flag;
>>>>> int intercomm_remote_size;
>>>>> int *rem_rank_tbl;
>>>>> int ii;
>>>>>
>>>>> MPI_Init(&argc, &argv);
>>>>>
>>>>> MPI_Comm_rank(MPI_COMM_WORLD, &comm_rank);
>>>>> MPI_Comm_size(MPI_COMM_WORLD, &comm_size);
>>>>>
>>>>> ok_flag = (comm_rank != 0) || (argc == 1);
>>>>> MPI_Bcast(&ok_flag, 1, MPI_INT, 0, MPI_COMM_WORLD);
>>>>>
>>>>> if (!ok_flag) {
>>>>> if (comm_rank == 0) {
>>>>> fprintf(stderr,"Usage: %s\n",argv[0]);
>>>>> }
>>>>> MPI_Abort(MPI_COMM_WORLD, 1);
>>>>> }
>>>>>
>>>>> MPI_Open_port(MPI_INFO_NULL, port_name);
>>>>>
>>>>> if (comm_rank == 0) {
>>>>> printf("Server port = '%s'\n", port_name);
>>>>> }
>>>>> MPI_Comm_accept(port_name, MPI_INFO_NULL, 0, MPI_COMM_WORLD,
>>>>> &intercomm);
>>>>>
>>>>> MPI_Close_port(port_name);
>>>>>
>>>>> MPI_Comm_test_inter(intercomm, &intercomm_flag);
>>>>> if (comm_rank == 0) {
>>>>> printf("intercomm_flag = %d\n", intercomm_flag);
>>>>> }
>>>>> assert(intercomm_flag != 0);
>>>>> MPI_Comm_remote_size(intercomm, &intercomm_remote_size);
>>>>> if (comm_rank == 0) {
>>>>> printf("intercomm_remote_size = %d\n", intercomm_remote_size);
>>>>> }
>>>>> rem_rank_tbl =
>>>>> malloc(intercomm_remote_size*sizeof(*rem_rank_tbl));
>>>>> MPI_Allgather(&comm_rank, 1, MPI_INT,
>>>>> rem_rank_tbl, 1, MPI_INT,
>>>>> intercomm);
>>>>> if (comm_rank == 0) {
>>>>> printf("rem_rank_tbl[%d] = {", intercomm_remote_size);
>>>>> for (ii=0; ii < intercomm_remote_size; ii++) {
>>>>> printf(" %d", rem_rank_tbl[ii]);
>>>>> }
>>>>> printf("}\n");
>>>>> }
>>>>> free(rem_rank_tbl);
>>>>>
>>>>> MPI_Comm_disconnect(&intercomm);
>>>>>
>>>>> MPI_Finalize();
>>>>>
>>>>> return 0;
>>>>> }
>>>>>
>>>>> /* aclient.c */
>>>>> #include <stdio.h>
>>>>> #include <unistd.h>
>>>>>
>>>>> #include <mpi.h>
>>>>>
>>>>> #include <assert.h>
>>>>> #include <stdlib.h>
>>>>>
>>>>> int main(int argc, char **argv)
>>>>> {
>>>>> int comm_rank,comm_size;
>>>>> int ok_flag;
>>>>> MPI_Comm intercomm;
>>>>>
>>>>> int intercomm_flag;
>>>>> int intercomm_remote_size;
>>>>> int *rem_rank_tbl;
>>>>> int ii;
>>>>>
>>>>> MPI_Init(&argc, &argv);
>>>>>
>>>>> MPI_Comm_rank(MPI_COMM_WORLD, &comm_rank);
>>>>> MPI_Comm_size(MPI_COMM_WORLD, &comm_size);
>>>>>
>>>>> ok_flag = (comm_rank != 0) || ((argc == 2) && argv[1] &&
>>>>> (*argv[1] != '\0'));
>>>>> MPI_Bcast(&ok_flag, 1, MPI_INT, 0, MPI_COMM_WORLD);
>>>>>
>>>>> if (!ok_flag) {
>>>>> if (comm_rank == 0) {
>>>>> fprintf(stderr,"Usage: %s mpi_port\n", argv[0]);
>>>>> }
>>>>> MPI_Abort(MPI_COMM_WORLD, 1);
>>>>> }
>>>>>
>>>>> while (MPI_Comm_connect((comm_rank == 0) ? argv[1] : 0,
>>>>> MPI_INFO_NULL, 0, MPI_COMM_WORLD, &intercomm) != MPI_SUCCESS) {
>>>>> if (comm_rank == 0) {
>>>>> printf("MPI_Comm_connect() failled, sleeping and retrying...
>>>>> \n");
>>>>> }
>>>>> sleep(1);
>>>>> }
>>>>>
>>>>> MPI_Comm_test_inter(intercomm, &intercomm_flag);
>>>>> if (comm_rank == 0) {
>>>>> printf("intercomm_flag = %d\n", intercomm_flag);
>>>>> }
>>>>> assert(intercomm_flag != 0);
>>>>> MPI_Comm_remote_size(intercomm, &intercomm_remote_size);
>>>>> if (comm_rank == 0) {
>>>>> printf("intercomm_remote_size = %d\n", intercomm_remote_size);
>>>>> }
>>>>> rem_rank_tbl =
>>>>> malloc(intercomm_remote_size*sizeof(*rem_rank_tbl));
>>>>> MPI_Allgather(&comm_rank, 1, MPI_INT,
>>>>> rem_rank_tbl, 1, MPI_INT,
>>>>> intercomm);
>>>>> if (comm_rank == 0) {
>>>>> printf("rem_rank_tbl[%d] = {", intercomm_remote_size);
>>>>> for (ii=0; ii < intercomm_remote_size; ii++) {
>>>>> printf(" %d", rem_rank_tbl[ii]);
>>>>> }
>>>>> printf("}\n");
>>>>> }
>>>>> free(rem_rank_tbl);
>>>>>
>>>>> MPI_Comm_disconnect(&intercomm);
>>>>>
>>>>> MPI_Finalize();
>>>>>
>>>>> return 0;
>>>>> }
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> -------------- next part --------------
>>>> A non-text attachment was scrubbed...
>>>> Name: smime.p7s
>>>> Type: application/pkcs7-signature
>>>> Size: 2423 bytes
>>>> Desc: not available
>>>> Url : http://www.open-mpi.org/MailArchives/users/attachments/20080313/642d41dd/attachment.bin
>>>>
>>>> ------------------------------
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> End of users Digest, Vol 841, Issue 1
>>>> *************************************
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> --
> Edgar Gabriel
> Assistant Professor
> Parallel Software Technologies Lab http://pstl.cs.uh.edu
> Department of Computer Science University of Houston
> Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA
> Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems