Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] RE : Unable to connect to a server using MX MTL with TCP
From: Audet, Martin (Martin.Audet_at_[hidden])
Date: 2010-06-04 19:24:28


Sorry,

I forgot the attachements...

Martin

________________________________________
De : users-bounces_at_[hidden] [users-bounces_at_[hidden]] de la part de Audet, Martin [Martin.Audet_at_[hidden]]
Date d'envoi : 4 juin 2010 19:18
À : users_at_[hidden]
Objet : [OMPI users] Unable to connect to a server using MX MTL with TCP

Hi OpenMPI_Users and OpenMPI_Developers,

I'm unable to connect a client application using MPI_Comm_connect() to a server job (the server job calls MPI_Open_port() before calling by MPI_Comm_accept()) when the server job uses MX MTL (although it works without problems when the server uses MX BTL). The server job runs on a cluster connected to a Myrinet 10G network (MX 1.2.11) in addition to an ordinary Ethernet network. The client runs on a different machine, not connected to the Myrinet network but accessible via the Ethernet network.

Joined to this message are the simple server and client programs (87 lines total) called simpleserver.c and simpleclient.c .

Note we are using OpenMPI 1.4.2 on x86_64 Linux (server: Fedora 7 client: Fedora 12).

Compiling these programs with mpicc on the server front node (fn1) and client workstation (linux15) works well:

   [audet_at_fn1 bench]$ mpicc simpleserver.c -o simpleserver

   [audet_at_linux15 mpi]$ mpicc simpleclient.c -o simpleclient

Then if we start the server on the cluster (job is started on cluster node cn18) and asking to use MTL :

   [audet_at_fn1 bench]$ mpiexec -x MX_RCACHE=2 -machinefile machinefile_cn18 --mca mtl mx --mca pml cm -n 1 ./simpleserver

It prints the server port (Note we uses MX_RCACHE=2 to avoid a warning but it doesn't affect the current issue) :

   Server port = '3548905472.0;tcp://172.17.15.20:39517+3548905473.0;tcp://172.17.10.18:47427:300'

Then starting the client on the workstation with this port number:

   [audet_at_linux15 mpi]$ mpiexec -n 1 ./simpleclient '3548905472.0;tcp://172.17.15.20:39517+3548905473.0;tcp://172.17.10.18:47427:300'

The server process core dump as follow:

   MPI_Comm_accept() sucessful...
   [cn18:24582] *** Process received signal ***
   [cn18:24582] Signal: Segmentation fault (11)
   [cn18:24582] Signal code: Address not mapped (1)
   [cn18:24582] Failing at address: 0x38
   [cn18:24582] [ 0] /lib64/libpthread.so.0 [0x305de0dd20]
   [cn18:24582] [ 1] /usr/local/openmpi-1.4.2/lib/openmpi/mca_mtl_mx.so [0x2aaaad6a7e6d]
   [cn18:24582] [ 2] /usr/local/openmpi-1.4.2/lib/openmpi/mca_pml_cm.so [0x2aaaad4a319d]
   [cn18:24582] [ 3] /usr/local/openmpi/lib/libmpi.so.0(ompi_dpm_base_disconnect_init+0xbf) [0x2aaaaab1403f]
   [cn18:24582] [ 4] /usr/local/openmpi-1.4.2/lib/openmpi/mca_dpm_orte.so [0x2aaaaed0eb19]
   [cn18:24582] [ 5] /usr/local/openmpi/lib/libmpi.so.0(PMPI_Comm_disconnect+0xa0) [0x2aaaaaaf4f20]
   [cn18:24582] [ 6] ./simpleserver(main+0x14c) [0x400d04]
   [cn18:24582] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x305ce1daa4]
   [cn18:24582] [ 8] ./simpleserver [0x400b09]
   [cn18:24582] *** End of error message ***
   --------------------------------------------------------------------------
   mpiexec noticed that process rank 0 with PID 24582 on node cn18 exited on signal 11 (Segmentation fault).
   --------------------------------------------------------------------------
   [audet_at_fn1 bench]$

And the client stops with the following error message:

   --------------------------------------------------------------------------
   At least one pair of MPI processes are unable to reach each other for
   MPI communications. This means that no Open MPI device has indicated
   that it can be used to communicate between these processes. This is
   an error; Open MPI requires that all MPI processes be able to reach
   each other. This error can sometimes be the result of forgetting to
   specify the "self" BTL.

     Process 1 ([[31386,1],0]) is on host: linux15
     Process 2 ([[54152,1],0]) is on host: cn18
     BTLs attempted: self sm tcp

   Your MPI job is now going to abort; sorry.
   --------------------------------------------------------------------------
   MPI_Comm_connect() sucessful...
   Error in comm_disconnect_waitall
   [audet_at_linux15 mpi]$

I really don't understand this message because the client can connect with the server using tcp on Ethernet.

Moreover if I add MCA options when I start the server to include TCP BTL, the same problems happens (the argument list then becomes: '--mca mtl mx --mca pml cm --mca btl tcp,shared,self' ).

However if I remove all MCA options when I start the server (e.g. when BTL MX is used), no such problems appears. Everything goes fine also if I start the server with an explicit request to use BTL MX and TCP (e.g. with options '--mca btl mx,tcp,sm,self').

Four running our server application we really prefer to use MX MTL over MX BTL since it is much faster with MTL (although the usual ping pong test is only slightly faster with MTL).

Enclosed also the output of ompi_info --all runned on the cluster node (cn18) and the workstation (linux15).

Please help me. I think my problem is only a question of wrong MCA parameters (which is obscure for me).

Thanks,

Martin Audet, Research Officer
Industrial Material Institute
National Research Council of Canada
75 de Mortagne, Boucherville, QC, J4B 6Y4, Canada

_______________________________________________
users mailing list
users_at_[hidden]
http://www.open-mpi.org/mailman/listinfo.cgi/users