Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2007-06-28 02:25:06


On Jun 26, 2007, at 5:06 PM, Georg Wassen wrote:

> Hello all,
>
> I temporarily worked around my former problem by using synchronous
> communication and shifting the initialization
> into the first call of a collective operation.
>
> But nevertheless, I found a performance bug in btl_openib.
>
> When I execute the attached sendrecv.c on 4 (or more) nodes of a
> Pentium D Cluster with Infinniband,
> each receiving process gets only 8 messages in some seconds and
> then does nothing for at least 20 sec.
> (I executed the following command and hit Ctrl-C 20 sec. after the
> last output)

This sounds like it could be a progression issue. When the openib
BTL is used by itself, we crank the frequency of the file descriptor
progression engine down very low because most progression will come
from verbs (not select/poll). I wonder if this is somehow related.

FWIW: the reason you have to use PML_CALL() is by design. The MPI
API has all the error checking stuff for ensuring that MPI_INIT
completed, error checking of parameters, etc. We never invoke the
top-level MPI API from elsewhere in the OMPI code base (except for
from within ROMIO; we didn't want to wholesale changes to that
package because it would make for extreme difficulty every time we
imported a new version). There's fault tolerance reasons why it's
not good to call back up to the top level MPI API, too.

But I agree with Andrew; if this is init-level stuff that is not
necessary to be exchanged on a per-communicator basis, then the modex
is probably your best bet. Avoid using the RML directly if possible.

> wassen_at_elrohir:~/src/mpi_test$ mpirun -np 4 -host
> pd-01,pd-02,pd-03,pd-04 -mca btl openib,self sendrecv
> [3] received data[0]=1
> [1] received data[0]=1
> [1] received data[1]=2
> [1] received data[2]=3
> [1] received data[3]=4
> [1] received data[4]=5
> [1] received data[5]=6
> [1] received data[6]=7
> [1] received data[7]=8
> [2] received data[0]=1
> [2] received data[1]=2
> [2] received data[2]=3
> [2] received data[3]=4
> [2] received data[4]=5
> [2] received data[5]=6
> [2] received data[6]=7
> [2] received data[7]=8
> [3] received data[1]=2
> [3] received data[2]=3
> [3] received data[3]=4
> [3] received data[4]=5
> [3] received data[5]=6
> [3] received data[6]=7
> [3] received data[7]=8
> {20 sec. later...}
> mpirun: killing job...
>
> When I execute the same program with "-mca btl udapl,self" or "-mca
> btl tcp,self", it runs fine and terminates in less than a second.
> Tried with Open MPI 1.2.1 and 1.2.3. The test program runs fine
> with several other MPIs (intel-mpi and mvapich with InfinniBand, mp-
> mpich with SCI).
>
> I hope, my information suffices to reproduce the problem.
>
> Best regards,
> Georg Wassen.
>
> ps. I know that I could transmit the array in one MPI_Send, but
> this is extracted from my real problem.
>
>
>
> --------------------1st node-----------------------
> wassen_at_pd-01:~$ /opt/infiniband/bin/ibv_devinfo
> hca_id: mthca0
> fw_ver: 1.2.0
> node_guid: 0002:c902:0020:b680
> sys_image_guid: 0002:c902:0020:b683
> vendor_id: 0x02c9
> vendor_part_id: 25204
> hw_ver: 0xA0
> board_id: MT_0230000001
> phys_port_cnt: 1
> port: 1
> state: PORT_ACTIVE (4)
> max_mtu: 2048 (4)
> active_mtu: 2048 (4)
> sm_lid: 1
> port_lid: 1
> port_lmc: 0x00
>
> ---------------------------------------------------------
> wassen_at_pd-01:~$ /sbin/ifconfig
> ...
> ib0 Protokoll:UNSPEC Hardware Adresse 00-00-04-04-
> FE-80-00-00-00-00-00-00-00-00-00-00
> inet Adresse:192.168.0.11 Bcast:192.168.0.255 Maske:
> 255.255.255.0
> inet6 Adresse: fe80::202:c902:20:b681/64
> Gültigkeitsbereich:Verbindung
> UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
> RX packets:260 errors:0 dropped:0 overruns:0 frame:0
> TX packets:331 errors:0 dropped:2 overruns:0 carrier:0
> Kollisionen:0 Sendewarteschlangenlänge:128
> RX bytes:14356 (14.0 KiB) TX bytes:24960 (24.3 KiB)
> -------------------------------------------------------
> #include "mpi.h"
> #include <stdio.h>
>
> #define NUM 16
>
> int main(int argc, char **argv) {
> int myrank, count;
> MPI_Status status;
>
>
> int data[NUM] = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16};
> int i, j;
>
> MPI_Init(&argc, &argv);
> MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
> MPI_Comm_size(MPI_COMM_WORLD, &count);
>
> if (myrank == 0) {
> for (i=1; i<count; i++) {
> for (j=0; j<NUM; j++) {
> MPI_Send(&data[j], 1, MPI_INT, i, 99, MPI_COMM_WORLD);
> }
> }
> } else {
> for (j=0; j<NUM; j++) {
> MPI_Recv(&data[j], 1, MPI_INT, 0, 99, MPI_COMM_WORLD, &status);
> printf("[%d] received data[%d]=%d\n", myrank, j, data[j]);
> }
> }
>
> MPI_Finalize();
> }
> <config.log.gz>
> <ompi_info_all.txt.gz>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
Cisco Systems