Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI-1.7.3 - cuda support
From: KESTENER Pierre (pierre.kestener_at_[hidden])
Date: 2013-10-30 17:34:55


Thanks for your help, it is working now; I didn't noticed that limitations.

Best regards,

Pierre Kestener.

________________________________
De : users [users-bounces_at_[hidden]] de la part de Rolf vandeVaart [rvandevaart_at_[hidden]]
Date d'envoi : mercredi 30 octobre 2013 17:26
À : Open MPI Users
Objet : Re: [OMPI users] OpenMPI-1.7.3 - cuda support

The CUDA-aware support is only available when running with the verbs interface to Infiniband. It does not work with the PSM interface which is being used in your installation.
To verify this, you need to disable the usage of PSM. This can be done in a variety of ways, but try running like this:

mpirun –mca pml ob1 …..

This will force the use of the verbs support layer (openib) with the CUDA-aware support.

From: users [mailto:users-bounces_at_[hidden]] On Behalf Of KESTENER Pierre
Sent: Wednesday, October 30, 2013 12:07 PM
To: users_at_[hidden]
Subject: Re: [OMPI users] OpenMPI-1.7.3 - cuda support

Dear Rolf,

thank for looking into this.
Here is the complete backtrace for execution using 2 GPUs on the same node:

(cuda-gdb) bt
#0 0x00007ffff711d885 in raise () from /lib64/libc.so.6
#1 0x00007ffff711f065 in abort () from /lib64/libc.so.6
#2 0x00007ffff0387b8d in psmi_errhandler_psm (ep=<value optimized out>,
    err=PSM_INTERNAL_ERR, error_string=<value optimized out>,
    token=<value optimized out>) at psm_error.c:76
#3 0x00007ffff0387df1 in psmi_handle_error (ep=0xfffffffffffffffe,
    error=PSM_INTERNAL_ERR, buf=<value optimized out>) at psm_error.c:154
#4 0x00007ffff0382f6a in psmi_am_mq_handler_rtsmatch (toki=0x7fffffffc6a0,
    args=0x7fffed0461d0, narg=<value optimized out>,
    buf=<value optimized out>, len=<value optimized out>) at ptl.c:200
#5 0x00007ffff037a832 in process_packet (ptl=0x737818, pkt=0x7fffed0461c0,
    isreq=<value optimized out>) at am_reqrep_shmem.c:2164
#6 0x00007ffff037d90f in amsh_poll_internal_inner (ptl=0x737818, replyonly=0)
    at am_reqrep_shmem.c:1756
#7 amsh_poll (ptl=0x737818, replyonly=0) at am_reqrep_shmem.c:1810
#8 0x00007ffff03a0329 in __psmi_poll_internal (ep=0x737538,
    poll_amsh=<value optimized out>) at psm.c:465
#9 0x00007ffff039f0af in psmi_mq_wait_inner (ireq=0x7fffffffc848)
    at psm_mq.c:299
#10 psmi_mq_wait_internal (ireq=0x7fffffffc848) at psm_mq.c:334
#11 0x00007ffff037db21 in amsh_mq_send_inner (ptl=0x737818,
    mq=<value optimized out>, epaddr=0x6eb418, flags=<value optimized out>,
    tag=844424930131968, ubuf=0x1308350000, len=32768)
---Type <return> to continue, or q <return> to quit---
    at am_reqrep_shmem.c:2339
#12 amsh_mq_send (ptl=0x737818, mq=<value optimized out>, epaddr=0x6eb418,
    flags=<value optimized out>, tag=844424930131968, ubuf=0x1308350000,
    len=32768) at am_reqrep_shmem.c:2387
#13 0x00007ffff039ed71 in __psm_mq_send (mq=<value optimized out>,
    dest=<value optimized out>, flags=<value optimized out>,
    stag=<value optimized out>, buf=<value optimized out>,
    len=<value optimized out>) at psm_mq.c:413
#14 0x00007ffff05c4ea8 in ompi_mtl_psm_send ()
   from /gpfslocal/pub/openmpi/1.7.3/lib/openmpi/mca_mtl_psm.so
#15 0x00007ffff1eeddea in mca_pml_cm_send ()
   from /gpfslocal/pub/openmpi/1.7.3/lib/openmpi/mca_pml_cm.so
#16 0x00007ffff79253da in PMPI_Sendrecv ()
   from /gpfslocal/pub/openmpi/1.7.3/lib/libmpi.so.1
#17 0x00000000004045ef in ExchangeHalos (cartComm=0x715460,
    devSend=0x1308350000, hostSend=0x7b8710, hostRecv=0x7c0720,
    devRecv=0x1308358000, neighbor=1, elemCount=4096) at CUDA_Aware_MPI.c:70
#18 0x00000000004033d8 in TransferAllHalos (cartComm=0x715460,
    domSize=0x7fffffffcd80, topIndex=0x7fffffffcd60, neighbors=0x7fffffffcd90,
    copyStream=0xaa4450, devBlocks=0x7fffffffcd30,
    devSideEdges=0x7fffffffcd20, devHaloLines=0x7fffffffcd10,
    hostSendLines=0x7fffffffcd00, hostRecvLines=0x7fffffffccf0) at Host.c:400
#19 0x000000000040363c in RunJacobi (cartComm=0x715460, rank=0, size=2,
---Type <return> to continue, or q <return> to quit---
    domSize=0x7fffffffcd80, topIndex=0x7fffffffcd60, neighbors=0x7fffffffcd90,
    useFastSwap=0, devBlocks=0x7fffffffcd30, devSideEdges=0x7fffffffcd20,
    devHaloLines=0x7fffffffcd10, hostSendLines=0x7fffffffcd00,
    hostRecvLines=0x7fffffffccf0, devResidue=0x1310480000,
    copyStream=0xaa4450, iterations=0x7fffffffcd44,
    avgTransferTime=0x7fffffffcd48) at Host.c:466
#20 0x0000000000401ccb in main (argc=4, argv=0x7fffffffcea8) at Jacobi.c:60
Pierre.

________________________________
De : KESTENER Pierre
Date d'envoi : mercredi 30 octobre 2013 16:34
À : users_at_[hidden]<mailto:users_at_[hidden]>
Cc: KESTENER Pierre
Objet : OpenMPI-1.7.3 - cuda support
Hello,

I'm having problems running a simple cuda-aware mpi application; the one found at
https://github.com/parallel-forall/code-samples/tree/master/posts/cuda-aware-mpi-example

I have modified symbol ENV_LOCAL_RANK into OMPI_COMM_WORLD_LOCAL_RANK
My cluster has 2 K20m GPUs per node, with QLogic IB stack.

The normal CUDA/MPI application works fine;
 but the cuda-ware mpi app is crashing when using 2 MPI proc over the 2 GPUs of the same node:
the error message is:
    Assertion failure at ptl.c:200: nbytes == msglen
I can send the complete backtrace from cuda-gdb if needed.

The same app when running on 2 GPUs on 2 different nodes give another error:
    jacobi_cuda_aware_mpi:28280 terminated with signal 11 at PC=2aae9d7c9f78 SP=7fffc06c21f8. Backtrace:
    /gpfslocal/pub/local/lib64/libinfinipath.so.4(+0x8f78)[0x2aae9d7c9f78]

Can someone give me hints where to look to track this problem ?
Thank you.

Pierre Kestener.

________________________________
This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
________________________________