Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] 1.7.5rc1, error "COLL-ML ml_discover_hierarchy exited with error."
From: Rolf vandeVaart (rvandevaart_at_[hidden])
Date: 2014-03-03 19:41:51


There is something going wrong with the ml collective component. So, if you disable it, things work.
I just reconfigured without any CUDA-aware support, and I see the same failure so it has nothing to do with CUDA.

Looks like Jeff Squyres just made a bug for it.

https://svn.open-mpi.org/trac/ompi/ticket/4331

>-----Original Message-----
>From: users [mailto:users-bounces_at_[hidden]] On Behalf Of Filippo Spiga
>Sent: Monday, March 03, 2014 7:32 PM
>To: Open MPI Users
>Subject: Re: [OMPI users] 1.7.5rc1, error "COLL-ML ml_discover_hierarchy
>exited with error."
>
>Dear Rolf,
>
>your suggestion works!
>
>$ mpirun -np 4 --map-by ppr:1:socket -bind-to core --mca coll ^ml osu_alltoall
># OSU MPI All-to-All Personalized Exchange Latency Test v4.2
># Size Avg Latency(us)
>1 8.02
>2 2.96
>4 2.91
>8 2.91
>16 2.96
>32 3.07
>64 3.25
>128 3.74
>256 3.85
>512 4.11
>1024 4.79
>2048 5.91
>4096 15.84
>8192 24.88
>16384 35.35
>32768 56.20
>65536 66.88
>131072 114.89
>262144 209.36
>524288 396.12
>1048576 765.65
>
>
>Can you clarify exactly where the problem come from?
>
>Regards,
>Filippo
>
>
>On Mar 4, 2014, at 12:17 AM, Rolf vandeVaart <rvandevaart_at_[hidden]>
>wrote:
>> Can you try running with --mca coll ^ml and see if things work?
>>
>> Rolf
>>
>>> -----Original Message-----
>>> From: users [mailto:users-bounces_at_[hidden]] On Behalf Of Filippo
>>> Spiga
>>> Sent: Monday, March 03, 2014 7:14 PM
>>> To: Open MPI Users
>>> Subject: [OMPI users] 1.7.5rc1, error "COLL-ML ml_discover_hierarchy
>>> exited with error."
>>>
>>> Dear Open MPI developers,
>>>
>>> I hit an expected error running OSU osu_alltoall benchmark using Open
>>> MPI 1.7.5rc1. Here the error:
>>>
>>> $ mpirun -np 4 --map-by ppr:1:socket -bind-to core osu_alltoall In
>>> bcol_comm_query hmca_bcol_basesmuma_allocate_sm_ctl_memory
>failed In
>>> bcol_comm_query hmca_bcol_basesmuma_allocate_sm_ctl_memory
>>> failed
>>> [tesla50][[6927,1],1][../../../../../ompi/mca/coll/ml/coll_ml_module.
>>> c:2996:mc a_coll_ml_comm_query] COLL-ML ml_discover_hierarchy exited
>>> with error.
>>>
>>> [tesla50:42200] In base_bcol_masesmuma_setup_library_buffers and
>>> mpool was not successfully setup!
>>> [tesla50][[6927,1],0][../../../../../ompi/mca/coll/ml/coll_ml_module.
>>> c:2996:mc a_coll_ml_comm_query] COLL-ML ml_discover_hierarchy exited
>>> with error.
>>>
>>> [tesla50:42201] In base_bcol_masesmuma_setup_library_buffers and
>>> mpool was not successfully setup!
>>> # OSU MPI All-to-All Personalized Exchange Latency Test v4.2
>>> # Size Avg Latency(us)
>>> ---------------------------------------------------------------------
>>> ----- mpirun noticed that process rank 3 with PID 4508 on node
>>> tesla51 exited on signal 11 (Segmentation fault).
>>> ---------------------------------------------------------------------
>>> -----
>>> 2 total processes killed (some possibly by mpirun during cleanup)
>>>
>>> Any idea where this come from?
>>>
>>> I compiled Open MPI using Intel 12.1, latest Mellanox stack and CUDA
>6.0RC.
>>> Attached outputs grabbed from configure, make and run. The configure
>>> was
>>>
>>> export MXM_DIR=/opt/mellanox/mxm
>>> export KNEM_DIR=$(find /opt -maxdepth 1 -type d -name "knem*"
>>> -print0) export FCA_DIR=/opt/mellanox/fca export
>>> HCOLL_DIR=/opt/mellanox/hcoll
>>>
>>> ../configure CC=icc CXX=icpc F77=ifort FC=ifort FFLAGS="-xSSE4.2
>>> -axAVX -ip -
>>> O3 -fno-fnalias" FCFLAGS="-xSSE4.2 -axAVX -ip -O3 -fno-fnalias"
>>> --prefix=<...> --enable-mpirun-prefix-by-default --with-fca=$FCA_DIR
>>> --with- mxm=$MXM_DIR --with-knem=$KNEM_DIR --with-
>>> cuda=$CUDA_INSTALL_PATH --enable-mpi-thread-multiple --with-
>>> hwloc=internal --with-verbs 2>&1 | tee config.out
>>>
>>>
>>> Thanks in advance,
>>> Regards
>>>
>>> Filippo
>>>
>>> --
>>> Mr. Filippo SPIGA, M.Sc.
>>> http://www.linkedin.com/in/filippospiga ~ skype: filippo.spiga
>>>
>>> <Nobody will drive us out of Cantor's paradise.> ~ David Hilbert
>>>
>>> *****
>>> Disclaimer: "Please note this message and any attachments are
>>> CONFIDENTIAL and may be privileged or otherwise protected from
>disclosure.
>>> The contents are not to be disclosed to anyone other than the addressee.
>>> Unauthorized recipients are requested to preserve this
>>> confidentiality and to advise the sender immediately of any error in
>transmission."
>>
>> ----------------------------------------------------------------------
>> ------------- This email message is for the sole use of the intended
>> recipient(s) and may contain confidential information. Any
>> unauthorized review, use, disclosure or distribution is prohibited.
>> If you are not the intended recipient, please contact the sender by
>> reply email and destroy all copies of the original message.
>> ----------------------------------------------------------------------
>> ------------- _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>--
>Mr. Filippo SPIGA, M.Sc.
>http://www.linkedin.com/in/filippospiga ~ skype: filippo.spiga
>
><Nobody will drive us out of Cantor's paradise.> ~ David Hilbert
>
>*****
>Disclaimer: "Please note this message and any attachments are
>CONFIDENTIAL and may be privileged or otherwise protected from disclosure.
>The contents are not to be disclosed to anyone other than the addressee.
>Unauthorized recipients are requested to preserve this confidentiality and to
>advise the sender immediately of any error in transmission."
>
>
>_______________________________________________
>users mailing list
>users_at_[hidden]
>http://www.open-mpi.org/mailman/listinfo.cgi/users