Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] application with mxm hangs on startup
From: Pavel Mezentsev (pavel.mezentsev_at_[hidden])
Date: 2012-08-24 10:13:57


I have the latest version installed: 1.1.1227.
And the version which initially came with latest mellanox ofed is 1.0.601.
And openmpi fails to get built with this version.

I have certain progress. I managed to launch the application with openmpi
1.6.0.
But from time to time I get errors. The error occur more often as the
number of processes grows.
Here is what I get:
MXM was unable to create an endpoint. Please make sure that the network
link is
active on the node and the hardware is functioning.

  Error: Shared memory error

--------------------------------------------------------------------------
[1345812628.431295] [b23:8172 :0] shm_queue.c:285 MXM ERROR Slave proccess
cannot obtain shared memory segment id.
[1345812628.431322] [b23:8172 :0] shm_ep.c:133 MXM ERROR Unable to
attach endpoint
[b23:08172] [[12734,1],10] selected pml ob1, but peer [[12734,1],0] on b23
selected pml cm

And apart from the errors the performance with mxm is pretty disappointing.
Out of 6 collectives (Allreduce Reduce Barrier Bcast Allgather Allgatherv)
only reduce shows better performance on large messages. At what scale
should the situation change? By default mxm starts working on 128
processes. But even on 256 the results with mxm are worse then with usual
openib,sm,self. Am I missing something? May be I need to tune something?
It's just that there is pretty little information on the subject except for
readme on mellanox web site.

I've tried performing the tests on two configurations:
16 nodes with Intel SB processors and fdr infiniband
10 nodes with AMD Interlagos and qdr infiniband.

Regards, Pavel Mezentsev.

2012/8/24 Mike Dubman <mike.ompi_at_[hidden]>

>
> Hi,
> Could you please download latest mxm from
> http://www.mellanox.com/products/mxm/ and retry?
> The mxm version which comes with OFED 1.5.3 was tested with OMPI 1.6.0.
>
> Regards
> M
>
> On Wed, Aug 22, 2012 at 2:22 PM, Pavel Mezentsev <
> pavel.mezentsev_at_[hidden]> wrote:
>
>> I've tried to launch the application on nodes with QDR Infiniband. The
>> first attempt with 2 processes worked, but the following was printed to the
>> output:
>> [1345633953.436676] [b01:2523 :0] mpool.c:99 MXM ERROR Invalid
>> mempool parameter(s)
>> [1345633953.436676] [b01:2522 :0] mpool.c:99 MXM ERROR Invalid
>> mempool parameter(s)
>> --------------------------------------------------------------------------
>> MXM was unable to create an endpoint. Please make sure that the network
>> link is
>> active on the node and the hardware is functioning.
>>
>> Error: Invalid parameter
>>
>> --------------------------------------------------------------------------
>>
>> The results from this launch didn't differ from the results of the launch
>> without MXM.
>>
>> Then I've tried to launch it with 256 processes, but got the same message
>> from each process and then the application crashed. After that I'm
>> observing the same behavior as with FDR: application hangs in
>> the beginning.
>>
>> Best regards, Pavel Mezentsev.
>>
>>
>> 2012/8/22 Pavel Mezentsev <pavel.mezentsev_at_[hidden]>
>>
>>> Hello!
>>>
>>> I've built openmpi 1.6.1rc3 with support of MXM. But when I try to
>>> launch an application using this mtl it hangs and can't figure out why.
>>>
>>> If I launch it with np below 128 then everything works fine since mxm
>>> isn't used. I've tried setting the threshold to 0 and launching 2 processes
>>> with the same result: hangs on startup.
>>> What could be causing this problem?
>>>
>>> Here is the command I execute:
>>> /opt/openmpi/1.6.1/mxm-test/bin/mpirun \
>>> -np $NP \
>>> -hostfile hosts_fdr2 \
>>> --mca mtl mxm \
>>> --mca btl ^tcp \
>>> --mca mtl_mxm_np 0 \
>>> -x OMP_NUM_THREADS=$NT \
>>> -x LD_LIBRARY_PATH \
>>> --bind-to-core \
>>> -npernode 16 \
>>> --mca coll_fca_np 0 -mca coll_fca_enable 0 \
>>> ./IMB-MPI1 -npmin $NP Allreduce Reduce Barrier Bcast
>>> Allgather Allgatherv
>>>
>>> I'm performing the tests on nodes with Intel SB processors and FDR.
>>> Openmpi was configured with the following parameters:
>>> CC=icc CXX=icpc F77=ifort FC=ifort ./configure
>>> --prefix=/opt/openmpi/1.6.1rc3/mxm-test --with-mxm=/opt/mellanox/mxm
>>> --with-fca=/opt/mellanox/fca --with-knem=/usr/share/knem
>>> I'm using the latest ofed from mellanox: 1.5.3-3.1.0 on centos 6.1 with
>>> default kernel: 2.6.32-131.0.15.
>>> The compilation with default mxm (1.0.601) failed so I installed the
>>> latest version from mellanox: 1.1.1227
>>>
>>> Best regards, Pavel Mezentsev.
>>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>