Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Segmentation fault error with IB support when number of processes is greater 129
From: Svyatoslav Korneev (svyatoslav.korneev_at_[hidden])
Date: 2013-03-12 11:53:27


Thank you for reply.

So my segv disappears after I switch off udapl. Thank you. I have more
questions. I configured the second interface ib1, and I have following
warning:

WARNING: There are more than one active ports on host
'compute-0-0.local', but the
default subnet GID prefix was detected on more than one of these
ports. If these ports are connected to different physical IB
networks, this configuration will fail in Open MPI. This version of
Open MPI requires that every physically separate IB subnet that is
used between connected MPI processes must have different subnet ID
values.

Please see this FAQ entry for more details:

  http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_default_gid_prefix to 0

I've read this link and as I understand it is no an issue for me,
because my host has two ports, and these ports are in the one Subnet.
But problem with limits still unclear for me. Editing
/etc/security/limits.conf does not help me. So my next step is:
Mellanox has advised the Open MPI community to increase the
log_num_mtt value. But where should I edit this constant.

Regards,
Svyatoslav

On Mon, Mar 11, 2013 at 5:49 PM, Jeff Squyres (jsquyres)
<jsquyres_at_[hidden]> wrote:
> If you have multiple active ports, you might as well utilize them. After all, you paid for them!
>
> You should disable udapl; it's really not useful unless you're going to run intel MPI.
>
> You do need to fix the registration memory issue - see the FAQ for that.
>
> So yes, none of these may be contributing to the segv. But they should sill be cleared up before chasing the segv, if for no other reason than to decease the number of variables and side effects. that you're chasing.
>
> Also, please upgrade to the latest certain of open MPI.
>
> Sent from my phone. No type good.
>
> On Mar 11, 2013, at 10:29 AM, "Svyatoslav Korneev" <svyatoslav.korneev_at_[hidden]> wrote:
>
>> Hello,
>>
>> Thank you for your answer.
>>
>> But below 129 code runs well even with these warnings.
>>
>> I have following warnings:
>>
>> 1. WARNING: There are more than one active ports on host
>> 'compute-0-0.local'
>>
>> 2. WARNING: Failed to open "OpenIB-cma-1" [DAT_INVALID_ADDRESS:].
>>
>> 3. open_hca: getaddr_netdev ERROR: Success. Is
>> ib1 configured?
>>
>> 4. open_hca: device mthca0 not found
>>
>> 5. library load failure: libdaplscm.so.2: cannot open shared object
>> file: No such file or directory
>>
>> 6. WARNING: It appears that your OpenFabrics subsystem is configured to only
>> allow registering part of your physical memory.
>>
>>
>> It's looks messy, but these warnings are not critical, or I'm not right ?
>>
>> Warnings 1-3 denote the unconfigured second port, should I configure
>> network interface for it ? I tried to solve warning 6 by manual, but
>> it does not work. What about warning 5, what is this library
>> libdaplscm.so.2?
>>
>> How do you think, may be it would be better to install MLNX_OFED on my
>> nodes as I have NIC from this brand ?
>>
>> Regards,
>> Svyatoslav
>>
>> On Mon, Mar 11, 2013 at 4:04 PM, Jeff Squyres (jsquyres)
>> <jsquyres_at_[hidden]> wrote:
>>> Did the check the FAQ entries listed on all the warning messages that you're getting? You should probably fix those first.
>>>
>>> Sent from my phone. No type good.
>>>
>>> On Mar 10, 2013, at 4:30 AM, "Svyatoslav Korneev" <svyatoslav.korneev_at_[hidden]> wrote:
>>>
>>>> Dear Community,
>>>>
>>>> I have 4 computing nodes and front-end. Computing nodes connected via
>>>> IB and Ethernet and fron-end has Ethernet only. Computing node has 4
>>>> CPU on board, each CPU has 16 cores, total number of cores per node is
>>>> 64. The IB network controller is Mellanox MT26428, IB switch is Qlogic
>>>> 12000. I installed Rock Cluster Linux 6.1 on my cluster, and this
>>>> system has OpenMPI from the box. ompi_info gives the version of OpenMP
>>>> is 1.6.2
>>>>
>>>> Good news that IB and OpendMPI with IB support working from the box,
>>>> but I face a really strange bug. If I try to run HelloWord on more
>>>> then 129 processes with IB support it gives me Segmentation fault
>>>> error. I have this error even if I try to start it on one node. Below
>>>> 129 processes everything is working fine, on 1 node or on 4 nodes
>>>> except warning messages (listing bellow). Without IB support
>>>> everything working fine on arbitrary number of processes.
>>>>
>>>> Anybody have an idea regarding my issue?
>>>>
>>>> Thank you.
>>>>
>>>>
>>>> Warning messages for HelloWord, running on two processes (each process
>>>> per node):
>>>>
>>>> mpirun --mca btl_openib_verbose 1 --mca btl ^tcp -hostfile
>>>> machinefile -n 2 a.out
>>>> [compute-0-0.local][[43740,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query]
>>>> Querying INI files for vendor 0x02c9, part ID 26428
>>>> [compute-0-0.local][[43740,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query]
>>>> Found corresponding INI values: Mellanox Hermon
>>>> [compute-0-0.local][[43740,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query]
>>>> Querying INI files for vendor 0x0000, part ID 0
>>>> [compute-0-0.local][[43740,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query]
>>>> Found corresponding INI values: default
>>>> --------------------------------------------------------------------------
>>>> WARNING: There are more than one active ports on host
>>>> 'compute-0-0.local', but the
>>>> default subnet GID prefix was detected on more than one of these
>>>> ports. If these ports are connected to different physical IB
>>>> networks, this configuration will fail in Open MPI. This version of
>>>> Open MPI requires that every physically separate IB subnet that is
>>>> used between connected MPI processes must have different subnet ID
>>>> values.
>>>>
>>>> Please see this FAQ entry for more details:
>>>>
>>>> http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid
>>>>
>>>> NOTE: You can turn off this warning by setting the MCA parameter
>>>> btl_openib_warn_default_gid_prefix to 0.
>>>> --------------------------------------------------------------------------
>>>> [compute-0-1.local][[43740,1],1][btl_openib_ini.c:166:ompi_btl_openib_ini_query]
>>>> Querying INI files for vendor 0x02c9, part ID 26428
>>>> [compute-0-1.local][[43740,1],1][btl_openib_ini.c:185:ompi_btl_openib_ini_query]
>>>> Found corresponding INI values: Mellanox Hermon
>>>> [compute-0-1.local][[43740,1],1][btl_openib_ini.c:166:ompi_btl_openib_ini_query]
>>>> Querying INI files for vendor 0x0000, part ID 0
>>>> [compute-0-1.local][[43740,1],1][btl_openib_ini.c:185:ompi_btl_openib_ini_query]
>>>> Found corresponding INI values: default
>>>> --------------------------------------------------------------------------
>>>> WARNING: Failed to open "OpenIB-cma-1" [DAT_INVALID_ADDRESS:].
>>>> This may be a real error or it may be an invalid entry in the uDAPL
>>>> Registry which is contained in the dat.conf file. Contact your local
>>>> System Administrator to confirm the availability of the interfaces in
>>>> the dat.conf file.
>>>> --------------------------------------------------------------------------
>>>> compute-0-0.local:20229: open_hca: getaddr_netdev ERROR: Success. Is
>>>> ib1 configured?
>>>> compute-0-0.local:20229: open_hca: device mthca0 not found
>>>> compute-0-0.local:20229: open_hca: device mthca0 not found
>>>> compute-0-1.local:58701: open_hca: getaddr_netdev ERROR: Success. Is
>>>> ib1 configured?
>>>> compute-0-1.local:58701: open_hca: device mthca0 not found
>>>> compute-0-1.local:58701: open_hca: device mthca0 not found
>>>> DAT: library load failure: libdaplscm.so.2: cannot open shared object
>>>> file: No such file or directory
>>>> DAT: library load failure: libdaplscm.so.2: cannot open shared object
>>>> file: No such file or directory
>>>> DAT: library load failure: libdaplscm.so.2: cannot open shared object
>>>> file: No such file or directory
>>>> DAT: library load failure: libdaplscm.so.2: cannot open shared object
>>>> file: No such file or directory
>>>> [compute-0-0.local][[43740,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query]
>>>> Querying INI files for vendor 0x02c9, part ID 26428
>>>> [compute-0-0.local][[43740,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query]
>>>> Found corresponding INI values: Mellanox Hermon
>>>> [compute-0-1.local][[43740,1],1][btl_openib_ini.c:166:ompi_btl_openib_ini_query]
>>>> Querying INI files for vendor 0x02c9, part ID 26428
>>>> [compute-0-1.local][[43740,1],1][btl_openib_ini.c:185:ompi_btl_openib_ini_query]
>>>> Found corresponding INI values: Mellanox Hermon
>>>> --------------------------------------------------------------------------
>>>> WARNING: It appears that your OpenFabrics subsystem is configured to only
>>>> allow registering part of your physical memory. This can cause MPI jobs to
>>>> run with erratic performance, hang, and/or crash.
>>>>
>>>> This may be caused by your OpenFabrics vendor limiting the amount of
>>>> physical memory that can be registered. You should investigate the
>>>> relevant Linux kernel module parameters that control how much physical
>>>> memory can be registered, and increase them to allow registering all
>>>> physical memory on your machine.
>>>>
>>>>
>>>> See this Open MPI FAQ item for more information on these Linux kernel module
>>>> parameters:
>>>>
>>>> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
>>>>
>>>> Local host: compute-0-0.local
>>>> Registerable memory: 32768 MiB
>>>> Total memory: 262125 MiB
>>>>
>>>> Your MPI job will continue, but may be behave poorly and/or hang.
>>>> --------------------------------------------------------------------------
>>>> [compute-0-0.local][[43740,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query]
>>>> Querying INI files for vendor 0x02c9, part ID 26428
>>>> [compute-0-0.local][[43740,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query]
>>>> Found corresponding INI values: Mellanox Hermon
>>>> [compute-0-1.local][[43740,1],1][btl_openib_ini.c:166:ompi_btl_openib_ini_query]
>>>> Querying INI files for vendor 0x02c9, part ID 26428
>>>> [compute-0-1.local][[43740,1],1][btl_openib_ini.c:185:ompi_btl_openib_ini_query]
>>>> Found corresponding INI values: Mellanox Hermon
>>>> Hello world from process 0 of 2
>>>> Hello world from process 1 of 2
>>>> [compute-0-0.local:20227] 1 more process has sent help message
>>>> help-mpi-btl-openib.txt / default subnet prefix
>>>> [compute-0-0.local:20227] Set MCA parameter "orte_base_help_aggregate"
>>>> to 0 to see all help / error messages
>>>> [compute-0-0.local:20227] 9 more processes have sent help message
>>>> help-mpi-btl-udapl.txt / dat_ia_open fail
>>>> [compute-0-0.local:20227] 3 more processes have sent help message
>>>> help-mpi-btl-openib.txt / reg mem limit low
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users