Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Segmentation fault error with IB support when number of processes is greater 129
From: Svyatoslav Korneev (svyatoslav.korneev_at_[hidden])
Date: 2013-03-11 10:28:07


Hello,

Thank you for your answer.

But below 129 code runs well even with these warnings.

I have following warnings:

1. WARNING: There are more than one active ports on host
'compute-0-0.local'

2. WARNING: Failed to open "OpenIB-cma-1" [DAT_INVALID_ADDRESS:].

3. open_hca: getaddr_netdev ERROR: Success. Is
ib1 configured?

4. open_hca: device mthca0 not found

5. library load failure: libdaplscm.so.2: cannot open shared object
file: No such file or directory

6. WARNING: It appears that your OpenFabrics subsystem is configured to only
allow registering part of your physical memory.

It's looks messy, but these warnings are not critical, or I'm not right ?

Warnings 1-3 denote the unconfigured second port, should I configure
network interface for it ? I tried to solve warning 6 by manual, but
it does not work. What about warning 5, what is this library
libdaplscm.so.2?

How do you think, may be it would be better to install MLNX_OFED on my
nodes as I have NIC from this brand ?

Regards,
Svyatoslav

On Mon, Mar 11, 2013 at 4:04 PM, Jeff Squyres (jsquyres)
<jsquyres_at_[hidden]> wrote:
> Did the check the FAQ entries listed on all the warning messages that you're getting? You should probably fix those first.
>
> Sent from my phone. No type good.
>
> On Mar 10, 2013, at 4:30 AM, "Svyatoslav Korneev" <svyatoslav.korneev_at_[hidden]> wrote:
>
>> Dear Community,
>>
>> I have 4 computing nodes and front-end. Computing nodes connected via
>> IB and Ethernet and fron-end has Ethernet only. Computing node has 4
>> CPU on board, each CPU has 16 cores, total number of cores per node is
>> 64. The IB network controller is Mellanox MT26428, IB switch is Qlogic
>> 12000. I installed Rock Cluster Linux 6.1 on my cluster, and this
>> system has OpenMPI from the box. ompi_info gives the version of OpenMP
>> is 1.6.2
>>
>> Good news that IB and OpendMPI with IB support working from the box,
>> but I face a really strange bug. If I try to run HelloWord on more
>> then 129 processes with IB support it gives me Segmentation fault
>> error. I have this error even if I try to start it on one node. Below
>> 129 processes everything is working fine, on 1 node or on 4 nodes
>> except warning messages (listing bellow). Without IB support
>> everything working fine on arbitrary number of processes.
>>
>> Anybody have an idea regarding my issue?
>>
>> Thank you.
>>
>>
>> Warning messages for HelloWord, running on two processes (each process
>> per node):
>>
>> mpirun --mca btl_openib_verbose 1 --mca btl ^tcp -hostfile
>> machinefile -n 2 a.out
>> [compute-0-0.local][[43740,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query]
>> Querying INI files for vendor 0x02c9, part ID 26428
>> [compute-0-0.local][[43740,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query]
>> Found corresponding INI values: Mellanox Hermon
>> [compute-0-0.local][[43740,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query]
>> Querying INI files for vendor 0x0000, part ID 0
>> [compute-0-0.local][[43740,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query]
>> Found corresponding INI values: default
>> --------------------------------------------------------------------------
>> WARNING: There are more than one active ports on host
>> 'compute-0-0.local', but the
>> default subnet GID prefix was detected on more than one of these
>> ports. If these ports are connected to different physical IB
>> networks, this configuration will fail in Open MPI. This version of
>> Open MPI requires that every physically separate IB subnet that is
>> used between connected MPI processes must have different subnet ID
>> values.
>>
>> Please see this FAQ entry for more details:
>>
>> http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid
>>
>> NOTE: You can turn off this warning by setting the MCA parameter
>> btl_openib_warn_default_gid_prefix to 0.
>> --------------------------------------------------------------------------
>> [compute-0-1.local][[43740,1],1][btl_openib_ini.c:166:ompi_btl_openib_ini_query]
>> Querying INI files for vendor 0x02c9, part ID 26428
>> [compute-0-1.local][[43740,1],1][btl_openib_ini.c:185:ompi_btl_openib_ini_query]
>> Found corresponding INI values: Mellanox Hermon
>> [compute-0-1.local][[43740,1],1][btl_openib_ini.c:166:ompi_btl_openib_ini_query]
>> Querying INI files for vendor 0x0000, part ID 0
>> [compute-0-1.local][[43740,1],1][btl_openib_ini.c:185:ompi_btl_openib_ini_query]
>> Found corresponding INI values: default
>> --------------------------------------------------------------------------
>> WARNING: Failed to open "OpenIB-cma-1" [DAT_INVALID_ADDRESS:].
>> This may be a real error or it may be an invalid entry in the uDAPL
>> Registry which is contained in the dat.conf file. Contact your local
>> System Administrator to confirm the availability of the interfaces in
>> the dat.conf file.
>> --------------------------------------------------------------------------
>> compute-0-0.local:20229: open_hca: getaddr_netdev ERROR: Success. Is
>> ib1 configured?
>> compute-0-0.local:20229: open_hca: device mthca0 not found
>> compute-0-0.local:20229: open_hca: device mthca0 not found
>> compute-0-1.local:58701: open_hca: getaddr_netdev ERROR: Success. Is
>> ib1 configured?
>> compute-0-1.local:58701: open_hca: device mthca0 not found
>> compute-0-1.local:58701: open_hca: device mthca0 not found
>> DAT: library load failure: libdaplscm.so.2: cannot open shared object
>> file: No such file or directory
>> DAT: library load failure: libdaplscm.so.2: cannot open shared object
>> file: No such file or directory
>> DAT: library load failure: libdaplscm.so.2: cannot open shared object
>> file: No such file or directory
>> DAT: library load failure: libdaplscm.so.2: cannot open shared object
>> file: No such file or directory
>> [compute-0-0.local][[43740,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query]
>> Querying INI files for vendor 0x02c9, part ID 26428
>> [compute-0-0.local][[43740,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query]
>> Found corresponding INI values: Mellanox Hermon
>> [compute-0-1.local][[43740,1],1][btl_openib_ini.c:166:ompi_btl_openib_ini_query]
>> Querying INI files for vendor 0x02c9, part ID 26428
>> [compute-0-1.local][[43740,1],1][btl_openib_ini.c:185:ompi_btl_openib_ini_query]
>> Found corresponding INI values: Mellanox Hermon
>> --------------------------------------------------------------------------
>> WARNING: It appears that your OpenFabrics subsystem is configured to only
>> allow registering part of your physical memory. This can cause MPI jobs to
>> run with erratic performance, hang, and/or crash.
>>
>> This may be caused by your OpenFabrics vendor limiting the amount of
>> physical memory that can be registered. You should investigate the
>> relevant Linux kernel module parameters that control how much physical
>> memory can be registered, and increase them to allow registering all
>> physical memory on your machine.
>>
>>
>> See this Open MPI FAQ item for more information on these Linux kernel module
>> parameters:
>>
>> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
>>
>> Local host: compute-0-0.local
>> Registerable memory: 32768 MiB
>> Total memory: 262125 MiB
>>
>> Your MPI job will continue, but may be behave poorly and/or hang.
>> --------------------------------------------------------------------------
>> [compute-0-0.local][[43740,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query]
>> Querying INI files for vendor 0x02c9, part ID 26428
>> [compute-0-0.local][[43740,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query]
>> Found corresponding INI values: Mellanox Hermon
>> [compute-0-1.local][[43740,1],1][btl_openib_ini.c:166:ompi_btl_openib_ini_query]
>> Querying INI files for vendor 0x02c9, part ID 26428
>> [compute-0-1.local][[43740,1],1][btl_openib_ini.c:185:ompi_btl_openib_ini_query]
>> Found corresponding INI values: Mellanox Hermon
>> Hello world from process 0 of 2
>> Hello world from process 1 of 2
>> [compute-0-0.local:20227] 1 more process has sent help message
>> help-mpi-btl-openib.txt / default subnet prefix
>> [compute-0-0.local:20227] Set MCA parameter "orte_base_help_aggregate"
>> to 0 to see all help / error messages
>> [compute-0-0.local:20227] 9 more processes have sent help message
>> help-mpi-btl-udapl.txt / dat_ia_open fail
>> [compute-0-0.local:20227] 3 more processes have sent help message
>> help-mpi-btl-openib.txt / reg mem limit low
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users