Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Mostyn Lewis (Mostyn.Lewis_at_[hidden])
Date: 2007-11-06 17:10:34


Andrew,

Thanks for looking. These machines are SUN X2200 and looking at the OUI of
the card it's a generic SUN Mellanox HCA.
This is SuSE SLES10 SP1 and the QuickSilver(SilverStorm) 4.1.0.0.1 software
release.

02:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex (Tavor compatibility mode) (rev a0)
HCA #0: MT25208 Tavor Compat, Lion Cub, revision A0
   Primary image is valid, unknown source
   Secondary image is valid, unknown source

   Vital Product Data
     Product Name: Lion cub
     P/N: 375-3382-01
     E/C: A1
     S/N: 1388FMH-0728200266
     Freq/Power: N/A
     Checksum: Ok
     Date Code: N/A

s1471:/proc/iba/mt23108/1/port2 # cat info
Port 2 Info
    PortState: Active PhysState: LinkUp DownDefault: Polling
    LID: 0x0392 LMC: 0
    Subnet: 0xfe80000000000000 GUID: 0x0003ba000100430e
    SMLID: 0x0001 SMSL: 0 RespTimeout : 33 ms SubnetTimeout: 536 ms
    M_KEY: 0x0000000000000000 Lease: 0 s Protect: Readonly
    MTU: Active: 2048 Supported: 2048 VL Stall: 0
    LinkWidth: Active: 4x Supported: 1-4x Enabled: 1-4x
    LinkSpeed: Active: 2.5Gb Supported: 2.5Gb Enabled: 2.5Gb
    VLs: Active: 4+1 Supported: 4+1 HOQLife: 4096 ns
    Capability 0x02010048: CR CM SL Trap
    Violations: M_Key: 0 P_Key: 0 Q_Key: 0
    ErrorLimits: Overrun: 15 LocalPhys: 15 DiagCode: 0x0000
    P_Key Enforcement: In: Off Out: Off FilterRaw: In: Off Out: Off

s1471:/proc/iba/mt23108/1/port2 # cat /etc/dat.conf
#
# DAT 1.1 configuration file
#
# Each entry should have the following fields:
#
# <ia_name> <api_version> <threadsafety> <default> <lib_path> \
# <provider_version> <ia_params> <platform_params>
#
# [ICS VERSION STRING: @(#) ./config/dat.conf.64 4_1_0_0_1G [10/22/07 19:25]

# Following are examples of valid entries:
#Hca u1.1 nonthreadsafe default /lib64/libdapl.so ri.1.1 " " " "
#Hca0 u1.1 nonthreadsafe default /lib64/libdapl.so ri.1.1 "InfiniHost0 " " "
#Hca1 u1.1 nonthreadsafe default /lib64/libdapl.so ri.1.1 "InfiniHost1 " " "
#Hca0Port1 u1.1 nonthreadsafe default /lib64/libdapl.so ri.1.1 "InfiniHost0 ib1" " "
#Hca0Port2 u1.1 nonthreadsafe default /lib64/libdapl.so ri.1.1 "InfiniHost0 ib2" " "
#=======
InfiniHost0 u1.1 nonthreadsafe default /lib64/libdapl.so ri.1.1 " " " "

Qlogic, now say they can reproduce it.

However, as we use the SilverStorm stuff a lot with many compilers and for
such as IB transport for the Lustre FileSystem, we try to stick with not
to many flavors of IB/MPI but also sometimes use OFED and Qlogics' OFED for
their PathScale cards. We also throw in Scali, MVAPICH amd mpich - so we
have a real mix to handle.

Regarding the lack of mvapi support in OpenMPI there's just udapl left for
such as SilverStorm :-(

Thanks for looking,
Mostyn

On Tue, 6 Nov 2007, Andrew Friedley wrote:

>
>
> Mostyn Lewis wrote:
>> Andrew,
>>
>> Failure looks like:
>>
>>>> + mpirun --prefix
>>>> +
>>> /tools/openmpi/1.3a1r16632_svn/infinicon/gcc64/4.1.2/udapl/suse_sles_1
>>>> + 0/x86_64/opteron -np 8
>>>> -machinefile H ./a.out
>>>> Process 0 of 8 on s1470
>>>> Process 1 of 8 on s1470
>>>> Process 4 of 8 on s1469
>>>> Process 2 of 8 on s1470
>>>> Process 7 of 8 on s1469
>>>> Process 5 of 8 on s1469
>>>> Process 6 of 8 on s1469
>>>> Process 3 of 8 on s1470
>>>> 30989:a.out *->0 (f=noaffinity,0,1,2,3)
>>>> 30988:a.out *->0 (f=noaffinity,0,1,2,3)
>>>> 30990:a.out *->0 (f=noaffinity,0,1,2,3)
>>>> 30372:a.out *->0 (f=noaffinity,0,1,2,3)
>>>> 30991:a.out *->0 (f=noaffinity,0,1,2,3)
>>>> 30370:a.out *->0 (f=noaffinity,0,1,2,3)
>>>> 30369:a.out *->0 (f=noaffinity,0,1,2,3)
>>>> 30371:a.out *->0 (f=noaffinity,0,1,2,3)
>>>> get ASYNC ERROR = 6
>
> I thought this might be coming from the uDAPL BTL but I don't see where
> in the could this could possibly be printed from.
>
>>>> [s1469:30369] *** Process received signal *** [s1469:30369] Signal:
>>>> Segmentation fault (11) [s1469:30369] Signal code: Address not mapped
>>>> (1) [s1469:30369] Failing at address: 0x110 [s1469:30369] [ 0]
>>>> /lib64/libpthread.so.0 [0x2b528ceefc10] [s1469:30369] [ 1]
>>>> /lib64/libdapl.so(dapl_llist_next_entry+0x25) [0x2b528fba5df5]
>>>> [s1469:30369] *** End of error message ***
>>
>>>> and in a /var/log/messages I see:
>>>>
>>>> Nov 5 14:46:00 s1469 sshd[30363]: Accepted publickey for mostyn from
>>>> 10.173.132.37 port 36211 ssh2 Nov 5 14:46:25 s1469 kernel: TVpd:
>>>> !ERROR! Async Event:TAVOR_EQE_TYPE_CQ_ERR: (CQ Access Error) cqn:641
>>> Nov
>>>> 5 14:46:25 s1469 kernel: a.out[30374]: segfault at 0000000000000110
>>> rip
>>>> 00002b528fba5df5 rsp 00000000410010b0 error 4
>>>>
>
> This makes me wonder if you're using the right DAT libraries. Take a
> look at your dat.conf, it's usually found in /etc and make sure that it
> is configured properly for the Qlogic stuff, and does NOT contain any
> lines for any other stuff (like OFED-based interfaces). Usually each
> line contains a path to a specific library to use for a particular
> interface, make sure it's the library you want. You might have to
> contact you uDAPL vendor for help on that.
>
>>>> This is repoducible.
>>>>
>>>> Is this OpenMPI or your libdapl that's doing this, you think?
>
> I can't be sure -- every uDAPL implementation seems to interpret the
> spec differently (or completely change or leave out some functionality),
> making it hell to provide portable uDAPL support. And currently the
> uDAPL BTL has seen little/no testing outside of Sun's and OFED's uDAPL.
>
> What kind of interface adapters are you using? Sounds like some kind of
> IB hardware; if possible I recommend using the OFED (openib BTL) or PSM
> (PSM MTL) interfaces instead of uDAPL.
>
> Andrew
>
>>>>
>>>> + ompi_info
>>>> Open MPI: 1.3a1svn11022007
>>>> Open MPI SVN revision: svn11022007
>>>> Open RTE: 1.3a1svn11022007
>>>> Open RTE SVN revision: svn11022007
>>>> OPAL: 1.3a1svn11022007
>>>> OPAL SVN revision: svn11022007
>>>> Prefix:
>>>>
>>> /tools/openmpi/1.3a1r16632_svn/infinicon/gcc64/4.1.2/udapl/suse_sles_10/
>>>> x86_64/opter
>>>> on
>>>> Configured architecture: x86_64-unknown-linux-gnu
>>>> Configure host: s1471
>>>> Configured by: root
>>>> Configured on: Fri Nov 2 16:20:29 PDT 2007
>>>> Configure host: s1471
>>>> Built by: mostyn
>>>> Built on: Fri Nov 2 16:30:07 PDT 2007
>>>> Built host: s1471
>>>> C bindings: yes
>>>> C++ bindings: yes
>>>> Fortran77 bindings: yes (all)
>>>> Fortran90 bindings: yes
>>>> Fortran90 bindings size: small
>>>> C compiler: gcc
>>>> C compiler absolute: /usr/bin/gcc
>>>> C++ compiler: g++
>>>> C++ compiler absolute: /usr/bin/g++
>>>> Fortran77 compiler: gfortran
>>>> Fortran77 compiler abs: /usr/bin/gfortran
>>>> Fortran90 compiler: gfortran
>>>> Fortran90 compiler abs: /usr/bin/gfortran
>>>> C profiling: yes
>>>> C++ profiling: yes
>>>> Fortran77 profiling: yes
>>>> Fortran90 profiling: yes
>>>> C++ exceptions: no
>>>> Thread support: posix (mpi: no, progress: no)
>>>> Sparse Groups: no
>>>> Internal debug support: no
>>>> MPI parameter check: runtime
>>>> Memory profiling support: no
>>>> Memory debugging support: no
>>>> libltdl support: yes
>>>> Heterogeneous support: yes
>>>> mpirun default --prefix: no
>>>> MPI I/O support: yes
>>>> MCA backtrace: execinfo (MCA v1.0, API v1.0, Component
>>> v1.3)
>>>> MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component
>>>> v1.3)
>>>> MCA paffinity: linux (MCA v1.0, API v1.1, Component v1.3)
>>>> MCA maffinity: first_use (MCA v1.0, API v1.0, Component
>>>> v1.3)
>>>> MCA maffinity: libnuma (MCA v1.0, API v1.0, Component v1.3)
>>>> MCA timer: linux (MCA v1.0, API v1.0, Component v1.3)
>>>> MCA installdirs: env (MCA v1.0, API v1.0, Component v1.3)
>>>> MCA installdirs: config (MCA v1.0, API v1.0, Component v1.3)
>>>> MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
>>>> MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
>>>> MCA coll: basic (MCA v1.0, API v1.1, Component v1.3)
>>>> MCA coll: inter (MCA v1.0, API v1.1, Component v1.3)
>>>> MCA coll: self (MCA v1.0, API v1.1, Component v1.3)
>>>> MCA coll: sm (MCA v1.0, API v1.1, Component v1.3)
>>>> MCA coll: tuned (MCA v1.0, API v1.1, Component v1.3)
>>>> MCA io: romio (MCA v1.0, API v1.0, Component v1.3)
>>>> MCA mpool: rdma (MCA v1.0, API v1.0, Component v1.3)
>>>> MCA mpool: sm (MCA v1.0, API v1.0, Component v1.3)
>>>> MCA pml: cm (MCA v1.0, API v1.0, Component v1.3)
>>>> MCA pml: dr (MCA v1.0, API v1.0, Component v1.3)
>>>> MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.3)
>>>> MCA bml: r2 (MCA v1.0, API v1.0, Component v1.3)
>>>> MCA rcache: vma (MCA v1.0, API v1.0, Component v1.3)
>>>> MCA btl: self (MCA v1.0, API v1.0.1, Component v1.3)
>>>> MCA btl: sm (MCA v1.0, API v1.0.1, Component v1.3)
>>>> MCA btl: udapl (MCA v1.0, API v1.0, Component v1.3)
>>>> MCA topo: unity (MCA v1.0, API v1.0, Component v1.3)
>>>> MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.3)
>>>> MCA osc: rdma (MCA v1.0, API v1.0, Component v1.3)
>>>> MCA errmgr: hnp (MCA v1.0, API v1.3, Component v1.3)
>>>> MCA errmgr: orted (MCA v1.0, API v1.3, Component v1.3)
>>>> MCA errmgr: proxy (MCA v1.0, API v1.3, Component v1.3)
>>>> MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.3)
>>>> MCA gpr: replica (MCA v1.0, API v1.0, Component v1.3)
>>>> MCA grpcomm: basic (MCA v1.0, API v2.0, Component v1.3)
>>>> MCA iof: proxy (MCA v1.0, API v1.0, Component v1.3)
>>>> MCA iof: svc (MCA v1.0, API v1.0, Component v1.3)
>>>> MCA ns: proxy (MCA v1.0, API v2.0, Component v1.3)
>>>> MCA ns: replica (MCA v1.0, API v2.0, Component v1.3)
>>>> MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
>>>> MCA odls: default (MCA v1.0, API v1.3, Component v1.3)
>>>> MCA ras: dash_host (MCA v1.0, API v1.3, Component
>>>> v1.3)
>>>> MCA ras: localhost (MCA v1.0, API v1.3, Component
>>>> v1.3)
>>>> MCA ras: slurm (MCA v1.0, API v1.3, Component v1.3)
>>>> MCA rds: hostfile (MCA v1.0, API v1.3, Component
>>> v1.3)
>>>> MCA rds: proxy (MCA v1.0, API v1.3, Component v1.3)
>>>> MCA rmaps: round_robin (MCA v1.0, API v1.3, Component
>>>> v1.3)
>>>> MCA rmgr: proxy (MCA v1.0, API v2.0, Component v1.3)
>>>> MCA rmgr: urm (MCA v1.0, API v2.0, Component v1.3)
>>>> MCA rml: oob (MCA v1.0, API v1.0, Component v1.3)
>>>> MCA routed: tree (MCA v1.0, API v1.0, Component v1.3)
>>>> MCA routed: unity (MCA v1.0, API v1.0, Component v1.3)
>>>> MCA pls: proxy (MCA v1.0, API v1.3, Component v1.3)
>>>> MCA pls: rsh (MCA v1.0, API v1.3, Component v1.3)
>>>> MCA pls: slurm (MCA v1.0, API v1.3, Component v1.3)
>>>> MCA sds: env (MCA v1.0, API v1.0, Component v1.3)
>>>> MCA sds: pipe (MCA v1.0, API v1.0, Component v1.3)
>>>> MCA sds: seed (MCA v1.0, API v1.0, Component v1.3)
>>>> MCA sds: singleton (MCA v1.0, API v1.0, Component
>>>> v1.3)
>>>> MCA sds: slurm (MCA v1.0, API v1.0, Component v1.3)
>>>> MCA filem: rsh (MCA v1.0, API v1.0, Component v1.3)
>>
>>
>> Regards,
>> Mostyn
>>
>>
>> On Tue, 6 Nov 2007, Andrew Friedley wrote:
>>
>>> All thread support is disabled by default in Open MPI; the uDAPL BTL is
>>> neither thread safe nor makes use of a threaded uDAPL implementation.
>>> For completeness, the thread support is controlled by the
>>> --enable-mpi-threads and --enable-progress-threads options to the
>>> configure script.
>>>
>>> The referense you're seeing to libpthread.so.0 is a side effect of the
>>> way we print backtraces when crashes occur and can be ignored.
>>>
>>> How exactly does your MPI program fail? Make sure you take a look at
>>> http://www.open-mpi.org/community/help/ and provide all relevant
>>> information.
>>>
>>> Andrew
>>>
>>> Mostyn Lewis wrote:
>>>> I'm trying to build a udapl OpenMPI from last Friday's SVN and using
>>>> Qlogic/QuickSilver/SilverStorm 4.1.0.0.1 software. I can get it
>>>> made and it works in machine. With IB between 2 machines is fails
>>>> near termination of a job. Qlogic says they don't have a threaded
>>>> udapl (libpthread is in the traceback).
>>>>
>>>> How do you (can you?) configure pthreads away alltogether?
>>>>
>>>> Mostyn
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>