Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] btl_openib_cpc_include rdmacm questions
From: Brock Palen (brockp_at_[hidden])
Date: 2011-04-28 10:17:23


Attached is the output of running with verbose 100, mpirun --mca btl_openib_cpc_include rdmacm --mca btl_base_verbose 100 NPmpi


[nyx0665.engin.umich.edu:06399] mca: base: components_open: Looking for btl components
[nyx0666.engin.umich.edu:07210] mca: base: components_open: Looking for btl components
[nyx0665.engin.umich.edu:06399] mca: base: components_open: opening btl components
[nyx0665.engin.umich.edu:06399] mca: base: components_open: found loaded component ofud
[nyx0665.engin.umich.edu:06399] mca: base: components_open: component ofud has no register function
[nyx0665.engin.umich.edu:06399] mca: base: components_open: component ofud open function successful
[nyx0665.engin.umich.edu:06399] mca: base: components_open: found loaded component openib
[nyx0665.engin.umich.edu:06399] mca: base: components_open: component openib has no register function
[nyx0665.engin.umich.edu:06399] mca: base: components_open: component openib open function successful
[nyx0665.engin.umich.edu:06399] mca: base: components_open: found loaded component self
[nyx0665.engin.umich.edu:06399] mca: base: components_open: component self has no register function
[nyx0665.engin.umich.edu:06399] mca: base: components_open: component self open function successful
[nyx0665.engin.umich.edu:06399] mca: base: components_open: found loaded component sm
[nyx0665.engin.umich.edu:06399] mca: base: components_open: component sm has no register function
[nyx0665.engin.umich.edu:06399] mca: base: components_open: component sm open function successful
[nyx0665.engin.umich.edu:06399] mca: base: components_open: found loaded component tcp
[nyx0665.engin.umich.edu:06399] mca: base: components_open: component tcp has no register function
[nyx0665.engin.umich.edu:06399] mca: base: components_open: component tcp open function successful
[nyx0666.engin.umich.edu:07210] mca: base: components_open: opening btl components
[nyx0666.engin.umich.edu:07210] mca: base: components_open: found loaded component ofud
[nyx0666.engin.umich.edu:07210] mca: base: components_open: component ofud has no register function
[nyx0666.engin.umich.edu:07210] mca: base: components_open: component ofud open function successful
[nyx0666.engin.umich.edu:07210] mca: base: components_open: found loaded component openib
[nyx0666.engin.umich.edu:07210] mca: base: components_open: component openib has no register function
[nyx0666.engin.umich.edu:07210] mca: base: components_open: component openib open function successful
[nyx0666.engin.umich.edu:07210] mca: base: components_open: found loaded component self
[nyx0666.engin.umich.edu:07210] mca: base: components_open: component self has no register function
[nyx0666.engin.umich.edu:07210] mca: base: components_open: component self open function successful
[nyx0666.engin.umich.edu:07210] mca: base: components_open: found loaded component sm
[nyx0666.engin.umich.edu:07210] mca: base: components_open: component sm has no register function
[nyx0666.engin.umich.edu:07210] mca: base: components_open: component sm open function successful
[nyx0666.engin.umich.edu:07210] mca: base: components_open: found loaded component tcp
[nyx0666.engin.umich.edu:07210] mca: base: components_open: component tcp has no register function
[nyx0666.engin.umich.edu:07210] mca: base: components_open: component tcp open function successful
[nyx0665.engin.umich.edu:06399] select: initializing btl component ofud
[nyx0665.engin.umich.edu:06399] select: init of component ofud returned failure
[nyx0665.engin.umich.edu:06399] select: module ofud unloaded
[nyx0665.engin.umich.edu:06399] select: initializing btl component openib
[nyx0666.engin.umich.edu:07210] select: initializing btl component ofud
[nyx0666.engin.umich.edu:07210] select: init of component ofud returned failure
[nyx0666.engin.umich.edu:07210] select: module ofud unloaded
[nyx0666.engin.umich.edu:07210] select: initializing btl component openib
[nyx0665.engin.umich.edu:06399] openib BTL: rdmacm IP address not found on port
[nyx0665.engin.umich.edu:06399] openib BTL: rdmacm CPC unavailable for use on mthca0:1; skipped
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host: nyx0665.engin.umich.edu
  Local device: mthca0
  Local port: 1
  CPCs attempted: rdmacm
--------------------------------------------------------------------------
[nyx0665.engin.umich.edu:06399] select: init of component openib returned failure
[nyx0665.engin.umich.edu:06399] select: module openib unloaded
[nyx0665.engin.umich.edu:06399] select: initializing btl component self
[nyx0665.engin.umich.edu:06399] select: init of component self returned success
[nyx0665.engin.umich.edu:06399] select: initializing btl component sm
[nyx0665.engin.umich.edu:06399] select: init of component sm returned success
[nyx0665.engin.umich.edu:06399] select: initializing btl component tcp
[nyx0665.engin.umich.edu:06399] select: init of component tcp returned success
[nyx0666.engin.umich.edu:07210] openib BTL: rdmacm IP address not found on port
[nyx0666.engin.umich.edu:07210] openib BTL: rdmacm CPC unavailable for use on mthca0:1; skipped
[nyx0666.engin.umich.edu:07210] select: init of component openib returned failure
[nyx0666.engin.umich.edu:07210] select: module openib unloaded
[nyx0666.engin.umich.edu:07210] select: initializing btl component self
[nyx0666.engin.umich.edu:07210] select: init of component self returned success
[nyx0666.engin.umich.edu:07210] select: initializing btl component sm
[nyx0666.engin.umich.edu:07210] select: init of component sm returned success
[nyx0666.engin.umich.edu:07210] select: initializing btl component tcp
[nyx0666.engin.umich.edu:07210] select: init of component tcp returned success
0: nyx0665
1: nyx0666
[nyx0666.engin.umich.edu:07210] btl: tcp: attempting to connect() to address 10.164.2.153 on port 516
[nyx0665.engin.umich.edu:06399] btl: tcp: attempting to connect() to address 10.164.2.154 on port 4
Now starting the main loop
  0: 1 bytes 1948 times --> 0.14 Mbps in 53.29 usec
  1: 2 bytes 1876 times --> 0.29 Mbps in 52.74 usec
  2: 3 bytes 1896 times --> 0.43 Mbps in 53.04 usec
  3: 4 bytes 1256 times --> 0.57 Mbps in 53.55 usec
  4: 6 bytes 1400 times --> 0.85 Mbps in 54.03 usec
  5: 8 bytes 925 times --> mpirun: killing job...

--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 6399 on node nyx0665.engin.umich.edu exited on signal 0 (Unknown signal 0).
--------------------------------------------------------------------------
mpirun: clean termination accomplished

[nyx0665.engin.umich.edu:06398] 1 more process has sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[nyx0665.engin.umich.edu:06398] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
2 total processes killed (some possibly by mpirun during cleanup)


We were being bit by a number of codes hanging in collectives, and was resolved by using rdmacm. We tried setting this as default till the two bugs in bugzilla are resolved as a work around. Then we hit this problem on our DDR/SDR gear.

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp_at_[hidden]
(734)936-1985

On Apr 28, 2011, at 8:07 AM, Jeff Squyres wrote:

> On Apr 27, 2011, at 10:02 AM, Brock Palen wrote:
>
>> Argh, our messed up environment with three generations on infiniband bit us,
>> Setting openib_cpc_include to rdmacm causes ib to not be used on our old DDR ib on some of our hosts. Note that jobs will never run across our old DDR ib and our new QDR stuff where rdmacm does work.
>
> Hmm -- odd. I use RDMACM on some old DDR (and SDR!) IB hardware and it seems to work fine.
>
> Do you have any indication as to why OMPI is refusing to use rdmacm on your older hardware, other than "No OF connection schemes reported..."? Try running with --mca btl_base_verbose 100 (beware: it will be a truckload of output). Make sure that you have rdmacm support available on those machines, both in OMPI and in OFED/the OS.
>
>> I am doing some testing with:
>> export OMPI_MCA_btl_openib_cpc_include=rdmacm,oob,xoob
>>
>> What I want to know is there a way to tell mpirun to 'dump all resolved mca settings' Or something similar.
>
> I'm not quite sure what you're asking here -- do you want to override MCA params on specific hosts?
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>