Some more info:
 
The MOFED that you will download will have MXM in it, but it is an older version of it (v1.1). A new version of MXM (v1.5) is available.
 
So, after installing MOFED, please erase the MXM in it (rpm -e mxm) and download the new MXM (v1.5) from:
http://www.mellanox.com/page/products_dyn?product_family=135&mtag=mxm
(the one that matches your OS)
 
Thanks,
Alina


On Wed, Jan 23, 2013 at 11:28 AM, Alina Sklarevich <alinas@dev.mellanox.co.il> wrote:
Hello Francesco,
 
Please download and install MOFED from:
http://www.mellanox.com/page/products_dyn?product_family=26&mtag=linux_sw_drivers
(the one that matches to your OS)
 
Then MXM will be compatible to your OS.
 
Thanks,
Alina.


On Mon, Jan 21, 2013 at 5:00 PM, Francesco Simula <francesco.simula@roma1.infn.it> wrote:

Hi Mike,

that is a question I'm not sure I can answer, because I didn't install the OFED package; looking for it on the OpenFabrics.org site I see that the archive contains the packages that in my distro come one by one, e.g. 'libmthca-1.0.5-6.el5.rpm', 'libmlx4-1.0.1-7.el5.rpm', etc.

The complete list of packages that I was instructed to install to make the IB HCAs work under Centos/Red Hat is here:

http://people.redhat.com/dledford/infiniband_get_started.html

Taking for example the 'libmthca' package, the version is 1.0.5, the same as 'libmthca-1.0.5-0.1-gbe5eef3.src.rpm' source package inside 'OFED-3.5-rc5.tgz' from the OpenFabrics site; on the other hand, the 'libibmad' package for Centos 5.8 is version 1.3.3, not matching with 'libibmad-1.3.9-1.src.rpm' inside the tarball.

Do you think I should erase the relevant packages and instead rely on a completely recompiled OFED? As a desperate measure I will, but I understood that usage of OMPI 1.6.3 with MXM was not needing this...

 

Best regards,

Francesco

 

Il 2013-01-19 20:06 Mike Dubman ha scritto:

Also, what MOFED/OFED version do you have?
MXM is compiled per OFED/MOFED version, is there match between active ofed and mxm.rpm selected?

On Thu, Jan 17, 2013 at 4:09 PM, Francesco Simula <francesco.simula@roma1.infn.it> wrote:
I tried building from OMPI 1.6.3 tarball with the following ./configure:
./configure --prefix=/apotto/home1/homedirs/fsimula/Lavoro/openmpi-1.6.3/install/ \
--disable-mpi-io \
--disable-io-romio \
--enable-dependency-tracking \
--without-slurm \
--with-platform=optimized \
--disable-mpi-f77 \
--disable-mpi-f90 \
--with-openib \
--disable-static \
--enable-shared \
--disable-vt \
--enable-pty-support \
--enable-mca-no-build=btl-ofud,pml-bfo \
--with-mxm=/opt/mellanox/mxm \
--with-mxm-libdir=/opt/mellanox/mxm/lib

As you can see from the last two lines, I want to enable the MXM transport layer on a cluster made of SuperMicro X8DTG-D boards with dual Xeons and Mellanox MT26428 HCAs; the OS is CentOS 5.8.

I tried with two different .rpm's for MXM, either 'mxm-1.1.ad085ef-1.x86_64-centos5u7.rpm' taken from here:
http://www.mellanox.com/downloads/hpc/mxm/v1.1/mxm-latest.tar

and 'mxm-1.5.f583875-1.x86_64-centos5u7.rpm' taken from here:
http://www.mellanox.com/downloads/hpc/mxm/v1.5/mxm-latest.tar

With both, even if the compilation concludes successfully, a simple test (osu_bw from the OSU Micro-Benchmarks 3.8) fails with the sort of message reported below; the lines:

rdma_dev.c:122  MXM DEBUG Port 1 on mlx4_0 has a link layer different from IB. Skipping it
rdma_dev.c:155  MXM ERROR An active IB port on a Mellanox device, with lid [any] gid [any] not found

make it seem like it cannot access the HW for the HCA: is that so? The very same test works when using '-mca pml ob1' (thus using the openib BTL).

I'm quite ready to start pulling my hair; any suggestions?

The output of /usr/bin/ibv_devinfo for the two cluster nodes follows:
[cut]
hca_id: mlx4_0
        transport:                      InfiniBand (0)
        fw_ver:                         2.7.000
        node_guid:                      0025:90ff:ff07:0ac4
        sys_image_guid:                 0025:90ff:ff07:0ac7
        vendor_id:                      0x02c9
        vendor_part_id:                 26428
        hw_ver:                         0xB0
        board_id:                       SM_1061000001000
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 4
                        port_lid:               6
                        port_lmc:               0x00
[/cut]

[cut]
hca_id: mlx4_0
        transport:                      InfiniBand (0)
        fw_ver:                         2.7.000
        node_guid:                      0025:90ff:ff07:0acc
        sys_image_guid:                 0025:90ff:ff07:0acf
        vendor_id:                      0x02c9
        vendor_part_id:                 26428
        hw_ver:                         0xB0
        board_id:                       SM_1061000001000
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 4
                        port_lid:               8
                        port_lmc:               0x00
[/cut]

The complete output of the failing test follows:

[fsimula@agape5 osu-micro-benchmarks-3.8]$ mpirun -x MXM_LOG_LEVEL=poll -mca pml cm -mca mtl_mxm_np 1 -np 2 -host agape4,agape5 install/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw H H
[1358430343.266782] [agape5:8596 :0] config_parser.c:168  MXM DEBUG
[1358430343.266815] [agape5:8596 :0] config_parser.c:168  MXM DEBUG default: MXM_HANDLE_ERRORS=bt
[1358430343.266826] [agape5:8596 :0] config_parser.c:168  MXM DEBUG default: MXM_GDB_PATH=/usr/bin/gdb
[1358430343.266838] [agape5:8596 :0] config_parser.c:168  MXM DEBUG default: MXM_DUMP_SIGNO=1
[1358430343.266851] [agape5:8596 :0] config_parser.c:168  MXM DEBUG default: MXM_DUMP_LEVEL=conn
[1358430343.266924] [agape5:8596 :0] config_parser.c:168  MXM DEBUG default: MXM_ASYNC_MODE=THREAD
[1358430343.266936] [agape5:8596 :0] config_parser.c:168  MXM DEBUG default: MXM_TIME_ACCURACY=0.1
[1358430343.266956] [agape5:8596 :0] config_parser.c:168  MXM DEBUG default: MXM_PTLS=self,shm,rdma
[1358430343.267249] [agape5:8596 :0]     mpool.c:265  MXM DEBUG mpool 'ptl_self_recv_ev': allocated chunk 0xc075f40 of 96016 bytes with 1000 elements
[1358430343.267308] [agape5:8596 :0]     mpool.c:156  MXM DEBUG mpool 'ptl_self_recv_ev': align 16, maxelems 1000, elemsize 88, padding 8
[1358430343.267316] [agape5:8596 :0]      self.c:410  MXM DEBUG Created ptl_self
[1358430343.267333] [agape5:8596 :0]   shm_ptl.c:56   MXM DEBUG Created ptl_shm
[1358430343.268457] [agape5:8596 :0]  rdma_ptl.c:65   MXM TRACE Got 1 IB devices
[1358430343.268640] [agape5:8596 :0]  rdma_ptl.c:112  MXM DEBUG added device mlx4_0
[1358430343.268665] [agape5:8596 :0]    memreg.c:187  MXM TRACE Created memory registration cache on 1 devices
[1358430343.268676] [agape5:8596 :0]  rdma_ptl.c:133  MXM DEBUG Created ptl_rdma
[1358430343.268689] [agape5:8596 :0]     event.c:353  MXM FUNC  mxm_event_init(event=0x2b73e0ee3038 mode=2 time_accuracy=160000000)
[1358430343.268698] [agape5:8596 :0]    timerq.c:55   MXM FUNC  mxm_timerq_init(timerq=0x2b73e0ee3060 accuracy=160000000 max_interval=1600000000)
[1358430343.268706] [agape5:8596 :0]     event.c:292  MXM FUNC  mxm_event_add_thread_context(thread=0x2b73e0ee30d0)
[1358430343.268732] [agape5:8596 :0]     event.c:198  MXM FUNC  mxm_set_fd_nonblock(fd=10)
[1358430343.268741] [agape5:8596 :0]     event.c:198  MXM FUNC  mxm_set_fd_nonblock(fd=11)
[1358430343.268841] [agape5:8596 :0]       mxm.c:162  MXM INFO  context 0x2b73e0ee3010 created
[1358430343.269090] [agape5:8596 :1]     event.c:41   MXM FUNC  __call_handler(handler->cb=0x2b73e0ab28a0 handler->arg=0x2b73e0ee3038)
[1358430343.269104] [agape5:8596 :1]    timerq.c:88   MXM FUNC  mxm_timerq_sweep(timerq=0x2b73e0ee3060 current_time=568595527963578)
[1358430343.274685] [agape5:8596 :0] config_parser.c:168  MXM DEBUG default: MXM_ENABLE_HUGETLB=1
[1358430343.274700] [agape5:8596 :0] config_parser.c:168  MXM DEBUG default: MXM_ENABLE_TIMEOUTS=y
[1358430343.274709] [agape5:8596 :0] config_parser.c:168  MXM DEBUG default: MXM_ACK_TIMEOUT=0.3
[1358430343.274721] [agape5:8596 :0] config_parser.c:168  MXM DEBUG default: MXM_POLL_INTERVAL=0.1
[1358430343.274742] [agape5:8596 :0] config_parser.c:168  MXM DEBUG default: MXM_WINDOW_SIZE=512
[1358430343.274755] [agape5:8596 :0] config_parser.c:168  MXM DEBUG default: MXM_TX_BATCH=1
[1358430343.274764] [agape5:8596 :0] config_parser.c:168  MXM DEBUG default: MXM_CQ_MODERATION=64
[1358430343.274773] [agape5:8596 :0] config_parser.c:168  MXM DEBUG default: MXM_DRAIN_CQ=n
[1358430343.274782] [agape5:8596 :0] config_parser.c:168  MXM DEBUG default: MXM_RNDV_THRESH=65536
[1358430343.274791] [agape5:8596 :0] config_parser.c:168  MXM DEBUG default: MXM_ZCOPY_THRESH=2040
[1358430343.274815] [agape5:8596 :0] config_parser.c:168  MXM DEBUG default: MXM_RESIZE_CQ=y
[1358430343.274826] [agape5:8596 :0] config_parser.c:168  MXM DEBUG default: MXM_UD_MTU=65536
[1358430343.274836] [agape5:8596 :0] config_parser.c:168  MXM DEBUG default: MXM_UD_RX_QUEUE_LEN=16000
[1358430343.274849] [agape5:8596 :0] config_parser.c:168  MXM DEBUG default: MXM_UD_TX_QUEUE_LEN=64
[1358430343.274859] [agape5:8596 :0] config_parser.c:168  MXM DEBUG default: MXM_UD_RX_MAX_BUFFERS=128000
[1358430343.274877] [agape5:8596 :0] config_parser.c:168  MXM DEBUG default: MXM_UD_TX_MAX_BUFFERS=8192
[1358430343.274887] [agape5:8596 :0] config_parser.c:168  MXM DEBUG default: MXM_UD_RX_DROP_RATE=0
[1358430343.274896] [agape5:8596 :0] config_parser.c:168  MXM DEBUG default: MXM_UD_ENABLE_NAK=y
[1358430343.274904] [agape5:8596 :0] config_parser.c:168  MXM DEBUG default: MXM_UD_RX_FILL_THRESH=0.6
[1358430343.274915] [agape5:8596 :0] config_parser.c:168  MXM DEBUG default: MXM_UD_TX_MAX_INLINE=128
[1358430343.274925] [agape5:8596 :0] config_parser.c:168  MXM DEBUG default: MXM_SHM_RX_MAX_BUFFERS=2000
[1358430343.274941] [agape5:8596 :0] config_parser.c:168  MXM DEBUG default: MXM_RDMA_ALLOC=1
[1358430343.274968] [agape5:8596 :0]        ep.c:36   MXM FUNC  mxm_ep_create(context=0x2b73e0ee3010)
[1358430343.274984] [agape5:8596 :0]      self.c:380  MXM DEBUG Created ptl_self EP(rank=3767085072)
[1358430343.275028] [agape5:8596 :0] shm_queue.c:230  MXM DEBUG shm_ep=0, shmid=6815750
[1358430343.275072] [agape5:8596 :0]     mpool.c:265  MXM DEBUG mpool 'shm_ep_recv': allocated chunk 0x2aaaadd0c010 of 65824016 bytes with 2000 elements
[1358430343.278550] [agape5:8596 :0]     mpool.c:156  MXM DEBUG mpool 'shm_ep_recv': align 16, maxelems 2000, elemsize 32904, padding 8
[1358430343.278584] [agape5:8596 :0]    timerq.c:139  MXM FUNC  mxm_timer_schedule(timerq=0x2b73e0ee3060 timer=0xc029538 expiration=568595550657300)
[1358430343.278594] [agape5:8596 :0]    timerq.c:43   MXM FUNC  mxm_timerq_insert_timer(put timer 0xc029538 expiration 568595550657300 in slot 10)
[1358430343.278608] [agape5:8596 :0]    timerq.c:145  MXM TRACE added timer 0xc029538 expiration 568595550657300 interval 160000000
[1358430343.278617] [agape5:8596 :0]    shm_ep.c:176  MXM DEBUG Created ptl_shm EP (rank=0, ctx_id=1)
[1358430343.278641] [agape5:8596 :0]   rdma_ep.c:317  MXM FUNC  mxm_rdma_ep_create()
[1358430343.278722] [agape5:8596 :0]  rdma_dev.c:194  MXM FUNC  mxm_rdma_dev_init(dev=0xc0b3f00)
[1358430343.278924] [agape5:8596 :0]  rdma_dev.c:122  MXM DEBUG Port 1 on mlx4_0 has a link layer different from IB. Skipping it
[1358430343.278939] [agape5:8596 :0]  rdma_dev.c:155  MXM ERROR An active IB port on a Mellanox device, with lid [any] gid [any] not found
[1358430343.278954] [agape5:8596 :0]    timerq.c:150  MXM FUNC  mxm_timer_cancel(timerq=0x2b73e0ee3060 timer=0xc029538)
[1358430343.279454] [agape5:8596 :0]     mpool.c:184  MXM DEBUG mpool 'shm_ep_recv': destroyed
[1358430343.279466] [agape5:8596 :0]      self.c:287  MXM FUNC  mxm_self_ep_destroy(ep=0xc094600)
--------------------------------------------------------------------------
MXM was unable to create an endpoint. Please make sure that the network link is
active on the node and the hardware is functioning.

  Error: No such device

--------------------------------------------------------------------------
[1358430343.287336] [agape5:8596 :0]     event.c:400  MXM FUNC  mxm_event_cleanup(event=0x2b73e0ee3038)
[1358430343.287348] [agape5:8596 :0]     event.c:338  MXM FUNC  mxm_event_remove_thread_context(thread=0x2b73e0ee30d0)
[1358430343.287355] [agape5:8596 :0]     event.c:145  MXM FUNC  mxm_event_thread_wakeup()
[1358430343.371011] [agape5:8596 :0]    timerq.c:76   MXM FUNC  mxm_timerq_cleanup(timerq=0x2b73e0ee3060)
[1358430343.371030] [agape5:8596 :0]    memreg.c:194  MXM TRACE Destroying memory registration cache
[1358430343.371129] [agape5:8596 :0]   shm_ptl.c:34   MXM FUNC  ptl_shm_destroy(ptl=0xc0729b0)
[1358430343.371139] [agape5:8596 :0]      self.c:340  MXM FUNC  mxm_self_destroy(ptl=0xc0699a0)
[1358430343.371148] [agape5:8596 :0]     mpool.c:184  MXM DEBUG mpool 'ptl_self_recv_ev': destroyed
[1358430343.371156] [agape5:8596 :0]       mxm.c:197  MXM INFO  context 0x2b73e0ee3010 destroyed
--------------------------------------------------------------------------
No available pml components were found!

This means that there are no components of this type installed on your
system or all the components reported that they could not be used.

This is a fatal error; your MPI process is likely to abort.  Check the
output of the "ompi_info" command and ensure that components of this
type are available on your system.  You may also wish to check the
value of the "component_path" MCA parameter and ensure that it has at
least one directory that contains valid MCA components.
--------------------------------------------------------------------------
[agape5:08596] PML cm cannot be selected
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 8596 on
node agape5 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

Regards,
Francesco

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users