Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

From: Troy Telford (ttelford_at_[hidden])
Date: 2006-05-30 12:45:52


I've been having trouble using Open MPI with a medium-sized cluster:

This cluster has three fabrics: Gigabit Ethernet, 10G Myrinet MX, and
InfiniBand. Myrinet works great. IB and GigE have issues:

Using the 'openib' BTL (kernel 2.6.16.1 for drivers, openib.org RC4
userspace libraries & tools).This example uses the IMB benchmark, but the
problem is not limited to IMB

*********************************************************************
[root_at_zartan ~]# mpirun -np 90 -mca btl openib
-machinefile /etc/pdsh/machines /tmp/IMB-MPI1

#----------------------------------------------------------------
# Benchmarking Reduce
# #processes = 64
# ( 26 additional processes waiting in MPI_Barrier)
#----------------------------------------------------------------
        #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
             0 1000 0.04 0.04 0.04
[0,1,63][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 12 for wr_id 47003518529948 opcode 0

[0,1,63][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47003518767232 opcode 0

[0,1,63][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47003518965544 opcode 0

[0,1,63][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47003547253820 opcode 0

[0,1,63][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47003547286872 opcode 0

[0,1,63][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47003547319924 opcode 0

[0,1,63][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47003547352976 opcode 0

[0,1,63][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47003547386028 opcode 0

[0,1,63][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47003547419080 opcode 0

[0,1,63][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47003547452132 opcode 0

[0,1,63][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47003549606016 opcode 0

[0,1,63][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47003549639068 opcode 0

**********************************************************************

With TCP, I get the following error(s)

[root_at_zartan ~]# mpirun -np 90 -mca btl tcp
-machinefile /etc/pdsh/machines /tmp/IMB-MPI1

Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x6
[0] func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib64/libopal.so.0
[0x2adefe5248ca]
[1] func:/lib64/libpthread.so.0 [0x2adefeb2e380]
[2]
func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib64/openmpi/mca_btl_tcp.so(mca_btl_tcp_proc_remove+0xbb)
[0x2adf018139ab]
[3]
func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib64/openmpi/mca_btl_tcp.so
[0x2adf01811bec]
[4]
func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib64/openmpi/mca_btl_tcp.so(mca_btl_tcp_add_procs+0x155)
[0x2adf0180f445]
[5]
func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib64/openmpi/mca_bml_r2.so(mca_bml_r2_add_procs+0x26b)
[0x2adf011912db]
[6]
func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib64/openmpi/mca_pml_ob1.so(mca_pml_ob1_add_procs+0xcc)
[0x2adf00f75d5c]
[7]
func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib64/libmpi.so.0(ompi_mpi_init
+0x590) [0x2adefe295c90]
[8] func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib64/libmpi.so.0(MPI_Init
+0x83) [0x2adefe2812d3]
[9] func:/tmp/IMB-MPI1(main+0x29) [0x402eb9]
[10] func:/lib64/libc.so.6(__libc_start_main+0xdc) [0x2adefec534cc]
[11] func:/tmp/IMB-MPI1 [0x402df9]
*** End of error message ***
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x6
*** End of error message ***
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x6
*** End of error message ***
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x6
*** End of error message ***
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x6
*** End of error message ***
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x6
*** End of error message ***
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x6
*** End of error message ***
4 additional processes aborted (not shown)

Any Thoughts/Ideas on how to fix it?

--
  Troy Telford