Hi All,
 
We are running open MPI 1.3.2 with OFED1.5. we have 8 node cluster with 10Gb Iwarp ethernet card.
 
Node name are as below n130,n131,n132,n133,n134,n135,n136,n137. Respective 10GB hostname are n130x,n131x..... n137x.
 
we have /root/mpd.hosts entry like as below:
 
n130x
n131x
n134x
n135x
n136x
n132x
n133x
n137x
 
We are not able to run open mpi with all 8 node.
 
mpirun -n 8 -np 8 -hostfile /root/mpd.hosts -mca btl openib,self,sm --mca orte_base_help_aggregate 0 --mca btl_base_verbose 10 --mca btl_openib_verbose 100 /usr/mpi/gcc/openmpi-1.3.2/tests/IMB-3.1/IMB-MPI1 Barrier
 
Output:
=================================================================================
 
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.
 
  Process 1 ([[33322,1],0]) is on host: n130
  Process 2 ([[33322,1],5]) is on host: n132x
  BTLs attempted: openib self sm
 
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.
 
  Process 1 ([[33322,1],2]) is on host: n134
  Process 2 ([[33322,1],5]) is on host: n132x
  BTLs attempted: openib self sm
 
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.
 
  Process 1 ([[33322,1],5]) is on host: n132
  Process 2 ([[33322,1],0]) is on host: n130
  BTLs attempted: openib self sm
 
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.
 
  Process 1 ([[33322,1],7]) is on host: n137
  Process 2 ([[33322,1],0]) is on host: n130
  BTLs attempted: openib self sm
 
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.
 
  Process 1 ([[33322,1],3]) is on host: n135
  Process 2 ([[33322,1],5]) is on host: n132x
  BTLs attempted: openib self sm
 
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.
 
  Process 1 ([[33322,1],6]) is on host: n133
  Process 2 ([[33322,1],0]) is on host: n130
  BTLs attempted: openib self sm
 
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.
 
  Process 1 ([[33322,1],1]) is on host: n131
  Process 2 ([[33322,1],5]) is on host: n132x
  BTLs attempted: openib self sm
 
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.
 
  Process 1 ([[33322,1],4]) is on host: n136
  Process 2 ([[33322,1],5]) is on host: n132x
  BTLs attempted: openib self sm
 
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
 
  PML add procs failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
 
  PML add procs failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
 
  PML add procs failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
 
  PML add procs failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
 
  PML add procs failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
 
  PML add procs failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
 
  PML add procs failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[n134:4888] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init_thread
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
*** An error occurred in MPI_Init_thread
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
*** An error occurred in MPI_Init_thread
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
*** An error occurred in MPI_Init_thread
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
*** An error occurred in MPI_Init_thread
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
*** An error occurred in MPI_Init_thread
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[n137:4890] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
[n135:4883] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
[n133:4850] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
[n136:4866] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
[n131:4866] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
[n132:4855] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
mpirun has exited due to process rank 3 with PID 4883 on
node n135x exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
 
  PML add procs failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[n130:4885] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
=================================================================================
 
we are able to run same command on btl with tcp as below for all 8 node :
 
mpirun -n 8 -np 8 -hostfile /root/mpd.hosts  -mca btl tcp,self,sm --mca orte_base_help_aggregate 0 --mca btl_base_verbose 10 --mca btl_openib_verbose 100 /usr/mpi/gcc/openmpi-1.3.2/tests/IMB-3.1/IMB-MPI1 Barrier
 
 
If we remove n132,n133,n137 node from mpd.hosts file then we are able to run open mpi for all remaining 5 node on btl openib,sm,self .
 
So there is some problem with only n132,n133,n137 node. we are able to run opnmpi with this 3 node. but when we try to run this node with other 5 node or one of the node (n130,n131,n134,n135,n136) then we will get below error:
 
Output :
===============
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.
 
  Process 1 ([[33304,1],1]) is on host: n132
  Process 2 ([[33304,1],0]) is on host: n130
  BTLs attempted: openib self sm
 
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
 
  PML add procs failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.
 
  Process 1 ([[33304,1],0]) is on host: n130
  Process 2 ([[33304,1],1]) is on host: 100
  BTLs attempted: openib self sm
 
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
 
  PML add procs failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[n130:4929] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init_thread
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[n132:4963] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 4929 on
node n130 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
-----------------------------------------------------------
 
we are able to run INtel,Mvapich2 MPI on All 8 node but we are facing problem for OpenMPI. Can any one help us what the real issue with that 3 node.
 
Find attached Log for detail.
 
 
Thanks,
Hardik