Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] selected pml cm, but peer [[2469, 1], 0] on compute-0-0 selected pml ob1
From: Nysal Jan (jnysal_at_[hidden])
Date: 2009-03-19 04:07:27


fs1 is selecting the "cm" PML whereas other nodes are selecting the
"ob1" PML component. You can force ob1 to be used via "--mca pml ob1"

What kind of hardware/NIC does fs1 have?

--Nysal

On Wed, 2009-03-18 at 17:17 -0400, Gary Draving wrote:
> Hi all,
>
> anyone ever seen an error like this? Seems like I have some setting
> wrong in opemmpi. I thought I had it setup like the other machines but
> seems as though I have missed something. I only get the error when
> adding machine "fs1" to the hostfile list. The other 40+ machines seem
> fine.
>
> [fs1.calvin.edu:01750] [[2469,1],6] selected pml cm, but peer
> [[2469,1],0] on compute-0-0 selected pml ob1
>
> When I use ompi_info the output looks like my other machines:
>
> [root_at_fs1 openmpi-1.3]# ompi_info | grep btl
> MCA btl: ofud (MCA v2.0, API v2.0, Component v1.3)
> MCA btl: openib (MCA v2.0, API v2.0, Component v1.3)
> MCA btl: self (MCA v2.0, API v2.0, Component v1.3)
> MCA btl: sm (MCA v2.0, API v2.0, Component v1.3)
>
> The whole error is below, any help would be greatly appreciated.
>
> Gary
>
> [admin_at_dahl 00.greetings]$ /usr/local/bin/mpirun --mca btl ^tcp
> --hostfile machines -np 7 greetings
> [fs1.calvin.edu:01959] [[2212,1],6] selected pml cm, but peer
> [[2212,1],0] on compute-0-0 selected pml ob1
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort. There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems. This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>
> PML add procs failed
> --> Returned "Unreachable" (-12) instead of "Success" (0)
> --------------------------------------------------------------------------
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [fs1.calvin.edu:1959] Abort before MPI_INIT completed successfully; not
> able to guarantee that all other processes were killed!
> --------------------------------------------------------------------------
> At least one pair of MPI processes are unable to reach each other for
> MPI communications. This means that no Open MPI device has indicated
> that it can be used to communicate between these processes. This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other. This error can sometimes be the result of forgetting to
> specify the "self" BTL.
>
> Process 1 ([[2212,1],3]) is on host: dahl.calvin.edu
> Process 2 ([[2212,1],0]) is on host: compute-0-0
> BTLs attempted: openib self sm
>
> Your MPI job is now going to abort; sorry.
> --------------------------------------------------------------------------
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [dahl.calvin.edu:16884] Abort before MPI_INIT completed successfully;
> not able to guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [compute-0-0.local:1591] Abort before MPI_INIT completed successfully;
> not able to guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [fs2.calvin.edu:8826] Abort before MPI_INIT completed successfully; not
> able to guarantee that all other processes were killed!
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 3 with PID 16884 on
> node dahl.calvin.edu exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
> [dahl.calvin.edu:16879] 3 more processes have sent help message
> help-mpi-runtime / mpi_init:startup:internal-failure
> [dahl.calvin.edu:16879] Set MCA parameter "orte_base_help_aggregate" to
> 0 to see all help / error messages
> [dahl.calvin.edu:16879] 2 more processes have sent help message
> help-mca-bml-r2.txt / unreachable proc
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users