Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Heterogeneous cluster problem - mixing AMD and Intel nodes
From: Brice Goglin (Brice.Goglin_at_[hidden])
Date: 2014-03-02 02:19:41

What's your mpirun or mpiexec command-line?
The error "BTLs attempted: self sm tcp" says that it didn't even try the
MX BTL (for Open-MX). Did you use the MX MTL instead?
Are you sure that you actually use Open-MX when not mixing AMD and Intel


Le 02/03/2014 08:06, Victor a écrit :
> I got 4 x AMD A-10 6800K nodes on loan for a few months and added them
> to my existing Intel nodes.
> All nodes share the relevant directories via NFS. I have OpenMPI 1.6.5
> which was build with Open-MX 1.5.3 support networked via GbE.
> All nodes run Ubuntu 12.04.
> Problem:
> I can run a job EITHER on 4 x AMD nodes OR on 2 x Intel nodes, but I
> cannot run a job on any combination of an AMD and Intel node, ie. 1 x
> AMD node + 1 x Intel node = error below.
> The error that I get during job setup is:
> --------------------------------------------------------------------------
> At least one pair of MPI processes are unable to reach each other for
> MPI communications. This means that no Open MPI device has indicated
> that it can be used to communicate between these processes. This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other. This error can sometimes be the result of forgetting to
> specify the "self" BTL.
> Process 1 ([[2229,1],1]) is on host: AMD-Node-1
> Process 2 ([[2229,1],8]) is on host: Intel-Node-1
> BTLs attempted: self sm tcp
> Your MPI job is now going to abort; sorry.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> MPI_INIT has failed because at least one MPI process is unreachable
> from another. This *usually* means that an underlying communication
> plugin -- such as a BTL or an MTL -- has either not loaded or not
> allowed itself to be used. Your MPI job will now abort.
> You may wish to try to narrow down the problem;
> * Check the output of ompi_info to see which BTL/MTL plugins are
> available.
> * Run your application with MPI_THREAD_SINGLE.
> * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
> if using MTL-based communications) to see exactly which
> communication plugins were considered and/or discarded.
> --------------------------------------------------------------------------
> [AMD-Node-1:3932] *** An error occurred in MPI_Init
> [AMD-Node-1:3932] *** on a NULL communicator
> [AMD-Node-1:3932] *** Unknown error
> [AMD-Node-1:3932] *** MPI_ERRORS_ARE_FATAL: your MPI job will now
> abort
> --------------------------------------------------------------------------
> An MPI process is aborting at a time when it cannot guarantee that all
> of its peer processes in the job will be killed properly. You should
> double check that everything has shut down cleanly.
> Reason: Before MPI_INIT completed
> Local host: AMD-Node-1
> PID: 3932
> --------------------------------------------------------------------------
> What I would like to know is, is it actually difficult (impossible) to
> mix AMD and Intel machines in the same cluster and have them run the
> same job, or am I missing something obvious, or not so obvious when it
> comes to the communication stack on the Intel nodes for example.
> I set up the AMD nodes just yesterday, but I used the same OpenMPI and
> Open-MX versions, however I may have inadvertently done something
> different, so I am thinking (hoping) that it is possible to run such a
> heterogeneous cluster, and that all I need to do is ensure that all
> OpenMPI modules are correctly installed on all nodes.
> I need the extra 32 Gb RAM and the AMD nodes bring as I need to
> validate our CFD application, and our additional Intel nodes are still
> not here (ETA 2 weeks).
> Thank you,
> Victor
> _______________________________________________
> users mailing list
> users_at_[hidden]