Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Heterogeneous cluster problem - mixing AMD and Intel nodes
From: Victor (victor.major_at_[hidden])
Date: 2014-03-02 02:06:03


I got 4 x AMD A-10 6800K nodes on loan for a few months and added them to
my existing Intel nodes.

All nodes share the relevant directories via NFS. I have OpenMPI 1.6.5
which was build with Open-MX 1.5.3 support networked via GbE.

All nodes run Ubuntu 12.04.

Problem:

I can run a job EITHER on 4 x AMD nodes OR on 2 x Intel nodes, but I cannot
run a job on any combination of an AMD and Intel node, ie. 1 x AMD node + 1
x Intel node = error below.

The error that I get during job setup is:

>
> --------------------------------------------------------------------------
> At least one pair of MPI processes are unable to reach each other for
> MPI communications. This means that no Open MPI device has indicated
> that it can be used to communicate between these processes. This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other. This error can sometimes be the result of forgetting to
> specify the "self" BTL.
> Process 1 ([[2229,1],1]) is on host: AMD-Node-1
> Process 2 ([[2229,1],8]) is on host: Intel-Node-1
> BTLs attempted: self sm tcp
> Your MPI job is now going to abort; sorry.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> MPI_INIT has failed because at least one MPI process is unreachable
> from another. This *usually* means that an underlying communication
> plugin -- such as a BTL or an MTL -- has either not loaded or not
> allowed itself to be used. Your MPI job will now abort.
> You may wish to try to narrow down the problem;
> * Check the output of ompi_info to see which BTL/MTL plugins are
> available.
> * Run your application with MPI_THREAD_SINGLE.
> * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
> if using MTL-based communications) to see exactly which
> communication plugins were considered and/or discarded.
> --------------------------------------------------------------------------
> [AMD-Node-1:3932] *** An error occurred in MPI_Init
> [AMD-Node-1:3932] *** on a NULL communicator
> [AMD-Node-1:3932] *** Unknown error
> [AMD-Node-1:3932] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
> --------------------------------------------------------------------------
> An MPI process is aborting at a time when it cannot guarantee that all
> of its peer processes in the job will be killed properly. You should
> double check that everything has shut down cleanly.
> Reason: Before MPI_INIT completed
> Local host: AMD-Node-1
> PID: 3932
> --------------------------------------------------------------------------

What I would like to know is, is it actually difficult (impossible) to mix
AMD and Intel machines in the same cluster and have them run the same job,
or am I missing something obvious, or not so obvious when it comes to the
communication stack on the Intel nodes for example.

I set up the AMD nodes just yesterday, but I used the same OpenMPI and
Open-MX versions, however I may have inadvertently done something
different, so I am thinking (hoping) that it is possible to run such a
heterogeneous cluster, and that all I need to do is ensure that all OpenMPI
modules are correctly installed on all nodes.

I need the extra 32 Gb RAM and the AMD nodes bring as I need to validate
our CFD application, and our additional Intel nodes are still not here (ETA
2 weeks).

Thank you,

Victor