Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: [OMPI users] Heterogeneous cluster problem - mixing AMD and Intel nodes
From: Victor (victor.major_at_[hidden])
Date: 2014-03-02 02:06:03


I got 4 x AMD A-10 6800K nodes on loan for a few months and added them to
my existing Intel nodes.

All nodes share the relevant directories via NFS. I have OpenMPI 1.6.5
which was build with Open-MX 1.5.3 support networked via GbE.

All nodes run Ubuntu 12.04.

Problem:

I can run a job EITHER on 4 x AMD nodes OR on 2 x Intel nodes, but I cannot
run a job on any combination of an AMD and Intel node, ie. 1 x AMD node + 1
x Intel node = error below.

The error that I get during job setup is:

>
> --------------------------------------------------------------------------
> At least one pair of MPI processes are unable to reach each other for
> MPI communications. This means that no Open MPI device has indicated
> that it can be used to communicate between these processes. This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other. This error can sometimes be the result of forgetting to
> specify the "self" BTL.
> Process 1 ([[2229,1],1]) is on host: AMD-Node-1
> Process 2 ([[2229,1],8]) is on host: Intel-Node-1
> BTLs attempted: self sm tcp
> Your MPI job is now going to abort; sorry.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> MPI_INIT has failed because at least one MPI process is unreachable
> from another. This *usually* means that an underlying communication
> plugin -- such as a BTL or an MTL -- has either not loaded or not
> allowed itself to be used. Your MPI job will now abort.
> You may wish to try to narrow down the problem;
> * Check the output of ompi_info to see which BTL/MTL plugins are
> available.
> * Run your application with MPI_THREAD_SINGLE.
> * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
> if using MTL-based communications) to see exactly which
> communication plugins were considered and/or discarded.
> --------------------------------------------------------------------------
> [AMD-Node-1:3932] *** An error occurred in MPI_Init
> [AMD-Node-1:3932] *** on a NULL communicator
> [AMD-Node-1:3932] *** Unknown error
> [AMD-Node-1:3932] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
> --------------------------------------------------------------------------
> An MPI process is aborting at a time when it cannot guarantee that all
> of its peer processes in the job will be killed properly. You should
> double check that everything has shut down cleanly.
> Reason: Before MPI_INIT completed
> Local host: AMD-Node-1
> PID: 3932
> --------------------------------------------------------------------------

What I would like to know is, is it actually difficult (impossible) to mix
AMD and Intel machines in the same cluster and have them run the same job,
or am I missing something obvious, or not so obvious when it comes to the
communication stack on the Intel nodes for example.

I set up the AMD nodes just yesterday, but I used the same OpenMPI and
Open-MX versions, however I may have inadvertently done something
different, so I am thinking (hoping) that it is possible to run such a
heterogeneous cluster, and that all I need to do is ensure that all OpenMPI
modules are correctly installed on all nodes.

I need the extra 32 Gb RAM and the AMD nodes bring as I need to validate
our CFD application, and our additional Intel nodes are still not here (ETA
2 weeks).

Thank you,

Victor