Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] design advice for multi-threaded load balancing application
From: Brian Budge (brian.budge_at_[hidden])
Date: 2013-06-12 17:10:25

Hi all -

I have an application where the master node will spawn slaves to
perform computation (using the singleton Comm_spawn_multiple paradigm
available in OpenMPI) . The master will only decide the work to do,
and also provide data common to all the computations.

The slaves are multi-threaded, and locally handle load balancing via a
non-blocking thread-safe queue.

Work is load balanced between nodes like so:

1) The master doles out half the work in a round-robin fashion
2) The master will replace work when it receives completed work from a slave

I currently have a paradigm where I have build OpenMPI with
multi-threading enabled, and I allow any thread to send (work) or
broadcast (common data and control messages) to the other nodes. I
have a dedicated thread handling receives which can also handle the
receive end of broadcast. The receive thread will hand the work data
off to the local load balancing mechanism, and set the common data in
a thread-safe fashion.

When worker threads complete work quickly, they pound MPI with sends.
This leads to a ton of lock contention. Another issue I'm facing is
that sometimes the messages are very small, but there are a lot of
them, and I think this may have a lot of overhead in MPI and/or
various network layers.

I'm thinking of going to a THREAD_FUNNELED design instead of a
THREAD_MULTIPLE design, but I'm unsure of the best way to accomplish
this. For example, is it advisable to multiple Isend and/or multiple
Irecv in flight at once (essentially allowing the data to be staged
concurrently), or is it better to only have one Isend at a time? If I
Iprobe and then Irecv, and then I Iprobe again, presumably I will not
get the same message because that message retrieval was started

Currently, I Isend data to all receiving nodes to describe the details
of a broadcast, but I Waitall before calling Bcast. Is there anything
to be careful of if I move to more asynchronous communication (if I
don't Waitall are there cases where I can deadlock? I haven't thought
of cases).

All my communication will be somewhat generic in the sense that
Probe/Iprobe accept MPI_ANY_SOURCE and MPI_ANY_TAG, and Bcast is only
initiated on the receiver side if it has received a control message
saying the source of the sender and the size of the message.

Thanks for any and all suggestions and comments,