I am copying your email from the web site because I had enabled the
option to receive all the emails once per day
On 11/04/2012 05:27 PM, George Markomanolis wrote:
> > Dear all,
> > I am trying to execute an experiment by oversubscribing the nodes. So I
> have available some clusters (I can use up to 8-10 different clusters
> during one execution) and I have totally around to 1300 cores. I am
> executing the EP benchmark from the NAS suite which means that there
> are not a lot of MPI messages, just some collective MPI calls.
> > The number of the MPI processes per node, depends on the available
> memory of each node. Thus in the machinefile I have declared one node
> 13 times if I want 13 MPI processes on it. Is that correct?
> You *can* do it that way, or you could just use "slots=13" for that
> node in the file, and list it only once.
OK, but I assume the result is the same, right?
> > Giving a machinefile of 32768 nodes when I want to execute 32768 processes, does OpenMPI
> behave like there is no oversubscribing?
> Yes, it should - I assume you mean "slots" and not "nodes" in the
> above statement, since you indicate that you listed each node multiple
> times to set the number of slots on that node.
Yes, I mean slots.
> > If yes how can I give a machinefile where there is different number of MPI processes on each
> node? The maximum number of MPI processes that I have in a node is 388.
> Just assign the number of slots on each node to be the number of
> processes you want on that node
> > My problem is that I can execute 16384 processes but not 32768. In
> the first case I need around to 3 minutes for the execution but in the
> second case, even after 7 hours the benchmark does not even start.
> There is no error, I am just cancelling the job by myself but I am
> assuming that something is wrong because 7 hours it is too much. I
> have to say that I executed the instance of 16384 processes without
> any problem. I added some debug info in the benchmark and I can see
> that the execution is delayed during MPI_Init, it never passes this
> point. For the instance of 16384 processes I need around to 2 minutes
> to finish the MPI_Init call. I am checking the memory of all the nodes
> and there is at least 0.5GB free memory on each node.
> > I know about the parameter mpi_yield_when_idle but I have read that if
> there are not a lot of MPI messages will not improve the performance.
> I tried though and nothing changed. I tried also the
> mpi_preconnect_mpi just in case but again nothing. Could you please
> indicate a reason why is this happening?
> You indicated that these jobs are actually spanning multiple clusters
> - true? If so, when you cross that 16384 boundary, do you also cross
> clusters? Is it possible one or more of the additional clusters is
> blocking communications?
I have tried both configurations even using exactly the same nodes with
less MPI processes per node in order to check if a site is blocking the
rest ones and I have tried the half machinefile for the instance of
16384 in order to see if there is any issue by using so many MPI
processes per node. Both were executed fine with the instance of 16384
MPI processes. Also I tried to combine different quarters of the
machinefile in order to check if there is any issue with the combination
of specific sites and again I didn't have a problem.
> > Moreover I used just one node with 48GB memory in order to execute
> 2048 MPI processes without any problem, of course I just had to wait a
> > I am using OpenMPI v1.4.1 and all the clusters are 64 bit.
> > I execute the benchmark with the following command:
> > mpirun --mca pml ob1 --mca btl tcp,self --mca btl_tcp_if_exclude
> ib0,lo,myri0 -machinefile machines -np 32768 ep.D.32768
> You could just leave off the "-np N" part of the command line - we'll
> assign one process to every slot specified in the machinefile.
> > Best regards,
> > George Markomanolis
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]