Dear Ralph,

I am copying your email from the web site because I had enabled the option to receive all the emails once per day

On 11/04/2012 05:27 PM, George Markomanolis wrote:
> Dear all, 

> I am trying to execute an experiment by oversubscribing the nodes. So I have available some clusters (I can use up to 8-10 different clusters during one execution) and I have totally around to 1300 cores. I am executing the EP benchmark from the NAS suite which means that there are not a lot of MPI messages, just some collective MPI calls. 
> The number of the MPI processes per node, depends on the available memory of each node. Thus in the machinefile I have declared one node 13 times if I want 13 MPI processes on it. Is that correct? 

You *can* do it that way, or you could just use "slots=13" for that node in the file, and list it only once.

OK, but I assume the result is the same, right?

> Giving a machinefile of 32768 nodes when I want to execute 32768 processes, does OpenMPI behave like there is no oversubscribing? 

Yes, it should - I assume you mean "slots" and not "nodes" in the above statement, since you indicate that you listed each node multiple times to set the number of slots on that node.

Yes, I mean slots.

> If yes how can I give a machinefile where there is different number of MPI processes on each node? The maximum number of MPI processes that I have in a node is 388. 

Just assign the number of slots on each node to be the number of processes you want on that node


> My problem is that I can execute 16384 processes but not 32768. In the first case I need around to 3 minutes for the execution but in the second case, even after 7 hours the benchmark does not even start. There is no error, I am just cancelling the job by myself but I am assuming that something is wrong because 7 hours it is too much. I have to say that I executed the instance of 16384 processes without any problem. I added some debug info in the benchmark and I can see that the execution is delayed during MPI_Init, it never passes this point. For the instance of 16384 processes I need around to 2 minutes to finish the MPI_Init call. I am checking the memory of all the nodes and there is at least 0.5GB free memory on each node. 
> I know about the parameter mpi_yield_when_idle but I have read that if there are not a lot of MPI messages will not improve the performance. I tried though and nothing changed. I tried also the mpi_preconnect_mpi just in case but again nothing. Could you please indicate a reason why is this happening? 

You indicated that these jobs are actually spanning multiple clusters - true? If so, when you cross that 16384 boundary, do you also cross clusters? Is it possible one or more of the additional clusters is blocking communications?

I have tried both configurations even using exactly the same nodes with less MPI processes per node in order to check if a site is blocking the rest ones and I have tried the half machinefile for the instance of 16384 in order to see if there is any issue by using so many MPI processes per node. Both were executed fine with the instance of 16384 MPI processes. Also I tried to combine different quarters of the machinefile in order to check if there is any issue with the combination of specific sites and again I didn't have a problem.

> Moreover I used just one node with 48GB memory in order to execute 2048 MPI processes without any problem, of course I just had to wait a lot. 
> I am using OpenMPI v1.4.1 and all the clusters are 64 bit. 
> I execute the benchmark with the following command: 
> mpirun --mca pml ob1 --mca btl tcp,self --mca btl_tcp_if_exclude ib0,lo,myri0 -machinefile machines -np 32768 ep.D.32768 

You could just leave off the "-np N" part of the command line - we'll assign one process to every slot specified in the machinefile.

OK, nice

Best regards,
George Markomanolis

> Best regards, 
> George Markomanolis 
> _______________________________________________ 
> users mailing list 
> users_at_[hidden]