Hi,
I'm working with MPI_Comm_spawn and I have some error messages.
The code is relatively simple:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <mpi.h>
int main(int argc, char ** argv){
int i;
int rank, size, child_rank;
char nomehost[20];
MPI_Comm parent, intercomm1, intercomm2;
int erro;
int level, curr_level;
MPI_Init(&argc, &argv);
level = atoi(argv[1]);
MPI_Comm_get_parent(&parent);
if(parent == MPI_COMM_NULL){
rank=0;
}
else{
MPI_Recv(&rank, 1, MPI_INT, 0, 0, parent, MPI_STATUS_IGNORE);
}
curr_level = (int) log2(rank+1);
printf(" --> rank: %d and curr_level: %d\n", rank, curr_level);
// Node propagation
if(curr_level < level){
// 2^(curr_level+1) - 1 + 2*(rank - 2^curr_level - 1)
= 2*rank + 1
child_rank = 2*rank + 1;
printf("(%d) Before create rank %d\n", rank, child_rank);
MPI_Comm_spawn(argv[0], &argv[1], 1, MPI_INFO_NULL, 0,
MPI_COMM_SELF, &intercomm1, &erro);
printf("(%d) After create rank %d\n", rank, child_rank);
MPI_Send(&child_rank, 1, MPI_INT, 0, 0, intercomm1);
//sleep(1);
child_rank = child_rank + 1;
printf("(%d) Before create rank %d\n", rank, child_rank);
MPI_Comm_spawn(argv[0], &argv[1], 1, MPI_INFO_NULL, 0,
MPI_COMM_SELF, &intercomm2, &erro);
printf("(%d) After create rank %d\n", rank, child_rank);
MPI_Send(&child_rank, 1, MPI_INT, 0, 0, intercomm2);
}
gethostname(nomehost, 20);
printf("(%d) in %s\n", rank, nomehost);
MPI_Finalize();
return(0);
}
The program will create a binary tree of process until get a specific
level determined by the variable "level". If the level is 2, the tree
will be:
(0)
/ \
(1) (2)
/ \ / \
(3) (4) (5) (6)
Error messages are (when a use 1 host):
Compiling: mpicc test.c -o test -lm
Running: mpirun -np 1 ./test 3
--> rank: 0 and curr_level: 0
(0) Before create rank 1
(0) After create rank 1
(0) Before create rank 2
--> rank: 1 and curr_level: 1
(1) Before create rank 3
[cacau.ic.uff.br:17892] [[31928,0],0] ORTE_ERROR_LOG: Not found in
file base/plm_base_launch_support.c at line 75
When I use 2 hosts, error is worst. The code is similar to the writing
here (I have to set hosts before spawn by MPI_Info_set).
Using MPILAM, program runs normally.
I think something wrong occurs when I try to use 2 MPI_Comm_spawn
consecutively and children processes spawn another processes too.
Seems to be a race condition because the error does not always happen
(when the level is 2, for example). Using 3 levels or more, error is
recurrent.
Similar error has been previously posted in another thread:
http://www.open-mpi.org/community/lists/users/2009/12/11601.php
However, I used the stable version 1.4.4 and this problem still happens.
Developers think of to fix it?
Thanks,
Fernanda
|