On Mar 7, 2011, at 3:24 AM, Federico Golfrč Andreasi wrote:

Hi Ralph,

thank you very much for the detailed response.

I have to apologize I was not clear: I would like to use the MPI_spawn_multiple function.

Shouldn't matter - it's the same code path.

(I've attached the example program I use) .

I'm rebuilding for C++ as I don't typically use that language - will report back later.


In any case I tryed your test program, just compling it with:
/home/fandreasi/openmpi-1.7/bin/mpicc loop_spawn.c -o loop_spawn
/home/fandreasi/openmpi-1.7/bin/mpicc loop_child.c -o loop_child
and execute it on a single machine with
/home/fandreasi/openmpi-1.7/bin/mpiexec ./loop_spawn ./loop_child

I should have been clearer - this is not the correct way to run the program. The correct way is:

mpiexec -n 1 ./loop_spawn

loop_child is just the executable being comm_spawn'd.

but it hungs at different loop iterations after printing:
"Child 26833:exiting"
but looking at the top both the process (loop_spawn and loop_child) are still alive.

I'm starting thinking that I've some environment setting not correct or I need to compile OpenMPI with some options.
I compile it just setting the --prefix option to the ./configure.
Do I need to do something else ?

No, that should work.


I have a linux Centos 4, 64 bits machine,
with gcc 3.4.

I think that this is my main problem now.



Just to answer to other topics (minor):
- Regardin version mismatch I use a linux cluster where the /home/ directory is shared among the compute nodes,
and I've edited by .bashrc and .bashprofile to export the correct LD_LIBRARY_PATH.
- thank you for the usefull trick about svn.

No idea, then - all that error says is that the receiving code and the sending code are mismatched.



Thank you very much !!!
Federico.






Il giorno 05 marzo 2011 19:05, Ralph Castain <rhc@open-mpi.org> ha scritto:
Hi Federico

I tested the trunk today and it works fine for me - I let it spin for 1000 cycles without issue. My test program is essentially identical to what you describe - you can see it in the orte/test/mpi directory. The "master" is loop_spawn.c, and the "slave" is loop_child.c. I only tested it on a single machine, though - will have to test multi-machine later. You might see if that makes a difference.

The error you report in your attachment is a classic symptom of mismatched versions. Remember, we don't forward your ld_lib_path, so it has to be correct on your remote machine.

As for r22794 - we don't keep anything that old on our web site. If you want to build it, the best way to get the code is to do a subversion checkout of the developer's trunk at that revision level:


Remember to run autogen before configure.


On Mar 4, 2011, at 4:43 AM, Federico Golfrč Andreasi wrote:


Hi Ralph,

I'm getting stuck with spawning stuff,

I've downloaded the snapshot from the trunk of 1st of March (openmpi-1.7a1r24472.tar.bz2),
I'm testing using a small program that does the following:
 - master program starts and each rank prints his hostsname
 - master program spawn a slave program with the same size
 - each rank of the slave (spawned) program prints his hostname
 - end
Not always he is able to complete the progam run, two different behaviour:
 1. not all the slave print their hostname and the program ends suddenly
 2. both program ends correctly but orted demon is still alive and I need to press crtl-c to exit


I've tryed to recompile my test program with a previous snapshot (openmpi-1.7a1r22794.tar.bz2)
where I have only the compiled version of OpenMPI (in another machine).
It gives me an error before starting (I've attacehd)
Surfing on the FAQ I found some tip and I verified to compile the program with the correct OpenMPI version,
that the LD_LIBRARY_PATH is consistent.
So I would like to re-compile the openmpi-1.7a1r22794.tar.bz2 but where can I found it ?


Thank you,
Federico










Il giorno 23 febbraio 2011 03:43, Ralph Castain <rhc.openmpi@gmail.com> ha scritto:
Apparently not. I will investigate when I return from vacation next week.


Sent from my iPad

On Feb 22, 2011, at 12:42 AM, Federico Golfrč Andreasi <federico.golfre@gmail.com> wrote:

Hi Ralf,

I've tested spawning with the OpenMPI 1.5 release but that fix is not there.
Are you sure you've added it ?

Thank you,
Federico



2010/10/19 Ralph Castain <rhc@open-mpi.org>
The fix should be there - just didn't get mentioned.

Let me know if it isn't and I'll ensure it is in the next one...but I'd be very surprised if it isn't already in there.


On Oct 19, 2010, at 3:03 AM, Federico Golfrč Andreasi wrote:

Hi Ralf !

I saw that the new realease 1.5 is out. 
I didn't found this fix in the "list of changes", is it present but not mentioned since is a minor fix ?

Thank you,
Federico



2010/4/1 Ralph Castain <rhc@open-mpi.org>
Hi there!

It will be in the 1.5.0 release, but not 1.4.2 (couldn't backport the fix). I understand that will come out sometime soon, but no firm date has been set.


On Apr 1, 2010, at 4:05 AM, Federico Golfrč Andreasi wrote:

Hi Ralph,


         I've downloaded and tested the openmpi-1.7a1r22817 snapshot,
and it works fine for (multiple) spawning more than 128 processes.

That fix will be included in the next release of OpenMPI, right ?
Do you when it will be released ? Or where I can find that info ?

Thank you,
     Federico



2010/3/1 Ralph Castain <rhc@open-mpi.org>
http://www.open-mpi.org/nightly/trunk/

I'm not sure this patch will solve your problem, but it is worth a try.




_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


<OpenMPI.error>


<master.cpp><slave.cpp>