Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Number of processes and spawn
From: Federico Golfrè Andreasi (federico.golfre_at_[hidden])
Date: 2011-03-08 04:42:00


Hi Ralph,

I've did some more tests, hope this can help.

*Using OpenMPI-1.5*

- The program works correctly doing a multiple spawn up to 128 cpus.
- When spawning to more than 129 cpus it hungs during the spawn.
  I've discovered that just before the spwaning all the processes liying on
one node goes down.
  I've tryed to eliminate from the hostfile that nodes but the same
behaviour is done by another nodes.

I've attached the output log files.

*Using OpenMPI-1.7a1r24472*

- The program works correcly with more than 128 cpus.
- Sometimes (not with the same number of process) after the program ends (it
prints **** THE SLAVE END ****) the orted deamon is not released.
  All the master-slave program are not on the top, but I can found
an mpiexec process in the laucing node and 1 orted process in all compute
nodes.
- Sometimes (not with the same number of precess) during the spawn it prints
a warning message of ORTE_ERROR_LOG (I've attached also this file).

Let me know if I can do some more tests that can helps,
or if I've to check some environment settings or hardware.

Thank you,
Federico.

Il giorno 07 marzo 2011 15:24, Ralph Castain <rhc_at_[hidden]> ha scritto:

>
> On Mar 7, 2011, at 3:24 AM, Federico Golfrè Andreasi wrote:
>
> Hi Ralph,
>
> thank you very much for the detailed response.
>
> I have to apologize I was not clear: I would like to use the
> MPI_spawn_multiple function.
>
>
> Shouldn't matter - it's the same code path.
>
> (I've attached the example program I use) .
>
>
> I'm rebuilding for C++ as I don't typically use that language - will report
> back later.
>
>
> In any case I tryed your test program, just compling it with:
> /home/fandreasi/openmpi-1.7/bin/mpicc loop_spawn.c -o loop_spawn
> /home/fandreasi/openmpi-1.7/bin/mpicc loop_child.c -o loop_child
> and execute it on a single machine with
> /home/fandreasi/openmpi-1.7/bin/mpiexec ./loop_spawn ./loop_child
>
>
> I should have been clearer - this is not the correct way to run the
> program. The correct way is:
>
> mpiexec -n 1 ./loop_spawn
>
> loop_child is just the executable being comm_spawn'd.
>
> but it hungs at different loop iterations after printing:
> "Child 26833:exiting"
> but looking at the top both the process (loop_spawn and loop_child) are
> still alive.
>
> I'm starting thinking that I've some environment setting not correct or I
> need to compile OpenMPI with some options.
> I compile it just setting the --prefix option to the ./configure.
> Do I need to do something else ?
>
>
> No, that should work.
>
>
> I have a linux Centos 4, 64 bits machine,
> with gcc 3.4.
>
> I think that this is my main problem now.
>
>
>
> Just to answer to other topics (minor):
> - Regardin version mismatch I use a linux cluster where the /home/
> directory is shared among the compute nodes,
> and I've edited by .bashrc and .bashprofile to export the correct
> LD_LIBRARY_PATH.
> - thank you for the usefull trick about svn.
>
>
> No idea, then - all that error says is that the receiving code and the
> sending code are mismatched.
>
>
>
> Thank you very much !!!
> Federico.
>
>
>
>
>
>
> Il giorno 05 marzo 2011 19:05, Ralph Castain <rhc_at_[hidden]> ha
> scritto:
>
>> Hi Federico
>>
>> I tested the trunk today and it works fine for me - I let it spin for 1000
>> cycles without issue. My test program is essentially identical to what you
>> describe - you can see it in the orte/test/mpi directory. The "master" is
>> loop_spawn.c, and the "slave" is loop_child.c. I only tested it on a single
>> machine, though - will have to test multi-machine later. You might see if
>> that makes a difference.
>>
>> The error you report in your attachment is a classic symptom of mismatched
>> versions. Remember, we don't forward your ld_lib_path, so it has to be
>> correct on your remote machine.
>>
>> As for r22794 - we don't keep anything that old on our web site. If you
>> want to build it, the best way to get the code is to do a subversion
>> checkout of the developer's trunk at that revision level:
>>
>> svn co -r 22794 http://svn.open-mpi.org/svn/ompi/trunk
>>
>> Remember to run autogen before configure.
>>
>>
>> On Mar 4, 2011, at 4:43 AM, Federico Golfrè Andreasi wrote:
>>
>>
>> Hi Ralph,
>>
>> I'm getting stuck with spawning stuff,
>>
>> I've downloaded the snapshot from the trunk of 1st of March (
>> openmpi-1.7a1r24472.tar.bz2),
>> I'm testing using a small program that does the following:
>> - master program starts and each rank prints his hostsname
>> - master program spawn a slave program with the same size
>> - each rank of the slave (spawned) program prints his hostname
>> - end
>> Not always he is able to complete the progam run, two different behaviour:
>> 1. not all the slave print their hostname and the program ends suddenly
>> 2. both program ends correctly but orted demon is still alive and I need
>> to press crtl-c to exit
>>
>>
>> I've tryed to recompile my test program with a previous snapshot
>> (openmpi-1.7a1r22794.tar.bz2)
>> where I have only the compiled version of OpenMPI (in another machine).
>> It gives me an error before starting (I've attacehd)
>> Surfing on the FAQ I found some tip and I verified to compile the program
>> with the correct OpenMPI version,
>> that the LD_LIBRARY_PATH is consistent.
>> So I would like to re-compile the openmpi-1.7a1r22794.tar.bz2 but where
>> can I found it ?
>>
>>
>> Thank you,
>> Federico
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Il giorno 23 febbraio 2011 03:43, Ralph Castain <rhc.openmpi_at_[hidden]>ha scritto:
>>
>>> Apparently not. I will investigate when I return from vacation next week.
>>>
>>>
>>> Sent from my iPad
>>>
>>> On Feb 22, 2011, at 12:42 AM, Federico Golfrè Andreasi <
>>> federico.golfre_at_[hidden]> wrote:
>>>
>>> Hi Ralf,
>>>
>>> I've tested spawning with the OpenMPI 1.5 release but that fix is not
>>> there.
>>> Are you sure you've added it ?
>>>
>>> Thank you,
>>> Federico
>>>
>>>
>>>
>>> 2010/10/19 Ralph Castain < <rhc_at_[hidden]>rhc_at_[hidden]>
>>>
>>>> The fix should be there - just didn't get mentioned.
>>>>
>>>> Let me know if it isn't and I'll ensure it is in the next one...but I'd
>>>> be very surprised if it isn't already in there.
>>>>
>>>>
>>>> On Oct 19, 2010, at 3:03 AM, Federico Golfrè Andreasi wrote:
>>>>
>>>> Hi Ralf !
>>>>
>>>> I saw that the new realease 1.5 is out.
>>>> I didn't found this fix in the "list of changes", is it present but not
>>>> mentioned since is a minor fix ?
>>>>
>>>> Thank you,
>>>> Federico
>>>>
>>>>
>>>>
>>>> 2010/4/1 Ralph Castain < <rhc_at_[hidden]>rhc_at_[hidden]>
>>>>
>>>>> Hi there!
>>>>>
>>>>> It will be in the 1.5.0 release, but not 1.4.2 (couldn't backport the
>>>>> fix). I understand that will come out sometime soon, but no firm date has
>>>>> been set.
>>>>>
>>>>>
>>>>> On Apr 1, 2010, at 4:05 AM, Federico Golfrè Andreasi wrote:
>>>>>
>>>>> Hi Ralph,
>>>>>
>>>>>
>>>>> I've downloaded and tested the openmpi-1.7a1r22817 snapshot,
>>>>> and it works fine for (multiple) spawning more than 128 processes.
>>>>>
>>>>> That fix will be included in the next release of OpenMPI, right ?
>>>>> Do you when it will be released ? Or where I can find that info ?
>>>>>
>>>>> Thank you,
>>>>> Federico
>>>>>
>>>>>
>>>>>
>>>>> 2010/3/1 Ralph Castain < <rhc_at_[hidden]>rhc_at_[hidden]>
>>>>>
>>>>>> <http://www.open-mpi.org/nightly/trunk/>
>>>>>> http://www.open-mpi.org/nightly/trunk/
>>>>>>
>>>>>> I'm not sure this patch will solve your problem, but it is worth a
>>>>>> try.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> <users_at_[hidden]>users_at_[hidden]
>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> <users_at_[hidden]>users_at_[hidden]
>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> <users_at_[hidden]>users_at_[hidden]
>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> <users_at_[hidden]>users_at_[hidden]
>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>>
>> <OpenMPI.error>
>>
>>
>>
> <master.cpp><slave.cpp>
>
>
>