Open MPI logo

MTT Devel Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all MTT Users mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2006-08-30 12:31:29


On 8/30/06 12:10 PM, "Josh Hursey" <jjhursey_at_[hidden]> wrote:

>> MTT directly sets environment variables in its own environment (via
>> $ENV{whatever} = "foo") before using fork/exec to launch compiles
>> and runs.
>> Hence, the forked children inherit the environment variables that
>> we set
>> (E.g., PATH and LD_LIBRARY_PATH).
>>
>> So if you source the env vars files that MTT drops, that should be
>> sufficient.
>
> Does it drop them to a file, or is it printed in the debugging output
> anywhere? I'm having a bit of trouble finding these strings in the
> output.

It does not put these in the -debug output.

The files that it drops are in the scratch dir. You'll need to go into
scratch/installs, and then it depends on what your INI file section names
are. You'll go to:

    <scratch>/installs/<mpi get>/<mpi install>/<mpi_version>/

And there should be files named "mpi_installed_vars.[csh|sh]" that you can
source, depending on your shell. IT should set PATH and LD_LIBRARY_PATH.

The intent of these files is for exactly this purpose -- for a human to test
borked MPI installs inside the MTT scratch tree.

>>
>> As for setting the values on *remote* nodes, we do it solely via the
>> --prefix option. I wonder if --prefix is broken under SLURM...?
>> That might
>> be something to check -- youmight be inadvertantly mixing
>> installations of
>> OMPI...?
>
> Yep I'll check it out.
>
> Cheers,
> Josh
>
>>
>>
>> On 8/30/06 10:36 AM, "Josh Hursey" <jjhursey_at_[hidden]> wrote:
>>
>>> I'm trying to replicate the MTT environment as much as possible, and
>>> have a couple of questions.
>>>
>>> Assume there is no mpirun in my PATH/LD_LIBRARY_PATH when I start
>>> MTT. After MTT builds Open MPI, how does it export these variables so
>>> that it can build the tests? How does it export these when it runs
>>> those tests (solely via --prefix)?
>>>
>>> Cheers,
>>> josh
>>>
>>> On Aug 30, 2006, at 10:25 AM, Josh Hursey wrote:
>>>
>>>> I already tried that. However I'm trying it in a couple different
>>>> ways and getting some mixed results. Let me formulate the error
>>>> cases
>>>> and get back to you.
>>>>
>>>> Cheers,
>>>> Josh
>>>>
>>>> On Aug 30, 2006, at 10:17 AM, Ralph H Castain wrote:
>>>>
>>>>> Well, why don't you try first separating this from MTT? Just run
>>>>> the command
>>>>> manually in batch mode and see if it works. If that works, then the
>>>>> problem
>>>>> is with MTT. Otherwise, we have a problem with notification.
>>>>>
>>>>> Or are you saying that you have already done this?
>>>>> Ralph
>>>>>
>>>>>
>>>>> On 8/30/06 8:03 AM, "Josh Hursey" <jjhursey_at_[hidden]> wrote:
>>>>>
>>>>>> yet another point (sorry for the spam). This may not be an MTT
>>>>>> issue
>>>>>> but a broken ORTE on the trunk :(
>>>>>>
>>>>>> When I try to run in a allocation (srun -N 16 -A) things run fine.
>>>>>> But if I try to run in batch mode (srun -N 16 -b myscript.sh)
>>>>>> then I
>>>>>> see the same hang as in MTT. seems that mpirun is not getting
>>>>>> properly notified of the completion of the job. :(
>>>>>>
>>>>>> I'll try to investigate a bit further today. Any thoughts on what
>>>>>> might be causing this?
>>>>>>
>>>>>> Cheers,
>>>>>> Josh
>>>>>>
>>>>>> On Aug 30, 2006, at 9:54 AM, Josh Hursey wrote:
>>>>>>
>>>>>>> forgot this bit in my mail. With the mpirun just hanging out
>>>>>>> there I
>>>>>>> attached GDB and got the following stack trace:
>>>>>>> (gdb) bt
>>>>>>> #0 0x0000003d1b9bd1af in poll () from /lib64/tls/libc.so.6
>>>>>>> #1 0x0000002a956e6389 in opal_poll_dispatch (base=0x5136d0,
>>>>>>> arg=0x513730, tv=0x7fbfffee70) at poll.c:191
>>>>>>> #2 0x0000002a956e28b6 in opal_event_base_loop (base=0x5136d0,
>>>>>>> flags=5) at event.c:584
>>>>>>> #3 0x0000002a956e26b7 in opal_event_loop (flags=5) at event.c:
>>>>>>> 514
>>>>>>> #4 0x0000002a956db7c7 in opal_progress () at runtime/
>>>>>>> opal_progress.c:
>>>>>>> 259
>>>>>>> #5 0x000000000040334c in opal_condition_wait (c=0x509650,
>>>>>>> m=0x509600) at ../../../opal/threads/condition.h:81
>>>>>>> #6 0x0000000000402f52 in orterun (argc=9, argv=0x7fbffff0b8) at
>>>>>>> orterun.c:444
>>>>>>> #7 0x00000000004028a3 in main (argc=9, argv=0x7fbffff0b8) at
>>>>>>> main.c:13
>>>>>>>
>>>>>>> Seems that mpirun is waiting for things to complete :/
>>>>>>>
>>>>>>> On Aug 30, 2006, at 9:53 AM, Josh Hursey wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> On Aug 30, 2006, at 7:19 AM, Jeff Squyres wrote:
>>>>>>>>
>>>>>>>>> On 8/29/06 8:57 PM, "Josh Hursey" <jjhursey_at_[hidden]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>>> Does this apply to *all* tests, or only some of the tests
>>>>>>>>>>> (like
>>>>>>>>>>> allgather)?
>>>>>>>>>>
>>>>>>>>>> All of the tests: Trivial and ibm. They all timeout :(
>>>>>>>>>
>>>>>>>>> Blah. The trivial tests are simply "hello world", so they
>>>>>>>>> should
>>>>>>>>> take just
>>>>>>>>> about no time at all.
>>>>>>>>>
>>>>>>>>> Is this running under SLURM? I put the code in there to
>>>>>>>>> know how
>>>>>>>>> many procs
>>>>>>>>> to use in SLURM but have not tested it in eons. I doubt that's
>>>>>>>>> the
>>>>>>>>> problem,
>>>>>>>>> but that's one thing to check.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Yep it is in SLURM. and it seems that the 'number of procs'
>>>>>>>> code is
>>>>>>>> working fine as it changes with different allocations.
>>>>>>>>
>>>>>>>>> Can you set a super-long timeout (e.g., a few minutes), and
>>>>>>>>> while
>>>>>>>>> one of the
>>>>>>>>> trivial tests is running, do some ps's on the relevant nodes
>>>>>>>>> and
>>>>>>>>> see what,
>>>>>>>>> if anything, is running? E.g., mpirun, the test executable on
>>>>>>>>> the
>>>>>>>>> nodes,
>>>>>>>>> etc.
>>>>>>>>
>>>>>>>> Without setting a long timeout. It seems that mpirun is running,
>>>>>>>> but
>>>>>>>> nothing else and only on the launching node.
>>>>>>>>
>>>>>>>> When a test starts you see the mpirun launching properly:
>>>>>>>> $ ps aux | grep ...
>>>>>>>> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME
>>>>>>>> COMMAND
>>>>>>>> mpiteam 15117 0.5 0.8 113024 33680 ? S 09:32 0:06
>>>>>>>> perl ./
>>>>>>>> client/mtt --debug --scratch /u/mpiteam/tmp/mtt-scratch --
>>>>>>>> file /u/
>>>>>>>> mpiteam/local/etc/ompi-iu-odin-core.ini --verbose --print-time
>>>>>>>> mpiteam 15294 0.0 0.0 0 0 ? Z 09:32 0:00
>>>>>>>> [sh]
>>>>>>>> <defunct>
>>>>>>>> mpiteam 28453 0.2 0.0 38072 3536 ? S 09:50 0:00
>>>>>>>> mpirun
>>>>>>>> -mca btl tcp,self -np 32 --prefix /san/homedirs/mpiteam/tmp/mtt-
>>>>>>>> scratch/installs/ompi-nightly-trunk/odin_gcc_warnings/
>>>>>>>> 1.3a1r11497/
>>>>>>>> install collective/allgather_in_place
>>>>>>>> mpiteam 28454 0.0 0.0 41716 2040 ? Sl 09:50 0:00
>>>>>>>> srun --
>>>>>>>> nodes=16 --ntasks=16 --
>>>>>>>> nodelist=odin022,odin021,odin020,odin019,odin018,odin017,odin016
>>>>>>>> ,o
>>>>>>>> d
>>>>>>>> in
>>>>>>>> 0
>>>>>>>> 15
>>>>>>>> ,odin014,odin013,odin012,odin011,odin010,odin009,odin008,odin007
>>>>>>>> orted --no-daemonize --bootproxy 1 --ns-nds slurm --name
>>>>>>>> 0.0.1 --
>>>>>>>> num_procs 16 --vpid_start 0 --universe
>>>>>>>> mpiteam_at_[hidden]:default-universe-28453 --
>>>>>>>> nsreplica
>>>>>>>> "0.0.0;tcp://129.79.240.107:40904" --gprreplica "0.0.0;tcp://
>>>>>>>> 129.79.240.107:40904"
>>>>>>>> mpiteam 28455 0.0 0.0 23212 1804 ? Ssl 09:50 0:00
>>>>>>>> srun --
>>>>>>>> nodes=16 --ntasks=16 --
>>>>>>>> nodelist=odin022,odin021,odin020,odin019,odin018,odin017,odin016
>>>>>>>> ,o
>>>>>>>> d
>>>>>>>> in
>>>>>>>> 0
>>>>>>>> 15
>>>>>>>> ,odin014,odin013,odin012,odin011,odin010,odin009,odin008,odin007
>>>>>>>> orted --no-daemonize --bootproxy 1 --ns-nds slurm --name
>>>>>>>> 0.0.1 --
>>>>>>>> num_procs 16 --vpid_start 0 --universe
>>>>>>>> mpiteam_at_[hidden]:default-universe-28453 --
>>>>>>>> nsreplica
>>>>>>>> "0.0.0;tcp://129.79.240.107:40904" --gprreplica "0.0.0;tcp://
>>>>>>>> 129.79.240.107:40904"
>>>>>>>> mpiteam 28472 0.0 0.0 36956 2256 ? S 09:50 0:00 /
>>>>>>>> san/
>>>>>>>> homedirs/mpiteam/tmp/mtt-scratch/installs/ompi-nightly-trunk/
>>>>>>>> odin_gcc_warnings/1.3a1r11497/install/bin/orted --no-
>>>>>>>> daemonize --
>>>>>>>> bootproxy 1 --ns-nds slurm --name 0.0.1 --num_procs 16 --
>>>>>>>> vpid_start 0
>>>>>>>> --universe mpiteam_at_[hidden]:default-
>>>>>>>> universe-28453 --
>>>>>>>> nsreplica "0.0.0;tcp://129.79.240.107:40904" --gprreplica
>>>>>>>> "0.0.0;tcp://129.79.240.107:40904"
>>>>>>>> mpiteam 28482 0.1 0.0 64296 3564 ? S 09:50 0:00
>>>>>>>> collective/allgather_in_place
>>>>>>>> mpiteam 28483 0.1 0.0 64296 3564 ? S 09:50 0:00
>>>>>>>> collective/allgather_in_place
>>>>>>>>
>>>>>>>> But once the test finishes, mpirun seems to just be hanging out.
>>>>>>>> $ ps aux | grep ...
>>>>>>>> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME
>>>>>>>> COMMAND
>>>>>>>> mpiteam 15083 0.0 0.0 52760 1040 ? S 09:31 0:00 /
>>>>>>>> bin/
>>>>>>>> bash /var/tmp/slurmd/job148126/script
>>>>>>>> root 15086 0.0 0.0 42884 3172 ? Ss 09:31 0:00
>>>>>>>> sshd:
>>>>>>>> mpiteam [priv]
>>>>>>>> mpiteam 15088 0.0 0.0 43012 3252 ? S 09:31 0:00
>>>>>>>> sshd:
>>>>>>>> mpiteam_at_pts/1
>>>>>>>> mpiteam 15089 0.0 0.0 56680 1912 pts/1 Ss 09:31 0:00 -
>>>>>>>> tcsh
>>>>>>>> mpiteam 15117 0.5 0.8 113024 33680 ? S 09:32 0:06
>>>>>>>> perl ./
>>>>>>>> client/mtt --debug --scratch /u/mpiteam/tmp/mtt-scratch --
>>>>>>>> file /u/
>>>>>>>> mpiteam/local/etc/ompi-iu-odin-core.ini --verbose --print-time
>>>>>>>> mpiteam 15294 0.0 0.0 0 0 ? Z 09:32 0:00
>>>>>>>> [sh]
>>>>>>>> <defunct>
>>>>>>>> mpiteam 28453 0.0 0.0 38204 3568 ? S 09:50 0:00
>>>>>>>> mpirun
>>>>>>>> -mca btl tcp,self -np 32 --prefix /san/homedirs/mpiteam/tmp/mtt-
>>>>>>>> scratch/installs/ompi-nightly-trunk/odin_gcc_warnings/
>>>>>>>> 1.3a1r11497/
>>>>>>>> install collective/allgather_in_place
>>>>>>>>
>>>>>>>> Thoughts?
>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Jeff Squyres
>>>>>>>>> Server Virtualization Business Unit
>>>>>>>>> Cisco Systems
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> mtt-users mailing list
>>>>>>>> mtt-users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> mtt-users mailing list
>>>>>>> mtt-users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users
>>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> mtt-users mailing list
>>>> mtt-users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users
>>>
>>> _______________________________________________
>>> mtt-users mailing list
>>> mtt-users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users
>>
>>
>> --
>> Jeff Squyres
>> Server Virtualization Business Unit
>> Cisco Systems
>
> _______________________________________________
> mtt-users mailing list
> mtt-users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users

-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems