From: Josh Hursey (jjhursey_at_[hidden])
Date: 2006-08-30 14:38:23


This fixes the hanging and gets me running (and passing) some/most of
the tests [Trivial and ibm]. Yay!

I have a 16 processor job running on Odin at the moment that seems to
be going well so far.

Thanks for your help.

Want me to file a bug about the tcsh problem below?

-- Josh

On Aug 30, 2006, at 2:30 PM, Jeff Squyres wrote:

> Bah!
>
> This is the result of perl expanding $? To 0 -- it seems that I
> need to
> escape $? So that it's not output as 0.
>
> Sorry about that!
>
> So is this just for the sourcing files, or for your overall (hanging)
> problems?
>
>
> On 8/30/06 2:28 PM, "Josh Hursey" <jjhursey_at_[hidden]> wrote:
>
>> So here are the results of my exploration. I have things running now.
>> The problem was that the user that I am running under does not set
>> the LD_LIBRARY_PATH variable at any point. So when MTT tries to
>> export the variable it does:
>> if (0LD_LIBRARY_PATH == 0) then
>> setenv LD_LIBRARY_PATH /san/<rest_of_path>/install/lib
>> else
>> setenv LD_LIBRARY_PATH /san/<rest_of_path>/install/lib:
>> $LD_LIBRARY_PATH
>> endif
>>
>> So this causes tcsh to emit the error the LD_LIBRARY_PATH is not
>> defined. So it is not set due to the error.
>>
>> I fixed this by always declaring it in the .cshrc file to "". However
>> MTT could do a sanity check before trying to check the value to see
>> if it is defined. Something like:
>>
>> <code>
>> if ($?LD_LIBRARY_PATH) then
>> else
>> setenv LD_LIBRARY_PATH ""
>> endif
>>
>> if (0LD_LIBRARY_PATH == 0) then
>> setenv LD_LIBRARY_PATH /san/<rest_of_path>/install/lib
>> else
>> setenv LD_LIBRARY_PATH /san/<rest_of_path>/install/lib:
>> $LD_LIBRARY_PATH
>> endif
>> </code>
>>
>> or something of the sort.
>>
>> As another note, could we start a "How to debug MTT" Wiki page with
>> some of the information that Jeff sent in this message regarding the
>> dumping of env vars? I think that would be helpful when getting
>> things started.
>>
>> Thanks for all your help, I'm sure I'll have more questions in the
>> near future.
>>
>> Cheers,
>> Josh
>>
>>
>> On Aug 30, 2006, at 12:31 PM, Jeff Squyres wrote:
>>
>>> On 8/30/06 12:10 PM, "Josh Hursey" <jjhursey_at_[hidden]> wrote:
>>>
>>>>> MTT directly sets environment variables in its own environment
>>>>> (via
>>>>> $ENV{whatever} = "foo") before using fork/exec to launch compiles
>>>>> and runs.
>>>>> Hence, the forked children inherit the environment variables that
>>>>> we set
>>>>> (E.g., PATH and LD_LIBRARY_PATH).
>>>>>
>>>>> So if you source the env vars files that MTT drops, that should be
>>>>> sufficient.
>>>>
>>>> Does it drop them to a file, or is it printed in the debugging
>>>> output
>>>> anywhere? I'm having a bit of trouble finding these strings in the
>>>> output.
>>>
>>> It does not put these in the -debug output.
>>>
>>> The files that it drops are in the scratch dir. You'll need to go
>>> into
>>> scratch/installs, and then it depends on what your INI file section
>>> names
>>> are. You'll go to:
>>>
>>> <scratch>/installs/<mpi get>/<mpi install>/<mpi_version>/
>>>
>>> And there should be files named "mpi_installed_vars.[csh|sh]" that
>>> you can
>>> source, depending on your shell. IT should set PATH and
>>> LD_LIBRARY_PATH.
>>>
>>> The intent of these files is for exactly this purpose -- for a
>>> human to test
>>> borked MPI installs inside the MTT scratch tree.
>>>
>>>>>
>>>>> As for setting the values on *remote* nodes, we do it solely
>>>>> via the
>>>>> --prefix option. I wonder if --prefix is broken under SLURM...?
>>>>> That might
>>>>> be something to check -- youmight be inadvertantly mixing
>>>>> installations of
>>>>> OMPI...?
>>>>
>>>> Yep I'll check it out.
>>>>
>>>> Cheers,
>>>> Josh
>>>>
>>>>>
>>>>>
>>>>> On 8/30/06 10:36 AM, "Josh Hursey" <jjhursey_at_[hidden]> wrote:
>>>>>
>>>>>> I'm trying to replicate the MTT environment as much as possible,
>>>>>> and
>>>>>> have a couple of questions.
>>>>>>
>>>>>> Assume there is no mpirun in my PATH/LD_LIBRARY_PATH when I start
>>>>>> MTT. After MTT builds Open MPI, how does it export these
>>>>>> variables so
>>>>>> that it can build the tests? How does it export these when it
>>>>>> runs
>>>>>> those tests (solely via --prefix)?
>>>>>>
>>>>>> Cheers,
>>>>>> josh
>>>>>>
>>>>>> On Aug 30, 2006, at 10:25 AM, Josh Hursey wrote:
>>>>>>
>>>>>>> I already tried that. However I'm trying it in a couple
>>>>>>> different
>>>>>>> ways and getting some mixed results. Let me formulate the error
>>>>>>> cases
>>>>>>> and get back to you.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Josh
>>>>>>>
>>>>>>> On Aug 30, 2006, at 10:17 AM, Ralph H Castain wrote:
>>>>>>>
>>>>>>>> Well, why don't you try first separating this from MTT? Just
>>>>>>>> run
>>>>>>>> the command
>>>>>>>> manually in batch mode and see if it works. If that works,
>>>>>>>> then the
>>>>>>>> problem
>>>>>>>> is with MTT. Otherwise, we have a problem with notification.
>>>>>>>>
>>>>>>>> Or are you saying that you have already done this?
>>>>>>>> Ralph
>>>>>>>>
>>>>>>>>
>>>>>>>> On 8/30/06 8:03 AM, "Josh Hursey" <jjhursey_at_[hidden]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> yet another point (sorry for the spam). This may not be an MTT
>>>>>>>>> issue
>>>>>>>>> but a broken ORTE on the trunk :(
>>>>>>>>>
>>>>>>>>> When I try to run in a allocation (srun -N 16 -A) things run
>>>>>>>>> fine.
>>>>>>>>> But if I try to run in batch mode (srun -N 16 -b myscript.sh)
>>>>>>>>> then I
>>>>>>>>> see the same hang as in MTT. seems that mpirun is not getting
>>>>>>>>> properly notified of the completion of the job. :(
>>>>>>>>>
>>>>>>>>> I'll try to investigate a bit further today. Any thoughts on
>>>>>>>>> what
>>>>>>>>> might be causing this?
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Josh
>>>>>>>>>
>>>>>>>>> On Aug 30, 2006, at 9:54 AM, Josh Hursey wrote:
>>>>>>>>>
>>>>>>>>>> forgot this bit in my mail. With the mpirun just hanging out
>>>>>>>>>> there I
>>>>>>>>>> attached GDB and got the following stack trace:
>>>>>>>>>> (gdb) bt
>>>>>>>>>> #0 0x0000003d1b9bd1af in poll () from /lib64/tls/libc.so.6
>>>>>>>>>> #1 0x0000002a956e6389 in opal_poll_dispatch (base=0x5136d0,
>>>>>>>>>> arg=0x513730, tv=0x7fbfffee70) at poll.c:191
>>>>>>>>>> #2 0x0000002a956e28b6 in opal_event_base_loop
>>>>>>>>>> (base=0x5136d0,
>>>>>>>>>> flags=5) at event.c:584
>>>>>>>>>> #3 0x0000002a956e26b7 in opal_event_loop (flags=5) at
>>>>>>>>>> event.c:
>>>>>>>>>> 514
>>>>>>>>>> #4 0x0000002a956db7c7 in opal_progress () at runtime/
>>>>>>>>>> opal_progress.c:
>>>>>>>>>> 259
>>>>>>>>>> #5 0x000000000040334c in opal_condition_wait (c=0x509650,
>>>>>>>>>> m=0x509600) at ../../../opal/threads/condition.h:81
>>>>>>>>>> #6 0x0000000000402f52 in orterun (argc=9,
>>>>>>>>>> argv=0x7fbffff0b8) at
>>>>>>>>>> orterun.c:444
>>>>>>>>>> #7 0x00000000004028a3 in main (argc=9, argv=0x7fbffff0b8) at
>>>>>>>>>> main.c:13
>>>>>>>>>>
>>>>>>>>>> Seems that mpirun is waiting for things to complete :/
>>>>>>>>>>
>>>>>>>>>> On Aug 30, 2006, at 9:53 AM, Josh Hursey wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Aug 30, 2006, at 7:19 AM, Jeff Squyres wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On 8/29/06 8:57 PM, "Josh Hursey" <jjhursey_at_[hidden]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Does this apply to *all* tests, or only some of the tests
>>>>>>>>>>>>> (like
>>>>>>>>>>>>> allgather)?
>>>>>>>>>>>>>
>>>>>>>>>>>>> All of the tests: Trivial and ibm. They all timeout :(
>>>>>>>>>>>>
>>>>>>>>>>>> Blah. The trivial tests are simply "hello world", so they
>>>>>>>>>>>> should
>>>>>>>>>>>> take just
>>>>>>>>>>>> about no time at all.
>>>>>>>>>>>>
>>>>>>>>>>>> Is this running under SLURM? I put the code in there to
>>>>>>>>>>>> know how
>>>>>>>>>>>> many procs
>>>>>>>>>>>> to use in SLURM but have not tested it in eons. I doubt
>>>>>>>>>>>> that's
>>>>>>>>>>>> the
>>>>>>>>>>>> problem,
>>>>>>>>>>>> but that's one thing to check.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Yep it is in SLURM. and it seems that the 'number of procs'
>>>>>>>>>>> code is
>>>>>>>>>>> working fine as it changes with different allocations.
>>>>>>>>>>>
>>>>>>>>>>>> Can you set a super-long timeout (e.g., a few minutes), and
>>>>>>>>>>>> while
>>>>>>>>>>>> one of the
>>>>>>>>>>>> trivial tests is running, do some ps's on the relevant
>>>>>>>>>>>> nodes
>>>>>>>>>>>> and
>>>>>>>>>>>> see what,
>>>>>>>>>>>> if anything, is running? E.g., mpirun, the test
>>>>>>>>>>>> executable on
>>>>>>>>>>>> the
>>>>>>>>>>>> nodes,
>>>>>>>>>>>> etc.
>>>>>>>>>>>
>>>>>>>>>>> Without setting a long timeout. It seems that mpirun is
>>>>>>>>>>> running,
>>>>>>>>>>> but
>>>>>>>>>>> nothing else and only on the launching node.
>>>>>>>>>>>
>>>>>>>>>>> When a test starts you see the mpirun launching properly:
>>>>>>>>>>> $ ps aux | grep ...
>>>>>>>>>>> USER PID %CPU %MEM VSZ RSS TTY STAT START
>>>>>>>>>>> TIME
>>>>>>>>>>> COMMAND
>>>>>>>>>>> mpiteam 15117 0.5 0.8 113024 33680 ? S 09:32
>>>>>>>>>>> 0:06
>>>>>>>>>>> perl ./
>>>>>>>>>>> client/mtt --debug --scratch /u/mpiteam/tmp/mtt-scratch --
>>>>>>>>>>> file /u/
>>>>>>>>>>> mpiteam/local/etc/ompi-iu-odin-core.ini --verbose --print-
>>>>>>>>>>> time
>>>>>>>>>>> mpiteam 15294 0.0 0.0 0 0 ? Z 09:32
>>>>>>>>>>> 0:00
>>>>>>>>>>> [sh]
>>>>>>>>>>> <defunct>
>>>>>>>>>>> mpiteam 28453 0.2 0.0 38072 3536 ? S 09:50
>>>>>>>>>>> 0:00
>>>>>>>>>>> mpirun
>>>>>>>>>>> -mca btl tcp,self -np 32 --prefix /san/homedirs/mpiteam/tmp/
>>>>>>>>>>> mtt-
>>>>>>>>>>> scratch/installs/ompi-nightly-trunk/odin_gcc_warnings/
>>>>>>>>>>> 1.3a1r11497/
>>>>>>>>>>> install collective/allgather_in_place
>>>>>>>>>>> mpiteam 28454 0.0 0.0 41716 2040 ? Sl 09:50
>>>>>>>>>>> 0:00
>>>>>>>>>>> srun --
>>>>>>>>>>> nodes=16 --ntasks=16 --
>>>>>>>>>>> nodelist=odin022,odin021,odin020,odin019,odin018,odin017,odi
>>>>>>>>>>> n0
>>>>>>>>>>> 16
>>>>>>>>>>> ,o
>>>>>>>>>>> d
>>>>>>>>>>> in
>>>>>>>>>>> 0
>>>>>>>>>>> 15
>>>>>>>>>>> ,odin014,odin013,odin012,odin011,odin010,odin009,odin008,odi
>>>>>>>>>>> n0
>>>>>>>>>>> 07
>>>>>>>>>>> orted --no-daemonize --bootproxy 1 --ns-nds slurm --name
>>>>>>>>>>> 0.0.1 --
>>>>>>>>>>> num_procs 16 --vpid_start 0 --universe
>>>>>>>>>>> mpiteam_at_[hidden]:default-universe-28453 --
>>>>>>>>>>> nsreplica
>>>>>>>>>>> "0.0.0;tcp://129.79.240.107:40904" --gprreplica
>>>>>>>>>>> "0.0.0;tcp://
>>>>>>>>>>> 129.79.240.107:40904"
>>>>>>>>>>> mpiteam 28455 0.0 0.0 23212 1804 ? Ssl 09:50
>>>>>>>>>>> 0:00
>>>>>>>>>>> srun --
>>>>>>>>>>> nodes=16 --ntasks=16 --
>>>>>>>>>>> nodelist=odin022,odin021,odin020,odin019,odin018,odin017,odi
>>>>>>>>>>> n0
>>>>>>>>>>> 16
>>>>>>>>>>> ,o
>>>>>>>>>>> d
>>>>>>>>>>> in
>>>>>>>>>>> 0
>>>>>>>>>>> 15
>>>>>>>>>>> ,odin014,odin013,odin012,odin011,odin010,odin009,odin008,odi
>>>>>>>>>>> n0
>>>>>>>>>>> 07
>>>>>>>>>>> orted --no-daemonize --bootproxy 1 --ns-nds slurm --name
>>>>>>>>>>> 0.0.1 --
>>>>>>>>>>> num_procs 16 --vpid_start 0 --universe
>>>>>>>>>>> mpiteam_at_[hidden]:default-universe-28453 --
>>>>>>>>>>> nsreplica
>>>>>>>>>>> "0.0.0;tcp://129.79.240.107:40904" --gprreplica
>>>>>>>>>>> "0.0.0;tcp://
>>>>>>>>>>> 129.79.240.107:40904"
>>>>>>>>>>> mpiteam 28472 0.0 0.0 36956 2256 ? S 09:50
>>>>>>>>>>> 0:00 /
>>>>>>>>>>> san/
>>>>>>>>>>> homedirs/mpiteam/tmp/mtt-scratch/installs/ompi-nightly-
>>>>>>>>>>> trunk/
>>>>>>>>>>> odin_gcc_warnings/1.3a1r11497/install/bin/orted --no-
>>>>>>>>>>> daemonize --
>>>>>>>>>>> bootproxy 1 --ns-nds slurm --name 0.0.1 --num_procs 16 --
>>>>>>>>>>> vpid_start 0
>>>>>>>>>>> --universe mpiteam_at_[hidden]:default-
>>>>>>>>>>> universe-28453 --
>>>>>>>>>>> nsreplica "0.0.0;tcp://129.79.240.107:40904" --gprreplica
>>>>>>>>>>> "0.0.0;tcp://129.79.240.107:40904"
>>>>>>>>>>> mpiteam 28482 0.1 0.0 64296 3564 ? S 09:50
>>>>>>>>>>> 0:00
>>>>>>>>>>> collective/allgather_in_place
>>>>>>>>>>> mpiteam 28483 0.1 0.0 64296 3564 ? S 09:50
>>>>>>>>>>> 0:00
>>>>>>>>>>> collective/allgather_in_place
>>>>>>>>>>>
>>>>>>>>>>> But once the test finishes, mpirun seems to just be hanging
>>>>>>>>>>> out.
>>>>>>>>>>> $ ps aux | grep ...
>>>>>>>>>>> USER PID %CPU %MEM VSZ RSS TTY STAT START
>>>>>>>>>>> TIME
>>>>>>>>>>> COMMAND
>>>>>>>>>>> mpiteam 15083 0.0 0.0 52760 1040 ? S 09:31
>>>>>>>>>>> 0:00 /
>>>>>>>>>>> bin/
>>>>>>>>>>> bash /var/tmp/slurmd/job148126/script
>>>>>>>>>>> root 15086 0.0 0.0 42884 3172 ? Ss 09:31
>>>>>>>>>>> 0:00
>>>>>>>>>>> sshd:
>>>>>>>>>>> mpiteam [priv]
>>>>>>>>>>> mpiteam 15088 0.0 0.0 43012 3252 ? S 09:31
>>>>>>>>>>> 0:00
>>>>>>>>>>> sshd:
>>>>>>>>>>> mpiteam_at_pts/1
>>>>>>>>>>> mpiteam 15089 0.0 0.0 56680 1912 pts/1 Ss 09:31
>>>>>>>>>>> 0:00 -
>>>>>>>>>>> tcsh
>>>>>>>>>>> mpiteam 15117 0.5 0.8 113024 33680 ? S 09:32
>>>>>>>>>>> 0:06
>>>>>>>>>>> perl ./
>>>>>>>>>>> client/mtt --debug --scratch /u/mpiteam/tmp/mtt-scratch --
>>>>>>>>>>> file /u/
>>>>>>>>>>> mpiteam/local/etc/ompi-iu-odin-core.ini --verbose --print-
>>>>>>>>>>> time
>>>>>>>>>>> mpiteam 15294 0.0 0.0 0 0 ? Z 09:32
>>>>>>>>>>> 0:00
>>>>>>>>>>> [sh]
>>>>>>>>>>> <defunct>
>>>>>>>>>>> mpiteam 28453 0.0 0.0 38204 3568 ? S 09:50
>>>>>>>>>>> 0:00
>>>>>>>>>>> mpirun
>>>>>>>>>>> -mca btl tcp,self -np 32 --prefix /san/homedirs/mpiteam/tmp/
>>>>>>>>>>> mtt-
>>>>>>>>>>> scratch/installs/ompi-nightly-trunk/odin_gcc_warnings/
>>>>>>>>>>> 1.3a1r11497/
>>>>>>>>>>> install collective/allgather_in_place
>>>>>>>>>>>
>>>>>>>>>>> Thoughts?
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Jeff Squyres
>>>>>>>>>>>> Server Virtualization Business Unit
>>>>>>>>>>>> Cisco Systems
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> mtt-users mailing list
>>>>>>>>>>> mtt-users_at_[hidden]
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> mtt-users mailing list
>>>>>>>>>> mtt-users_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> mtt-users mailing list
>>>>>>> mtt-users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users
>>>>>>
>>>>>> _______________________________________________
>>>>>> mtt-users mailing list
>>>>>> mtt-users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users
>>>>>
>>>>>
>>>>> --
>>>>> Jeff Squyres
>>>>> Server Virtualization Business Unit
>>>>> Cisco Systems
>>>>
>>>> _______________________________________________
>>>> mtt-users mailing list
>>>> mtt-users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users
>>>
>>>
>>> --
>>> Jeff Squyres
>>> Server Virtualization Business Unit
>>> Cisco Systems
>>
>> _______________________________________________
>> mtt-users mailing list
>> mtt-users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users
>
>
> --
> Jeff Squyres
> Server Virtualization Business Unit
> Cisco Systems