Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] 1.3 hangs running 2 exes with different names (Ralph Castain)
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-01-25 06:57:26


It took me quite a while, but I have finally traced this back to a bug
in 1.3.0. The confusion was caused by the original cited issue of a
problem when the exes had different names. This proved incorrect.

The key was your final statement about having both exes available on
all nodes. This is correct - and obviously not an intended behavior.
We didn't pick this up on our tests because all of our test
environments use NFS-mounted file systems - thus, the exes are always
available on all nodes.

It will be fixed in 1.3.1. Thanks for your patience in helping to
track it down!
Ralph

On Jan 23, 2009, at 10:45 AM, Geoffroy Pignot wrote:

> Hi Ralph,
>
> Thanks for taking time to look into my problem. As you can see , it
> happens when i dont have both exe available on both nodes.
> When it's the case (test3) , it works. I dont know if my particular
> libdir causes the problem or not but I 'll try on Monday with a more
> classical setup.
>
> I ll keep you inform.
>
> Geoffroy
>
>
>
> HI Geoffrey
>
> Hmmm....well, I redid my tests to mirror yours, and still cannot
> replicate this problem. I tried it with both slurm and ssh
> environments - no difference in the results.
>
> % make hello
>
> % cp hello hello2
>
> % ls
> hello hello2
>
> % mpirun -n 1 -host odin038 ./hello : -n 1 -host odin039 ./hello2
> Hello World, I am 0 of 2
> Hello World, I am 1 of 2
>
> I have tried a variety of combinations, including giving a fake
> executable as one of the apps, and have not been able to replicate
> your observed behavior. In all cases, it works correctly.
>
> It looks like you are using rsh/ssh as you launch environment. All I
> can advise at this stage is to again check to ensure that
> the .login/.cshrc (or whatever) on your remote nodes isn't setting
> your path to point at another OMPI installation. The fact that you can
> run at all would seem to indicate that things are okay, but I honestly
> have no ideas at this stage as to why you are seeing this behavior.
>
> Sorry I can't be of more help...
> Ralph
>
> On Jan 23, 2009, at 12:57 AM, Geoffroy Pignot wrote:
>
> > Hello
> >
> > I redid few tests with my hello world , here are my results.
> >
> > First of all my config :
> > configure --prefix=/tmp/openmpi-1.3 --libdir=/tmp/openmpi-1.3/lib64
> > --enable-heterogeneous . you will find attached my ompi_info -param
> > all all
> > compil02 and compil03 are identical Rh43 64 bits nodes.
> >
> > Test 1 :
> > compil02% ls /tmp
> > a.out openmpi-1.3
> >
> > compil03% ls /tmp
> > a.out openmpi-1.3
> >
> > /tmp/openmpi-1.3/bin/mpirun -d -n 1 -host compil03 /tmp/a.out : -n 1
> > -host compil02 /tmp/a.out
> > WORKS
> >
> > Test 2 :
> > compil02% mv a.out a.out_64 ; ls /tmp
> > a.out_64 openmpi-1.3
> >
> > compil03% ls /tmp
> > a.out openmpi-1.3
> >
> > compil03% /tmp/openmpi-1.3/bin/mpirun -d -n 1 -host compil03 /tmp/
> > a.out : -n 1 -host compil02 /tmp/a.out_64
> > [compil03:03774] procdir: /tmp/openmpi-sessions-
> > gpignot_at_compil03_0/20717/0/0
> > [compil03:03774] jobdir: /tmp/openmpi-sessions-
> > gpignot_at_compil03_0/20717/0
> > [compil03:03774] top: openmpi-sessions-gpignot_at_compil03_0
> > [compil03:03774] tmp: /tmp
> > [compil03:03774] mpirun: reset PATH: /tmp/openmpi-1.3/bin:/u/
> gpignot/
> > jobmgr/bin:.:/cgg/lv5000/jobmgr/bin:/cgg/lv5000/jobmgr/exec/
> Linux2.6-
> > x86_64/PIV:/cgg/jobmgr/bin:/cgg/jobmgr/exec/Linux2.6-x86_64/PIV:/
> cgg/
> > lv5000/bin:/cgg/lv5000/exec/Linux2.6-x86_64/PIV:/cgg/util:/bin:/usr/
> > bin:/usr/sbin:/etc:/usr/etc:/usr/local/bin:/usr/bin/X11:/nfs/softs/
> > TOOLS/bin:/nfs/netapp1/DEVTOOLS/bin:/nfs/netapp1/DEVTOOLS/free/
> > Linux2.6-x86_64/bin:/cgg/localdev:/cgg/Applis/bin
> > [compil03:03774] mpirun: reset LD_LIBRARY_PATH: /tmp/openmpi-1.3/
> > lib64:/tmp/openmpi-1.3/lib64
> > [compil02:10684] procdir: /tmp/openmpi-sessions-
> > gpignot_at_compil02_0/20717/0/1
> > [compil02:10684] jobdir: /tmp/openmpi-sessions-
> > gpignot_at_compil02_0/20717/0
> > [compil02:10684] top: openmpi-sessions-gpignot_at_compil02_0
> > [compil02:10684] tmp: /tmp
> > [compil03:03774] [[20717,0],0] node[0].name compil03 daemon 0 arch
> > ffc91200
> > [compil03:03774] [[20717,0],0] node[1].name compil02 daemon 1 arch
> > ffc91200
> > [compil02:10684] [[20717,0],1] node[0].name compil03 daemon 0 arch
> > ffc91200
> > [compil02:10684] [[20717,0],1] node[1].name compil02 daemon 1 arch
> > ffc91200
> > [compil03:03774] Info: Setting up debugger process table for
> > applications
> > MPIR_being_debugged = 0
> > MPIR_debug_state = 1
> > MPIR_partial_attach_ok = 1
> > MPIR_i_am_starter = 0
> > MPIR_proctable_size = 2
> > MPIR_proctable:
> > (i, host, exe, pid) = (0, compil03, /tmp/a.out, 0)
> > (i, host, exe, pid) = (1, compil02, /tmp/a.out_64, 0)
> >
> > HANGS : both exe have pid 0
> >
> > Test 3 :
> >
> > compil02% cp a.out_64 a.out ; ls /tmp
> > a.out_64 a.out openmpi-1.3
> >
> > compil03% ls /tmp
> > a.out openmpi-1.3
> >
> > [compil03:03777] procdir: /tmp/openmpi-sessions-
> > gpignot_at_compil03_0/20626/0/0
> > [compil03:03777] jobdir: /tmp/openmpi-sessions-
> > gpignot_at_compil03_0/20626/0
> > [compil03:03777] top: openmpi-sessions-gpignot_at_compil03_0
> > [compil03:03777] tmp: /tmp
> > [compil03:03777] mpirun: reset PATH: /tmp/openmpi-1.3/bin:/u/
> gpignot/
> > jobmgr/bin:.:/cgg/lv5000/jobmgr/bin:/cgg/lv5000/jobmgr/exec/
> Linux2.6-
> > x86_64/PIV:/cgg/jobmgr/bin:/cgg/jobmgr/exec/Linux2.6-x86_64/PIV:/
> cgg/
> > lv5000/bin:/cgg/lv5000/exec/Linux2.6-x86_64/PIV:/cgg/util:/bin:/usr/
> > bin:/usr/sbin:/etc:/usr/etc:/usr/local/bin:/usr/bin/X11:/nfs/softs/
> > TOOLS/bin:/nfs/netapp1/DEVTOOLS/bin:/nfs/netapp1/DEVTOOLS/free/
> > Linux2.6-x86_64/bin:/cgg/localdev:/cgg/Applis/bin
> > [compil03:03777] mpirun: reset LD_LIBRARY_PATH: /tmp/openmpi-1.3/
> > lib64:/tmp/openmpi-1.3/lib64
> > [compil02:10786] procdir: /tmp/openmpi-sessions-
> > gpignot_at_compil02_0/20626/0/1
> > [compil02:10786] jobdir: /tmp/openmpi-sessions-
> > gpignot_at_compil02_0/20626/0
> > [compil02:10786] top: openmpi-sessions-gpignot_at_compil02_0
> > [compil02:10786] tmp: /tmp
> > [compil03:03777] [[20626,0],0] node[0].name compil03 daemon 0 arch
> > ffc91200
> > [compil03:03777] [[20626,0],0] node[1].name compil02 daemon 1 arch
> > ffc91200
> > [compil02:10786] [[20626,0],1] node[0].name compil03 daemon 0 arch
> > ffc91200
> > [compil02:10786] [[20626,0],1] node[1].name compil02 daemon 1 arch
> > ffc91200
> > [compil03:03777] Info: Setting up debugger process table for
> > applications
> > MPIR_being_debugged = 0
> > MPIR_debug_state = 1
> > MPIR_partial_attach_ok = 1
> > MPIR_i_am_starter = 0
> > MPIR_proctable_size = 2
> > MPIR_proctable:
> > (i, host, exe, pid) = (0, compil03, /tmp/a.out, 0)
> > (i, host, exe, pid) = (1, compil02, /tmp/a.out_64, 10787)
> > [compil02:10787] procdir: /tmp/openmpi-sessions-
> > gpignot_at_compil02_0/20626/1/1
> > [compil02:10787] jobdir: /tmp/openmpi-sessions-
> > gpignot_at_compil02_0/20626/1
> > [compil02:10787] top: openmpi-sessions-gpignot_at_compil02_0
> > [compil02:10787] tmp: /tmp
> > [compil02:10787] [[20626,1],1] node[0].name compil03 daemon 0 arch
> > ffc91200
> > [compil02:10787] [[20626,1],1] node[1].name compil02 daemon 1 arch
> > ffc91200
> >
> > HANGS : go a little bit further but still one pid = 0
> >
> > Test4:
> >
> > compil02% ls /tmp
> > a.out_64 a.out openmpi-1.3
> >
> > compil03% cp a.out a.out_64 ; ls /tmp
> > a.out_64 a.out openmpi-1.3
> >
> > compil03% /tmp/openmpi-1.3/bin/mpirun -d -n 1 -host compil03 /tmp/
> > a.out : -n 1 -host compil02 /tmp/a.out_64
> > [compil03:03789] procdir: /tmp/openmpi-sessions-
> > gpignot_at_compil03_0/20638/0/0
> > [compil03:03789] jobdir: /tmp/openmpi-sessions-
> > gpignot_at_compil03_0/20638/0
> > [compil03:03789] top: openmpi-sessions-gpignot_at_compil03_0
> > [compil03:03789] tmp: /tmp
> > [compil03:03789] mpirun: reset PATH: /tmp/openmpi-1.3/bin:/u/
> gpignot/
> > jobmgr/bin:.:/cgg/lv5000/jobmgr/bin:/cgg/lv5000/jobmgr/exec/
> Linux2.6-
> > x86_64/PIV:/cgg/jobmgr/bin:/cgg/jobmgr/exec/Linux2.6-x86_64/PIV:/
> cgg/
> > lv5000/bin:/cgg/lv5000/exec/Linux2.6-x86_64/PIV:/cgg/util:/bin:/usr/
> > bin:/usr/sbin:/etc:/usr/etc:/usr/local/bin:/usr/bin/X11:/nfs/softs/
> > TOOLS/bin:/nfs/netapp1/DEVTOOLS/bin:/nfs/netapp1/DEVTOOLS/free/
> > Linux2.6-x86_64/bin:/cgg/localdev:/cgg/Applis/bin
> > [compil03:03789] mpirun: reset LD_LIBRARY_PATH: /tmp/openmpi-1.3/
> > lib64:/tmp/openmpi-1.3/lib64
> > [compil02:10937] procdir: /tmp/openmpi-sessions-
> > gpignot_at_compil02_0/20638/0/1
> > [compil02:10937] jobdir: /tmp/openmpi-sessions-
> > gpignot_at_compil02_0/20638/0
> > [compil02:10937] top: openmpi-sessions-gpignot_at_compil02_0
> > [compil02:10937] tmp: /tmp
> > [compil03:03789] [[20638,0],0] node[0].name compil03 daemon 0 arch
> > ffc91200
> > [compil03:03789] [[20638,0],0] node[1].name compil02 daemon 1 arch
> > ffc91200
> > [compil02:10937] [[20638,0],1] node[0].name compil03 daemon 0 arch
> > ffc91200
> > [compil02:10937] [[20638,0],1] node[1].name compil02 daemon 1 arch
> > ffc91200
> > [compil03:03789] Info: Setting up debugger process table for
> > applications
> > MPIR_being_debugged = 0
> > MPIR_debug_state = 1
> > MPIR_partial_attach_ok = 1
> > MPIR_i_am_starter = 0
> > MPIR_proctable_size = 2
> > MPIR_proctable:
> > (i, host, exe, pid) = (0, compil03, /tmp/a.out, 3792)
> > (i, host, exe, pid) = (1, compil02, /tmp/a.out_64, 10938)
> > [compil03:03792] procdir: /tmp/openmpi-sessions-
> > gpignot_at_compil03_0/20638/1/0
> > [compil03:03792] jobdir: /tmp/openmpi-sessions-
> > gpignot_at_compil03_0/20638/1
> > [compil03:03792] top: openmpi-sessions-gpignot_at_compil03_0
> > [compil03:03792] tmp: /tmp
> > [compil03:03792] [[20638,1],0] node[0].name compil03 daemon 0 arch
> > ffc91200
> > [compil03:03792] [[20638,1],0] node[1].name compil02 daemon 1 arch
> > ffc91200
> > [compil02:10938] procdir: /tmp/openmpi-sessions-
> > gpignot_at_compil02_0/20638/1/1
> > [compil02:10938] jobdir: /tmp/openmpi-sessions-
> > gpignot_at_compil02_0/20638/1
> > [compil02:10938] top: openmpi-sessions-gpignot_at_compil02_0
> > [compil02:10938] tmp: /tmp
> > [compil02:10938] [[20638,1],1] node[0].name compil03 daemon 0 arch
> > ffc91200
> > [compil02:10938] [[20638,1],1] node[1].name compil02 daemon 1 arch
> > ffc91200
> > Hello world from process 0 of 2
> > Hello world from process 1 of 2
> > [compil03:03792] sess_dir_finalize: proc session dir not empty -
> > leaving
> > [compil02:10938] sess_dir_finalize: proc session dir not empty -
> > leaving
> > [compil03:03789] sess_dir_finalize: proc session dir not empty -
> > leaving
> > [compil02:10937] sess_dir_finalize: proc session dir not empty -
> > leaving
> > [compil03:03789] sess_dir_finalize: job session dir not empty -
> > leaving
> > [compil02:10937] sess_dir_finalize: job session dir not empty -
> > leaving
> > [compil03:03789] sess_dir_finalize: proc session dir not empty -
> > leaving
> > orterun: exiting with status 0
> >
> > WORKS PERFECTLY
> >
> >
> > I dont understand exactly what is going on , but I am not sure that
> > this behavoiur is considered as normal
> >
> > Thanks in advance for your comments
> >
> > Geoffroy
> >
> >
> >
> > <geoffroy_ompi_info>_______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> End of users Digest, Vol 1127, Issue 8
> **************************************
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users