Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Nathan DeBardeleben (ndebard_at_[hidden])
Date: 2005-08-17 10:06:08


Just to clarify:
  1: no orted started (meaning the MPIrun or registry programs will
start one by themselves) causes those programs to lock up.
  2: starting orted by hand (trying to get these programs to connect to
a centralized one) causes the connecting programs to seg fault.

-- Nathan
Correspondence
---------------------------------------------------------------------
Nathan DeBardeleben, Ph.D.
Los Alamos National Laboratory
Parallel Tools Team
High Performance Computing Environments
phone: 505-667-3428
email: ndebard_at_[hidden]
---------------------------------------------------------------------

Nathan DeBardeleben wrote:

>So I dropped an .ompi_ignore into that directory, reconfigured, and
>compile worked (yay!).
>However, not a lot of progress: mpirun locks up, all my registry test
>programs lock up as well. If I start the orted by hand, then any of my
>registry calling programs cause segfault:
>
>
>
>>[sparkplug]~/ptp > gdb sub_test
>>GNU gdb 6.1
>>Copyright 2004 Free Software Foundation, Inc.
>>GDB is free software, covered by the GNU General Public License, and
>>you are
>>welcome to change it and/or distribute copies of it under certain
>>conditions.
>>Type "show copying" to see the conditions.
>>There is absolutely no warranty for GDB. Type "show warranty" for
>>details.
>>This GDB was configured as "x86_64-suse-linux"...Using host
>>libthread_db library "/lib64/tls/libthread_db.so.1".
>>
>>(gdb) run
>>Starting program: /home/ndebard/ptp/sub_test
>>
>>Program received signal SIGSEGV, Segmentation fault.
>>0x0000000000000000 in ?? ()
>>(gdb) where
>>#0 0x0000000000000000 in ?? ()
>>#1 0x00000000004598a5 in orte_init_stage1 () at orte_init_stage1.c:419
>>#2 0x00000000004155cf in orte_system_init () at orte_system_init.c:38
>>#3 0x00000000004150ef in orte_init () at orte_init.c:46
>>#4 0x00000000004148a1 in main (argc=1, argv=0x7fbffff178) at
>>sub_test.c:60
>>(gdb)
>>
>>
>
>Yes, I recompiled everything.
>
>Here's an example of me trying something a little more complicated
>(which I believe locks up for the same reason - something borked with
>the registry interaction).
>
>
>
>>>[sparkplug]~/ompi-test > bjssub -s 10000 -n 10 -i bash
>>>Waiting for interactive job nodes.
>>>(nodes 18 16 17 18 19 20 21 22 23 24 25)
>>>Starting interactive job.
>>>NODES=16,17,18,19,20,21,22,23,24,25
>>>JOBID=18
>>>
>>>
>>so i got my nodes
>>
>>
>>
>>>ndebard_at_sparkplug:~/ompi-test> export OMPI_MCA_ptl_base_exclude=sm
>>>ndebard_at_sparkplug:~/ompi-test> export
>>>OMPI_MCA_pls_bproc_seed_priority=101
>>>
>>>
>>and set these envvars like we need to use Greg's bproc, without the
>>2nd export the machine's load maxes and locks up.
>>
>>
>>
>>>ndebard_at_sparkplug:~/ompi-test> bpstat
>>>Node(s) Status Mode
>>>User Group 100-128 down
>>>---------- root root 0-15
>>>up ---x------ vchandu vchandu
>>>16-25 up ---x------
>>>ndebard ndebard
>>>26-27 up ---x------
>>>root root 28-30 up
>>>---x--x--x root root ndebard_at_sparkplug:~/ompi-test> env | grep
>>>NODES
>>>NODES=16,17,18,19,20,21,22,23,24,25
>>>
>>>
>>yes, i really have the nodes
>>
>>
>>
>>>ndebard_at_sparkplug:~/ompi-test> mpicc -o test-mpi test-mpi.c
>>>ndebard_at_sparkplug:~/ompi-test>
>>>
>>>
>>recompile for good measure
>>
>>
>>
>>>ndebard_at_sparkplug:~/ompi-test> ls /tmp/openmpi-sessions-ndebard*
>>>/bin/ls: /tmp/openmpi-sessions-ndebard*: No such file or directory
>>>
>>>
>>proof that there's no left over old directory
>>
>>
>>
>>>ndebard_at_sparkplug:~/ompi-test> mpirun -np 1 test-mpi
>>>
>>>
>>it never responds at this point - but I can kill it with ^C.
>>
>>
>>
>>>mpirun: killing job...
>>>Killed
>>>ndebard_at_sparkplug:~/ompi-test>
>>>
>>>
>
>-- Nathan
>Correspondence
>---------------------------------------------------------------------
>Nathan DeBardeleben, Ph.D.
>Los Alamos National Laboratory
>Parallel Tools Team
>High Performance Computing Environments
>phone: 505-667-3428
>email: ndebard_at_[hidden]
>---------------------------------------------------------------------
>
>
>
>Jeff Squyres wrote:
>
>
>
>>Is this what Tim Prins was working on?
>>
>>
>>On Aug 16, 2005, at 5:21 PM, Tim S. Woodall wrote:
>>
>>
>>
>>
>>
>>>I'm not sure why this is even building... Is someone working on this?
>>>I thought we had .ompi_ignore files in this directory.
>>>
>>>Tim
>>>
>>>
>>>Nathan DeBardeleben wrote:
>>>
>>>
>>>
>>>
>>>>So I'm seeing all these nice emails about people developing on OMPI
>>>>today yet I can't get it to compile. Am I out here in limbo on this
>>>>or
>>>>are others in the same boat? The errors I'm seeing are about some
>>>>bproc
>>>>code calling undefined functions and they are linked again below.
>>>>
>>>>-- Nathan
>>>>Correspondence
>>>>---------------------------------------------------------------------
>>>>Nathan DeBardeleben, Ph.D.
>>>>Los Alamos National Laboratory
>>>>Parallel Tools Team
>>>>High Performance Computing Environments
>>>>phone: 505-667-3428
>>>>email: ndebard_at_[hidden]
>>>>---------------------------------------------------------------------
>>>>
>>>>
>>>>
>>>>Nathan DeBardeleben wrote:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>Back from training and trying to test this but now OMPI doesn't
>>>>>compile
>>>>>at all:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>gcc -DHAVE_CONFIG_H -I. -I. -I../../../../include
>>>>>>-I../../../../include -I../../../.. -I../../../..
>>>>>>-I../../../../include -I../../../../opal -I../../../../orte
>>>>>>-I../../../../ompi -g -Wall -Wundef -Wno-long-long -Wsign-compare
>>>>>>-Wmissing-prototypes -Wstrict-prototypes -Wcomment -pedantic
>>>>>>-Werror-implicit-function-declaration -fno-strict-aliasing -MT
>>>>>>ras_lsf_bproc.lo -MD -MP -MF .deps/ras_lsf_bproc.Tpo -c
>>>>>>ras_lsf_bproc.c -o ras_lsf_bproc.o
>>>>>>ras_lsf_bproc.c: In function `orte_ras_lsf_bproc_node_insert':
>>>>>>ras_lsf_bproc.c:32: error: implicit declaration of function
>>>>>>`orte_ras_base_node_insert'
>>>>>>ras_lsf_bproc.c: In function `orte_ras_lsf_bproc_node_query':
>>>>>>ras_lsf_bproc.c:37: error: implicit declaration of function
>>>>>>`orte_ras_base_node_query'
>>>>>>make[4]: *** [ras_lsf_bproc.lo] Error 1
>>>>>>make[4]: Leaving directory
>>>>>>`/home/ndebard/ompi/orte/mca/ras/lsf_bproc'
>>>>>>make[3]: *** [all-recursive] Error 1
>>>>>>make[3]: Leaving directory `/home/ndebard/ompi/orte/mca/ras'
>>>>>>make[2]: *** [all-recursive] Error 1
>>>>>>make[2]: Leaving directory `/home/ndebard/ompi/orte/mca'
>>>>>>make[1]: *** [all-recursive] Error 1
>>>>>>make[1]: Leaving directory `/home/ndebard/ompi/orte'
>>>>>>make: *** [all-recursive] Error 1
>>>>>>[sparkplug]~/ompi >
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>Clean SVN checkout this morning with configure:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>[sparkplug]~/ompi > ./configure --enable-static --disable-shared
>>>>>>--without-threads --prefix=/home/ndebard/local/ompi
>>>>>>--with-devel-headers
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>-- Nathan
>>>>>Correspondence
>>>>>---------------------------------------------------------------------
>>>>>Nathan DeBardeleben, Ph.D.
>>>>>Los Alamos National Laboratory
>>>>>Parallel Tools Team
>>>>>High Performance Computing Environments
>>>>>phone: 505-667-3428
>>>>>email: ndebard_at_[hidden]
>>>>>---------------------------------------------------------------------
>>>>>
>>>>>
>>>>>
>>>>>Brian Barrett wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>This is now fixed in SVN. You should no longer need the
>>>>>>--build=i586... hack to compile 32 bit code on Opterons.
>>>>>>
>>>>>>Brian
>>>>>>
>>>>>>On Aug 12, 2005, at 3:17 PM, Brian Barrett wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>On Aug 12, 2005, at 3:13 PM, Nathan DeBardeleben wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>We've got a 64bit Linux (SUSE) box here. For a variety of reasons
>>>>>>>>(Java, JNI, linking in with OMPI libraries, etc which I won't get
>>>>>>>>into)
>>>>>>>>I need to compile OMPI 32 bit (or get 64bit versions of a lot of
>>>>>>>>other
>>>>>>>>libraries).
>>>>>>>>I get various compile errors when I try different things, but
>>>>>>>>first
>>>>>>>>let
>>>>>>>>me explain the system we have:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>><snip>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>This goes on and on and on actually. And the 'is incompatible
>>>>>>>>with
>>>>>>>>i386:x86-64 output' looks to be repeated for every line before
>>>>>>>>this
>>>>>>>>error which actually caused the Make to bomb.
>>>>>>>>
>>>>>>>>Any suggestions at all? Surely someone must have tried to force
>>>>>>>>OMPI to
>>>>>>>>build in 32bit mode on a 64bit machine.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>I don't think anyone has tried to build 32 bit on an Opteron, which
>>>>>>>is the cause of the problems...
>>>>>>>
>>>>>>>I think I know how to fix this, but won't happen until later in the
>>>>>>>weekend. I can't think of a good workaround until then. Well, one
>>>>>>>possibility is to set the target like you were doing and disable
>>>>>>>ROMIO. Actually, you'll also need to disable Fortran 77. So
>>>>>>>something like:
>>>>>>>
>>>>>>>./configure [usual options] --build=i586-suse-linux --disable-io-
>>>>>>>romio --disable-f77
>>>>>>>
>>>>>>>might just do the trick.
>>>>>>>
>>>>>>>Brian
>>>>>>>
>>>>>>>
>>>>>>>--
>>>>>>>Brian Barrett
>>>>>>>Open MPI developer
>>>>>>>http://www.open-mpi.org/
>>>>>>>
>>>>>>>
>>>>>>>_______________________________________________
>>>>>>>devel mailing list
>>>>>>>devel_at_[hidden]
>>>>>>>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>_______________________________________________
>>>>>devel mailing list
>>>>>devel_at_[hidden]
>>>>>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>_______________________________________________
>>>>devel mailing list
>>>>devel_at_[hidden]
>>>>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>>
>>>>
>>>>
>>>>
>>>_______________________________________________
>>>devel mailing list
>>>devel_at_[hidden]
>>>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>>
>_______________________________________________
>devel mailing list
>devel_at_[hidden]
>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>