Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Tim S. Woodall (twoodall_at_[hidden])
Date: 2005-08-18 11:18:13


Brian,

Wasn't the introduction of sds part of your changes for redstorm? Any ideas
why it would be NULL here?

Thanks,
Tim

Rainer Keller wrote:

>Hello,
>see the "same" (well probably not exactly same) thing here in Opteron with
>64bit (-g and so on), I get:
>
>#0 0x0000000040085160 in orte_sds_base_contact_universe ()
>at ../../../../../orte/mca/sds/base/sds_base_interface.c:29
>29 return orte_sds_base_module->contact_universe();
>(gdb) where
>#0 0x0000000040085160 in orte_sds_base_contact_universe ()
>at ../../../../../orte/mca/sds/base/sds_base_interface.c:29
>#1 0x0000000040063e95 in orte_init_stage1 ()
>at ../../../orte/runtime/orte_init_stage1.c:185
>#2 0x0000000040017e7d in orte_system_init ()
>at ../../../orte/runtime/orte_system_init.c:38
>#3 0x00000000400148f5 in orte_init () at ../../../orte/runtime/orte_init.c:46
>#4 0x000000004000dfc7 in main (argc=4, argv=0x7fbfffe8a8)
>at ../../../../orte/tools/orterun/orterun.c:291
>#5 0x0000002a95c0c017 in __libc_start_main () from /lib64/libc.so.6
>#6 0x000000004000bf2a in _start ()
>(gdb)
>within mpirun
>
>orte_sds_base_module here is Null...
>This is without persistent orted; Just mpirun...
>
>CU,
>ray
>
>
>On Thursday 18 August 2005 16:57, Nathan DeBardeleben wrote:
>
>
>>FYI, this only happens when I let OMPI compile 64bit on Linux. When I
>>throw in there CFLAGS=FFLAGS=CXXFLAGS=-m32 orted, my myriad of test
>>codes, mpirun, registry subscription codes, and JNI all work like a champ.
>>Something's wrong with the 64bit it appears to me.
>>
>>-- Nathan
>>Correspondence
>>---------------------------------------------------------------------
>>Nathan DeBardeleben, Ph.D.
>>Los Alamos National Laboratory
>>Parallel Tools Team
>>High Performance Computing Environments
>>phone: 505-667-3428
>>email: ndebard_at_[hidden]
>>---------------------------------------------------------------------
>>
>>Tim S. Woodall wrote:
>>
>>
>>>Nathan,
>>>
>>>I'll try to reproduce this sometime this week - but I'm pretty swamped.
>>>Is Greg also seeing the same behavior?
>>>
>>>Thanks,
>>>Tim
>>>
>>>Nathan DeBardeleben wrote:
>>>
>>>
>>>>To expand on this further, orte_init() seg faults on both bluesteel
>>>>(32bit linux) and sparkplug (64bit linux) equally. The required
>>>>condition is that orted must be running first (which of course we
>>>>require for our work - a persistent orte daemon and registry).
>>>>
>>>>
>>>>
>>>>>[bluesteel]~/ptp > ./dump_info
>>>>>Segmentation fault
>>>>>[bluesteel]~/ptp > gdb dump_info
>>>>>GNU gdb 6.1
>>>>>Copyright 2004 Free Software Foundation, Inc.
>>>>>GDB is free software, covered by the GNU General Public License, and
>>>>>you are
>>>>>welcome to change it and/or distribute copies of it under certain
>>>>>conditions.
>>>>>Type "show copying" to see the conditions.
>>>>>There is absolutely no warranty for GDB. Type "show warranty" for
>>>>>details.
>>>>>This GDB was configured as "x86_64-suse-linux"...Using host
>>>>>libthread_db library "/lib64/tls/libthread_db.so.1".
>>>>>
>>>>>(gdb) run
>>>>>Starting program: /home/ndebard/ptp/dump_info
>>>>>
>>>>>Program received signal SIGSEGV, Segmentation fault.
>>>>>0x0000000000000000 in ?? ()
>>>>>(gdb) where
>>>>>#0 0x0000000000000000 in ?? ()
>>>>>#1 0x000000000045997d in orte_init_stage1 () at orte_init_stage1.c:419
>>>>>#2 0x00000000004156a7 in orte_system_init () at orte_system_init.c:38
>>>>>#3 0x00000000004151c7 in orte_init () at orte_init.c:46
>>>>>#4 0x0000000000414cbb in main (argc=1, argv=0x7fbffff298) at
>>>>>dump_info.c:185
>>>>>(gdb)
>>>>>
>>>>>
>>>>-- Nathan
>>>>Correspondence
>>>>---------------------------------------------------------------------
>>>>Nathan DeBardeleben, Ph.D.
>>>>Los Alamos National Laboratory
>>>>Parallel Tools Team
>>>>High Performance Computing Environments
>>>>phone: 505-667-3428
>>>>email: ndebard_at_[hidden]
>>>>---------------------------------------------------------------------
>>>>
>>>>Nathan DeBardeleben wrote:
>>>>
>>>>
>>>>>Just to clarify:
>>>>>1: no orted started (meaning the MPIrun or registry programs will
>>>>>start one by themselves) causes those programs to lock up.
>>>>>2: starting orted by hand (trying to get these programs to connect to
>>>>>a centralized one) causes the connecting programs to seg fault.
>>>>>
>>>>>-- Nathan
>>>>>Correspondence
>>>>>---------------------------------------------------------------------
>>>>>Nathan DeBardeleben, Ph.D.
>>>>>Los Alamos National Laboratory
>>>>>Parallel Tools Team
>>>>>High Performance Computing Environments
>>>>>phone: 505-667-3428
>>>>>email: ndebard_at_[hidden]
>>>>>---------------------------------------------------------------------
>>>>>
>>>>>Nathan DeBardeleben wrote:
>>>>>
>>>>>
>>>>>>So I dropped an .ompi_ignore into that directory, reconfigured, and
>>>>>>compile worked (yay!).
>>>>>>However, not a lot of progress: mpirun locks up, all my registry test
>>>>>>programs lock up as well. If I start the orted by hand, then any of my
>>>>>>
>>>>>>registry calling programs cause segfault:
>>>>>>
>>>>>>
>>>>>>>[sparkplug]~/ptp > gdb sub_test
>>>>>>>GNU gdb 6.1
>>>>>>>Copyright 2004 Free Software Foundation, Inc.
>>>>>>>GDB is free software, covered by the GNU General Public License, and
>>>>>>>you are
>>>>>>>welcome to change it and/or distribute copies of it under certain
>>>>>>>conditions.
>>>>>>>Type "show copying" to see the conditions.
>>>>>>>There is absolutely no warranty for GDB. Type "show warranty" for
>>>>>>>details.
>>>>>>>This GDB was configured as "x86_64-suse-linux"...Using host
>>>>>>>libthread_db library "/lib64/tls/libthread_db.so.1".
>>>>>>>
>>>>>>>(gdb) run
>>>>>>>Starting program: /home/ndebard/ptp/sub_test
>>>>>>>
>>>>>>>Program received signal SIGSEGV, Segmentation fault.
>>>>>>>0x0000000000000000 in ?? ()
>>>>>>>(gdb) where
>>>>>>>#0 0x0000000000000000 in ?? ()
>>>>>>>#1 0x00000000004598a5 in orte_init_stage1 () at
>>>>>>>orte_init_stage1.c:419 #2 0x00000000004155cf in orte_system_init ()
>>>>>>>at orte_system_init.c:38 #3 0x00000000004150ef in orte_init () at
>>>>>>>orte_init.c:46
>>>>>>>#4 0x00000000004148a1 in main (argc=1, argv=0x7fbffff178) at
>>>>>>>sub_test.c:60
>>>>>>>(gdb)
>>>>>>>
>>>>>>>
>>>>>>Yes, I recompiled everything.
>>>>>>
>>>>>>Here's an example of me trying something a little more complicated
>>>>>>(which I believe locks up for the same reason - something borked with
>>>>>>the registry interaction).
>>>>>>
>>>>>>
>>>>>>
>>>>>>>>[sparkplug]~/ompi-test > bjssub -s 10000 -n 10 -i bash
>>>>>>>>Waiting for interactive job nodes.
>>>>>>>>(nodes 18 16 17 18 19 20 21 22 23 24 25)
>>>>>>>>Starting interactive job.
>>>>>>>>NODES=16,17,18,19,20,21,22,23,24,25
>>>>>>>>JOBID=18
>>>>>>>>
>>>>>>>>
>>>>>>>so i got my nodes
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>ndebard_at_sparkplug:~/ompi-test> export OMPI_MCA_ptl_base_exclude=sm
>>>>>>>>ndebard_at_sparkplug:~/ompi-test> export
>>>>>>>>OMPI_MCA_pls_bproc_seed_priority=101
>>>>>>>>
>>>>>>>>
>>>>>>>and set these envvars like we need to use Greg's bproc, without the
>>>>>>>2nd export the machine's load maxes and locks up.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>ndebard_at_sparkplug:~/ompi-test> bpstat
>>>>>>>>Node(s) Status Mode
>>>>>>>>User Group 100-128 down
>>>>>>>>---------- root root 0-15
>>>>>>>>up ---x------ vchandu vchandu
>>>>>>>>16-25 up ---x------
>>>>>>>>ndebard ndebard
>>>>>>>>26-27 up ---x------
>>>>>>>>root root 28-30 up
>>>>>>>>---x--x--x root root ndebard_at_sparkplug:~/ompi-test> env | grep
>>>>>>>>NODES
>>>>>>>>NODES=16,17,18,19,20,21,22,23,24,25
>>>>>>>>
>>>>>>>>
>>>>>>>yes, i really have the nodes
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>ndebard_at_sparkplug:~/ompi-test> mpicc -o test-mpi test-mpi.c
>>>>>>>>ndebard_at_sparkplug:~/ompi-test>
>>>>>>>>
>>>>>>>>
>>>>>>>recompile for good measure
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>ndebard_at_sparkplug:~/ompi-test> ls /tmp/openmpi-sessions-ndebard*
>>>>>>>>/bin/ls: /tmp/openmpi-sessions-ndebard*: No such file or directory
>>>>>>>>
>>>>>>>>
>>>>>>>proof that there's no left over old directory
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>ndebard_at_sparkplug:~/ompi-test> mpirun -np 1 test-mpi
>>>>>>>>
>>>>>>>>
>>>>>>>it never responds at this point - but I can kill it with ^C.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>mpirun: killing job...
>>>>>>>>Killed
>>>>>>>>ndebard_at_sparkplug:~/ompi-test>
>>>>>>>>
>>>>>>>>
>>>>>>-- Nathan
>>>>>>Correspondence
>>>>>>---------------------------------------------------------------------
>>>>>>Nathan DeBardeleben, Ph.D.
>>>>>>Los Alamos National Laboratory
>>>>>>Parallel Tools Team
>>>>>>High Performance Computing Environments
>>>>>>phone: 505-667-3428
>>>>>>email: ndebard_at_[hidden]
>>>>>>---------------------------------------------------------------------
>>>>>>
>>>>>>Jeff Squyres wrote:
>>>>>>
>>>>>>
>>>>>>>Is this what Tim Prins was working on?
>>>>>>>
>>>>>>>On Aug 16, 2005, at 5:21 PM, Tim S. Woodall wrote:
>>>>>>>
>>>>>>>
>>>>>>>>I'm not sure why this is even building... Is someone working on this?
>>>>>>>>I thought we had .ompi_ignore files in this directory.
>>>>>>>>
>>>>>>>>Tim
>>>>>>>>
>>>>>>>>Nathan DeBardeleben wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>>So I'm seeing all these nice emails about people developing on OMPI
>>>>>>>>>today yet I can't get it to compile. Am I out here in limbo on this
>>>>>>>>>or
>>>>>>>>>are others in the same boat? The errors I'm seeing are about some
>>>>>>>>>bproc
>>>>>>>>>code calling undefined functions and they are linked again below.
>>>>>>>>>
>>>>>>>>>-- Nathan
>>>>>>>>>Correspondence
>>>>>>>>>--------------------------------------------------------------------
>>>>>>>>>- Nathan DeBardeleben, Ph.D.
>>>>>>>>>Los Alamos National Laboratory
>>>>>>>>>Parallel Tools Team
>>>>>>>>>High Performance Computing Environments
>>>>>>>>>phone: 505-667-3428
>>>>>>>>>email: ndebard_at_[hidden]
>>>>>>>>>--------------------------------------------------------------------
>>>>>>>>>-
>>>>>>>>>
>>>>>>>>>Nathan DeBardeleben wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>Back from training and trying to test this but now OMPI doesn't
>>>>>>>>>>compile
>>>>>>>>>>
>>>>>>>>>>at all:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>gcc -DHAVE_CONFIG_H -I. -I. -I../../../../include
>>>>>>>>>>>-I../../../../include -I../../../.. -I../../../..
>>>>>>>>>>>-I../../../../include -I../../../../opal -I../../../../orte
>>>>>>>>>>>-I../../../../ompi -g -Wall -Wundef -Wno-long-long -Wsign-compare
>>>>>>>>>>>-Wmissing-prototypes -Wstrict-prototypes -Wcomment -pedantic
>>>>>>>>>>>-Werror-implicit-function-declaration -fno-strict-aliasing -MT
>>>>>>>>>>>ras_lsf_bproc.lo -MD -MP -MF .deps/ras_lsf_bproc.Tpo -c
>>>>>>>>>>>ras_lsf_bproc.c -o ras_lsf_bproc.o
>>>>>>>>>>>ras_lsf_bproc.c: In function `orte_ras_lsf_bproc_node_insert':
>>>>>>>>>>>ras_lsf_bproc.c:32: error: implicit declaration of function
>>>>>>>>>>>`orte_ras_base_node_insert'
>>>>>>>>>>>ras_lsf_bproc.c: In function `orte_ras_lsf_bproc_node_query':
>>>>>>>>>>>ras_lsf_bproc.c:37: error: implicit declaration of function
>>>>>>>>>>>`orte_ras_base_node_query'
>>>>>>>>>>>make[4]: *** [ras_lsf_bproc.lo] Error 1
>>>>>>>>>>>make[4]: Leaving directory
>>>>>>>>>>>`/home/ndebard/ompi/orte/mca/ras/lsf_bproc'
>>>>>>>>>>>make[3]: *** [all-recursive] Error 1
>>>>>>>>>>>make[3]: Leaving directory `/home/ndebard/ompi/orte/mca/ras'
>>>>>>>>>>>make[2]: *** [all-recursive] Error 1
>>>>>>>>>>>make[2]: Leaving directory `/home/ndebard/ompi/orte/mca'
>>>>>>>>>>>make[1]: *** [all-recursive] Error 1
>>>>>>>>>>>make[1]: Leaving directory `/home/ndebard/ompi/orte'
>>>>>>>>>>>make: *** [all-recursive] Error 1
>>>>>>>>>>>[sparkplug]~/ompi >
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>Clean SVN checkout this morning with configure:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>[sparkplug]~/ompi > ./configure --enable-static --disable-shared
>>>>>>>>>>>--without-threads --prefix=/home/ndebard/local/ompi
>>>>>>>>>>>--with-devel-headers
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>-- Nathan
>>>>>>>>>>Correspondence
>>>>>>>>>>-------------------------------------------------------------------
>>>>>>>>>>-- Nathan DeBardeleben, Ph.D.
>>>>>>>>>>Los Alamos National Laboratory
>>>>>>>>>>Parallel Tools Team
>>>>>>>>>>High Performance Computing Environments
>>>>>>>>>>phone: 505-667-3428
>>>>>>>>>>email: ndebard_at_[hidden]
>>>>>>>>>>-------------------------------------------------------------------
>>>>>>>>>>--
>>>>>>>>>>
>>>>>>>>>>Brian Barrett wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>This is now fixed in SVN. You should no longer need the
>>>>>>>>>>>--build=i586... hack to compile 32 bit code on Opterons.
>>>>>>>>>>>
>>>>>>>>>>>Brian
>>>>>>>>>>>
>>>>>>>>>>>On Aug 12, 2005, at 3:17 PM, Brian Barrett wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>On Aug 12, 2005, at 3:13 PM, Nathan DeBardeleben wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>We've got a 64bit Linux (SUSE) box here. For a variety of
>>>>>>>>>>>>>reasons (Java, JNI, linking in with OMPI libraries, etc which I
>>>>>>>>>>>>>won't get into)
>>>>>>>>>>>>>I need to compile OMPI 32 bit (or get 64bit versions of a lot of
>>>>>>>>>>>>>other
>>>>>>>>>>>>>libraries).
>>>>>>>>>>>>>I get various compile errors when I try different things, but
>>>>>>>>>>>>>first
>>>>>>>>>>>>>let
>>>>>>>>>>>>>me explain the system we have:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>><snip>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>This goes on and on and on actually. And the 'is incompatible
>>>>>>>>>>>>>with
>>>>>>>>>>>>>i386:x86-64 output' looks to be repeated for every line before
>>>>>>>>>>>>>this
>>>>>>>>>>>>>error which actually caused the Make to bomb.
>>>>>>>>>>>>>
>>>>>>>>>>>>>Any suggestions at all? Surely someone must have tried to force
>>>>>>>>>>>>>OMPI to
>>>>>>>>>>>>>build in 32bit mode on a 64bit machine.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>I don't think anyone has tried to build 32 bit on an Opteron,
>>>>>>>>>>>>which is the cause of the problems...
>>>>>>>>>>>>
>>>>>>>>>>>>I think I know how to fix this, but won't happen until later in
>>>>>>>>>>>>the weekend. I can't think of a good workaround until then.
>>>>>>>>>>>>Well, one possibility is to set the target like you were doing
>>>>>>>>>>>>and disable ROMIO. Actually, you'll also need to disable
>>>>>>>>>>>>Fortran 77. So something like:
>>>>>>>>>>>>
>>>>>>>>>>>>./configure [usual options] --build=i586-suse-linux --disable-io-
>>>>>>>>>>>>romio --disable-f77
>>>>>>>>>>>>
>>>>>>>>>>>>might just do the trick.
>>>>>>>>>>>>
>>>>>>>>>>>>Brian
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>--
>>>>>>>>>>>>Brian Barrett
>>>>>>>>>>>>Open MPI developer
>>>>>>>>>>>>http://www.open-mpi.org/
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>_______________________________________________
>>>>>>>>>>>>devel mailing list
>>>>>>>>>>>>devel_at_[hidden]
>>>>>>>>>>>>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>_______________________________________________
>>>>>>>>>>devel mailing list
>>>>>>>>>>devel_at_[hidden]
>>>>>>>>>>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>_______________________________________________
>>>>>>>>>devel mailing list
>>>>>>>>>devel_at_[hidden]
>>>>>>>>>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>_______________________________________________
>>>>>>>>devel mailing list
>>>>>>>>devel_at_[hidden]
>>>>>>>>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>
>>>>>>>>
>>>>>>_______________________________________________
>>>>>>devel mailing list
>>>>>>devel_at_[hidden]
>>>>>>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>>
>>>>>_______________________________________________
>>>>>devel mailing list
>>>>>devel_at_[hidden]
>>>>>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>>
>>>>_______________________________________________
>>>>devel mailing list
>>>>devel_at_[hidden]
>>>>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>>
>>>_______________________________________________
>>>devel mailing list
>>>devel_at_[hidden]
>>>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>_______________________________________________
>>devel mailing list
>>devel_at_[hidden]
>>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>
>
>