Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Nathan DeBardeleben (ndebard_at_[hidden])
Date: 2005-08-18 09:57:08


FYI, this only happens when I let OMPI compile 64bit on Linux. When I
throw in there CFLAGS=FFLAGS=CXXFLAGS=-m32 orted, my myriad of test
codes, mpirun, registry subscription codes, and JNI all work like a champ.
Something's wrong with the 64bit it appears to me.

-- Nathan
Correspondence
---------------------------------------------------------------------
Nathan DeBardeleben, Ph.D.
Los Alamos National Laboratory
Parallel Tools Team
High Performance Computing Environments
phone: 505-667-3428
email: ndebard_at_[hidden]
---------------------------------------------------------------------

Tim S. Woodall wrote:

>Nathan,
>
>I'll try to reproduce this sometime this week - but I'm pretty swamped.
>Is Greg also seeing the same behavior?
>
>Thanks,
>Tim
>
>Nathan DeBardeleben wrote:
>
>
>>To expand on this further, orte_init() seg faults on both bluesteel
>>(32bit linux) and sparkplug (64bit linux) equally. The required
>>condition is that orted must be running first (which of course we
>>require for our work - a persistent orte daemon and registry).
>>
>>
>>
>>
>>>[bluesteel]~/ptp > ./dump_info
>>>Segmentation fault
>>>[bluesteel]~/ptp > gdb dump_info
>>>GNU gdb 6.1
>>>Copyright 2004 Free Software Foundation, Inc.
>>>GDB is free software, covered by the GNU General Public License, and
>>>you are
>>>welcome to change it and/or distribute copies of it under certain
>>>conditions.
>>>Type "show copying" to see the conditions.
>>>There is absolutely no warranty for GDB. Type "show warranty" for
>>>details.
>>>This GDB was configured as "x86_64-suse-linux"...Using host
>>>libthread_db library "/lib64/tls/libthread_db.so.1".
>>>
>>>(gdb) run
>>>Starting program: /home/ndebard/ptp/dump_info
>>>
>>>Program received signal SIGSEGV, Segmentation fault.
>>>0x0000000000000000 in ?? ()
>>>(gdb) where
>>>#0 0x0000000000000000 in ?? ()
>>>#1 0x000000000045997d in orte_init_stage1 () at orte_init_stage1.c:419
>>>#2 0x00000000004156a7 in orte_system_init () at orte_system_init.c:38
>>>#3 0x00000000004151c7 in orte_init () at orte_init.c:46
>>>#4 0x0000000000414cbb in main (argc=1, argv=0x7fbffff298) at
>>>dump_info.c:185
>>>(gdb)
>>>
>>>
>>-- Nathan
>>Correspondence
>>---------------------------------------------------------------------
>>Nathan DeBardeleben, Ph.D.
>>Los Alamos National Laboratory
>>Parallel Tools Team
>>High Performance Computing Environments
>>phone: 505-667-3428
>>email: ndebard_at_[hidden]
>>---------------------------------------------------------------------
>>
>>
>>
>>Nathan DeBardeleben wrote:
>>
>>
>>
>>
>>>Just to clarify:
>>>1: no orted started (meaning the MPIrun or registry programs will
>>>start one by themselves) causes those programs to lock up.
>>>2: starting orted by hand (trying to get these programs to connect to
>>>a centralized one) causes the connecting programs to seg fault.
>>>
>>>-- Nathan
>>>Correspondence
>>>---------------------------------------------------------------------
>>>Nathan DeBardeleben, Ph.D.
>>>Los Alamos National Laboratory
>>>Parallel Tools Team
>>>High Performance Computing Environments
>>>phone: 505-667-3428
>>>email: ndebard_at_[hidden]
>>>---------------------------------------------------------------------
>>>
>>>
>>>
>>>Nathan DeBardeleben wrote:
>>>
>>>
>>>
>>>
>>>
>>>
>>>>So I dropped an .ompi_ignore into that directory, reconfigured, and
>>>>compile worked (yay!).
>>>>However, not a lot of progress: mpirun locks up, all my registry test
>>>>programs lock up as well. If I start the orted by hand, then any of my
>>>>registry calling programs cause segfault:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>[sparkplug]~/ptp > gdb sub_test
>>>>>GNU gdb 6.1
>>>>>Copyright 2004 Free Software Foundation, Inc.
>>>>>GDB is free software, covered by the GNU General Public License, and
>>>>>you are
>>>>>welcome to change it and/or distribute copies of it under certain
>>>>>conditions.
>>>>>Type "show copying" to see the conditions.
>>>>>There is absolutely no warranty for GDB. Type "show warranty" for
>>>>>details.
>>>>>This GDB was configured as "x86_64-suse-linux"...Using host
>>>>>libthread_db library "/lib64/tls/libthread_db.so.1".
>>>>>
>>>>>(gdb) run
>>>>>Starting program: /home/ndebard/ptp/sub_test
>>>>>
>>>>>Program received signal SIGSEGV, Segmentation fault.
>>>>>0x0000000000000000 in ?? ()
>>>>>(gdb) where
>>>>>#0 0x0000000000000000 in ?? ()
>>>>>#1 0x00000000004598a5 in orte_init_stage1 () at orte_init_stage1.c:419
>>>>>#2 0x00000000004155cf in orte_system_init () at orte_system_init.c:38
>>>>>#3 0x00000000004150ef in orte_init () at orte_init.c:46
>>>>>#4 0x00000000004148a1 in main (argc=1, argv=0x7fbffff178) at
>>>>>sub_test.c:60
>>>>>(gdb)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>Yes, I recompiled everything.
>>>>
>>>>Here's an example of me trying something a little more complicated
>>>>(which I believe locks up for the same reason - something borked with
>>>>the registry interaction).
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>>[sparkplug]~/ompi-test > bjssub -s 10000 -n 10 -i bash
>>>>>>Waiting for interactive job nodes.
>>>>>>(nodes 18 16 17 18 19 20 21 22 23 24 25)
>>>>>>Starting interactive job.
>>>>>>NODES=16,17,18,19,20,21,22,23,24,25
>>>>>>JOBID=18
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>so i got my nodes
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>ndebard_at_sparkplug:~/ompi-test> export OMPI_MCA_ptl_base_exclude=sm
>>>>>>ndebard_at_sparkplug:~/ompi-test> export
>>>>>>OMPI_MCA_pls_bproc_seed_priority=101
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>and set these envvars like we need to use Greg's bproc, without the
>>>>>2nd export the machine's load maxes and locks up.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>ndebard_at_sparkplug:~/ompi-test> bpstat
>>>>>>Node(s) Status Mode
>>>>>>User Group 100-128 down
>>>>>>---------- root root 0-15
>>>>>>up ---x------ vchandu vchandu
>>>>>>16-25 up ---x------
>>>>>>ndebard ndebard
>>>>>>26-27 up ---x------
>>>>>>root root 28-30 up
>>>>>>---x--x--x root root ndebard_at_sparkplug:~/ompi-test> env | grep
>>>>>>NODES
>>>>>>NODES=16,17,18,19,20,21,22,23,24,25
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>yes, i really have the nodes
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>ndebard_at_sparkplug:~/ompi-test> mpicc -o test-mpi test-mpi.c
>>>>>>ndebard_at_sparkplug:~/ompi-test>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>recompile for good measure
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>ndebard_at_sparkplug:~/ompi-test> ls /tmp/openmpi-sessions-ndebard*
>>>>>>/bin/ls: /tmp/openmpi-sessions-ndebard*: No such file or directory
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>proof that there's no left over old directory
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>ndebard_at_sparkplug:~/ompi-test> mpirun -np 1 test-mpi
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>it never responds at this point - but I can kill it with ^C.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>mpirun: killing job...
>>>>>>Killed
>>>>>>ndebard_at_sparkplug:~/ompi-test>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>-- Nathan
>>>>Correspondence
>>>>---------------------------------------------------------------------
>>>>Nathan DeBardeleben, Ph.D.
>>>>Los Alamos National Laboratory
>>>>Parallel Tools Team
>>>>High Performance Computing Environments
>>>>phone: 505-667-3428
>>>>email: ndebard_at_[hidden]
>>>>---------------------------------------------------------------------
>>>>
>>>>
>>>>
>>>>Jeff Squyres wrote:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>Is this what Tim Prins was working on?
>>>>>
>>>>>
>>>>>On Aug 16, 2005, at 5:21 PM, Tim S. Woodall wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>I'm not sure why this is even building... Is someone working on this?
>>>>>>I thought we had .ompi_ignore files in this directory.
>>>>>>
>>>>>>Tim
>>>>>>
>>>>>>
>>>>>>Nathan DeBardeleben wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>So I'm seeing all these nice emails about people developing on OMPI
>>>>>>>today yet I can't get it to compile. Am I out here in limbo on this
>>>>>>>or
>>>>>>>are others in the same boat? The errors I'm seeing are about some
>>>>>>>bproc
>>>>>>>code calling undefined functions and they are linked again below.
>>>>>>>
>>>>>>>-- Nathan
>>>>>>>Correspondence
>>>>>>>---------------------------------------------------------------------
>>>>>>>Nathan DeBardeleben, Ph.D.
>>>>>>>Los Alamos National Laboratory
>>>>>>>Parallel Tools Team
>>>>>>>High Performance Computing Environments
>>>>>>>phone: 505-667-3428
>>>>>>>email: ndebard_at_[hidden]
>>>>>>>---------------------------------------------------------------------
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>Nathan DeBardeleben wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>Back from training and trying to test this but now OMPI doesn't
>>>>>>>>compile
>>>>>>>>at all:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>gcc -DHAVE_CONFIG_H -I. -I. -I../../../../include
>>>>>>>>>-I../../../../include -I../../../.. -I../../../..
>>>>>>>>>-I../../../../include -I../../../../opal -I../../../../orte
>>>>>>>>>-I../../../../ompi -g -Wall -Wundef -Wno-long-long -Wsign-compare
>>>>>>>>>-Wmissing-prototypes -Wstrict-prototypes -Wcomment -pedantic
>>>>>>>>>-Werror-implicit-function-declaration -fno-strict-aliasing -MT
>>>>>>>>>ras_lsf_bproc.lo -MD -MP -MF .deps/ras_lsf_bproc.Tpo -c
>>>>>>>>>ras_lsf_bproc.c -o ras_lsf_bproc.o
>>>>>>>>>ras_lsf_bproc.c: In function `orte_ras_lsf_bproc_node_insert':
>>>>>>>>>ras_lsf_bproc.c:32: error: implicit declaration of function
>>>>>>>>>`orte_ras_base_node_insert'
>>>>>>>>>ras_lsf_bproc.c: In function `orte_ras_lsf_bproc_node_query':
>>>>>>>>>ras_lsf_bproc.c:37: error: implicit declaration of function
>>>>>>>>>`orte_ras_base_node_query'
>>>>>>>>>make[4]: *** [ras_lsf_bproc.lo] Error 1
>>>>>>>>>make[4]: Leaving directory
>>>>>>>>>`/home/ndebard/ompi/orte/mca/ras/lsf_bproc'
>>>>>>>>>make[3]: *** [all-recursive] Error 1
>>>>>>>>>make[3]: Leaving directory `/home/ndebard/ompi/orte/mca/ras'
>>>>>>>>>make[2]: *** [all-recursive] Error 1
>>>>>>>>>make[2]: Leaving directory `/home/ndebard/ompi/orte/mca'
>>>>>>>>>make[1]: *** [all-recursive] Error 1
>>>>>>>>>make[1]: Leaving directory `/home/ndebard/ompi/orte'
>>>>>>>>>make: *** [all-recursive] Error 1
>>>>>>>>>[sparkplug]~/ompi >
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>Clean SVN checkout this morning with configure:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>[sparkplug]~/ompi > ./configure --enable-static --disable-shared
>>>>>>>>>--without-threads --prefix=/home/ndebard/local/ompi
>>>>>>>>>--with-devel-headers
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>-- Nathan
>>>>>>>>Correspondence
>>>>>>>>---------------------------------------------------------------------
>>>>>>>>Nathan DeBardeleben, Ph.D.
>>>>>>>>Los Alamos National Laboratory
>>>>>>>>Parallel Tools Team
>>>>>>>>High Performance Computing Environments
>>>>>>>>phone: 505-667-3428
>>>>>>>>email: ndebard_at_[hidden]
>>>>>>>>---------------------------------------------------------------------
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>Brian Barrett wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>This is now fixed in SVN. You should no longer need the
>>>>>>>>>--build=i586... hack to compile 32 bit code on Opterons.
>>>>>>>>>
>>>>>>>>>Brian
>>>>>>>>>
>>>>>>>>>On Aug 12, 2005, at 3:17 PM, Brian Barrett wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>On Aug 12, 2005, at 3:13 PM, Nathan DeBardeleben wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>We've got a 64bit Linux (SUSE) box here. For a variety of reasons
>>>>>>>>>>>(Java, JNI, linking in with OMPI libraries, etc which I won't get
>>>>>>>>>>>into)
>>>>>>>>>>>I need to compile OMPI 32 bit (or get 64bit versions of a lot of
>>>>>>>>>>>other
>>>>>>>>>>>libraries).
>>>>>>>>>>>I get various compile errors when I try different things, but
>>>>>>>>>>>first
>>>>>>>>>>>let
>>>>>>>>>>>me explain the system we have:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>><snip>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>This goes on and on and on actually. And the 'is incompatible
>>>>>>>>>>>with
>>>>>>>>>>>i386:x86-64 output' looks to be repeated for every line before
>>>>>>>>>>>this
>>>>>>>>>>>error which actually caused the Make to bomb.
>>>>>>>>>>>
>>>>>>>>>>>Any suggestions at all? Surely someone must have tried to force
>>>>>>>>>>>OMPI to
>>>>>>>>>>>build in 32bit mode on a 64bit machine.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>I don't think anyone has tried to build 32 bit on an Opteron, which
>>>>>>>>>>is the cause of the problems...
>>>>>>>>>>
>>>>>>>>>>I think I know how to fix this, but won't happen until later in the
>>>>>>>>>>weekend. I can't think of a good workaround until then. Well, one
>>>>>>>>>>possibility is to set the target like you were doing and disable
>>>>>>>>>>ROMIO. Actually, you'll also need to disable Fortran 77. So
>>>>>>>>>>something like:
>>>>>>>>>>
>>>>>>>>>>./configure [usual options] --build=i586-suse-linux --disable-io-
>>>>>>>>>>romio --disable-f77
>>>>>>>>>>
>>>>>>>>>>might just do the trick.
>>>>>>>>>>
>>>>>>>>>>Brian
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>--
>>>>>>>>>>Brian Barrett
>>>>>>>>>>Open MPI developer
>>>>>>>>>>http://www.open-mpi.org/
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>_______________________________________________
>>>>>>>>>>devel mailing list
>>>>>>>>>>devel_at_[hidden]
>>>>>>>>>>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>_______________________________________________
>>>>>>>>devel mailing list
>>>>>>>>devel_at_[hidden]
>>>>>>>>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>_______________________________________________
>>>>>>>devel mailing list
>>>>>>>devel_at_[hidden]
>>>>>>>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>_______________________________________________
>>>>>>devel mailing list
>>>>>>devel_at_[hidden]
>>>>>>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>_______________________________________________
>>>>devel mailing list
>>>>devel_at_[hidden]
>>>>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>_______________________________________________
>>>devel mailing list
>>>devel_at_[hidden]
>>>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>>
>>>
>>>
>>_______________________________________________
>>devel mailing list
>>devel_at_[hidden]
>>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>>
>_______________________________________________
>devel mailing list
>devel_at_[hidden]
>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>