Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Rainer Keller (Keller_at_[hidden])
Date: 2005-08-18 10:45:05


Hello,
see the "same" (well probably not exactly same) thing here in Opteron with
64bit (-g and so on), I get:

#0 0x0000000040085160 in orte_sds_base_contact_universe ()
at ../../../../../orte/mca/sds/base/sds_base_interface.c:29
29 return orte_sds_base_module->contact_universe();
(gdb) where
#0 0x0000000040085160 in orte_sds_base_contact_universe ()
at ../../../../../orte/mca/sds/base/sds_base_interface.c:29
#1 0x0000000040063e95 in orte_init_stage1 ()
at ../../../orte/runtime/orte_init_stage1.c:185
#2 0x0000000040017e7d in orte_system_init ()
at ../../../orte/runtime/orte_system_init.c:38
#3 0x00000000400148f5 in orte_init () at ../../../orte/runtime/orte_init.c:46
#4 0x000000004000dfc7 in main (argc=4, argv=0x7fbfffe8a8)
at ../../../../orte/tools/orterun/orterun.c:291
#5 0x0000002a95c0c017 in __libc_start_main () from /lib64/libc.so.6
#6 0x000000004000bf2a in _start ()
(gdb)
within mpirun

orte_sds_base_module here is Null...
This is without persistent orted; Just mpirun...

CU,
ray

On Thursday 18 August 2005 16:57, Nathan DeBardeleben wrote:
> FYI, this only happens when I let OMPI compile 64bit on Linux. When I
> throw in there CFLAGS=FFLAGS=CXXFLAGS=-m32 orted, my myriad of test
> codes, mpirun, registry subscription codes, and JNI all work like a champ.
> Something's wrong with the 64bit it appears to me.
>
> -- Nathan
> Correspondence
> ---------------------------------------------------------------------
> Nathan DeBardeleben, Ph.D.
> Los Alamos National Laboratory
> Parallel Tools Team
> High Performance Computing Environments
> phone: 505-667-3428
> email: ndebard_at_[hidden]
> ---------------------------------------------------------------------
>
> Tim S. Woodall wrote:
> >Nathan,
> >
> >I'll try to reproduce this sometime this week - but I'm pretty swamped.
> >Is Greg also seeing the same behavior?
> >
> >Thanks,
> >Tim
> >
> >Nathan DeBardeleben wrote:
> >>To expand on this further, orte_init() seg faults on both bluesteel
> >>(32bit linux) and sparkplug (64bit linux) equally. The required
> >>condition is that orted must be running first (which of course we
> >>require for our work - a persistent orte daemon and registry).
> >>
> >>>[bluesteel]~/ptp > ./dump_info
> >>>Segmentation fault
> >>>[bluesteel]~/ptp > gdb dump_info
> >>>GNU gdb 6.1
> >>>Copyright 2004 Free Software Foundation, Inc.
> >>>GDB is free software, covered by the GNU General Public License, and
> >>>you are
> >>>welcome to change it and/or distribute copies of it under certain
> >>>conditions.
> >>>Type "show copying" to see the conditions.
> >>>There is absolutely no warranty for GDB. Type "show warranty" for
> >>>details.
> >>>This GDB was configured as "x86_64-suse-linux"...Using host
> >>>libthread_db library "/lib64/tls/libthread_db.so.1".
> >>>
> >>>(gdb) run
> >>>Starting program: /home/ndebard/ptp/dump_info
> >>>
> >>>Program received signal SIGSEGV, Segmentation fault.
> >>>0x0000000000000000 in ?? ()
> >>>(gdb) where
> >>>#0 0x0000000000000000 in ?? ()
> >>>#1 0x000000000045997d in orte_init_stage1 () at orte_init_stage1.c:419
> >>>#2 0x00000000004156a7 in orte_system_init () at orte_system_init.c:38
> >>>#3 0x00000000004151c7 in orte_init () at orte_init.c:46
> >>>#4 0x0000000000414cbb in main (argc=1, argv=0x7fbffff298) at
> >>>dump_info.c:185
> >>>(gdb)
> >>
> >>-- Nathan
> >>Correspondence
> >>---------------------------------------------------------------------
> >>Nathan DeBardeleben, Ph.D.
> >>Los Alamos National Laboratory
> >>Parallel Tools Team
> >>High Performance Computing Environments
> >>phone: 505-667-3428
> >>email: ndebard_at_[hidden]
> >>---------------------------------------------------------------------
> >>
> >>Nathan DeBardeleben wrote:
> >>>Just to clarify:
> >>>1: no orted started (meaning the MPIrun or registry programs will
> >>>start one by themselves) causes those programs to lock up.
> >>>2: starting orted by hand (trying to get these programs to connect to
> >>>a centralized one) causes the connecting programs to seg fault.
> >>>
> >>>-- Nathan
> >>>Correspondence
> >>>---------------------------------------------------------------------
> >>>Nathan DeBardeleben, Ph.D.
> >>>Los Alamos National Laboratory
> >>>Parallel Tools Team
> >>>High Performance Computing Environments
> >>>phone: 505-667-3428
> >>>email: ndebard_at_[hidden]
> >>>---------------------------------------------------------------------
> >>>
> >>>Nathan DeBardeleben wrote:
> >>>>So I dropped an .ompi_ignore into that directory, reconfigured, and
> >>>>compile worked (yay!).
> >>>>However, not a lot of progress: mpirun locks up, all my registry test
> >>>>programs lock up as well. If I start the orted by hand, then any of my
> >>>>
> >>>>registry calling programs cause segfault:
> >>>>>[sparkplug]~/ptp > gdb sub_test
> >>>>>GNU gdb 6.1
> >>>>>Copyright 2004 Free Software Foundation, Inc.
> >>>>>GDB is free software, covered by the GNU General Public License, and
> >>>>>you are
> >>>>>welcome to change it and/or distribute copies of it under certain
> >>>>>conditions.
> >>>>>Type "show copying" to see the conditions.
> >>>>>There is absolutely no warranty for GDB. Type "show warranty" for
> >>>>>details.
> >>>>>This GDB was configured as "x86_64-suse-linux"...Using host
> >>>>>libthread_db library "/lib64/tls/libthread_db.so.1".
> >>>>>
> >>>>>(gdb) run
> >>>>>Starting program: /home/ndebard/ptp/sub_test
> >>>>>
> >>>>>Program received signal SIGSEGV, Segmentation fault.
> >>>>>0x0000000000000000 in ?? ()
> >>>>>(gdb) where
> >>>>>#0 0x0000000000000000 in ?? ()
> >>>>>#1 0x00000000004598a5 in orte_init_stage1 () at
> >>>>> orte_init_stage1.c:419 #2 0x00000000004155cf in orte_system_init ()
> >>>>> at orte_system_init.c:38 #3 0x00000000004150ef in orte_init () at
> >>>>> orte_init.c:46
> >>>>>#4 0x00000000004148a1 in main (argc=1, argv=0x7fbffff178) at
> >>>>>sub_test.c:60
> >>>>>(gdb)
> >>>>
> >>>>Yes, I recompiled everything.
> >>>>
> >>>>Here's an example of me trying something a little more complicated
> >>>>(which I believe locks up for the same reason - something borked with
> >>>>the registry interaction).
> >>>>
> >>>>>>[sparkplug]~/ompi-test > bjssub -s 10000 -n 10 -i bash
> >>>>>>Waiting for interactive job nodes.
> >>>>>>(nodes 18 16 17 18 19 20 21 22 23 24 25)
> >>>>>>Starting interactive job.
> >>>>>>NODES=16,17,18,19,20,21,22,23,24,25
> >>>>>>JOBID=18
> >>>>>
> >>>>>so i got my nodes
> >>>>>
> >>>>>>ndebard_at_sparkplug:~/ompi-test> export OMPI_MCA_ptl_base_exclude=sm
> >>>>>>ndebard_at_sparkplug:~/ompi-test> export
> >>>>>>OMPI_MCA_pls_bproc_seed_priority=101
> >>>>>
> >>>>>and set these envvars like we need to use Greg's bproc, without the
> >>>>>2nd export the machine's load maxes and locks up.
> >>>>>
> >>>>>>ndebard_at_sparkplug:~/ompi-test> bpstat
> >>>>>>Node(s) Status Mode
> >>>>>>User Group 100-128 down
> >>>>>>---------- root root 0-15
> >>>>>>up ---x------ vchandu vchandu
> >>>>>>16-25 up ---x------
> >>>>>>ndebard ndebard
> >>>>>>26-27 up ---x------
> >>>>>>root root 28-30 up
> >>>>>>---x--x--x root root ndebard_at_sparkplug:~/ompi-test> env | grep
> >>>>>>NODES
> >>>>>>NODES=16,17,18,19,20,21,22,23,24,25
> >>>>>
> >>>>>yes, i really have the nodes
> >>>>>
> >>>>>>ndebard_at_sparkplug:~/ompi-test> mpicc -o test-mpi test-mpi.c
> >>>>>>ndebard_at_sparkplug:~/ompi-test>
> >>>>>
> >>>>>recompile for good measure
> >>>>>
> >>>>>>ndebard_at_sparkplug:~/ompi-test> ls /tmp/openmpi-sessions-ndebard*
> >>>>>>/bin/ls: /tmp/openmpi-sessions-ndebard*: No such file or directory
> >>>>>
> >>>>>proof that there's no left over old directory
> >>>>>
> >>>>>>ndebard_at_sparkplug:~/ompi-test> mpirun -np 1 test-mpi
> >>>>>
> >>>>>it never responds at this point - but I can kill it with ^C.
> >>>>>
> >>>>>>mpirun: killing job...
> >>>>>>Killed
> >>>>>>ndebard_at_sparkplug:~/ompi-test>
> >>>>
> >>>>-- Nathan
> >>>>Correspondence
> >>>>---------------------------------------------------------------------
> >>>>Nathan DeBardeleben, Ph.D.
> >>>>Los Alamos National Laboratory
> >>>>Parallel Tools Team
> >>>>High Performance Computing Environments
> >>>>phone: 505-667-3428
> >>>>email: ndebard_at_[hidden]
> >>>>---------------------------------------------------------------------
> >>>>
> >>>>Jeff Squyres wrote:
> >>>>>Is this what Tim Prins was working on?
> >>>>>
> >>>>>On Aug 16, 2005, at 5:21 PM, Tim S. Woodall wrote:
> >>>>>>I'm not sure why this is even building... Is someone working on this?
> >>>>>>I thought we had .ompi_ignore files in this directory.
> >>>>>>
> >>>>>>Tim
> >>>>>>
> >>>>>>Nathan DeBardeleben wrote:
> >>>>>>>So I'm seeing all these nice emails about people developing on OMPI
> >>>>>>>today yet I can't get it to compile. Am I out here in limbo on this
> >>>>>>>or
> >>>>>>>are others in the same boat? The errors I'm seeing are about some
> >>>>>>>bproc
> >>>>>>>code calling undefined functions and they are linked again below.
> >>>>>>>
> >>>>>>>-- Nathan
> >>>>>>>Correspondence
> >>>>>>>--------------------------------------------------------------------
> >>>>>>>- Nathan DeBardeleben, Ph.D.
> >>>>>>>Los Alamos National Laboratory
> >>>>>>>Parallel Tools Team
> >>>>>>>High Performance Computing Environments
> >>>>>>>phone: 505-667-3428
> >>>>>>>email: ndebard_at_[hidden]
> >>>>>>>--------------------------------------------------------------------
> >>>>>>>-
> >>>>>>>
> >>>>>>>Nathan DeBardeleben wrote:
> >>>>>>>>Back from training and trying to test this but now OMPI doesn't
> >>>>>>>>compile
> >>>>>>>>
> >>>>>>>>at all:
> >>>>>>>>>gcc -DHAVE_CONFIG_H -I. -I. -I../../../../include
> >>>>>>>>>-I../../../../include -I../../../.. -I../../../..
> >>>>>>>>>-I../../../../include -I../../../../opal -I../../../../orte
> >>>>>>>>>-I../../../../ompi -g -Wall -Wundef -Wno-long-long -Wsign-compare
> >>>>>>>>>-Wmissing-prototypes -Wstrict-prototypes -Wcomment -pedantic
> >>>>>>>>>-Werror-implicit-function-declaration -fno-strict-aliasing -MT
> >>>>>>>>>ras_lsf_bproc.lo -MD -MP -MF .deps/ras_lsf_bproc.Tpo -c
> >>>>>>>>>ras_lsf_bproc.c -o ras_lsf_bproc.o
> >>>>>>>>>ras_lsf_bproc.c: In function `orte_ras_lsf_bproc_node_insert':
> >>>>>>>>>ras_lsf_bproc.c:32: error: implicit declaration of function
> >>>>>>>>>`orte_ras_base_node_insert'
> >>>>>>>>>ras_lsf_bproc.c: In function `orte_ras_lsf_bproc_node_query':
> >>>>>>>>>ras_lsf_bproc.c:37: error: implicit declaration of function
> >>>>>>>>>`orte_ras_base_node_query'
> >>>>>>>>>make[4]: *** [ras_lsf_bproc.lo] Error 1
> >>>>>>>>>make[4]: Leaving directory
> >>>>>>>>>`/home/ndebard/ompi/orte/mca/ras/lsf_bproc'
> >>>>>>>>>make[3]: *** [all-recursive] Error 1
> >>>>>>>>>make[3]: Leaving directory `/home/ndebard/ompi/orte/mca/ras'
> >>>>>>>>>make[2]: *** [all-recursive] Error 1
> >>>>>>>>>make[2]: Leaving directory `/home/ndebard/ompi/orte/mca'
> >>>>>>>>>make[1]: *** [all-recursive] Error 1
> >>>>>>>>>make[1]: Leaving directory `/home/ndebard/ompi/orte'
> >>>>>>>>>make: *** [all-recursive] Error 1
> >>>>>>>>>[sparkplug]~/ompi >
> >>>>>>>>
> >>>>>>>>Clean SVN checkout this morning with configure:
> >>>>>>>>>[sparkplug]~/ompi > ./configure --enable-static --disable-shared
> >>>>>>>>>--without-threads --prefix=/home/ndebard/local/ompi
> >>>>>>>>>--with-devel-headers
> >>>>>>>>
> >>>>>>>>-- Nathan
> >>>>>>>>Correspondence
> >>>>>>>>-------------------------------------------------------------------
> >>>>>>>>-- Nathan DeBardeleben, Ph.D.
> >>>>>>>>Los Alamos National Laboratory
> >>>>>>>>Parallel Tools Team
> >>>>>>>>High Performance Computing Environments
> >>>>>>>>phone: 505-667-3428
> >>>>>>>>email: ndebard_at_[hidden]
> >>>>>>>>-------------------------------------------------------------------
> >>>>>>>>--
> >>>>>>>>
> >>>>>>>>Brian Barrett wrote:
> >>>>>>>>>This is now fixed in SVN. You should no longer need the
> >>>>>>>>>--build=i586... hack to compile 32 bit code on Opterons.
> >>>>>>>>>
> >>>>>>>>>Brian
> >>>>>>>>>
> >>>>>>>>>On Aug 12, 2005, at 3:17 PM, Brian Barrett wrote:
> >>>>>>>>>>On Aug 12, 2005, at 3:13 PM, Nathan DeBardeleben wrote:
> >>>>>>>>>>>We've got a 64bit Linux (SUSE) box here. For a variety of
> >>>>>>>>>>> reasons (Java, JNI, linking in with OMPI libraries, etc which I
> >>>>>>>>>>> won't get into)
> >>>>>>>>>>>I need to compile OMPI 32 bit (or get 64bit versions of a lot of
> >>>>>>>>>>>other
> >>>>>>>>>>>libraries).
> >>>>>>>>>>>I get various compile errors when I try different things, but
> >>>>>>>>>>>first
> >>>>>>>>>>>let
> >>>>>>>>>>>me explain the system we have:
> >>>>>>>>>>
> >>>>>>>>>><snip>
> >>>>>>>>>>
> >>>>>>>>>>>This goes on and on and on actually. And the 'is incompatible
> >>>>>>>>>>>with
> >>>>>>>>>>>i386:x86-64 output' looks to be repeated for every line before
> >>>>>>>>>>>this
> >>>>>>>>>>>error which actually caused the Make to bomb.
> >>>>>>>>>>>
> >>>>>>>>>>>Any suggestions at all? Surely someone must have tried to force
> >>>>>>>>>>>OMPI to
> >>>>>>>>>>>build in 32bit mode on a 64bit machine.
> >>>>>>>>>>
> >>>>>>>>>>I don't think anyone has tried to build 32 bit on an Opteron,
> >>>>>>>>>> which is the cause of the problems...
> >>>>>>>>>>
> >>>>>>>>>>I think I know how to fix this, but won't happen until later in
> >>>>>>>>>> the weekend. I can't think of a good workaround until then.
> >>>>>>>>>> Well, one possibility is to set the target like you were doing
> >>>>>>>>>> and disable ROMIO. Actually, you'll also need to disable
> >>>>>>>>>> Fortran 77. So something like:
> >>>>>>>>>>
> >>>>>>>>>>./configure [usual options] --build=i586-suse-linux --disable-io-
> >>>>>>>>>>romio --disable-f77
> >>>>>>>>>>
> >>>>>>>>>>might just do the trick.
> >>>>>>>>>>
> >>>>>>>>>>Brian
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>--
> >>>>>>>>>>Brian Barrett
> >>>>>>>>>>Open MPI developer
> >>>>>>>>>>http://www.open-mpi.org/
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>_______________________________________________
> >>>>>>>>>>devel mailing list
> >>>>>>>>>>devel_at_[hidden]
> >>>>>>>>>>http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>>>>>>
> >>>>>>>>_______________________________________________
> >>>>>>>>devel mailing list
> >>>>>>>>devel_at_[hidden]
> >>>>>>>>http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>>>>>
> >>>>>>>_______________________________________________
> >>>>>>>devel mailing list
> >>>>>>>devel_at_[hidden]
> >>>>>>>http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>>>>
> >>>>>>_______________________________________________
> >>>>>>devel mailing list
> >>>>>>devel_at_[hidden]
> >>>>>>http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>>
> >>>>_______________________________________________
> >>>>devel mailing list
> >>>>devel_at_[hidden]
> >>>>http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>
> >>>_______________________________________________
> >>>devel mailing list
> >>>devel_at_[hidden]
> >>>http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>
> >>_______________________________________________
> >>devel mailing list
> >>devel_at_[hidden]
> >>http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> >_______________________________________________
> >devel mailing list
> >devel_at_[hidden]
> >http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
---------------------------------------------------------------------
Dipl.-Inf. Rainer Keller             email: keller_at_[hidden]
  High Performance Computing         Tel: ++49 (0)711-685 5858
    Center Stuttgart (HLRS)          Fax: ++49 (0)711-678 7626
  Nobelstrasse 19,  R. O0.030        http://www.hlrs.de/people/keller
  70550 Stuttgart