Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Bus error with openmpi-1.7.4rc1 on Solaris
From: Siegmar Gross (Siegmar.Gross_at_[hidden])
Date: 2013-12-19 08:52:15


Hi,

at first thank you very much for your help.

1st patch:

> Can you apply the following patch to a trunk tarball and see if it works
> for you?

2nd patch:

> Found the problem. Was accessing a boolean variable using intval. That
> is a bug that has gone unnoticed on all platforms but thankfully Solaris
> caught it.
>
> Please try the attached patch.

I applied both patches manually to openmpi-1.9a1r29972, because
my patch program couldn't use the patches. Unfortunately I still
get a Bus Error. Hopefully I didn't make a mistake applying your
patches. Therefore I show you a "diff" for my files. By the way,
I tried to apply your patches with "patch -b -i <your file>".
Is it necessary to use a different command?

tyr openmpi-1.9a1r29972 161 ls -l opal/mca/base/mca_base_var.c*
-rw-r--r-- 1 fd1026 inf 60418 Dec 19 08:35 opal/mca/base/mca_base_var.c
-rw-r--r-- 1 fd1026 inf 60236 Dec 19 03:05 opal/mca/base/mca_base_var.c.orig
tyr openmpi-1.9a1r29972 162 diff opal/mca/base/mca_base_var.c*
1685,1689c1685
< if (MCA_BASE_VAR_TYPE_BOOL == var->mbv_type) {
< ret = var->mbv_enumerator->string_from_value(var->mbv_enumerator,
value->boolval, &tmp);
< } else {
< ret = var->mbv_enumerator->string_from_value(var->mbv_enumerator,
value->intval, &tmp);
< }

---
>         ret = var->mbv_enumerator->string_from_value(var->mbv_enumerator, 
value->intval, &tmp);
tyr openmpi-1.9a1r29972 163 
tyr openmpi-1.9a1r29972 165 ls -l opal/util/net.c*
-rw-r--r-- 1 fd1026 inf 12922 Dec 19 07:55 opal/util/net.c
-rw-r--r-- 1 fd1026 inf 12675 Dec 19 03:05 opal/util/net.c.orig
tyr openmpi-1.9a1r29972 166 diff opal/util/net.c*
267,271c267,268
<             struct sockaddr_in inaddr1, inaddr2;
<             /* Use temporary variables and memcpy's so that we don't
<                run into bus errors on Solaris/SPARC */
<             memcpy(&inaddr1, addr1, sizeof(inaddr1));
<             memcpy(&inaddr2, addr2, sizeof(inaddr2));
---
>             const struct sockaddr_in *inaddr1 = (struct sockaddr_in*) addr1;
>             const struct sockaddr_in *inaddr2 = (struct sockaddr_in*) addr2;
274,275c271,272
<             if((inaddr1.sin_addr.s_addr & netmask) ==
<                (inaddr2.sin_addr.s_addr & netmask)) {
---
>             if((inaddr1->sin_addr.s_addr & netmask) ==
>                (inaddr2->sin_addr.s_addr & netmask)) {
284,290c281,284
<             struct sockaddr_in6 inaddr1, inaddr2;
<             /* Use temporary variables and memcpy's so that we don't
<                run into bus errors on Solaris/SPARC */
<             memcpy(&inaddr1, addr1, sizeof(inaddr1));
<             memcpy(&inaddr2, addr2, sizeof(inaddr2));
<             struct in6_addr *a6_1 = (struct in6_addr*) &inaddr1.sin6_addr;
<             struct in6_addr *a6_2 = (struct in6_addr*) &inaddr2.sin6_addr;
---
>             const struct sockaddr_in6 *inaddr1 = (struct sockaddr_in6*) addr1;
>             const struct sockaddr_in6 *inaddr2 = (struct sockaddr_in6*) addr2;
>             struct in6_addr *a6_1 = (struct in6_addr*) &inaddr1->sin6_addr;
>             struct in6_addr *a6_2 = (struct in6_addr*) &inaddr2->sin6_addr;
tyr openmpi-1.9a1r29972 167 
Now my debug information.
tyr fd1026 52 cd /usr/local/openmpi-1.9_64_cc/bin/
tyr bin 53 /opt/solstudio12.3/bin/sparcv9/dbx ompi_info
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.9' in your .dbxrc
Reading ompi_info
Reading ld.so.1
Reading libmpi.so.0.0.0
Reading libopen-rte.so.0.0.0
Reading libopen-pal.so.0.0.0
Reading libsendfile.so.1
Reading libpicl.so.1
Reading libkstat.so.1
Reading liblgrp.so.1
Reading libsocket.so.1
Reading libnsl.so.1
Reading librt.so.1
Reading libm.so.2
Reading libthread.so.1
Reading libc.so.1
Reading libdoor.so.1
Reading libaio.so.1
Reading libmd.so.1
(dbx) run -a
Running: ompi_info -a 
(process id 10998)
Reading libc_psr.so.1
...
    MCA compress: parameter "compress_base_verbose" (current value:
                  "-1", data source: default, level: 8 dev/detail,
                  type: int)
                  Verbosity level for the compress framework (0 = no
                  verbosity)
t_at_1 (l_at_1) signal BUS (invalid address alignment) in var_value_string
  at line 1680 in file "mca_base_var.c"
 1680  ret = asprintf (value_string, var_type_formats[var->mbv_type],
  value[0]);
(dbx) 
(dbx) 
(dbx) check -all
dbx: warning: check -all will be turned on in the next run of the process
access checking - OFF
memuse checking - OFF
(dbx) run -a
Running: ompi_info -a 
(process id 11000)
Reading rtcapihook.so
Reading libdl.so.1
Reading rtcaudit.so
Reading libmapmalloc.so.1
Reading rtcboot.so
Reading librtc.so
Reading libmd_psr.so.1
RTC: Enabling Error Checking...
RTC: Using UltraSparc trap mechanism
RTC: See `help rtc showmap' and `help rtc limitations' for details.
RTC: Running program...
Read from uninitialized (rui) on thread 1:
Attempting to read 4 bytes at address 0xffffffff7fffd5f8
    which is 184 bytes above the current stack pointer
Variable is 'index'
t_at_1 (l_at_1) stopped in var_find at line 802 in file "mca_base_var.c"
  802       return (OPAL_SUCCESS != ret) ? ret : index;
(dbx) 
In my opinion it is the same error as before.
I still get a Bus Error with "make check".
tyr bin 54 cd 
/export2/src/openmpi-1.9/openmpi-1.9a1r29972-SunOS.sparc.64_cc/test/datatype/.li
bs/
tyr .libs 55 /opt/solstudio12.3/bin/sparcv9/dbx ddt_raw
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.9' in your .dbxrc
Reading ddt_raw
Reading ld.so.1
Reading libmpi.so.0.0.0
Reading libopen-rte.so.0.0.0
Reading libopen-pal.so.0.0.0
Reading libsendfile.so.1
Reading libpicl.so.1
Reading libkstat.so.1
Reading liblgrp.so.1
Reading libsocket.so.1
Reading libnsl.so.1
Reading librt.so.1
Reading libm.so.2
Reading libthread.so.1
Reading libc.so.1
Reading libdoor.so.1
Reading libaio.so.1
Reading libmd.so.1
(dbx) run
Running: ddt_raw 
(process id 11018)
Reading libc_psr.so.1
#
 * TEST INVERSED VECTOR
 #
t_at_1 (l_at_1) signal BUS (invalid address alignment) in opal_convertor_raw
  at line 71 in file "opal_convertor_raw.c"
   71       DO_DEBUG( opal_output( 0, "opal_convertor_raw( %p, {%p,
   %u}, %lu )\n", (void*)pConvertor,
(dbx) 
Once more I think it is the same error. I have the same problem with
my small program.
tyr small_prog 62 mpicc init_finalize.c
tyr small_prog 63 /opt/solstudio12.3/bin/sparcv9/dbx \
  /usr/local/openmpi-1.9_64_cc/bin/mpiexec
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.9'
   in your .dbxrc
Reading mpiexec
Reading ld.so.1
Reading libopen-rte.so.0.0.0
Reading libopen-pal.so.0.0.0
Reading libsendfile.so.1
Reading libpicl.so.1
Reading libkstat.so.1
Reading liblgrp.so.1
Reading libsocket.so.1
Reading libnsl.so.1
Reading librt.so.1
Reading libm.so.2
Reading libthread.so.1
Reading libc.so.1
Reading libdoor.so.1
Reading libaio.so.1
Reading libmd.so.1
(dbx) 
(dbx) run -np 1 a.out
Running: mpiexec -np 1 a.out 
(process id 11050)
Reading libc_psr.so.1
Reading mca_shmem_mmap.so
Reading libmp.so.2
Reading libscf.so.1
Reading libuutil.so.1
Reading libgen.so.1
Reading mca_shmem_posix.so
Reading mca_shmem_sysv.so
Reading mca_ess_env.so
Reading mca_ess_hnp.so
Reading mca_ess_singleton.so
Reading mca_ess_tool.so
Reading mca_pstat_test.so
Reading mca_state_app.so
Reading mca_state_hnp.so
Reading mca_state_novm.so
Reading mca_state_orted.so
Reading mca_state_staged_hnp.so
Reading mca_state_staged_orted.so
Reading mca_state_tool.so
Reading mca_errmgr_default_app.so
Reading mca_errmgr_default_hnp.so
Reading mca_errmgr_default_orted.so
Reading mca_errmgr_default_tool.so
Reading mca_plm_rsh.so
Reading mca_oob_tcp.so
Reading mca_rml_oob.so
Reading mca_routed_binomial.so
Reading mca_routed_debruijn.so
Reading mca_routed_direct.so
Reading mca_routed_radix.so
Reading mca_db_hash.so
Reading mca_db_print.so
Reading mca_grpcomm_bad.so
Reading mca_ras_simulator.so
Reading mca_rmaps_lama.so
Reading mca_rmaps_mindist.so
Reading mca_rmaps_ppr.so
Reading mca_rmaps_rank_file.so
Reading mca_rmaps_resilient.so
Reading mca_rmaps_round_robin.so
Reading mca_rmaps_seq.so
Reading mca_rmaps_staged.so
Reading mca_odls_default.so
Reading mca_iof_hnp.so
Reading mca_iof_mr_hnp.so
Reading mca_iof_mr_orted.so
Reading mca_iof_orted.so
Reading mca_iof_tool.so
Reading mca_filem_raw.so
Reading mca_dfs_app.so
Reading mca_dfs_orted.so
Reading mca_dfs_test.so
Now the program hangs.
^Cdbx: warning: Interrupt ignored but forwarded to child.
t_at_1 (l_at_1) signal INT (Interrupt) in __pollsys at 0xffffffff7d5dc740
0xffffffff7d5dc740: __pollsys+0x0004:   ta       %icc,0x0000000000000040
Current function is orterun
 1049           opal_event_loop(orte_event_base, OPAL_EVLOOP_ONCE);
(dbx) 
(dbx) 
(dbx) 
(dbx) check -all
dbx: warning: check -all will be turned on in the next run of the process
access checking - OFF
memuse checking - OFF
(dbx) run -np 1 a.out
Running: mpiexec -np 1 a.out 
(process id 11054)
Reading rtcapihook.so
Reading libdl.so.1
Reading rtcaudit.so
Reading libmapmalloc.so.1
Reading rtcboot.so
Reading librtc.so
Reading libmd_psr.so.1
RTC: Enabling Error Checking...
RTC: Using UltraSparc trap mechanism
RTC: See `help rtc showmap' and `help rtc limitations' for details.
RTC: Running program...
Read from uninitialized (rui) on thread 1:
Attempting to read 4 bytes at address 0xffffffff7fffd438
    which is 184 bytes above the current stack pointer
Variable is 'index'
t_at_1 (l_at_1) stopped in var_find at line 802 in file "mca_base_var.c"
  802       return (OPAL_SUCCESS != ret) ? ret : index;
(dbx) 
I'm sorry that you have so much trouble with me and Solaris. On the
other hand I still hope that you can solve the problem(s). Once more
thank you very much for your help in advance.
Kind regards
Siegmar