Simply to keep track of what's going on:

I checked the build environment for openmpi and the system's setting, they were built using gcc 3.4.4 with -Os, which was reputed unstable and problematic with this compiler version. I've asked Prasanna to rebuild using -O2 but this could be a bit lengthy since the entire system (or at least all libs openmpi links to) needs to be rebuilt.


Eric Thibodeau wrote:

    Please send me your /etc/make.conf and the contents of /var/db/pkg/sys-cluster/openmpi-1.2.7/

You can package this with the following command line:

tar -cjf data.tbz /etc/make.conf /var/db/pkg/sys-cluster/openmpi-1.2.7/

And simply send me the data.tbz file.



Prasanna Ranganathan wrote:

 I did make sure at the beginning that only eth0 was activated on all the
nodes. Nevertheless, I am currently verifying the NIC configuration on all
the nodes and making sure things are as expected.

While trying different things, I did come across this peculiar error which I
had detailed in one of my previous mails in this thread.

I am testing the helloWorld program in the following trivial case:

mpirun -np 1 -host localhost /main/mpiHelloWorld

Which works fine.


mpirun -np 1 -host localhost --debug-daemons /main/mpiHelloWorld

always fails as follows:

Daemon [0,0,1] checking in as pid 2059 on host localhost
[idx1:02059] [0,0,1] orted: received launch callback
idx1 is node 0 of 1
ranks sum to 0
[idx1:02059] [0,0,1] orted_recv_pls: received message from [0,0,0]
[idx1:02059] [0,0,1] orted_recv_pls: received exit
[idx1:02059] *** Process received signal ***
[idx1:02059] Signal: Segmentation fault (11)
[idx1:02059] Signal code:  (128)
[idx1:02059] Failing at address: (nil)
[idx1:02059] [ 0] /lib/ [0x2afa8c597f30]
[idx1:02059] [ 1] /usr/lib64/
[idx1:02059] [ 2] /usr/lib64/
[idx1:02059] [ 3] /usr/lib64/
[idx1:02059] [ 4] orted(main+0x8a6) [0x4024ae]
[idx1:02059] *** End of error message ***

The failure happens with more verbose output when using the -d flag.

Does this point to some bug in OpenMPI or am I missing something here?

I have attached ompi_info output on this node.




_______________________________________________ users mailing list

_______________________________________________ users mailing list