Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openMPI shared with NFS, but says different version
From: Cristobal Navarro (axischire_at_[hidden])
Date: 2010-07-28 14:53:29


to clear things,

i still can do a hello world on all 16 threads, but a few more repetitions
of the example and it kernel crashes :(

fcluster_at_agua:~$ mpirun --hostfile localhostfile -np 16 testMPI/hola
Process 0 on agua out of 16
Process 2 on agua out of 16
Process 14 on agua out of 16
Process 8 on agua out of 16
Process 1 on agua out of 16
Process 7 on agua out of 16
Process 9 on agua out of 16
Process 3 on agua out of 16
Process 4 on agua out of 16
Process 10 on agua out of 16
Process 15 on agua out of 16
Process 5 on agua out of 16
Process 6 on agua out of 16
Process 11 on agua out of 16
Process 13 on agua out of 16
Process 12 on agua out of 16
fcluster_at_agua:~$

On Wed, Jul 28, 2010 at 2:47 PM, Cristobal Navarro <axischire_at_[hidden]>wrote:

>
>
> On Wed, Jul 28, 2010 at 11:09 AM, Gus Correa <gus_at_[hidden]>wrote:
>
>> Hi Cristobal
>>
>> In case you are not using full path name for mpiexec/mpirun,
>> what does "which mpirun" say?
>>
>
> --> $which mpirun
> /opt/openmpi-1.4.2
>
>>
>> Often times this is a source of confusion, old versions may
>> be first on the PATH.
>>
>> Gus
>>
>
> openMPI version problem is now gone, i can confirm that the version is
> consistent now :), thanks.
>
> however, i keep getting this kernel crash randomnly when i execute with -np
> higher than 5
> these are Xeons, with Hyperthreading On, is that a problem??
>
> im trying to locate the kernel error on logs, but after rebooting a crash,
> the error is not in the kern.log (neither kern.log.1).
> all i remember is that it starts with "Kernel BUG..."
> and somepart it mentions a certain CPU X, where that cpu can be any from 0
> to 15 (im testing only in main node). Someone knows where the log of kernel
> error could be?
>
>>
>> Cristobal Navarro wrote:
>>
>>>
>>> On Tue, Jul 27, 2010 at 7:29 PM, Gus Correa <gus_at_[hidden]<mailto:
>>> gus_at_[hidden]>> wrote:
>>>
>>> Hi Cristobal
>>>
>>> Does it run only on the head node alone?
>>> (Fuego? Agua? Acatenango?)
>>> Try to put only the head node on the hostfile and execute with
>>> mpiexec.
>>>
>>> --> i will try only with the head node, and post results back
>>> This may help sort out what is going on.
>>> Hopefully it will run on the head node.
>>>
>>> Also, do you have Infinband connecting the nodes?
>>> The error messages refer to the openib btl (i.e. Infiniband),
>>> and complains of
>>>
>>>
>>> no we are just using normal network 100MBit/s , since i am just testing
>>> yet.
>>>
>>>
>>> "perhaps a missing symbol, or compiled for a different
>>> version of Open MPI?".
>>> It sounds as a mixup of versions/builds.
>>>
>>>
>>> --> i agree, somewhere there must be the remains of the older version
>>>
>>> Did you configure/build OpenMPI from source, or did you install
>>> it with apt-get?
>>> It may be easier/less confusing to install from source.
>>> If you did, what configure options did you use?
>>>
>>>
>>> -->i installed from source, ./configure --prefix=/opt/openmpi-1.4.2
>>> --with-sge --without-xgid --disable--static
>>>
>>> Also, as for the OpenMPI runtime environment,
>>> it is not enough to set it on
>>> the command line, because it will be effective only on the head node.
>>> You need to either add them to the PATH and LD_LIBRARY_PATH
>>> on your .bashrc/.cshrc files (assuming these files and your home
>>> directory are *also* shared with the nodes via NFS),
>>> or use the --prefix option of mpiexec to point to the OpenMPI main
>>> directory.
>>>
>>>
>>> yes, all nodes have their PATH and LD_LIBRARY_PATH set up properly inside
>>> the login scripts ( .bashrc in my case )
>>>
>>> Needless to say, you need to check and ensure that the OpenMPI
>>> directory (and maybe your home directory, and your work directory)
>>> is (are)
>>> really mounted on the nodes.
>>>
>>>
>>> --> yes, doublechecked that they are
>>>
>>> I hope this helps,
>>>
>>>
>>> --> thanks really!
>>>
>>> Gus Correa
>>>
>>> Update: i just reinstalled openMPI, with the same parameters, and it
>>> seems that the problem has gone, i couldnt test entirely but when i
>>> get back to lab ill confirm.
>>>
>>> best regards! Cristobal
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>