Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Error with multiple MPI runs inside one Slurm allocation (with QLogic PSM)
From: Gutierrez, Samuel K (samuel_at_[hidden])
Date: 2012-04-02 12:20:50


Sorry to hijack the thread, but I have a question regarding the failed PSM initialization.

Some of our users oversubscribe a node with multiple mpiruns in order to run their regression tests. Recently, a user reported the same "Could not detect network connectivity" error.

My question: is there a way to allow this type of behavior? That is, oversubscribe a node with multiple mpiruns. For example, say I have a node with 16 processing elements and I want to run 8 instances of "mpirun -n 3 mpi_foo" on a single node simultaneously and don't care about performance.

Please note that oversubscription within one node and a **single** mpirun works as expected. The error only shows up when another mpirun wants to join the party.

Thanks,

Lost in Los Alamos

 
On Apr 2, 2012, at 9:40 AM, Ralph Castain wrote:

> I'm not sure the 1.4 series can support that behavior. Each mpirun only knows about itself - it has no idea something else is going on.
>
> If you attempted to bind, all procs of same rank from each run would bind on the same CPU.
>
> All you can really do is use -host to tell the fourth run not to use the first node. Or use the devel trunk, which has more ability to separate runs.
>
> Sent from my iPad
>
> On Apr 2, 2012, at 6:53 AM, Rémi Palancher <remi_at_[hidden]> wrote:
>
>> Hi there,
>>
>> I'm encountering a problem when trying to run multiple mpirun in parallel inside
>> one SLURM allocation on multiple nodes using a QLogic interconnect network with
>> PSM.
>>
>> I'm using Open MPI version 1.4.5 compiled with GCC 4.4.5 on Debian Lenny.
>>
>> My cluster is composed of 12 cores nodes.
>>
>> Here is how I'm able to reproduce the problem:
>>
>> Allocate 20 CPU on 2 nodes :
>>
>> frontend $ salloc -N 2 -n 20
>> frontend $ srun hostname | sort | uniq -c
>> 12 cn1381
>> 8 cn1382
>>
>> My job allocates 12 CPU on node cn1381 and 8 CPU on cn1382.
>>
>> My test MPI program parse for each task the value of Cpus_allowed_list in file
>> /proc/$PID/status and print it.
>>
>> If I run it on all 20 allocated CPU, it works well:
>>
>> frontend $ mpirun get-allowed-cpu-ompi 1
>> Launch 1 Task 00 of 20 (cn1381): 0
>> Launch 1 Task 01 of 20 (cn1381): 1
>> Launch 1 Task 02 of 20 (cn1381): 2
>> Launch 1 Task 03 of 20 (cn1381): 3
>> Launch 1 Task 04 of 20 (cn1381): 4
>> Launch 1 Task 05 of 20 (cn1381): 7
>> Launch 1 Task 06 of 20 (cn1381): 5
>> Launch 1 Task 07 of 20 (cn1381): 9
>> Launch 1 Task 08 of 20 (cn1381): 8
>> Launch 1 Task 09 of 20 (cn1381): 10
>> Launch 1 Task 10 of 20 (cn1381): 6
>> Launch 1 Task 11 of 20 (cn1381): 11
>> Launch 1 Task 12 of 20 (cn1382): 4
>> Launch 1 Task 13 of 20 (cn1382): 5
>> Launch 1 Task 14 of 20 (cn1382): 6
>> Launch 1 Task 15 of 20 (cn1382): 7
>> Launch 1 Task 16 of 20 (cn1382): 8
>> Launch 1 Task 17 of 20 (cn1382): 10
>> Launch 1 Task 18 of 20 (cn1382): 9
>> Launch 1 Task 19 of 20 (cn1382): 11
>>
>> Here you can see that Slurm gave me CPU 0-11 on cn1381 and 4-11 on cn1382.
>>
>> Now I'd like to run multiple MPI runs in parallel, 4 tasks each, inside my job.
>>
>> frontend $ cat params.txt
>> 1
>> 2
>> 3
>> 4
>> 5
>>
>> It works well when I launch 3 runs in parallel, where it only use the 12 CPU of
>> the first node (3 runs x 4 tasks = 12 CPU):
>>
>> frontend $ xargs -P 3 -n 1 mpirun -n 4 get-allowed-cpu-ompi < params.txt
>> Launch 2 Task 00 of 04 (cn1381): 1
>> Launch 2 Task 01 of 04 (cn1381): 2
>> Launch 2 Task 02 of 04 (cn1381): 4
>> Launch 2 Task 03 of 04 (cn1381): 7
>> Launch 1 Task 00 of 04 (cn1381): 0
>> Launch 1 Task 01 of 04 (cn1381): 3
>> Launch 1 Task 02 of 04 (cn1381): 5
>> Launch 1 Task 03 of 04 (cn1381): 6
>> Launch 3 Task 00 of 04 (cn1381): 9
>> Launch 3 Task 01 of 04 (cn1381): 8
>> Launch 3 Task 02 of 04 (cn1381): 10
>> Launch 3 Task 03 of 04 (cn1381): 11
>> Launch 4 Task 00 of 04 (cn1381): 0
>> Launch 4 Task 01 of 04 (cn1381): 3
>> Launch 4 Task 02 of 04 (cn1381): 1
>> Launch 4 Task 03 of 04 (cn1381): 5
>> Launch 5 Task 00 of 04 (cn1381): 2
>> Launch 5 Task 01 of 04 (cn1381): 4
>> Launch 5 Task 02 of 04 (cn1381): 7
>> Launch 5 Task 03 of 04 (cn1381): 6
>>
>> But when I try to launch 4 runs or more in parallel, where it needs to use the
>> CPU of the other node as well, it fails:
>>
>> frontend $ $ xargs -P 4 -n 1 mpirun -n 4 get-allowed-cpu-ompi < params.txt
>> cn1381.23245ipath_userinit: assign_context command failed: Network is down
>> cn1381.23245can't open /dev/ipath, network down (err=26)
>> --------------------------------------------------------------------------
>> PSM was unable to open an endpoint. Please make sure that the network link is
>> active on the node and the hardware is functioning.
>>
>> Error: Could not detect network connectivity
>> --------------------------------------------------------------------------
>> cn1381.23248ipath_userinit: assign_context command failed: Network is down
>> cn1381.23248can't open /dev/ipath, network down (err=26)
>> --------------------------------------------------------------------------
>> PSM was unable to open an endpoint. Please make sure that the network link is
>> active on the node and the hardware is functioning.
>>
>> Error: Could not detect network connectivity
>> --------------------------------------------------------------------------
>> cn1381.23247ipath_userinit: assign_context command failed: Network is down
>> cn1381.23247can't open /dev/ipath, network down (err=26)
>> cn1381.23249ipath_userinit: assign_context command failed: Network is down
>> cn1381.23249can't open /dev/ipath, network down (err=26)
>> --------------------------------------------------------------------------
>> It looks like MPI_INIT failed for some reason; your parallel process is
>> likely to abort. There are many reasons that a parallel process can
>> fail during MPI_INIT; some of which are due to configuration or environment
>> problems. This failure appears to be an internal failure; here's some
>> additional information (which may only be relevant to an Open MPI
>> developer):
>>
>> PML add procs failed
>> --> Returned "Error" (-1) instead of "Success" (0)
>> --------------------------------------------------------------------------
>> *** The MPI_Init() function was called before MPI_INIT was invoked.
>> *** This is disallowed by the MPI standard.
>> *** Your MPI job will now abort.
>> *** The MPI_Init() function was called before MPI_INIT was invoked.
>> *** This is disallowed by the MPI standard.
>> *** Your MPI job will now abort.
>> *** The MPI_Init() function was called before MPI_INIT was invoked.
>> *** This is disallowed by the MPI standard.
>> *** Your MPI job will now abort.
>> [cn1381:23245] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
>> *** The MPI_Init() function was called before MPI_INIT was invoked.
>> *** This is disallowed by the MPI standard.
>> *** Your MPI job will now abort.
>> [cn1381:23247] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
>> [cn1381:23242] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
>> [cn1381:23243] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
>> --------------------------------------------------------------------------
>> mpirun has exited due to process rank 2 with PID 23245 on
>> node cn1381 exiting without calling "finalize". This may
>> have caused other processes in the application to be
>> terminated by signals sent by mpirun (as reported here).
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> It looks like MPI_INIT failed for some reason; your parallel process is
>> likely to abort. There are many reasons that a parallel process can
>> fail during MPI_INIT; some of which are due to configuration or environment
>> problems. This failure appears to be an internal failure; here's some
>> additional information (which may only be relevant to an Open MPI
>> developer):
>>
>> PML add procs failed
>> --> Returned "Error" (-1) instead of "Success" (0)
>> --------------------------------------------------------------------------
>> *** The MPI_Init() function was called before MPI_INIT was invoked.
>> *** This is disallowed by the MPI standard.
>> *** Your MPI job will now abort.
>> *** The MPI_Init() function was called before MPI_INIT was invoked.
>> *** This is disallowed by the MPI standard.
>> *** Your MPI job will now abort.
>> [cn1381:23246] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
>> *** The MPI_Init() function was called before MPI_INIT was invoked.
>> *** This is disallowed by the MPI standard.
>> *** Your MPI job will now abort.
>> [cn1381:23248] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
>> *** The MPI_Init() function was called before MPI_INIT was invoked.
>> *** This is disallowed by the MPI standard.
>> *** Your MPI job will now abort.
>> [cn1381:23249] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
>> [cn1381:23244] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
>> --------------------------------------------------------------------------
>> mpirun has exited due to process rank 2 with PID 23248 on
>> node cn1381 exiting without calling "finalize". This may
>> have caused other processes in the application to be
>> terminated by signals sent by mpirun (as reported here).
>> --------------------------------------------------------------------------
>> [ivanoe1:24981] 3 more processes have sent help message help-mtl-psm.txt / unable to open endpoint
>> [ivanoe1:24981] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
>> [ivanoe1:24981] 3 more processes have sent help message help-mpi-runtime / mpi_init:startup:internal-failure
>> [ivanoe1:24983] 3 more processes have sent help message help-mtl-psm.txt / unable to open endpoint
>> [ivanoe1:24983] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
>> [ivanoe1:24983] 3 more processes have sent help message help-mpi-runtime / mpi_init:startup:internal-failure
>> Launch 3 Task 00 of 04 (cn1381): 0
>> Launch 3 Task 01 of 04 (cn1381): 1
>> Launch 3 Task 02 of 04 (cn1381): 2
>> Launch 3 Task 03 of 04 (cn1381): 3
>> Launch 1 Task 00 of 04 (cn1381): 4
>> Launch 1 Task 01 of 04 (cn1381): 5
>> Launch 1 Task 02 of 04 (cn1381): 6
>> Launch 1 Task 03 of 04 (cn1381): 8
>> Launch 5 Task 00 of 04 (cn1381): 7
>> Launch 5 Task 01 of 04 (cn1381): 9
>> Launch 5 Task 02 of 04 (cn1381): 10
>> Launch 5 Task 03 of 04 (cn1381): 11
>>
>> As far as I can understand, Open MPI tries to launch all runs on the same nodes
>> (cn1382 in my case) and it forgets about the other node. Am I right? How can I
>> avoid this behaviour?
>>
>> Here are the Open MPI variables set in my environment:
>> $ env | grep OMPI
>> OMPI_MCA_mtl=psm
>> OMPI_MCA_pml=cm
>>
>> You can find attached to this email the config.log and the output of the
>> following commands:
>> frontend $ ompi_info --all > ompi_info_all.txt
>> frontend $ mpirun --bynode --npernode 1 --tag-output ompi_info -v ompi full \
>> --parsable > ompi_nodes.txt
>>
>> Thanks in advance for any kind of help!
>>
>> Best regards,
>> --
>> Rémi Palancher
>> http://rezib.org
>> <config.log.gz>
>> <ompi_info_all.txt.gz>
>> <ompi_nodes.txt>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users