MPIEXEC and Intel MPI library 1.0.1

Anton Starikov A.Starikov at utwente.nl
Mon May 2 12:06:52 EDT 2005


Can you describe in detais:
1) Which fabric do you use?
2) You wrote that version is 1.0.1, does it  mean actaully version 
1.0.035? So far it is the last version available form Intel. I've tested 
my patch with this version.


Finalize hang itself can be related to general problem with MPICH2, it's 
described on the mpiexec site. Try tp add "-kill" into mpiexec options.

Anton Starikov.

Thomas Zeiser wrote:
> Dear All!
> 
> I just tested the patch to get Intel MPI running.  
> 
> With version 1.0 of Intel MPI everything is fine. 
> 
> However, when I try the recent upate (Intel MPI 1.0.1) I get very
> strange results:
> 
> - I start the MPI program with
>   mpiexec -comm pmi [-verbose] ./test-f-g77-intelmpi101
> 
> - the processes are correctly started on all nodes (twice on
>   snode164 and snode164; veryfied with "ps")
> 
>   mpiexec: resolve_exe: using absolute exe "./test-f-g77-intelmpi101".
>   mpiexec: accept_pmi_conn: got request: cmd=initack pmiid=0.
>   mpiexec: accept_pmi_conn: rank 0 checks in.
>   mpiexec: accept_pmi_conn: got request: cmd=init pmi_version=1.1.
>   mpiexec: accept_pmi_conn: got request: cmd=initack pmiid=1.
>   mpiexec: accept_pmi_conn: rank 1 checks in.
>   mpiexec: accept_pmi_conn: got request: cmd=init pmi_version=1.1.
>   mpiexec: accept_pmi_conn: got request: cmd=initack pmiid=2.
>   mpiexec: accept_pmi_conn: rank 2 checks in.
>   mpiexec: accept_pmi_conn: got request: cmd=init pmi_version=1.1.
>   mpiexec: accept_pmi_conn: got request: cmd=initack pmiid=3.
>   mpiexec: accept_pmi_conn: rank 3 checks in.
>   mpiexec: accept_pmi_conn: got request: cmd=init pmi_version=1.1.
>   mpiexec: All 4 tasks started.
> 
> - in the test program all MPI processes get their hostname using
>   MPI_GET_PROCESSOR_NAME and send it withMPI_SEND to the master.
>   The master receives the messages with MPI_RECV and outputs them
>   (it's the simple test.f* program form the Intel MPI test
>   directory). The outut is the following:
> 
>   Hello world: rank  0 of  4 running on snode164
>   Hello world: rank  1 of  4 running on snode164
>   Hello world: rank  2 of  4 running on snode164
>   Hello world: rank  3 of  4 running on snode164
> 
>   All processes seem to be running on the same node!
> 
> - now MPI_FINALIZE comes in the program. However, the processes
>   hang. When I now kill step by step all processes I get
> 
>   accept_pmi_conn: waiting for info
>   accept_pmi_conn: waiting for info
>   accept_pmi_conn: waiting for info
>   accept_pmi_conn: waiting for info
>   wait_one_task_start: evt = 2, task 0 host snode164
>   wait_one_task_start: evt = 3, task 1 host snode164
>   wait_one_task_start: evt = 4, task 2 host snode163
>   wait_one_task_start: evt = 5, task 3 host snode163
>   wait_tasks: waiting for snode164 snode164 snode163 snode163
>   wait_tasks: waiting for snode164 snode163 snode163
>   wait_tasks: waiting for snode164 snode163
>   wait_tasks: waiting for snode163
> 
>   Killed
>   ABORT - process 3: failure: Other MPI error
>   mpiexec: wait_tasks: numspawned = 4, got evt 6 for tid 2 host snode164 status 1.
>   mpiexec: wait_tasks: numspawned = 3, got evt 9 for tid 5 host snode163 status 13.
>   Killed
>   mpiexec: wait_tasks: numspawned = 2, got evt 7 for tid 3 host snode164 status 1.
>   Killed
>   mpiexec: wait_tasks: numspawned = 1, got evt 8 for tid 4 host snode163 status 1.
>   mpiexec: Warning: tasks 0-2 exited with status 1.
>   mpiexec: Warning: task 3 exited with status 13.
> 
> 
> 
> 
> Any ideas (exept using totalview to get a better insight)?
> 
> 
> Kind regards,
> 
> Thomas Zeiser



More information about the mpiexec mailing list