XGrid MPI Tutorial

From CSclasswiki
Jump to: navigation, search

--Thiebaut 18:25, 3 May 2009 (UTC)


Back to XGrid page

A simple Hello program to run with MPI

This step ensures you have the necessary software installed on your mac.

  • mpicc: to compile
  • mpirun: to run the program

Here's the code (stollen from http://macresearch.org/runing_mpi_job_through_xgrid)

/* hellompi.c
    D. Thiebaut
    Taken from http://macresearch.org/runing_mpi_job_through_xgrid
    Compile with:
    mpicc -o hello hellompi.c

    Run with:
    mpirun -n 8 hello

    where "-n 8" specifies running on 8 nodes.  Change as you see fit.

    Output:
    Process 1 on xgridmac.local out of 8
    Process 0 on xgridmac.local out of 8
    Process 3 on xgridmac.local out of 8
    Process 4 on xgridmac.local out of 8
    Process 5 on xgridmac.local out of 8
    Process 7 on xgridmac.local out of 8
    Process 2 on xgridmac.local out of 8
    Process 6 on xgridmac.local out of 8

*/
#include <stdio.h>
#include <mpi.h>

int main(int argc, char *argv[]) {
	int numprocs, rank, namelen;
	char processor_name[MPI_MAX_PROCESSOR_NAME];

	MPI_Init(&argc, &argv);
	MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
	MPI_Comm_rank(MPI_COMM_WORLD, &rank);
	MPI_Get_processor_name(processor_name, &namelen);

	printf("Process %d on %s out of %d\n", rank, processor_name, numprocs);

	MPI_Finalize();
}

To compile and run:

   mpicc -o hello hellompi.c
   mpirun -n 8 hello

where -n 8 specifies the number of nodes to run the program on.

The typical output (will vary depending on business of nodes):

   Process 1 on xgridmac.local out of 8
   Process 0 on xgridmac.local out of 8
   Process 3 on xgridmac.local out of 8
   Process 4 on xgridmac.local out of 8
   Process 5 on xgridmac.local out of 8
   Process 7 on xgridmac.local out of 8
   Process 2 on xgridmac.local out of 8
   Process 6 on xgridmac.local out of 8

Hello on the XGrid, using MPI

First makesure mpirun is installed on your XGrid system:

xgrid -job run /usr/bin/which mpirun
/usr/bin/mpirun

To run on the XGrid, use the xgrid -job run option:

xgrid -job run /usr/bin/mpirun -n 8 hello
Process 0 on node5.smith.edu out of 8
Process 3 on node5.smith.edu out of 8
Process 4 on node5.smith.edu out of 8
Process 5 on node5.smith.edu out of 8
Process 6 on node5.smith.edu out of 8
Process 7 on node5.smith.edu out of 8
Process 2 on node5.smith.edu out of 8
Process 1 on node5.smith.edu out of 8

This works fine.

Currently stuck!

I'm stuck now when I try to run this program on more than 8 processors, which should force MPI to fork the program on other nodes. But currently, even if I start mpirun with -n 88 for example, all 88 processes run on node5.

Things I tried

  • I tried to insert a delay in hello.c, which makes its execution last 1 second, but that didn't help: all processes run on 1 node
  • I tried giving mpirun a host-file, that is a text file containing the names of all the 11 nodes. I tried listing the nodes as node0.smith.edu, node1.smith.edu, etc., or mathgrid0.smith.edu, mathgrid1.smith.edu, and even mathgrid0, mathgrid1, etc. To no avail.
  • My guess is that we should be able to start mpirun on the local Mac on which we sit, and specify the XGrid nodes via the hostfile option of the mpirun command. I tried it but was prompted for a password, but it didn't accept our XGrid password.
mpirun -hostfile hostfile.txt -np 8 hellosleep 
Password:
Password:
Password:
thiebaut@mathgrid0.smith.edu's password: 
Connection closed by 131.229.72.40

My hostfile.txt looked something like this:

mathgrid0.smith.edu

and I tried also using IP addresses:

131.229.72.45
131.229.72.46
131.229.72.50
  • I also tried to close the firewall on my Macpro and try ssh-ing to all the XGrid hosts to enter their IP address in the known hosts. This seem to help, but I am still prompted for a password.
Here are the error messages I get:
[xgridmac.local:01172] [0,0,0] ORTE_ERROR_LOG: Timeout in file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/pls/base/pls_base_orted_cmds.c at line 275
[xgridmac.local:01172] [0,0,0] ORTE_ERROR_LOG: Timeout in file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/pls/rsh/pls_rsh_module.c at line 1164
[xgridmac.local:01172] [0,0,0] ORTE_ERROR_LOG: Timeout in file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/errmgr/hnp/errmgr_hnp.c at line 90
[xgridmac.local:01172] ERROR: A daemon on node 131.229.72.50 failed to start as expected.
[xgridmac.local:01172] ERROR: There may be more information available from
[xgridmac.local:01172] ERROR: the remote shell (see above).
[xgridmac.local:01172] The daemon received a signal 2.
[xgridmac.local:01172] ERROR: A daemon on node 131.229.72.46 failed to start as expected.
[xgridmac.local:01172] ERROR: There may be more information available from
[xgridmac.local:01172] ERROR: the remote shell (see above).
[xgridmac.local:01172] The daemon received a signal 2.
[xgridmac.local:01172] ERROR: A daemon on node 131.229.72.45 failed to start as expected.
[xgridmac.local:01172] ERROR: There may be more information available from
[xgridmac.local:01172] ERROR: the remote shell (see above).
[xgridmac.local:01172] The daemon received a signal 2.
To be continued!



Back to XGrid page