Posts Tagged ‘Numpy’

h1

mpi4py parallel IO example

September 23, 2010

For about 9 months I have  been running python jobs in parallel using mpi4py and NumPy. I had to write a new algorithm with MPI  so I decided to do the IO in parallel. Below is a small example of reading data in parallel. Mpi4py is lacking examples. It is not pretty, however, it does work.

import mpi4py.MPI as MPI
import numpy as np
class Particle_parallel():
    """ Particle_parallel - distributed reading of x-y-z coordinates.

    Designed to split the vectors as evenly as possible except for rounding
    ont the last processor.

    File format:
        32bit int :Data dimensions which should = 3
        32bit int :n_particles
        64bit float (n_particles) : x-coordinates
        64bit float (n_particles) : y-coordinates
        64bit float (n_particles) : z-coordinates
    """
    def __init__(self, file_name,comm):
        self.comm = comm
        self.rank = self.comm.Get_rank()
        self.size = self.comm.Get_size()
        self.data_type_size = 8
        self.mpi_file = MPI.File.Open(self.comm, file_name)
        self.data_dim = np.zeros(1, dtype = np.dtype('i4'))
        self.n_particles = np.zeros(1, dtype = np.dtype('i4'))
        self.file_name = file_name
        self.debug = True

    def info(self):
        """ Distrubute the required information for reading to all ranks.

        Every rank must run this funciton.
        Each machine needs data_dim and n_particles.
        """
        # get info on all machines
        self.mpi_file.Read_all([self.data_dim, MPI.INT])
        self.mpi_file.Read_all([self.n_particles, MPI.INT])
        self.data_start = self.mpi_file.Get_position()
    def read(self):
        """ Read data and return the processors part of the coordinates to:
            self.x_proc
            self.y_proc
            self.z_proc
        """
        assert self.data_dim != 0
        # First establish rank's vector sizes
        default_size = np.ceil(self.n_particles / self.size)
        # Rounding errors here should not be a problem unless
        # default size is very small
        end_size = self.n_particles - (default_size * (self.size - 1))
        assert end_size >= 1
        if (self.rank == (self.size - 1)):
            self.proc_vector_size = end_size
        else:
            self.proc_vector_size = default_size
        # Create individual processor pointers
        #
        x_start = int(self.data_start + self.rank * default_size *
                self.data_type_size)
        y_start = int(self.data_start + self.rank * default_size *
                self.data_type_size +  self.n_particles *
                self.data_type_size * 1)
        z_start = int(self.data_start + self.rank * default_size *
                self.data_type_size + self.n_particles *
                self.data_type_size * 2)
        self.x_proc = np.zeros(self.proc_vector_size)
        self.y_proc = np.zeros(self.proc_vector_size)
        self.z_proc = np.zeros(self.proc_vector_size)
        # Seek to x
        self.mpi_file.Seek(x_start)
        if self.debug:
            print 'MPI Read'
        self.mpi_file.Read([self.x_proc, MPI.DOUBLE])
        if self.rank:
            print 'MPI Read done'
        self.mpi_file.Seek(y_start)
        self.mpi_file.Read([self.y_proc, MPI.DOUBLE])
        self.mpi_file.Seek(z_start)
        self.mpi_file.Read([self.z_proc, MPI.DOUBLE])
        self.comm.Barrier()
        return self.x_proc, self.y_proc, self.z_proc
    def Close(self):
        self.mpi_file.Close()
h1

Python and VTK

September 8, 2010

I recently have been working on moving data gathered in vitro as the geometric basis for some computational fluid dynamics (CFD). simulations I am running. The simulations are solved using openFOAM, therefore, I import the geometry as a series of .STL files.

The idea is that the data provided to me will be able to describe where the solid–fluid boundary is. From this I should be able to generate an .STL surface. The most reliable way (I have observed) to do this, given the data I am provided, is to generate a volume such that solid and fluid phases are distinguishable. This allows a iso-surface (and from this an .STL) to be generated.

I can employ Tecplot or Paraview to do this assuming I have an appropriate data file. Rather than painstakingly duplicate the VTK data format IO for paraview I decided to use the VTK python bindings and generate the files, and later the contours, myself.

VTK is an excellent tool. The python bindings are comprehensive and despite the package size I managed to get things moving without too much trouble. The interface to NumPy arrays allows it to interface nicely with any python based calculations I had. The errors messages were informative and the Doxygen documentation has decent descriptions for many classes. All of the classes even have help available in the interpreter. This is somewhat hidden (you need to use dir to get the available functions and ask for help for each of them individually).

The downsides: Python examples are sparse compared with C++ / tcl and some of the classes have very similar functions with slightly unpredictable behaviour i.e.

# vtk_data is of type vtk.vtkImageData()

im_FFT = vtk.vtkImageFFT()

im_FFT.SetInputConnection(vtk_data.GetOutputConnection())

im_FFT.SetInput(vtk_data)

The examples (and to some extent the book) presume that you have a compatible data file to start with. There are no examples of how to bring in large quantities of data from another part of a program.

Recommendations: Get yourself a copy of the vtk book for the first few days of working with VTK. It introduces concepts in a straightforward manner and increased my understanding substantially. After you are familiar with VTK it is not required.

Next project: Tecplot (.plt) to .vt* converter.. I have a very limited version working, however, it requires work to be robust.

h1

Binary data and Python: Just use NumPy!

April 6, 2010

To post-process some CFD data, I have manipulated binary files generated for Tecplot with Python. The challenge here is how to import large vectors of binary numbers into NumPy ndarrays while processing binary metadata.

This task appears straightforward as shown below:

import numpy as np
import struct

file_in = 'strange_binary_format.dat'
fd = open(file_in,'rb')
# buffer data -- this is bulky
buffer = fd.read()
# read 1000 doubles from the buffer from byte "position" forward
position = 0
no_of_doubles = 1000
read_format = str(no_of_doubles) + 'd'
read_size = struct.calcsize(read_format)
# put data into numpy arrays . . this is very slow and memory intensive
# might be due to struct.unpack returning as a tuple of floats
numpy_data =np.array(struct.unpack(read_format, buffer[position:
            (position + read_size)]))

However, this method is inefficient (and possibly cause memory leaks!). The struct.unpack function returns a tuple with 1000 individual floats resulting in significant overhead. The result is both memory intensive and slow. A later attempt is shown below:

import numpy as np
import struct

file_in = 'strange_binary_format.dat'
fd = open(file_in,'rb')
position = 0
no_of_doubles = 1000
# move to position in file
fd.seek(position,0)

# straight to numpy data (no buffering) 
numpy_data = np.fromfile(fd, dtype = np.dtype('d'), count = no_of_doubles)

The NumPy function fromfile is significantly more efficient in terms of  both time and memory.

From this experience I have a rule for numerically intensive computing with Python: NumPy / SciPy functions will almost always be faster!

h1

Python for scientific computing

March 27, 2010

I am currently undertaking a Ph.D. where i am researching blood flow using an academic computational fluid dynamics (CFD) code (Viper). Like many numerical investigations the pre-processing and post-processing is as important as the algorithm itself. Up until late early this year MATLAB was my language of choice for processing data, however, I have recently embraced Python (particularly NumPy) as a fantastic alternative.

Matlab is a nice clean language for dealing with vectors, however, even with the alternative of octave the vendor lock-in becomes terrible when you want to scale up your code for production runs on clusters. Python in contrast is completely free and scientific computation implemented by several  fantastic open source packages such as SciPy and  NumPy which easily duplicate the core functionality of Matlab in a pythonic, object orientated fashion. Additionally unlike Matlab the overhead of starting an interpreter is small and Python is almost universally available on *nix systems.

Yet the numerical capacity of Python is not the primary reason I have come to like the language. The core features of the language such as:

  • Clean code and indentation based control
  • Fantastic support for file I/O
  • Huge standard library including integrated support for debugging
  • A fantastic community with plenty of practical resources and superb documentation

All said the transition (despite some moments of pain) has sharpened my programming and extended my abilities appreciably (including writing my first decent MPI code).