Computational & Technology Resources
an online resource for computational,
engineering & technology publications |
|
Civil-Comp Proceedings
ISSN 1759-3433 CCP: 90
PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND GRID COMPUTING FOR ENGINEERING Edited by:
Paper 16
Unified Kernel and User Space Distributed Tracing for Message Passing Analysis B. Poirier, R. Roy and M. Dagenais
Department of Computer Engineering, École Polytechnique, Montreal, Canada B. Poirier, R. Roy, M. Dagenais, "Unified Kernel and User Space Distributed Tracing for Message Passing Analysis", in , (Editors), "Proceedings of the First International Conference on Parallel, Distributed and Grid Computing for Engineering", Civil-Comp Press, Stirlingshire, UK, Paper 16, 2009. doi:10.4203/ccp.90.16
Keywords: performance analysis, event tracing, time synchronization, message passing, operating system.
Summary
Tracing tools commonly available do not have the ability to fully
trace a distributed system. Whereas profilers record a sampled overview
of a system, tracers can record a complete list of events. There are
tools for application tracing, kernel tracing and network monitoring
but each of these, taken individually, records events from only one
part of a complete system. Some tools such as DTrace and KTAU
allow to merge user and kernel space traces. A complete tracer would
enable the debugging, monitoring and optimization of distributed systems,
grid computing systems and client-server programs from the application
level down to the operating system and device driver level.
In this paper we present the design and architecture of a tool to trace an entire distributed system with minimal impact. This is accomplished in two parts: user space and kernel space traces are merged during execution time whereas the resulting distributed traces are synchronized afterwards during a retrospective analysis. Offline trace synchronization includes algorithms based on linear regressions or geometric analysis of offsets of individual messages. To achieve this, we have used the Linux Trace Toolkit Next Generation (LTTng) tracer for the Linux kernel. It has been extended with user space trace points and an MPI tracing library. Time synchronization is based on identifying message exchanges, using the traced TCP events. We have tested the impact of tracing on the MPIBench communication benchmark and the Dbench filesystem benchmark. The tracer can collect millions of events per second from user and kernel space with an impact on communication times between 10 to 15%. We have then analyzed these traces to calculate clock parameters and synchronized all the events in a common timebase with an estimated standard deviation lower than 130µs. purchase the full-text of this paper (price £20)
go to the previous paper |
|