Naci Dai Lawrence Mandel Arthur Ryman Using Openmp Portable Shared Memory Parallel Programming

Naci Dai Lawrence Mandel Arthur Ryman Using
Openmp Portable Shared Memory Parallel
Programming download
https://ebookbell.com/product/naci-dai-lawrence-mandel-arthur-
ryman-using-openmp-portable-shared-memory-parallel-
programming-56576114
Explore and download more ebooks at ebookbell.com

Here are some recommended products that we believe you will be
interested in. You can click the link to download.
Eclipse Web Tools Platform Developing Javatm Web Applications Naci Dai
https://ebookbell.com/product/eclipse-web-tools-platform-developing-
javatm-web-applications-naci-dai-974792
Semiconductors For Optoelectronics Basics And Applications 1st Ed 2021
Naci Balkan
https://ebookbell.com/product/semiconductors-for-optoelectronics-
basics-and-applications-1st-ed-2021-naci-balkan-49152204
Arming The Sultan German Arms Trade And Personal Diplomacy In The
Ottoman Empire Before World War I Naci Yorulmaz
https://ebookbell.com/product/arming-the-sultan-german-arms-trade-and-
personal-diplomacy-in-the-ottoman-empire-before-world-war-i-naci-
yorulmaz-50236992
Giant Intracranial Aneurysms A Casebased Atlas Of Imaging And
Treatment 1st Edition Naci Kocer Auth
https://ebookbell.com/product/giant-intracranial-aneurysms-a-
casebased-atlas-of-imaging-and-treatment-1st-edition-naci-kocer-
auth-5606992

Arming The Sultan German Arms Trade And Diplomacy In The Ottoman
Empire Before World War I Naci Yorulmaz
https://ebookbell.com/product/arming-the-sultan-german-arms-trade-and-
diplomacy-in-the-ottoman-empire-before-world-war-i-naci-
yorulmaz-6801214
Finite Approximations In Discretetime Stochastic Control 1st Ed Naci
Saldi
https://ebookbell.com/product/finite-approximations-in-discretetime-
stochastic-control-1st-ed-naci-saldi-7150356
Farewell A Turkish Officers Diary Of The Gallioli Campaign Ibrahim
Naci
https://ebookbell.com/product/farewell-a-turkish-officers-diary-of-
the-gallioli-campaign-ibrahim-naci-174946094
Economic Aspects Of Obesity Michael Grossman Editor Naci Mocan Editor
https://ebookbell.com/product/economic-aspects-of-obesity-michael-
grossman-editor-naci-mocan-editor-51436838
Citizencentered Public Policy Making In Turkey Volkan Golu
https://ebookbell.com/product/citizencentered-public-policy-making-in-
turkey-volkan-golu-52447296

Scientific and Engineering Computation
William Gropp and Ewing Lusk, editors; Janusz Kowalik, founding editor
Data-Parallel Programming on MIMD Computers, Philip J. Hatcher and Michael J. Quinn, 1991
Unstructured Scientific Computation on Scalable Multiprocessors, edited by Piyush Mehrotra,
Joel Saltz, and Robert Voigt, 1992
Parallel Computational Fluid Dynamics: Implementation and Results, edited by Horst
D. Simon, 1992
Enterprise Integration Modeling: Proceedings of the First International Conference, edited by
Charles J. Petrie, Jr., 1992
The High Performance Fortran Handbook, Charles H. Koelbel, David B. Loveman, Robert
S. Schreiber, Guy L. Steele Jr., and Mary E. Zosel, 1994
PVM: Parallel Virtual Machine—A Users’ Guide and Tutorial for Network Parallel
Computing, Al Geist, Adam Beguelin, Jack Dongarra, Weicheng Jiang, Bob Manchek, and
Vaidy Sunderam, 1994
Practical Parallel Programming, Gregory V. Wilson, 1995
Enabling Technologies for Petaflops Computing, Thomas Sterling, Paul Messina, and Paul
H. Smith, 1995
An Introduction to High-Performance Scientific Computing, Lloyd D. Fosdick, Elizabeth
R. Jessup, Carolyn J. C. Schauble, and Gitta Domik, 1995
Parallel Programming Using C++, edited by Gregory V. Wilson and Paul Lu, 1996
Using PLAPACK: Parallel Linear Algebra Package, Robert A. van de Geijn, 1997
Fortran 95 Handbook, Jeanne C. Adams, Walter S. Brainerd, Jeanne T. Martin, Brian T. Smith,
and Jerrold L. Wagener, 1997
MPI—The Complete Reference: Volume 1, The MPI Core, Marc Snir, Steve Otto, Steven
Huss-Lederman, David Walker, and Jack Dongarra, 1998
MPI—The Complete Reference: Volume 2, The MPI-2 Extensions, William Gropp, Steven
Huss-Lederman, Andrew Lumsdaine, Ewing Lusk, Bill Nitzberg, William Saphir, and Marc Snir,
1998
A Programmer’s Guide to ZPL, Lawrence Snyder, 1999
How to Build a Beowulf, Thomas L. Sterling, John Salmon, Donald J. Becker, and Daniel
F. Savarese, 1999
Using MPI: Portable Parallel Programming with the Message-Passing Interface, second
edition, William Gropp, Ewing Lusk, and Anthony Skjellum, 1999
Using MPI-2: Advanced Features of the Message-Passing Interface, William Gropp, Ewing
Lusk, and Rajeev Thakur, 1999
Beowulf Cluster Computing with Linux, Thomas Sterling, 2001
Beowulf Cluster Computing with Windows, Thomas Sterling, 2001
Scalable Input/Output: Achieving System Balance, Daniel A. Reed, 2003

Using OpenMP
Portable Shared Memory Parallel Programming
Barbara Chapman, Gabriele Jost, Ruud van der Pas
The MIT Press
Cambridge, Massachusetts
London, England

c
2008 Massachusetts Institute of Technology
All rights reserved. No part of this book may be reproduced in any form by any electronic or
mechanical means (including photocopying, recording, or information storage and retrieval)
without permission in writing from the publisher.
This book was set in L
A
TEX by the authors and was printed and bound in the United States of
America.
Library of Congress Cataloging-in-Publication Data
Chapman, Barbara, 1954-
Using OpenMP : portable shared memory parallel programming / Barbara Chapman, Gabriele
Jost, Ruud van der Pas.
p. cm. – (Scientiﬁc and engineering computation)
Includes bibliographical references and index.
ISBN-13: 978-0-262-53302-7 (paperback : alk. paper)
1. Parallel programming (Computer science) 2. Application program interfaces
(Computer software) I. Jost, Gabriele. II. Pas, Ruud van der. III. Title.
QA76.642.C49 2007
005.2’75–dc22
2007026656

Dedicated to the memory of Ken Kennedy, who inspired in so many of us a
passion for High Performance Computing

Contents
Series Foreword xiii
Foreword xv
Preface xix
1 Introduction 1
1.1 Why Parallel Computers Are Here to Stay 1
1.2 Shared-Memory Parallel Computers 3
1.2.1 Cache Memory Is Not Shared 4
1.2.2 Implications of Private Cache Memory 6
1.3 Programming SMPs and the Origin of OpenMP 6
1.3.1 What Are the Needs? 7
1.3.2 A Brief History of Saving Time 7
1.4 What Is OpenMP? 8
1.5 Creating an OpenMP Program 9
1.6 The Bigger Picture 11
1.7 Parallel Programming Models 13
1.7.1 Realization of Shared- and Distributed-Memory
Models
14
1.8 Ways to Create Parallel Programs 15
1.8.1 A Simple Comparison 16
1.9 A Final Word 21
2 Overview of OpenMP 23
2.1 Introduction 23
2.2 The Idea of OpenMP 23
2.3 The Feature Set 25
2.3.1 Creating Teams of Threads 25
2.3.2 Sharing Work among Threads 26
2.3.3 The OpenMP Memory Model 28
2.3.4 Thread Synchronization 29
2.3.5 Other Features to Note 30
2.4 OpenMP Programming Styles 31

viii Contents
2.5 Correctness Considerations 32
2.6 Performance Considerations 33
2.7 Wrap-Up 34
3 Writing a First OpenMP Program 35
3.1 Introduction 35
3.2 Matrix Times Vector Operation 37
3.2.1 C and Fortran Implementations of the Problem 38
3.2.2 A Sequential Implementation of the Matrix Times
Vector Operation
38
3.3 Using OpenMP to Parallelize the Matrix Times Vector
Product
41
3.4 Keeping Sequential and Parallel Programs as a Single
Source Code
47
3.5 Wrap-Up 50
4 OpenMP Language Features 51
4.1 Introduction 51
4.2 Terminology 52
4.3 Parallel Construct 53
4.4 Sharing the Work among Threads in an OpenMP Program 57
4.4.1 Loop Construct 58
4.4.2 The Sections Construct 60
4.4.3 The Single Construct 64
4.4.4 Workshare Construct 66
4.4.5 Combined Parallel Work-Sharing Constructs 68
4.5 Clauses to Control Parallel and Work-Sharing Constructs 70
4.5.1 Shared Clause 71
4.5.2 Private Clause 72
4.5.3 Lastprivate Clause 73
4.5.4 Firstprivate Clause 75
4.5.5 Default Clause 77
4.5.6 Nowait Clause 78

Contents ix
4.5.7 Schedule Clause 79
4.6 OpenMP Synchronization Constructs 83
4.6.1 Barrier Construct 84
4.6.2 Ordered Construct 86
4.6.3 Critical Construct 87
4.6.4 Atomic Construct 90
4.6.5 Locks 93
4.6.6 Master Construct 94
4.7 Interaction with the Execution Environment 95
4.8 More OpenMP Clauses 100
4.8.1 If Clause 100
4.8.2 Num threads Clause 102
4.8.3 Ordered Clause 102
4.8.4 Reduction Clause 105
4.8.5 Copyin Clause 110
4.8.6 Copyprivate Clause 110
4.9 Advanced OpenMP Constructs 111
4.9.1 Nested Parallelism 111
4.9.2 Flush Directive 114
4.9.3 Threadprivate Directive 118
4.10 Wrap-Up 123
5 How to Get Good Performance by Using
OpenMP
125
5.1 Introduction 125
5.2 Performance Considerations for Sequential Programs 125
5.2.1 Memory Access Patterns and Performance 126
5.2.2 Translation-Lookaside Buﬀer 128
5.2.3 Loop Optimizations 129
5.2.4 Use of Pointers and Contiguous Memory in C 136
5.2.5 Using Compilers 137
5.3 Measuring OpenMP Performance 138
5.3.1 Understanding the Performance of an OpenMP
Program
140

x Contents
5.3.2 Overheads of the OpenMP Translation 142
5.3.3 Interaction with the Execution Environment 143
5.4 Best Practices 145
5.4.1 Optimize Barrier Use 145
5.4.2 Avoid the Ordered Construct 147
5.4.3 Avoid Large Critical Regions 147
5.4.4 Maximize Parallel Regions 148
5.4.5 Avoid Parallel Regions in Inner Loops 148
5.4.6 Address Poor Load Balance 150
5.5 Additional Performance Considerations 152
5.5.1 The Single Construct Versus the Master Construct 153
5.5.2 Avoid False Sharing 153
5.5.3 Private Versus Shared Data 156
5.6 Case Study: The Matrix Times Vector Product 156
5.6.1 Testing Circumstances and Performance Metrics 157
5.6.2 A Modiﬁed OpenMP Implementation 158
5.6.3 Performance Results for the C Version 159
5.6.4 Performance Results for the Fortran Version 164
5.7 Fortran Performance Explored Further 167
5.8 An Alternative Fortran Implementation 180
5.9 Wrap-Up 189
6 Using OpenMP in the Real World 191
6.1 Scalability Challenges for OpenMP 191
6.2 Achieving Scalability on cc-NUMA Architectures 193
6.2.1 Memory Placement and Thread Binding: Why Do
We Care?
193
6.2.2 Examples of Vendor-Speciﬁc cc-NUMA Support 196
6.2.3 Implications of Data and Thread Placement on
cc-NUMA Performance
199
6.3 SPMD Programming 200
Case Study 1: A CFD Flow Solver 201
6.4 Combining OpenMP and Message Passing 207
6.4.1 Case Study 2: The NAS Parallel Benchmark BT 211

Contents xi
6.4.2 Case Study 3: The Multi-Zone NAS Parallel
Benchmarks
214
6.5 Nested OpenMP Parallelism 216
6.5.1 Case Study 4: Employing Nested OpenMP for
Multi-Zone CFD Benchmarks
221
6.6 Performance Analysis of OpenMP Programs 228
6.6.1 Performance Profiling of OpenMP Programs 228
6.6.2 Interpreting Timing Information 230
6.6.3 Using Hardware Counters 239
6.7 Wrap-Up 241
7 Troubleshooting 243
7.2 Common Misunderstandings and Frequent Errors 243
7.2.1 Data Race Conditions 243
7.2.2 Default Data-Sharing Attributes 246
7.2.3 Values of Private Variables 249
7.2.4 Problems with the Master Construct 250
7.2.5 Assumptions about Work Scheduling 252
7.2.6 Invalid Nesting of Directives 252
7.2.7 Subtle Errors in the Use of Directives 255
7.2.8 Hidden Side Effects, or the Need for Thread Safety 255
7.3 Deeper Trouble: More Subtle Problems 259
7.3.1 Memory Consistency Problems 259
7.3.2 Erroneous Assumptions about Memory Consistency 262
7.3.3 Incorrect Use of Flush 264
7.3.4 A Well-Masked Data Race 266
7.3.5 Deadlock Situations 268
7.4 Debugging OpenMP Codes 271
7.4.1 Verification of the Sequential Version 271
7.4.2 Verification of the Parallel Code 272
7.4.3 How Can Tools Help? 272
7.5 Wrap-Up 276

xii Contents
8 Under the Hood: How OpenMP Really Works 277
8.2 The Basics of Compilation 278
8.2.1 Optimizing the Code 279
8.2.2 Setting Up Storage for the Program’s Data 280
8.3 OpenMP Translation 282
8.3.1 Front-End Extensions 283
8.3.2 Normalization of OpenMP Constructs 284
8.3.3 Translating Array Statements 286
8.3.4 Translating Parallel Regions 286
8.3.5 Implementing Worksharing 291
8.3.6 Implementing Clauses on Worksharing Constructs 294
8.3.7 Dealing with Orphan Directives 297
8.3.8 OpenMP Data Environment 298
8.3.9 Do Idle Threads Sleep? 300
8.3.10 Handling Synchronization Constructs 302
8.4 The OpenMP Runtime System 303
8.5 Impact of OpenMP on Compiler Optimizations 304
8.6 Wrap-Up 304
9 The Future of OpenMP 307
9.2 The Architectural Challenge 309
9.3 OpenMP for Distributed-Memory Systems 311
9.4 Increasing the Expressivity of OpenMP 312
9.4.1 Enhancing OpenMP Features 312
9.4.2 New Features and New Kinds of Applications 314
9.5 How Might OpenMP Evolve? 317
9.6 In Conclusion 318
A Glossary 321
References 331
Index 349

Series Foreword
The Scientific and Engineering Computation Series from MIT Press aims to provide
practical and immediately usable information to scientists and engineers engaged
at the leading edge of modern computing. Aspects of modern computing first
presented in research papers and at computer science conferences are presented
here with the intention of accelerating the adoption and impact of these ideas in
scientific and engineering applications. Such aspects include parallelism, language
design and implementation, systems software, numerical libraries, and scientific
visualization.
This book is a tutorial on OpenMP, an approach to writing parallel programs for
the shared-memory model of parallel computation. Now that all commodity proces-
sors are becoming multicore, OpenMP provides one of the few programming models
that allows computational scientists to easily take advantage of the parallelism of-
fered by these processors. This book includes a complete description of how to
use OpenMP in both C and Fortran for real-world programs, provides suggestions
for achieving high performance with OpenMP, and discusses how OpenMP-enabled
compilers work. The book concludes with a discussion of future directions for
OpenMP.
William Gropp and Ewing Lusk, Editors

Foreword
Programming languages evolve just as natural languages do, driven by human de-
sires to express thoughts more cleverly, succinctly, or elegantly than in the past.
A big difference is the fact that one key receiver of programs is nonhuman. These
nonhumans evolve faster than humans do, helping drive language mutation after
mutation, and—together with the human program writers and readers—naturally
selecting among the mutations.
In the 1970s, vector and parallel computer evolution was on the move. Program-
ming assistance was provided by language extensions—first to Fortran and then
to C—in the form of directives and pragmas, respectively. Vendors differentiated
themselves by providing “better” extensions than did their competitors; and by
the mid-1980s things had gotten out of hand for software vendors. At Kuck and
Associates (KAI), we had the problem of dealing with the whole industry, so Bruce
Leasure and I set out to fix things by forming an industrywide committee, the
Parallel Computing Forum (PCF). PCF struck a nerve and became very active.
In a few years we had a draft standard that we took through ANSI, and after a
few more years it became the ANSI X3.H5 draft. Our stamina gave out before it
became an official ANSI standard, but the industry paid attention, and extensions
evolved more uniformly.
This situation lasted for a few years, but the 1980s were a golden era for parallel
architectural evolution, with many people writing parallel programs, so extensions
again diverged, and programming needs grew. KAI took on the challenge of re-
thinking things and defining parallel profiling and correctness-checking tools at the
same time, with the goal of innovative software development products. By the
mid-1990s we had made a lot of progress and had discussed it a bit with some
hardware vendors. When SGI bought Cray in April 1996, they had an immediate
directive problem (two distinct extensions) and approached us about working with
them. Together we refined what we had, opened up to the industry, and formed
the Architecture Review Board (ARB). OpenMP was born 18 months later, as the
New York Times reported:
NEW STANDARD FOR PARALLEL PROCESSING WORKSTATIONS
Compaq, Digital, Intel, IBM and Silicon Graphics have agreed to
support OpenMP, a new standard developed by Silicon Graphics and
Kuck Associates to allow programmers to write a single version
of their software that will run on parallel processor computers
using Unix or Windows NT operating systems. The new standard will

xvi Foreword
hasten the trend in which scientists and engineers choose high-end
workstations rather than supercomputers for complex computational
applications. (NYT 28 Oct. 1997)
OpenMP has been adopted by many software developers in the past decade, but it
has competed with traditional hand threading at the one extreme and MPI at the
other. These alternatives are much lower-level expressions of parallelism: threading
allows more control, MPI more scalability. Both usually require much more initial
effort to think through the details of program control, data decomposition, and
expressing thoughts with assembly-language-style calls. The multicore revolution
now demands simple parallel application development, which OpenMP provides
with language extensions and tools. While OpenMP has limitations rooted in its
technical origins, the ARB continues to drive the standard forward.
The supercomputing needs of the New York Times article have now been largely
replaced by scalable clusters of commodity multicore processors. What was a work-
station is now a desktop or laptop multicore system. The need for effective parallel
software development continues to grow in importance.
This book provides an excellent introduction to parallel programming and Open-
MP. It covers the language, the performance of OpenMP programs (with one hun-
dred pages of details about Fortran and C), common sources of errors, scalability
via nested parallelism and combined OpenMP/MPI programs, OpenMP implemen-
tation issues, and future ideas. Few books cover the topics in this much detail; it
includes the new OpenMP 2.5 specification, as well as hints about OpenMP 3.0
discussions and issues.
The book should be welcomed by academia, where there is rising interest in un-
dergraduate parallel programming courses. Today, parallel programming is taught
in most universities, but only as a graduate course. With multicore processors now
used everywhere, introductory courses need to add parallel programming. Because
performance is little discussed in any undergraduate programming courses today,
parallel programming for performance is hard to incorporate. OpenMP helps to
bridge this gap because it can be added simply to sequential programs and comes
with multiple scheduling algorithms that can easily provide an experimental ap-
proach to parallel performance tuning.
OpenMP has some deceptive simplicities, both good and bad. It is easy to start
using, placing substantial burden on the system implementers. In that sense, it puts
off some experienced users and beginners with preconceived ideas about POSIX or
WinThreads, who decide that parallel programming can’t be that simple and who
want to indicate on which processor each thread is going to run (and other unnec-

Foreword xvii
essary details). OpenMP also allows for very strong correctness checking versus
the correctness of the sequential program to which OpenMP directives are added.
Intel Thread Checker and other tools can dynamically pinpoint, to the line num-
ber, most OpenMP parallel programming bugs. Thus, OpenMP implementations
indeed remove annoying burdens from developers. This book will help educate the
community about such benefits.
On the other hand, the simplicity of getting started with OpenMP can lead
one to believing that any sequential program can be made into a high-performance
parallel program, which is not true. Architectural and program constraints must be
considered in scaling up any parallel program. MPI forces one to think about this
immediately and in that sense is less seductive than OpenMP. However, OpenMP
scalability is being extended with nested parallelism and by Intel’s ClusterOpenMP
with new directives to distinguish shared- and distributed-memory variables. In
the end, a high-performance OpenMP or OpenMP/MPI program may need a lot
of work, but getting started with OpenMP remains quite easy, and this book treats
the intricacies of scaling via nesting and hybrid OpenMP/MPI.
OpenMP is supported by thriving organizations. The ARB membership now in-
cludes most of the world’s leading computer manufacturers and software providers.
The ARB is a technical body that works to define new features and fix problems.
Any interested programmer can join cOMPunity, a forum of academic and industrial
researchers and developers who help drive the standard forward.
I am pleased that the authors asked me to write this foreword, and I hope that
readers learn to use the full expressibility and power of OpenMP. This book should
provide an excellent introduction to beginners, and the performance section should
help those with some experience who want to push OpenMP to its limits.
David J. Kuck
Intel Fellow, Software and Solutions Group
Director, Parallel and Distributed Solutions
Intel Corporation
Urbana, IL, USA
March 14, 2007

Preface
At Supercomputing 1997, a major conference on High Performance Computing,
Networking, and Storage held in San Jose, California, a group of High Performance
Computing experts from industry and research laboratories used an informal “Birds
of a Feather” session to unveil a new, portable programming interface for shared-
memory parallel computers. They called it OpenMP. The proposers included repre-
sentatives from several hardware companies and from the software house Kuck and
Associates, as well as scientists from the Department of Energy who wanted a way
to write programs that could exploit the parallelism in shared memory machines
provided by several major hardware manufacturers.
This initiative could not have been more timely. A diversity of programming
models for those early shared-memory systems were in use. They were all different
enough to inhibit an easy port between them. It was good to end this undesirable
situation and propose a unified model.
A company was set up to own and maintain the new informal standard. It
was named the OpenMP Architecture Review Board (ARB). Since that time, the
number of vendors involved in the specification and maintenance of OpenMP has
steadily grown. There has been increased involvement of application developers,
compiler experts, and language specialists in the ARB too.
The original proposal provided directives, a user-level library, and several environ-
ment variables that could be used to turn Fortran 77 programs into shared-memory
parallel programs with minimal effort. Fairly soon after the first release, the speci-
fication was further developed to enable its use with C/C++ programs and to take
features of Fortran 90 more fully into account. Since then, the bindings for Fortran
and C/C++ have been merged, both for simplicity and to ensure that they are as
similar as possible. Over time, support for OpenMP has been added to more and
more compilers. So we can safely say that today OpenMP provides a compact,
yet flexible shared-memory programming model for Fortran, C, and C++ that is
widely available to application developers.
Many people collaborated in order to produce the first specification of OpenMP.
Since that time, many more have worked hard in the committees set up by the
ARB to clarify certain features of the language, to consider extensions, and to
make their implementations more compatible with each other. Proposals for a
standard means to support interactions between implementations and external tools
have been intensively debated. Ideas for new features have been implemented in
research prototypes. Other people have put considerable effort into promoting the
use of OpenMP and in teaching novices and experts alike how to utilize its features
to solve a variety of programming needs. One of the authors founded a not-for-

xx Preface
profit company called cOMPunity to help researchers participate more fully in the
evolution of OpenMP and to promote interactions between vendors, researchers,
and users. Many volunteers helped cOMPunity achieve its goals.
At the time of writing, hardware companies are poised to introduce a whole
new generation of computers. They are designing and building multicore platforms
capable of supporting the simultaneous execution of a growing number of threads
in a shared-memory system. Even laptops are already small parallel computers.
The question is when and how the software will be adapted to take advantage of
this trend. For a while, improved throughput is going to be the main benefit of
multicore technology. It is quite typical to deploy multiple independent activities
on a laptop or PC, but how many cores are needed for this? At some point, users
will expect individual applications to take advantage of the additional processing
power. To do so, a parallel programming model is required. We think OpenMP is
in a perfect position to satisfy this need — not only today, but also in the future.
Why a book on OpenMP? After all, the OpenMP specification can be downloaded
from the web. The answer lies in the fact that, although the specification has
been written in a relatively informal style and has numerous examples, it is still
not a particularly suitable starting point for learning how to write real programs.
Moreover, some of the factors that may influence a program’s performance are
not mentioned anywhere in that document. Despite its apparent simplicity, then,
additional information is needed. This book fills in those gaps.
Chapter 1 provides background information and explains where OpenMP is ap-
plicable, as well as how it differs from other programming interfaces.
Chapter 2 gives a brief overview of the features of OpenMP. It is intended as a
high-level introduction that can be read either before or after trying out OpenMP.
Among other topics, it explains how OpenMP deals with problems arising from the
complex memory hierarchy present on most modern computers.
Chapter 3 is an essential chapter for novice parallel programmers. It discusses a
complete OpenMP program (in both Fortran and C versions) that exploits a couple
of the most widely used features, and it explains the basics of the OpenMP syntax.
Chapter 4 provides an extensive overview of the OpenMP programming model,
with many examples. First, the most widely used features are introduced, with
a focus on those that enable work to be shared among multiple threads. Then,
some important additional elements of the API are presented. Finally, we de-
scribe some of OpenMP’s lesser-used parts. In the early sections, our examples are
straightforward. Later, we give solutions to some more challenging programming
problems.

Preface xxi
Chapters 5 and 6 discuss how to get good performance with OpenMP. We in-
clude a number of programming tips, along with an extended example that gives
insight into the process of investigating performance problems. With the growing
number of threads available on new platforms, the strategies given in Chapter 6 for
achieving higher levels of scalability are likely to be important for many application
developers.
Chapter 7 discusses problems of program correctness. Troubleshooting any ap-
plication can be hard, but shared-memory parallel programming adds another di-
mension to this effort. In particular, certain kinds of bugs are nondeterministic.
Whether they manifest themselves may depend on one or more external factors,
such as the number of threads used, the load on the system, the compiler, and the
OpenMP library implementation.
Chapter 8 shows how the compiler translates an OpenMP program to turn it into
an application capable of parallel execution. Since OpenMP provides a fairly high
level programming model, knowledge of what happens behind the scenes may help
the reader understand the impact of its translation and the workings of OpenMP-
aware compilers, performance tools, and debuggers. It may also give deeper insight
into techniques and strategies for obtaining high levels of performance.
Chapter 9 describes some of the trends that are likely to influence extensions to
the OpenMP specification. Included are comments on language features we expect
to be included in the reasonably near future.
Acknowledgments
A number of people have worked very hard to help maintain OpenMP, provide
feedback to users, debate and develop syntax for new language features, implement
those features, and teach others how to use them. It is their work that we present
here. We also acknowledge here the continuous efforts of many colleagues on the
various committees of the OpenMP Architecture Review Board. We particularly
mention Mark Bull, from the University of Edinburgh, without whom progress on
the language front is difficult to conceive.
We thank our colleagues who have contributed to the activities of cOMPunity,
which enables the participation of researchers and application developers in the
work of the ARB. These include Eduard Ayguade, Rudi Eigenmann, Dieter an
Mey, Mark Bull, Guy Robinson, and Mitsuhisa Sato.
We thank Michael Resch and colleagues at the High Performance Computing
Center (HLRS) of the University of Stuttgart, Germany, for providing logisitical
support for the creation of this manuscript and for offering a pleasant working

xxii Preface
environment and good company for one of us during a part of the writing phase.
We particularly thank Matthias Müller, originally from HLRS, but now at the
Dresden University of Technology, for his comments, encouragement, and support
and for getting us started with the publisher’s software.
Our sincere gratitude goes to the following organizations and individuals that
have helped us throughout the writing of this book: Lei Huang, Chunhua Liao,
and students in the HPC Tools lab at the University of Houston provided mate-
rial for some examples and criticized our efforts. We benefited from many helpful
discussions on OpenMP scalability issues with the staff of NASA Ames Research
Center. In particular, we thank Michael Aftosmis and Marsha Berger for the flow-
Cart example and Henry Jin for many interesting discussions of the NAS Parallel
Benchmarks and OpenMP in general. Our thanks go to colleagues at CEPBA (Eu-
ropean Center for Parallelism of Barcelona) and UPC (Universitat Politecnica de
Catalunya), especially Judit Gimenez and Jesus Labarta for fruitful collaborations
in performance analysis of large-scale OpenMP applications, and Eduard Ayguade,
Marc Gonzalez, and Xavier Martorell for sharing their experience in OpenMP com-
piler technology.
Nawal Copty, Eric Duncan, and Yuan Lin at Sun Microsystems gave their help
in answering a variety of questions on OpenMP in general and also on compiler and
library implementation issues.
We gratefully acknowledge copious feedback on draft versions of this book from
Tim Mattson (Intel Corporation) and Nawal Copty and Richard Friedman (both
at Sun Microsystems). They helped us find a number of mistakes and made many
suggestions for modifications and improvements. Remaining errors are, of course,
entirely our responsibility.
Last but not least, our gratitude goes to our families for their continued patience
and encouragement. Special thanks go to Dave Barker (a husband) for tolerating
awakening to the sound of a popcorn popper (the keyboard) in the wee hours and
for providing helpful feedback throughout the project; to Carola and Jonathan
(two children) for cheerfully putting up with drafts of book chapters lying around
in many likely, and some unlikely, places; and to Marion, Vincent, Stéphanie, and
Juliette, who never complained and who provided loving support throughout this
journey.

1 Introduction
OpenMP enables the creation of shared-memory parallel programs. In this chapter,
we describe the evolution of computers that has led to the specification of OpenMP
and that has made it relevant to mainstream computing. We put our subject matter
into a broader context by giving a brief overview of parallel computing and the main
approaches taken to create parallel programs. Our discussion of these topics is not
intended to be comprehensive.
1.1 Why Parallel Computers Are Here to Stay
No matter how fast computers are, technology is being developed to make them
even faster. Our appetite for compute power and memory seems insatiable. A more
powerful machine leads to new kinds of applications, which in turn fuel our demand
for yet more powerful systems. The result of this continued technological progress
is nothing short of breathtaking: the laptops a couple of us are using to type this
script would have been among the fastest machines on the planet just a decade ago,
if they had been around at the time.
In order to achieve their breakneck speed, today’s computer systems are highly
complex [85]. They are made up of multiple components, or functional units, that
may be able to operate simultaneously and have specific tasks, such as adding two
integer numbers or determining whether a value is greater than zero. As a result, a
computer might be able to fetch a datum from memory, multiply two floating-point
numbers, and evaluate a branch condition all at the same time. This is a very low
level of parallel processing and is often referred to as “instruction-level parallelism,”
or ILP. A processor that supports this is said to have a superscalar architecture.
Nowadays it is a common feature in general-purpose microprocessors, even those
used in laptops and PCs.
Careful reordering of these operations may keep the machine’s components busy.
The lion’s share of the work of finding such a suitable ordering of operations is
performed by the compiler (although it can be supported in hardware). To accom-
plish this, compiler writers developed techniques to determine dependences between
operations and to find an ordering that efficiently utilizes the instruction-level par-
allelism and keeps many functional units and paths to memory busy with useful
work. Modern compilers put considerable effort into this kind of instruction-level
optimization. For instance, software pipelining may modify the sequence of in-
structions in a loop nest, often overlapping instructions from different iterations to
ensure that as many instructions as possible complete every clock cycle. Unfortu-
nately, several studies [95] showed that typical applications are not likely to contain

2 Chapter 1
more than three or four different instructions that can be fed to the computer at a
time in this way. Thus, there is limited payoff for extending the hardware support
for this kind of instruction-level parallelism.
Back in the 1980s, several vendors produced computers that exploited another
kind of architectural parallelism.1
They built machines consisting of multiple com-
plete processors with a common shared memory. These shared-memory parallel, or
multiprocessor, machines could work on several jobs at once, simply by parceling
them out to the different processors. They could process programs with a variety
of memory needs, too, and were thus suitable for many different workloads. As a
result, they became popular in the server market, where they have remained impor-
tant ever since. Both small and large shared-memory parallel computers (in terms
of number of processors) have been built: at the time of writing, many of them have
two or four CPUs, but there also exist shared-memory systems with more than a
thousand CPUs in use, and the number that can be configured is growing. The
technology used to connect the processors and memory has improved significantly
since the early days [44]. Recent developments in hardware technology have made
architectural parallelism of this kind important for mainstream computing.
In the past few decades, the components used to build both high-end and desktop
machines have continually decreased in size. Shortly before 1990, Intel announced
that the company had put a million transistors onto a single chip (the i860). A
few years later, several companies had fit 10 million onto a chip. In the meantime,
technological progress has made it possible to put billions of transistors on a single
chip. As data paths became shorter, the rate at which instructions were issued
could be increased. Raising the clock speed became a major source of advances in
processor performance. This approach has inherent limitations, however, particu-
larly with respect to power consumption and heat emission, which is increasingly
hard to dissipate.
Recently, therefore, computer architects have begun to emphasize other strategies
for increasing hardware performance and making better use of the available space on
the chip. Given the limited usefulness of adding functional units, they have returned
to the ideas of the 1980s: multiple processors that share memory are configured in a
single machine and, increasingly, on a chip. This new generation of shared-memory
parallel computers is inexpensive and is intended for general-purpose usage.
Some recent computer designs permit a single processor to execute multiple in-
struction streams in an interleaved way. Simultaneous multithreading, for example,
interleaves instructions from multiple applications in an attempt to use more of the
1Actually, the idea was older than that, but it didn’t take off until the 1980s.

Introduction 3
hardware components at any given time. For instance, the computer might add two
values from one set of instructions and, at the same time, fetch a value from memory
that is needed to perform an operation in a different set of instructions. An ex-
ample is Intel’s hyperthreadingTM
technology. Other recent platforms (e.g., IBM’s
Power5, AMD’s Opteron and Sun’s UltraSPARC IV, IV+, and T1 processors) go
even further, replicating substantial parts of a processor’s logic on a single chip and
behaving much like shared-memory parallel machines. This approach is known as
multicore. Simultaneous multithreading platforms, multicore machines, and shared-
memory parallel computers all provide system support for the execution of multiple
independent instruction streams, or threads. Moreover, these technologies may be
combined to create computers that can execute high numbers of threads.
Given the limitations of alternative strategies for creating more powerful com-
puters, the use of parallelism in general-purpose hardware is likely to be more
pronounced in the near future. Some PCs and laptops are already multicore or
multithreaded. Soon, processors will routinely have many cores and possibly the
ability to execute multiple instruction streams within each core. In other words,
multicore technology is going mainstream [159]. It is vital that application soft-
ware be able to make effective use of the parallelism that is present in our hardware
[171]. But despite major strides in compiler technology, the programmer will need
to help, by describing the concurrency that is contained in application codes. In
this book, we will discuss one of the easiest ways in which this can be done.
1.2 Shared-Memory Parallel Computers
Throughout this book, we will refer to shared-memory parallel computers as SMPs.
Early SMPs included computers produced by Alliant, Convex, Sequent [146], En-
core, and Synapse [10] in the 1980s. Larger shared-memory machines included
IBM’s RP3 research computer [149] and commercial systems such as the BBN But-
terfly [23]. Later SGI’s Power Challenge [65] and Sun Microsystem’s Enterprise
servers entered the market, followed by a variety of desktop SMPs.
The term SMP was originally coined to designate a symmetric multiprocessor sys-
tem, a shared-memory parallel computer whose individual processors share memory
(and I/O) in such a way that each of them can access any memory location with
the same speed; that is, they have a uniform memory access (UMA) time. Many
small shared-memory machines are symmetric in this sense. Larger shared-memory
machines, however, usually do not satisfy this definition; even though the differ-
ence may be relatively small, some memory may be “nearer to” one or more of
the processors and thus accessed faster by them. We say that such machines have

4 Chapter 1
cache-coherent non-uniform memory access (cc-NUMA). Early innovative attempts
to build cc-NUMA shared-memory machines were undertaken by Kendall Square
Research (KSR1 [62]) and Denelcor (the Denelcor HEP). More recent examples
of large NUMA platforms with cache coherency are SGI’s Origin and Altix series,
HP’s Exemplar, and Sun Fire E25K.
Today, the major hardware vendors all offer some form of shared-memory parallel
computer, with sizes ranging from two to hundreds – and, in a few cases, thousands
– of processors.
Conveniently, the acronym SMP can also stand for “shared-memory parallel
computer,” and we will use it to refer to all shared-memory systems, including
cc-NUMA platforms. By and large, the programmer can ignore this difference,
although techniques that we will explore in later parts of the book can help take
cc-NUMA characteristics into account.
1.2.1 Cache Memory Is Not Shared
Somewhat confusing is the fact that even SMPs have some memory that is not
shared. To explain why this is the case and what the implications for applications
programming are, we present some background information. One of the major
challenges facing computer architects today is the growing discrepancy in processor
and memory speed. Processors have been consistently getting faster. But the more
rapidly they can perform instructions, the quicker they need to receive the values
of operands from memory. Unfortunately, the speed with which data can be read
from and written to memory has not increased at the same rate. In response,
the vendors have built computers with hierarchical memory systems, in which a
small, expensive, and very fast memory called cache memory, or “cache” for short,
supplies the processor with data and instructions at high rates [74]. Each processor
of an SMP needs its own private cache if it is to be fed quickly; hence, not all
memory is shared.
Figure 1.1 shows an example of a generic, cache-based dual-core processor. There
are two levels of cache. The term level is used to denote how far away (in terms
of access time) a cache is from the CPU, or core. The higher the level, the longer
it takes to access the cache(s) at that level. At level 1 we distinguish a cache for
data (“Data Cache”), one for instructions (“Instr. Cache”), and the “Translation-
Lookaside Buffer” (or TLB for short). The last of these is an address cache. It
is discussed in Section 5.2.2. These three caches are all private to a core: other
core(s) cannot access them. Our figure shows only one cache at the second level. It

Introduction 5
is most likely bigger than each of the level-1 caches, and it is shared by both cores.
It is also uniﬁed, which means that it contains instructions as well as data.
Level 1

Level 2

Processor
Figure 1.1: Block diagram of a generic, cache-based dual core processor
– In this imaginary processor, there are two levels of cache. Those closest to the core are
called “level 1.” The higher the level, the farther away from the CPU (measured in access
time) the cache is. The level-1 cache is private to the core, but the cache at the second
level is shared. Both cores can use it to store and retrieve instructions, as well as data.
Data is copied into cache from main memory: blocks of consecutive memory
locations are transferred at a time. Since the cache is very small in comparison
to main memory, a new block may displace data that was previously copied in.
An operation can be (almost) immediately performed if the values it needs are
available in cache. But if they are not, there will be a delay while the corresponding
data is retrieved from main memory. Hence, it is important to manage cache
carefully. Since neither the programmer nor the compiler can directly put data
into—or remove data from—cache, it is useful to learn how to structure program
code to indirectly make sure that cache is utilized well.2
2The techniques developed to accomplish this task are useful for sequential programming, too.
They are brieﬂy covered in Section 5.2.3

6 Chapter 1
1.2.2 Implications of Private Cache Memory
In a uniprocessor system, new values computed by the processor are written back
to cache, where they remain until their space is required for other data. At that
point, any new values that have not already been copied back to main memory are
stored back there. This strategy does not work for SMP systems. When a processor
of an SMP stores results of local computations in its private cache, the new values
are accessible only to code executing on that processor. If no extra precautions are
taken, they will not be available to instructions executing elsewhere on an SMP
machine until after the corresponding block of data is displaced from cache. But it
may not be clear when this will happen. In fact, since the old values might still be
in other private caches, code executing on other processors might continue to use
them even then.
This is known as the memory consistency problem. A number of strategies have
been developed to help overcome it. Their purpose is to ensure that updates to data
that have taken place on one processor are made known to the program running on
other processors, and to make the modified values available to them if needed. A
system that provides this functionality transparently is said to be cache coherent.
Fortunately, the OpenMP application developer does not need to understand how
cache coherency works on a given computer. Indeed, OpenMP can be implemented
on a computer that does not provide cache coherency, since it has its own set of rules
on how data is shared among the threads running on different processors. Instead,
the programmer must be aware of the OpenMP memory model, which provides for
shared and private data and specifies when updated shared values are guaranteed
to be available to all of the code in an OpenMP program.
1.3 Programming SMPs and the Origin of OpenMP
Once the vendors had the technology to build moderately priced SMPs, they needed
to ensure that their compute power could be exploited by individual applications.
This is where things got sticky. Compilers had always been responsible for adapting
a program to make best use of a machine’s internal parallelism. Unfortunately, it is
very hard for them to do so for a computer with multiple processors or cores. The
reason is that the compilers must then identify independent streams of instructions
that can be executed in parallel. Techniques to extract such instruction streams
from a sequential program do exist; and, for simple programs, it may be worthwhile
trying out a compiler’s automatic (shared-memory) parallelization options. How-
ever, the compiler often does not have enough information to decide whether it is

Introduction 7
possible to split up a program in this way. It also cannot make large-scale changes
to code, such as replacing an algorithm that is not suitable for parallelization. Thus,
most of the time the compiler will need some help from the user.
1.3.1 What Are the Needs?
To understand how programmers might express a code’s parallelism, the hardware
manufacturers looked carefully at existing technology. Beginning in the 1980s,
scientists engaged in solving particularly tough computational problems attempted
to exploit the SMPs of the day to speed up their code and to perform much larger
computations than were possible on a uniprocessor. To get the multiple processors
to collaborate to execute a single application, they looked for regions of code whose
instructions could be shared among the processors. Much of the time, they focused
on distributing the work in loop nests to the processors.
In most programs, code executed on one processor required results that had been
calculated on another one. In principle, this was not a problem because a value
produced by one processor could be stored in main memory and retrieved from
there by code running on other processors as needed. However, the programmer
needed to ensure that the value was retrieved after it had been produced, that is,
that the accesses occurred in the required order. Since the processors operated
independently of one another, this was a nontrivial difficulty: their clocks were not
synchronized, and they could and did execute their portions of the code at slightly
different speeds.
Accordingly, the vendors of SMPs in the 1980s provided special notation to spec-
ify how the work of a program was to be parceled out to the individual processors of
an SMP, as well as to enforce an ordering of accesses by different threads to shared
data. The notation mainly took the form of special instructions, or directives, that
could be added to programs written in sequential languages, especially Fortran.
The compiler used this information to create the actual code for execution by each
processor. Although this strategy worked, it had the obvious deficiency that a
program written for one SMP did not necessarily execute on another one.
1.3.2 A Brief History of Saving Time
Toward the end of the 1980s, vendors began to collaborate to improve this state of
affairs. An informal industry group called the Parallel Computing Forum (PCF)
agreed on a set of directives for specifying loop parallelism in Fortran programs;
their work was published in 1991 [59]. An official ANSI subcommittee called X3H5
was set up to develop an ANSI standard based on PCF. A document for the new

8 Chapter 1
standard was drafted in 1994 [19], but it was never formally adopted. Interest in
PCF and X3H5 had dwindled with the rise of other kinds of parallel computers that
promised a scalable and more cost-effective approach to parallel programming. The
X3H5 standardization effort had missed its window of opportunity.
But this proved to be a temporary setback. OpenMP was defined by the OpenMP
Architecture Review Board (ARB), a group of vendors who joined forces during the
latter half of the 1990s to provide a common means for programming a broad
range of SMP architectures. OpenMP was based on the earlier PCF work. The
first version, consisting of a set of directives that could be used with Fortran, was
introduced to the public in late 1997. OpenMP compilers began to appear shortly
thereafter. Since that time, bindings for C and C++ have been introduced, and the
set of features has been extended. Compilers are now available for virtually all SMP
platforms. The number of vendors involved in maintaining and further developing
its features has grown. Today, almost all the major computer manufacturers, major
compiler companies, several government laboratories, and groups of researchers
belong to the ARB.
One of the biggest advantages of OpenMP is that the ARB continues to work to
ensure that OpenMP remains relevant as computer technology evolves. OpenMP
is under cautious, but active, development; and features continue to be proposed
for inclusion into the application programming interface. Applications live vastly
longer than computer architectures and hardware technologies; and, in general, ap-
plication developers are careful to use programming languages that they believe will
be supported for many years to come. The same is true for parallel programming
interfaces.
1.4 What Is OpenMP?
OpenMP is a shared-memory application programming interface (API) whose fea-
tures, as we have just seen, are based on prior efforts to facilitate shared-memory
parallel programming. Rather than an officially sanctioned standard, it is an
agreement reached between the members of the ARB, who share an interest in a
portable, user-friendly, and efficient approach to shared-memory parallel program-
ming. OpenMP is intended to be suitable for implementation on a broad range of
SMP architectures. As multicore machines and multithreading processors spread in
the marketplace, it might be increasingly used to create programs for uniprocessor
computers also.
Like its predecessors, OpenMP is not a new programming language. Rather, it
is notation that can be added to a sequential program in Fortran, C, or C++ to

Introduction 9
describe how the work is to be shared among threads that will execute on different
processors or cores and to order accesses to shared data as needed. The appropriate
insertion of OpenMP features into a sequential program will allow many, perhaps
most, applications to benefit from shared-memory parallel architectures—often with
minimal modification to the code. In practice, many applications have considerable
parallelism that can be exploited.
The success of OpenMP can be attributed to a number of factors. One is its
strong emphasis on structured parallel programming. Another is that OpenMP is
comparatively simple to use, since the burden of working out the details of the
parallel program is up to the compiler. It has the major advantage of being widely
adopted, so that an OpenMP application will run on many different platforms.
But above all, OpenMP is timely. With the strong growth in deployment of
both small and large SMPs and other multithreading hardware, the need for a
shared-memory programming standard that is easy to learn and apply is accepted
throughout the industry. The vendors behind OpenMP collectively deliver a large
fraction of the SMPs in use today. Their involvement with this de facto standard
ensures its continued applicability to their architectures.
1.5 Creating an OpenMP Program
OpenMP’s directives let the user tell the compiler which instructions to execute
in parallel and how to distribute them among the threads that will run the code.
An OpenMP directive is an instruction in a special format that is understood by
OpenMP compilers only. In fact, it looks like a comment to a regular Fortran
compiler or a pragma to a C/C++ compiler, so that the program may run just
as it did beforehand if a compiler is not OpenMP-aware. The API does not have
many different directives, but they are powerful enough to cover a variety of needs.
In the chapters that follow, we will introduce the basic idea of OpenMP and then
each of the directives in turn, giving examples and discussing their main uses.
The first step in creating an OpenMP program from a sequential one is to identify
the parallelism it contains. Basically, this means finding instructions, sequences of
instructions, or even large regions of code that may be executed concurrently by
different processors.
Sometimes, this is an easy task. Sometimes, however, the developer must reor-
ganize portions of a code to obtain independent instruction sequences. It may even
be necessary to replace an algorithm with an alternative one that accomplishes
the same task but offers more exploitable parallelism. This can be a challenging
problem. Fortunately, there are some typical kinds of parallelism in programs, and

10 Chapter 1
a variety of strategies for exploiting them have been developed. A good deal of
knowledge also exists about algorithms and their suitability for parallel execution.
A growing body of literature is being devoted to this topic [102, 60] and to the de-
sign of parallel programs [123, 152, 72, 34]. In this book, we will introduce some of
these strategies by way of examples and will describe typical approaches to creating
parallel code using OpenMP.
The second step in creating an OpenMP program is to express, using OpenMP,
the parallelism that has been identified. A huge practical benefit of OpenMP is
that it can be applied to incrementally create a parallel program from an existing
sequential code. The developer can insert directives into a portion of the program
and leave the rest in its sequential form. Once the resulting program version has
been successfully compiled and tested, another portion of the code can be paral-
lelized. The programmer can terminate this process once the desired speedup has
been obtained.
Although creating an OpenMP program in this way can be easy, sometimes sim-
ply inserting directives is not enough. The resulting code may not deliver the
expected level of performance, and it may not be obvious how to remedy the situa-
tion. Later, we will introduce techniques that may help improve a parallel program,
and we will give insight into how to investigate performance problems. Armed with
this information, one may be able to take a simple OpenMP program and make it
run better, maybe even significantly better. It is essential that the resulting code be
correct, and thus we also discuss the perils and pitfalls of the process. Finding cer-
tain kinds of bugs in parallel programs can be difficult, so an application developer
should endeavor to prevent them by adopting best practices from the start.
Generally, one can quickly and easily create parallel programs by relying on the
implementation to work out the details of parallel execution. This is how OpenMP
directives work. Unfortunately, however, it is not always possible to obtain high
performance by a straightforward, incremental insertion of OpenMP directives into
a sequential program. To address this situation, OpenMP designers included several
features that enable the programmer to specify more details of the parallel code.
Later in the book, we will describe a completely different way of using OpenMP
to take advantage of these features. Although it requires quite a bit more work,
users may find that getting their hands downright dirty by creating the code for
each thread can be a lot of fun. And, this may be the ticket to getting OpenMP to
solve some very large problems on a very big machine.

Introduction 11
1.6 The Bigger Picture
Many kinds of computer architectures have been built that exploit parallelism [55].
In fact, parallel computing has been an indispensable technology in many cutting-
edge disciplines for several decades. One of the earliest kinds of parallel systems
were the powerful and expensive vector computers that used the idea of pipelining
instructions to apply the same operation to many data objects in turn (e.g., Cyber-
205 [114], CRAY-1 [155], Fujitsu Facom VP-200 [135]). These systems dominated
the high end of computing for several decades, and machines of this kind are still
deployed. Other platforms were built that simultaneously applied the same oper-
ation to many data objects (e.g. CM2 [80], MasPar [140]). Many systems have
been produced that connect multiple independent computers via a network; both
proprietary and off-the-shelf networks have been deployed. Early products based
on this approach include Intel’s iPSC series [28] and machines built by nCUBE
and Meiko [22]. Memory is associated with each of the individual computers in
the network and is thus distributed across the machine. These distributed-memory
parallel systems are often referred to as massively parallel computers (MPPs) be-
cause very large systems can be put together this way. Information on some of
the fastest machines built during the past decade and the technology used to build
them can be found at http://www.top500.org.
Many MPPs are in use today, especially for scientific computing. If distributed-
memory computers are designed with additional support that enables memory to be
shared between all the processors, they are also SMPs according to our definition.
Such platforms are often called distributed shared-memory computers (DSMs) to
emphasize the distinctive nature of this architecture (e.g., SGI Origin [106]). When
distributed-memory computers are constructed by using standard workstations or
PCs and an off-the-shelf network, they are usually called clusters [169]. Clusters,
which are often composed of SMPs, are much cheaper to build than proprietary
MPPs. This technology has matured in recent years, so that clusters are common
in universities and laboratories as well as in some companies. Thus, although SMPs
are the most widespread kind of parallel computer in use, there are many other kinds
of parallel machines in the marketplace, particularly for high-end applications.
Figure 1.2 shows the difference in these architectures: in (a) we see a shared-
memory system where processors share main memory but have their own private
cache; (b) depicts an MPP in which memory is distributed among the processors,
or nodes, of the system. The platform in (c) is identical to (b) except for the fact
that the distributed memories are accessible to all processors. The cluster in (d)
consists of a set of independent computers linked by a network.

12 Chapter 1
P
Shared Memory
Cache
… …
P
P
P
P P
Cache Cache
(a)
(c)
(b)
Memory Memory Memory
(d)
…
Cache-Coherent Interconnect
P
P P
Memory Memory Memory
Network
…
Cache-Coherent Interconnect
Distributed Shared Memory
Non-Cache-Coherent Interconnect
Figure 1.2: Distributed- and shared-memory computers – The machine in
(a) has physically shared memory, whereas the others have distributed memory. However,
the memory in (c) is accessible to all processors.
An equally broad range of applications makes use of parallel computers [61].
Very early adopters of this technology came from such disciplines as aeronautics,
aerospace, and chemistry, where vehicles were designed, materials tested, and their
properties evaluated long before they were constructed. Scientists in many dis-
ciplines have achieved monumental insights into our universe by running parallel
programs that model real-world phenomena with high levels of accuracy. Theoret-
ical results were conﬁrmed in ways that could not be done via experimentation.
Parallel computers have been used to improve the design and production of goods
from automobiles to packaging for refrigerators and to ensure that the designs com-
ply with pricing and behavioral constraints, including regulations. They have been
used to study natural phenomena that we are unable to fully observe, such as the
formation of galaxies and the interactions of molecules. But they are also rou-
tinely used in weather forecasting, and improvements in the accuracy of our daily
forecasts are mainly the result of deploying increasingly fast (and large) parallel
computers. More recently, they have been widely used in Hollywood and elsewhere

Introduction 13
to generate highly realistic film sequences and special effects. In this context, too,
the ability to build bigger parallel computers has led to higher-quality results, here
in the form of more realistic imagery. Of course, parallel computers are also used
to digitally remaster old film and to perform many other tasks involving image
processing. Other areas using substantial parallel computing include drug design,
financial and economic forecasting, climate modeling, surveillance, and medical
imaging. It is routine in many areas of engineering, chemistry, and physics, and
almost all commercial databases are able to exploit parallel machines.
1.7 Parallel Programming Models
Just as there are several different classes of parallel hardware, so too are there
several distinct models of parallel programming. Each of them has a number of
concrete realizations. OpenMP realizes a shared-memory (or shared address space)
programming model. This model assumes, as its name implies, that programs
will be executed on one or more processors that share some or all of the available
memory. Shared-memory programs are typically executed by multiple independent
threads (execution states that are able to process an instruction stream); the threads
share data but may also have some additional, private data. Shared-memory ap-
proaches to parallel programming must provide, in addition to a normal range of
instructions, a means for starting up threads, assigning work to them, and coordi-
nating their accesses to shared data, including ensuring that certain operations are
performed by only one thread at a time [15].
A different programming model has been proposed for distributed-memory sys-
tems. Generically referred to as “message passing,” this model assumes that pro-
grams will be executed by one or more processes, each of which has its own pri-
vate address space [69]. Message-passing approaches to parallel programming must
provide a means to initiate and manage the participating processes, along with
operations for sending and receiving messages, and possibly for performing special
operations across data distributed among the different processes. The pure message-
passing model assumes that processes cooperate to exchange messages whenever one
of them needs data produced by another one. However, some recent models are
based on “single-sided communication.” These assume that a process may inter-
act directly with memory across a network to read and write data anywhere on a
machine.
Various realizations of both shared- and distributed-memory programming mod-
els have been defined and deployed. An ideal API for parallel programming is
expressive enough to permit the specification of many parallel algorithms, is easy

14 Chapter 1
to use, and leads to efficient programs. Moreover, the more transparent its imple-
mentation is, the easier it is likely to be for the programmer to understand how to
obtain good performance. Unfortunately, there are trade-offs between these goals
and parallel programming APIs differ in the features provided and in the manner
and complexity of their implementation. Some are a collection of library routines
with which the programmer may specify some or all of the details of parallel execu-
tion (e.g., GA [141] and Pthreads [108] for shared-memory programming and MPI
for MPPs), while others such as OpenMP and HPF [101] take the form of addi-
tional instructions to the compiler, which is expected to utilize them to generate
the parallel code.
1.7.1 Realization of Shared- and Distributed-Memory Models
Initially, vendors of both MPPs and SMPs provided their own custom sets of in-
structions for exploiting the parallelism in their machines. Application developers
had to work hard to modify their codes when they were ported from one machine
to another. As the number of parallel machines grew and as more and more par-
allel programs were written, developers began to demand standards for parallel
programming. Fortunately, such standards now exist.
MPI, or the Message Passing Interface, was defined in the early 1990s by a
group of researchers and vendors who based their work on existing vendor APIs
[69, 137, 147]. It provides a comprehensive set of library routines for managing
processes and exchanging messages. MPI is widely used in high-end computing,
where problems are so large that many computers are needed to attack them. It
is comparatively easy to implement on a broad variety of platforms and therefore
provides excellent portability. However, the portability comes at a cost. Creating
a parallel program based on this API typically requires a major reorganization of
the original sequential code. The development effort can be large and complex
compared to a compiler-supported approach such as that offered by OpenMP.
One can also combine some programming APIs. In particular, MPI and OpenMP
may be used together in a program, which may be useful if a program is to be
executed on MPPs that consist of multiple SMPs (possibly with multiple cores
each). Reasons for doing so include exploiting a finer granularity of parallelism than
possible with MPI, reducing memory usage, or reducing network communication.
Various commercial codes have been programmed using both MPI and OpenMP.
Combining MPI and OpenMP effectively is nontrivial, however, and in Chapter 6
we return to this topic and to the challenge of creating OpenMP codes that will
work well on large systems.

Introduction 15
1.8 Ways to Create Parallel Programs
In this section, we briefly compare OpenMP with the most important alternatives
for programming shared-memory machines. Some vendors also provide custom
APIs on their platforms. Although such APIs may be fast (this is, after all, the
purpose of a custom API), programs written using them may have to be substan-
tially rewritten to function on a different machine. We do not consider APIs that
were not designed for broad use.
Automatic parallelization: Many compilers provide a flag, or option, for auto-
matic program parallelization. When this is selected, the compiler analyzes the
program, searching for independent sets of instructions, and in particular for loops
whose iterations are independent of one another. It then uses this information to
generate explicitly parallel code. One of the ways in which this could be realized is
to generate OpenMP directives, which would enable the programmer to view and
possibly improve the resulting code. The difficulty with relying on the compiler to
detect and exploit parallelism in an application is that it may lack the necessary
information to do a good job. For instance, it may need to know the values that
will be assumed by loop bounds or the range of values of array subscripts: but this
is often unknown ahead of run time. In order to preserve correctness, the compiler
has to conservatively assume that a loop is not parallel whenever it cannot prove
the contrary. Needless to say, the more complex the code, the more likely it is that
this will occur. Moreover, it will in general not attempt to parallelize regions larger
than loop nests. For programs with a simple structure, it may be worth trying this
option.
MPI: The Message Passing Interface [137] was developed to facilitate portable
programming for distributed-memory architectures (MPPs), where multiple pro-
cesses execute independently and communicate data as needed by exchanging mes-
sages. The API was designed to be highly expressive and to enable the creation of
efficient parallel code, as well as to be broadly implementable. As a result of its suc-
cess in these respects, it is the most widely used API for parallel programming in the
high-end technical computing community, where MPPs and clusters are common.
Since most vendors of shared-memory systems also provide MPI implementations
that leverage the shared address space, we include it here.
Creating an MPI program can be tricky. The programmer must create the code
that will be executed by each process, and this implies a good deal of reprogram-
ming. The need to restructure the entire program does not allow for incremental
parallelization as does OpenMP. It can be difficult to create a single program ver-
sion that will run efficiently on many different systems, since the relative cost of

16 Chapter 1
communicating data and performing computations varies from one system to an-
other and may suggest different approaches to extracting parallelism. Care must
be taken to avoid certain programming errors, particularly deadlock where two or
more processes each wait in perpetuity for the other to send a message. A good
introduction to MPI programming is provided in [69] and [147].
Since many MPPs consist of a collection of SMPs, MPI is increasingly mixed
with OpenMP to create a program that directly matches the hardware. A recent
revision of the standard, MPI-2 ([58]), facilitates their integration.
Pthreads: This is a set of threading interfaces developed by the IEEE (Institute of
Electrical and Electronics Engineers) committees in charge of specifying a Portable
Operating System Interface (POSIX). It realizes the shared-memory programming
model via a collection of routines for creating, managing and coordinating a col-
lection of threads. Thus, like MPI, it is a library. Some features were primarily
designed for uniprocessors, where context switching enables a time-sliced execu-
tion of multiple threads, but it is also suitable for programming small SMPs. The
Pthreads library aims to be expressive as well as portable, and it provides a fairly
comprehensive set of features to create, terminate, and synchronize threads and to
prevent different threads from trying to modify the same values at the same time: it
includes mutexes, locks, condition variables, and semaphores. However, program-
ming with Pthreads is much more complex than with OpenMP, and the resulting
code is likely to differ substantially from a prior sequential program (if there is one).
Even simple tasks are performed via multiple steps, and thus a typical program will
contain many calls to the Pthreads library. For example, to execute a simple loop in
parallel, the programmer must declare threading structures, create and terminate
the threads individually, and compute the loop bounds for each thread. If interac-
tions occur within loop iterations, the amount of thread-specific code can increase
substantially. Compared to Pthreads, the OpenMP API directives make it easy to
specify parallel loop execution, to synchronize threads, and to specify whether or
not data is to be shared. For many applications, this is sufficient.
1.8.1 A Simple Comparison
The code snippets below demonstrate the implementation of a dot product in each
of the programming APIs MPI, Pthreads, and OpenMP. We do not explain in detail
the features used here, as our goal is simply to illustrate the flavor of each, although
we will introduce those used in the OpenMP code in later chapters.

Introduction 17
Sequential Dot-Product
int main(argc,argv)
int argc;
char *argv[];
{
double sum;
double a [256], b [256];
int n;
n = 256;
for (i = 0; i n; i++) {
a [i] = i * 0.5;
b [i] = i * 2.0;
}
sum = 0;
for (i = 1; i = n; i++ ) {
sum = sum + a[i]*b[i];
}
printf (sum = %f, sum);
}
The sequential program multiplies the individual elements of two arrays and saves
the result in the variable sum; sum is a so-called reduction variable.
Dot-Product in MPI
int main(argc,argv)
int argc;
char *argv[];
{
double sum, sum_local;
double a [256], b [256];
int n, numprocs, myid, my_first, my_last;
n = 256;
MPI_Init(argc,argv);
MPI_Comm_size(MPI_COMM_WORLD,numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,myid);

18 Chapter 1
my_first = myid * n/numprocs;
my_last = (myid + 1) * n/numprocs;
for (i = 0; i n; i++) {
a [i] = i * 0.5;
b [i] = i * 2.0;
}
sum_local = 0;
for (i = my_first; i my_last; i++) {
sum_local = sum_local + a[i]*b[i];
}
MPI_Allreduce(sum_local, sum, 1, MPI_DOUBLE, MPI_SUM,
MPI_COMM_WORLD);
if (iam==0) printf (sum = %f, sum);
}
Under MPI, all data is local. To implement the dot-product, each process builds
a partial sum, the sum of its local data. To do so, each executes a portion of the
original loop. Data and loop iterations are accordingly manually shared among
processors by the programmer. In a subsequent step, the partial sums have to be
communicated and combined to obtain the global result. MPI provides the global
communication routine MPI_Allreduce for this purpose.
Dot-Product in Pthreads
#define NUMTHRDS 4
double sum;
double a [256], b [256];
int status;
int n=256;
pthread_t thd[NUMTHRDS];
pthread_mutex_t mutexsum;
int main(argc,argv)
int argc;
char *argv[];
{
pthread_attr_t attr;

Introduction 19
for (i = 0; i n; i++) {
a [i] = i * 0.5;
b [i] = i * 2.0;
}
thread_mutex_init(mutexsum, NULL);
pthread_attr_init(attr);
pthread_attr_setdetachstate(attr, PTHREAD_CREATE_JOINABLE);
for(i=0;iNUMTHRDS;i++)
{
pthread_create( thds[i], attr, dotprod, (void *)i);
}
pthread_attr_destroy(attr);
for(i=0;iNUMTHRDS;i++) {
pthread_join( thds[i], (void **)status);
}
printf (sum = %f n, sum);
pthread_mutex_destroy(mutexsum);
pthread_exit(NULL);
}
void *dotprod(void *arg)
{
int myid, i, my_first, my_last;
double sum_local;
myid = (int)arg;
my_first = myid * n/NUMTHRDS;
my_last = (myid + 1) * n/NUMTHRDS;
sum_local = 0;
for (i = my_first; i = my_last; i++) {

20 Chapter 1
sum_local = sum_local + a [i]*b[i];
}
pthread_mutex_lock (mutex_sum);
sum = sum + sum_local;
pthread_mutex_unlock (mutex_sum);
pthread_exit((void*) 0);
}
In the Pthreads programming API, all data is shared but logically distributed
among the threads. Access to globally shared data needs to be explicitly synchro-
nized by the user. In the dot-product implementation shown, each thread builds a
partial sum and then adds its contribution to the global sum. Access to the global
sum is protected by a lock so that only one thread at a time updates this variable.
We note that the implementation eﬀort in Pthreads is as high as, if not higher than,
in MPI.
Dot-Product in OpenMP
int main(argc,argv)
int argc; char *argv[];
{
double sum;
double a [256], b [256];
int status;
int n=256;
for (i = 0; i n; i++) {
a [i] = i * 0.5;
b [i] = i * 2.0;
}
sum = 0;
#pragma omp for reduction(+:sum)
for (i = 1; i = n; i++ ) {
sum = sum + a[i]*b[i];
}
printf (sum = %f n, sum);
}

Introduction 21
Under OpenMP, all data is shared by default. In this case, we are able to paral-
lelize the loop simply by inserting a directive that tells the compiler to parallelize it,
and identifying sum as a reduction variable. The details of assigning loop iterations
to threads, having the different threads build partial sums and their accumulation
into a global sum are left to the compiler. Since (apart from the usual variable dec-
larations and initializations) nothing else needs to be specified by the programmer,
this code fragment illustrates the simplicity that is possible with OpenMP.
1.9 A Final Word
Given the trend toward bigger SMPs and multithreading computers, it is vital that
strategies and techniques for creating shared-memory parallel programs become
widely known. Explaining how to use OpenMP in conjunction with the major
programming languages Fortran, C, and C++ to write such parallel programs is
the purpose of this book. Under OpenMP, one can easily introduce threading in
such a way that the same program may run sequentially as well as in parallel. The
application developer can rely on the compiler to work out the details of the parallel
code or may decide to explicitly assign work to threads. In short, OpenMP is a
very flexible medium for creating parallel code.
The discussion of language features in this book is based on the OpenMP 2.5
specification, which merges the previously separate specifications for Fortran and
C/C++. At the time of writing, the ARB is working on the OpenMP 3.0 specifica-
tion, which will expand the model to provide additional convenience and expressiv-
ity for the range of architectures that it supports. Further information on this, as
well as up-to-date news, can be found at the ARB website http://www.openmp.org
and at the website of its user community, http://www.compunity.org. The com-
plete OpenMP specification can also be downloaded from the ARB website.

2 Overview of OpenMP
In this chapter we give an overview of the OpenMP programming interface and
compare it with other approaches to parallel programming for SMPs.
2.1 Introduction
The OpenMP Application Programming Interface (API) was developed to enable
portable shared memory parallel programming. It aims to support the paralleliza-
tion of applications from many disciplines. Moreover, its creators intended to pro-
vide an approach that was relatively easy to learn as well as apply. The API is
designed to permit an incremental approach to parallelizing an existing code, in
which portions of a program are parallelized, possibly in successive steps. This is
a marked contrast to the all-or-nothing conversion of an entire program in a single
step that is typically required by other parallel programming paradigms. It was
also considered highly desirable to enable programmers to work with a single source
code: if a single set of source files contains the code for both the sequential and
the parallel versions of a program, then program maintenance is much simplified.
These goals have done much to give the OpenMP API its current shape, and they
continue to guide the OpenMP Architecture Review Board (ARB) as it works to
provide new features.
2.2 The Idea of OpenMP
A thread is a runtime entity that is able to independently execute a stream of
instructions. OpenMP builds on a large body of work that supports the specification
of programs for execution by a collection of cooperating threads [15]. The operating
system creates a process to execute a program: it will allocate some resources to
that process, including pages of memory and registers for holding values of objects.
If multiple threads collaborate to execute a program, they will share the resources,
including the address space, of the corresponding process. The individual threads
need just a few resources of their own: a program counter and an area in memory
to save variables that are specific to it (including registers and a stack). Multiple
threads may be executed on a single processor or core via context switches; they may
be interleaved via simultaneous multithreading. Threads running simultaneously on
multiple processors or cores may work concurrently to execute a parallel program.
Multithreaded programs can be written in various ways, some of which permit
complex interactions between threads. OpenMP attempts to provide ease of pro-
gramming and to help the user avoid a number of potential programming errors,

24 Chapter 2
Initial Thread
Fork
Join
Team of Threads
Initial Thread
Figure 2.1: The fork-join programming model supported by OpenMP –
The program starts as a single thread of execution, the initial thread. A team of threads
is forked at the beginning of a parallel region and joined at the end.
by offering a structured approach to multithreaded programming. It supports the
so-called fork-join programming model [48], which is illustrated in Figure 2.1. Un-
der this approach, the program starts as a single thread of execution, just like a
sequential program. The thread that executes this code is referred to as the ini-
tial thread. Whenever an OpenMP parallel construct is encountered by a thread
while it is executing the program, it creates a team of threads (this is the fork),
becomes the master of the team, and collaborates with the other members of the
team to execute the code dynamically enclosed by the construct. At the end of
the construct, only the original thread, or master of the team, continues; all others
terminate (this is the join). Each portion of code enclosed by a parallel construct
is called a parallel region.
OpenMP expects the application developer to give a high-level specification of the
parallelism in the program and the method for exploiting that parallelism. Thus it
provides notation for indicating the regions of an OpenMP program that should be
executed in parallel; it also enables the provision of additional information on how
this is to be accomplished. The job of the OpenMP implementation is to sort out
the low-level details of actually creating independent threads to execute the code
and to assign work to them according to the strategy specified by the programmer.

Overview of OpenMP 25
2.3 The Feature Set
The OpenMP API comprises a set of compiler directives, runtime library routines,
and environment variables to specify shared-memory parallelism in Fortran and
C/C++ programs. An OpenMP directive is a specially formatted comment or
pragma that generally applies to the executable code immediately following it in
the program. A directive or OpenMP routine generally aﬀects only those threads
that encounter it. Many of the directives are applied to a structured block of code,
a sequence of executable statements with a single entry at the top and a single
exit at the bottom in Fortran programs, and an executable statement in C/C++
(which may be a compound statement with a single entry and single exit). In other
words, the program may not branch into or out of blocks of code associated with
directives. In Fortran programs, the start and end of the applicable block of code
are explicitly marked by OpenMP directives. Since the end of the block is explicit
in C/C++, only the start needs to be marked.
OpenMP provides means for the user to
• create teams of threads for parallel execution,
• specify how to share work among the members of a team,
• declare both shared and private variables, and
• synchronize threads and enable them to perform certain operations exclusively
(i.e., without interference by other threads).
In the following sections, we give an overview of the features of the API. In
subsequent chapters we describe these features and show how they can be used to
create parallel programs.
2.3.1 Creating Teams of Threads
A team of threads is created to execute the code in a parallel region of an OpenMP
program. To accomplish this, the programmer simply speciﬁes the parallel region
by inserting a parallel directive immediately before the code that is to be executed
in parallel to mark its start; in Fortran programs, the end is also marked by an
end parallel directive. Additional information can be supplied along with the
parallel directive. This is mostly used to enable threads to have private copies of
some data for the duration of the parallel region and to initialize that data. At the
end of a parallel region is an implicit barrier synchronization: this means that no
thread can progress until all other threads in the team have reached that point in

Random documents with unrelated
content Scribd suggests to you:

verse to the rhythmed prose of his Prophetic books, Blake struck
definitely away from the monotonous and select metres of the
eighteenth century, and anticipated the liberty, multiplicity, and
variety of the nineteenth. And he differed, almost equally, from all
but one or two of his older contemporaries, and from most of his
younger for many years, in the colour and fingering of his verse.
Bowles, William Lisle (1762-1850).—A generally mediocre poet, who,
however, deserves a place of honour here for the sonnets which he
published in 1789, and which had an immense influence on
Coleridge, Southey, and others of his juniors, not merely in restoring
that great form to popularity, but by inculcating description and
study of nature in connection with the thoughts and passions of
men.
Browne, William (1591-1643).—A Jacobean poet of the loosely named
Spenserian school—effective in various metres, but a special and
early exponent of the enjambed couplet.
Browning, Elizabeth Barrett (1806-1861).—Remarkable here for her
adoption of the nineteenth-century principle of the widest possible
metrical experiment and variety. In actual metre effective, though
sometimes a little slipshod. In rhyme a portent and a warning.
Perhaps the worst rhymester in the English language—perpetrating,
and attempting to defend on a mistaken view of assonance,
cacophonies so hideous that they need not sully this page.
Browning, Robert (1812-1889).—Often described as a loose and
rugged metrist, and a licentious, if not criminal, rhymester. Nothing
of the sort. Extraordinarily bold in both capacities, and sometimes,
perhaps, as usually happens in these cases, a little too bold; but in
metre practically never, in rhyme very seldom (and then only for
purposes of designed contrast, like the farce in tragedy),
overstepping actual bounds. A great master of broken metres,
internal rhyme, heavily equivalenced lines, and all the tours de force
of English prosody.

Burns, Robert (1759-1796).—Of the very greatest importance in
historical prosody, because of the shock which his fresh dialect
administered to the conventional poetic diction of the eighteenth
century, and his unusual and broken measures (especially the
famous Burns-metre) to its notions of metric. An admirable
performer on the strings that he tried; a master of musical
fingering of verse; and to some extent a pioneer of the revival of
substitution.
Byron, George Gordon, Lord (1788-1824).—Usually much undervalued
as a prosodist, even by those who admire him as a poet. Really of
great importance in this respect, owing to the variety, and in some
cases the novelty, of his accomplishment, and to its immense
popularity. His Spenserians in Childe Harold not of the highest class,
but the light octaves of Beppo and Don Juan the very best examples
of the metre in English. Some fine but rhetorical blank verse, and a
great deal of fluent octosyllabic couplet imitated from Scott. But his
lyrics of most importance, combining popular appeal with great
variety, and sometimes positive novelty, of adjustment and cadence.
Diction is his weakest point.
Campbell, Thomas (1777-1844).—Not prosodically remarkable in his
longer poems, but very much so in some of his shorter, especially
The Battle of the Baltic, where the bold shortening of the last line,
effective in itself, has proved suggestive to others of even better
things, such as the half-humorous, half-plaintive measure of
Holmes's The Last Leaf and Locker's Grandmamma.
Campion, Thomas (?-1619).—Equally remarkable for the sweetness and
variety of his rhymed lyrics in various ordinary measures, and as the
advocate and practitioner of a system of rhymeless verse, different
from the usual hexametrical attempts of his contemporaries, but still
adjusted to classical patterns.

Canning, George (1770-1827).—Influential, in the general breaking-up
of the conventional metres and diction of the eighteenth century, by
his parodies of Darwin and his light lyrical pieces in the Anti-Jacobin.
Chamberlayne, William (1619-1689).—Remarkable as, in Pharonnida,
one of the chief exponents of the beauties, but still more of the
dangers, of the enjambed heroic couplet; in his England's Jubile as a
rather early, and by no means unaccomplished, practitioner of the
rival form. To be carefully distinguished from his contemporary,
Robert Chamberlain (fl. c. 1640), a very poor poetaster who wrote a
few English hexameters.
Chatterton, Thomas (1752-1770).—Of some interest here because his
manufactured diction was a protest against the conventional
language of eighteenth-century poetry. Of more, because he
ventured upon equivalence in octosyllabic couplet, and wrote ballad
and other lyrical stanzas, entirely different in form and cadence from
those of most of his contemporaries, and less artificial even than
those of Collins and Gray.
Chaucer, Geoffrey (1340?-1400).—The reducer of the first stage of
English prosody to complete form and order; the greatest master of
prosodic harmony in our language before the later sixteenth century,
and one of the greatest (with value for capacity in language) of all
time; the introducer of the decasyllabic couplet—if not absolutely,
yet systematically and on a large scale—and of the seven-lined
rhyme-royal stanza; and, finally, a poet whose command of the
utmost prosodic possibilities of English, at the time of his writing,
almost necessitated a temporary prosodic disorder, when those who
followed attempted to imitate him with a changed pronunciation,
orthography, and word-store.
Cleveland, John (1613-1658).—Of no great importance as a poet, but
holding a certain position as a comparatively early experimenter with
apparently anapæstic measures in his Mark Antony and other
pieces.

Coleridge, Samuel Taylor (1772-1834).—In the Ancient Mariner and
Christabel, the great instaurator of equivalence and substitution; a
master of many other kinds of metre; and an experimenter in
classical versing.
Collins, William (1721-1759)—Famous in prosody for his attempt at
odes less definitely regular than Gray's, but a vast improvement on
the loose Pindaric which had preceded; and for a remarkable
attempt at rhymeless verse in that To Evening. In diction retained
a good deal of artificiality.
Congreve, William (1670-1729).—Regularised Cowley's loose Pindaric.
Cowley, Abraham (1618-1667).—The most popular poet of the mid-
seventeenth century; important to prosody for a wide, various, and
easy, though never quite consummate command of lyric, as well as
for a vigorous and effective couplet (with occasional Alexandrines) of
a kind midway between that of the early seventeenth century and
Dryden's; but chiefly for his introduction of the so-called Pindaric.
Cowper, William (1731-1800).—One of the first to protest, definitely
and by name, against the mechanic art of Pope's couplet. He
himself returned to Dryden for that metre; but practised very largely
in blank verse, and wrote lyrics with great sweetness, a fairly varied
command of metre, and, in Boadicea, The Castaway, and some
of his hymns, no small intensity of tone and cry. His chief
shortcoming, a preference of elision to substitution.
Donne, John (1573-1631).—Famous for the beauty of his lyrical
poetry, the metaphysical strangeness of his sentiment and diction
throughout, and the roughness of his couplets. This last made
Jonson, who thought him the first poet in the world for some
things, declare that he nevertheless deserved hanging for not
keeping accent, and has induced others to suppose a (probably

imaginary) revolt against Spenserian smoothness, and an attempt at
a new prosody.
Drayton, Michael (1563-1631).—A very important poet prosodically,
representing the later Elizabethan school as it passes into the
Jacobean, and even the Caroline. Expresses and exemplifies the
demand for the couplet (which he calls gemell or geminel), but is
an adept in stanzas. In the Polyolbion produced the only long English
poem in continuous Alexandrines before Browning's Fifine at the Fair
(which is very much shorter). A very considerable sonneteer, and the
deviser of varied and beautiful lyrical stanzas in short rhythms, the
most famous being the Ballad of Agincourt.
Dryden, John (1630-1700).—The establisher and master of the
stopped heroic couplet with variations of triplets and Alexandrines;
the last great writer of dramatic blank verse, after he had given up
the couplet for that use; master also of any other metre—the
stopped heroic quatrain, lyrics of various form, etc.—that he chose
to try. A deliberate student of prosody, on which he had intended to
leave a treatise, but did not.
Dixon, Richard Watson (1833-1900).—The only English poet who has
attempted, and (as far perhaps as the thing is possible) successfully
carried out, a long poem (Mano) in terza rima. Possessed also of
great lyrical gift in various metres, especially in irregular or Pindaric
arrangements.
Dunbar, William (1450?-1513? or -1530?).—The most accomplished
and various master of metre in Middle Scots, including both
alliterative and strictly metrical forms. If he wrote The Friars of
Berwick, the chief master of decasyllabic couplet between Chaucer
and Spenser.
Dyer, John (1700?-1758?).—Derives his prosodic importance from
Grongar Hill, a poem in octosyllabic couplet, studied, with
independence, from Milton, and helping to keep alive in that couplet
the variety of iambic and trochaic cadence derived from catalexis, or
alternation of eight- and seven-syllabled lines.

Fairfax, Edward (d. 1635).—Very influential in the formation of the
stopped antithetic couplet by his use of it at the close of the octaves
of his translation of Tasso.
Fitzgerald, Edward (1809-1883).—Like Fairfax, famous for the
prosodic feature of his translation of the Rubáiyát of Omar Khayyám.
This is written in decasyllabic quatrains, the first, second, and fourth
lines rhymed together, the third left blank.
Fletcher, Giles (1588-1623), and Phineas (1582-1650).—Both
attempted alterations of the Spenserian by leaving out first one and
then two lines. Phineas also a great experimenter in other directions.
Fletcher, John (1579-1625).—The dramatist. Prosodically noticeable
for his extreme leaning to redundance in dramatic blank verse. A
master of lyric also.
Frere, John Hookham (1769-1846).—Reintroduced the octave for comic
purposes in the Monks and the Giants (1817), and taught it to
Byron. Showed himself a master of varied metre in his translations of
Aristophanes. Also dabbled in English hexameters, holding that
extra-metrical syllables were permissible there.
Gascoigne, George (1525?-1577).—Not unremarkable as a prosodist,
from having tried various lyrical measures with distinct success, and
as having given the first considerable piece of non-dramatic blank
verse (The Steel Glass) after Surrey. But chiefly to be mentioned
for his remarkable Notes of Instruction on English verse, the first
treatise on English prosody and a very shrewd one, despite some
slips due to the time.
Glover, Richard (1712-1785).—A very dull poet, but noteworthy for
two points connected with prosody—his exaggeration of the

Thomsonian heavy stop in the middle of blank-verse lines, and the
unrhymed choruses of his Medea.
Godric, Saint (?-1170).—The first named and known author of
definitely English (that is Middle English) lyric, if not of definitely
English (that is Middle English) verse altogether.
Gower, John (1325?-1408).—The most productive, and perhaps the
best, older master of the fluent octosyllable, rarely though
sometimes varied in syllabic length, and approximating most directly
to the French model.
Hampole, Richard Rolle of, most commonly called by the place-name
(1290?-1347).—Noteworthy for the occasional occurrence of
complete decasyllabic couplets in the octosyllables of the Prick of
Conscience. Possibly the author of poems in varied lyrical measures,
some of great accomplishment.
Hawes, Stephen (d. 1523?).—Notable for the contrast between the
occasional poetry of his Pastime of Pleasure and its sometimes
extraordinarily bad rhyme-royal—which latter is shown without any
relief in his other long poem, the Example of Virtue. The chief late
example of fifteenth-century degradation in this respect.
Herrick, Robert (1591-1674).—The best known (though not in his
own or immediately succeeding times) of the Caroline poets. A
great master of variegated metre, and a still greater one of sweet
and various grace in diction.
Hunt, J. H. Leigh (1784-1859).—Chiefly remarkable prosodically for
his revival of the enjambed decasyllabic couplet; but a wide student,
and a catholic appreciator and practitioner, of English metre
generally. Probably influenced Keats much at first.
Jonson, Benjamin, always called Ben (1573?-1637).—A great practical
prosodist, and apparently (like his successor, and in some respects
analogue, Dryden) only by accident not a teacher of the study. Has
left a few remarks, as it is, eulogising, but in rather equivocal terms,
the decasyllabic couplet, objecting to Donne's not keeping of
accent, to Spenser's metre for what exact reason we know not, and

to the English hexameter apparently. His practice much plainer
sailing. A fine though rather hard master of blank verse; excellent at
the couplet itself; but in lyric, as far as form goes, near perfection in
the simpler and more classical adjustments, as well as in pure ballad
measure.
Keats, John (1795-1821).—One of the chief examples, among the
greater English poets, of sedulous and successful study of prosody;
in this contrasting remarkably with his contemporary, and in some
sort analogue, Shelley. Began by much reading of Spenser and of
late sixteenth- and early seventeenth-century poets, in following
whose enjambed couplet he was also, to some extent, a disciple of
Leigh Hunt. Exemplified the dangers as well as the beauties of this in
Endymion, and corrected it by stanza-practice in Isabella, the Eve of
St. Agnes, and his great Odes, as well as by a study of Dryden which
produced the stricter but more splendid couplet of Lamia. Strongly
Miltonic, but with much originality also, in the blank verse of
Hyperion; and a great master of the freer sonnet, which he had
studied in the Elizabethans. Modified the ballad measure in La Belle
Dame sans Merci with astonishing effect, and in the Eve of St. Mark
recovered (perhaps from Gower) a handling of the octosyllable
which remained undeveloped till Mr. William Morris took it up.
Kingsley, Charles (1819-1875).—A poet very notable, in proportion to
the quantity of his work, for variety and freshness of metrical
command in lyric. But chiefly so for the verse of Andromeda, which,
aiming at accentual dactylic hexameter, converts itself into a five-
foot anapæstic line with anacrusis and hypercatalexis, and in so
doing entirely shakes off the ungainly and slovenly shamble of the
Evangeline type.
Landor, Walter Savage (1775-1864).—A great master of form in all
metres, but, in his longer poems and more regular measures, a little
formal in the less favourable sense. In his smaller lyrics

(epigrammatic in the Greek rather than the modern use) hardly
second to Ben Jonson, whom he resembles not a little. His phrase of
singular majesty and grace.
Langland, William (fourteenth century).—The probable name of the
pretty certainly single author of the remarkable alliterative poem
called The Vision of Piers Plowman. Develops the alliterative metre
itself in a masterly fashion through the successive versions of his
poem, but also exhibits most notably the tendency of the line to fall
into definitely metrical shapes—decasyllabic, Alexandrine, and
fourteener,—with not infrequent anapæstic correspondences.
Layamon (late twelfth and early thirteenth century).—Exhibits in the
Brut, after a fashion hardly to be paralleled elsewhere, the passing
of one metrical system into another. May have intended to write
unrhymed alliteratives, but constantly passes into complete rhymed
octosyllabic couplet, and generally provides something between the
two. A later version, made most probably, if not certainly, after his
death, accentuates the transfer.
Lewis, Matthew Gregory (1775-1818).—A very minor poet, and hardly
a major man of letters in any other way than that of prosody. Here,
however, in consequence partly of an early visit to Germany, he
acquired love for, and command of, the anapæstic measures, which
he taught to greater poets than himself from Scott downwards, and
which had not a little to do with the progress of the Romantic
Revival.
Locker (latterly Locker-Lampson) Frederick (1821-1895).—An author of
verse of society who brought out the serio-comic power of much
variegated and indented metre with remarkable skill.
Longfellow, Henry Wadsworth (1807-1882).—An extremely competent
American practitioner of almost every metre that he tried, except
perhaps the unrhymed terza rima, which is difficult and may be
impossible in English. Established the popularity of the loose
accentual hexameter in Evangeline, and did surprisingly well with

unvaried trochaic dimeter in Hiawatha. His lyrical metres not of the
first distinction, but always musical and craftsmanlike.
Lydgate, John (1370-1450?).—The most industrious and productive of
the followers of Chaucer, writing indifferently rhyme-royal, riding
rhyme, and octosyllabic couplet, but especially the first and last, as
well as ballades and probably other lyrical work. Lydgate seems to
have made an effort to accommodate the breaking-down
pronunciation of the time—especially as regarded final e's—to these
measures; but as a rule he had very little success. One of his
varieties of decasyllabic is elsewhere stigmatised. He is least abroad
in the octosyllable, but not very effective even there.
Macaulay, Thomas Babington (1800-1859).—Best known prosodically by
his spirited and well beaten-out ballad measure in the Lays of
Ancient Rome. Sometimes, as in The Last Buccaneer, tried less
commonplace movements with strange success.
Maginn, William (1793-1842).—Deserves to be mentioned with
Barham as a chief initiator of the earlier middle nineteenth century in
the ringing and swinging comic measures which have done so much
to supple English verse, and to accustom the general ear to its
possibilities.
Marlowe, Christopher (1664-1693).—The greatest master, among
præ-Shakespearian writers, of the blank-verse line for splendour and
might, as Peele was for sweetness and brilliant colour. Seldom,
though sometimes, got beyond the single-moulded form; but
availed himself to the very utmost of the majesty to which that form
rather specially lends itself. Very great also in couplet (which he
freely enjambed) and in miscellaneous measure when he tried it.
Milton, John (1608-1674).—The last of the four chief masters of
English prosody. Began by various experiments in metre, both in and
out of lyric stanza—reaching, in the Nativity hymn, almost the

maximum of majesty in concerted measures. In L'Allegro, Il
Penseroso, and the Arcades passed to a variety of the octosyllabic
couplet, which had been much practised by Shakespeare and others,
but developed its variety and grace yet further, though he did not
attempt the full Spenserian or Christabel variation. In Comus
continued this, partly, with lyrical extensions, but wrote the major
part in blank verse—not irreminiscent of the single-moulded form,
but largely studied off Shakespeare and Fletcher, and with his own
peculiar turns already given to it. In Lycidas employed irregularly
rhymed paragraphs of mostly decasyllabic lines. Wrote some score of
fine sonnets, adjusted more closely to the usual Italian models than
those of most of his predecessors. After an interval, produced, in
Paradise Lost, the first long poem in blank verse, and the greatest
non-dramatic example of the measure ever seen—admitting the
fullest variation and substitution of foot and syllable, and
constructing verse-paragraphs of almost stanzaic effect by varied
pause and contrasted stoppage and overrunning. Repeated this, with
perhaps some slight modifications, in Paradise Regained. Finally, in
Samson Agonistes, employed blank-verse dialogue with choric
interludes rhymed elaborately—though in an afterthought note to
Paradise Lost he had denounced rhyme—and arranged on metrical
schemes sometimes unexampled in English.
Moore, Thomas (1779-1852).—A very voluminous poet in the most
various metres, and a competent master of all. But especially
noticeable as a trained and practising musician, who wrote a very
large proportion of his lyrics directly to music, and composed or
adapted settings for many of them. The double process has resulted
in great variety and sweetness, but occasionally also in laxity which,
from the prosodic point of view, is somewhat excessive.
Morris, William (1834-1896).—One of the best and most variously
gifted of recent prosodists. In his early work, The Defence of
Guenevere, achieved a great number of metres, on the most varied
schemes, with surprising effect; in his longer productions, Jason and
The Earthly Paradise, handled enjambed couplets, octosyllabic and
decasyllabic, with an extraordinary compound of freedom and

precision. In Love is Enough tried alliterative and irregular rhythm
with unequal but sometimes beautiful results; and in Sigurd the
Volsung fingered the old fourteener into a sweeping narrative verse
of splendid quality and no small range.
Orm.—A monk of the twelfth to the thirteenth century, who
composed a long versification of the Calendar Gospels in unrhymed,
strictly syllabic, fifteen-syllabled verse, lending itself to regular
division in eights and sevens. A very important evidence as to the
experimenting tendency of the time and to the strivings for a new
English prosody.
O'Shaughnessy, Arthur W. E. (1844-1881).—A lyrist of great originality,
and with a fingering peculiar to himself, though most nearly
resembling that of Edgar Poe.
Peele, George (1558?-1597?).—Remarkable for softening the early
decasyllabon as Marlowe sublimed it.
Percy, Thomas (1729-1811).—As an original verse-maker, of very small
value, and as a meddler with older verse to patch and piece it,
somewhat mischievous; but as the editor of the Reliques, to be
hallowed and canonised for that his deed, in every history of English
prosody and poetry.
Poe, Edgar (1809-1849).—The greatest master of original prosodic
effect that the United States have produced, and an instinctively and
generally right (though, in detail, hasty, ill-informed, and crude)
essayist on points of prosodic doctrine. Produced little, and that little
not always equal; but at his best an unsurpassable master of music
in verse and phrase.

Pope, Alexander (1688-1744).—Practically devoted himself to one
metre, and one form of it—the stopped heroic couplet,—subjected
as much as possible to a rigid absence of licence; dropping (though
he sometimes used them) the triplets and Alexandrines, which even
Dryden had admitted; adhering to an almost mathematically
centrical pause; employing, by preference, short, sharp rhymes with
little echo in them; and but very rarely, though with at least one odd
exception, allowing even the possibility of a trisyllabic foot. An
extraordinary artist on this practically single string, but gave himself
few chances on others.
Praed, Winthrop Mackworth (1802-1839)—An early nineteenth-century
Prior. Not incapable of serious verse, and hardly surpassed in
laughter. His greatest triumph, the adaptation of the three-foot
anapæst, alternately hypercatalectic and acatalectic or exact, which
had been a ballad-burlesque metre as early as Gay, had been partly
ensouled by Byron in one piece, but was made his own by Praed,
and handed down by him to Mr. Swinburne to be yet further
sublimated.
Prior, Matthew (1664-1721).—Of special prosodic importance for his
exercises in anapæstic metres and in octosyllabic couplet, both of
which forms he practically established in the security of popular
favour, when the stopped heroic couplet was threatening monopoly.
His phrase equally suitable to the vers de société of which he was
our first great master.
Robert of Gloucester (fl. c. 1280).—Nomen clarum in prosody, as
being apparently the first copious and individual producer of the
great fourteener metre, which, with the octosyllabic couplet, is the
source, or at least the oldest, of all modern English forms.
Rossetti, Christina Georgina (1830-1894) and Dante Gabriel (1828-
1882).—A brother and sister who rank extraordinarily high in our
flock. Of mainly Italian blood, though thoroughly Anglicised, and

indeed partly English by blood itself, they produced the greatest
English sonnets on the commoner Italian model, and displayed
almost infinite capacity in other metres. Miss Rossetti had the
greater tendency to metrical experiment, and perhaps the more
strictly lyrical gift of the song kind; her brother, the severer
command of sculpturesque but richly coloured form in poetry.
Sackville, Thomas (1536-1608).—One of the last and best practitioners
of the old rhyme-royal of Chaucer, and one of the first experimenters
in dramatic blank verse.
Sandys, George (1578-1644).—Has traditional place after Fairfax and
with Waller (Sir John Beaumont, who ought to rank perhaps before
these, being generally omitted) as a practitioner of stopped heroic
couplet. Also used In Memoriam quatrain.
Sayers, Frank (1763-1817).—An apostle, both in practice and
preaching, of the unrhymed verse—noteworthy at the close of the
eighteenth century—which gives him his place in the story.
Scott, Sir Walter (1771-1832).—The facts of his prosodic influence
and performance hardly deniable, but its nature and value often
strangely misrepresented. Was probably influenced by Lewis in
adopting (from the German) anapæstic measures; and certainly and
most avowedly influenced by Coleridge (whose Christabel he heard
read or recited long before publication) in adopting equivalenced
octosyllabic couplet and ballad metres in narrative verse. But
probably derived as much from the old ballads and romances
themselves, which he knew as no one else then did, and as few
have known them since. Applied the method largely in his verse-
romances, but was also a master of varied forms of lyric, no mean
proficient in the Spenserian and in fragments, at least, of blank
verse.

Shakespeare, William (1564-1616).—The catholicos or universal master,
as of English poetry so of English prosody. In the blank verse of his
plays, and in the songs interspersed in them, as well as in his
immature narrative poems and more mature sonnets, every principle
of English versification can be found exemplified, less deliberately
machined, it may be, than in Milton or Tennyson, but in absolutely
genuine and often not earlier-found form.
Shelley, Percy Bysshe (1792-1822).—The great modern example of
prosodic inspiration, as Keats, Tennyson, and Mr. Swinburne are of
prosodic study. Shelley's early verse is as unimportant in this way as
in others; but from Queen Mab to some extent, from Alastor
unquestionably, onwards, he displayed totally different quality, and
every metre that he touched (even if possibly suggested to some
extent by others) bears the marks of his own personality.
Shenstone, William (1714-1763).—Not quite unimportant as poet, in
breaking away from the couplet; but of much more weight for the
few prosodic remarks in his Essays, in which he directly pleads for
trisyllabic (as he awkwardly calls them dactylic) feet, for long-
echoing rhymes, and for other things adverse to the mechanic tune
by heart of the popular prosody.
Sidney, Sir Philip (1554-1586).—A great experimenter in Elizabethan
classical forms; but much more happy as an accomplished and very
influential master of the sonnet, and a lyric poet of great sweetness
and variety.
Southey, Robert (1774-1843).—A very deft and learned practitioner of
many kinds of verse, his tendency to experiment leading him into
rhymelessness (Thalaba) and hexameters (The Vision of Judgment);
but quite sound on general principles, and the first of his school and
time to champion the use of trisyllabic feet in principle, and to
appeal to old practice in their favour.
Spenser, Edmund (1552?-1599).—The second founder of English
prosody in his whole work; the restorer of regular form not destitute
of music; the preserver of equivalence in octosyllabic couplet; and

the inventor of the great Spenserian stanza, the greatest in every
sense of all assemblages of lines, possessing individual beauty and
capable of indefinite repetition.
Surrey, Earl of, the courtesy title of Henry Howard (1517-1547).—Our
second English sonneteer, our second author of reformed literary
lyric after the fifteenth-century break-down, and our first clearly
intentional writer of blank verse.
Swinburne, Algernon Charles (1837-1909).—Of all English poets the
one who has applied the widest scholarship and study, assisted by
great original prosodic gift, to the varying and accomplishing of
English metre. Impeccable in all kinds; in lyric nearly supreme. To
some extent early, and, still more, later, experimented in very long
lines, never unharmonious, but sometimes rather compounds than
genuine integers. Achieved many triumphs with special metres,
especially by the shortening of the last line of the Praed-stanza into
the form of Dolores, which greatly raises its passion and power.
Tennyson, Alfred (1809-1892).—A poet who very nearly, if not quite,
deserves the position accorded here to Chaucer, Spenser,
Shakespeare, and Milton. Coming sufficiently late after the great
Romantic poets of the earlier school to generalise their results, he
started with an apparent freedom (perfectly orderly, in fact) which
puzzled even Coleridge. Very soon, too, he produced a practically
new form of blank verse, in which the qualities of the Miltonic and
Shakespearian kinds were blended, and a fresh metrical touch given.
All poets since—sometimes while denying or belittling him—have felt
his prosodic influence; and it is still, even after Mr. Swinburne's fifty
years of extended practice of it, the pattern of modern English
prosody.
Thomson, James (1700-1748).—The first really important practitioner of
blank verse after Milton, and a real, though rather mannerised,
master of it. Displayed an equally real, and more surprising, though

much more unequal, command of the Spenserian in The Castle of
Indolence.
Tusser, Thomas (1524?-1580).—A very minor poet—in fact, little more
than a doggerelist; but important because, at the very time when
men like Gascoigne were doubting whether English had any foot but
the iambic, he produced lolloping but perfectly metrical continuous
anapæsts, and mixed measures of various kinds.
Waller, Edmund (1606-1687).—A good mixed prosodist of the Caroline
period, whose chief traditional importance is in connection with the
popularising of the stopped couplet. His actual precedence in this is
rather doubtful; but his influence was early acknowledged, and
therefore is an indisputable fact. He was also early as a literary user
of anapæstic measures, and tried various experiments.
Watts, Isaac (1674-1741).—By no means unnoteworthy as a
prosodist. Followed Milton in blank verse, early popularised triple-
time measures by his religious pieces, evidently felt the monotony of
the couplet, and even attempted English Sapphics.
Whitman, Walt[er] (1819-1892).—An American poet who has pushed
farther than any one before him, and with more success than any
one after him, the substitution, for regular metre, of irregular
rhythmed prose, arranged in versicles something like those of the
English Bible, but with a much wider range of length and rhythm,
the latter going from sheer prose cadence into definite verse.
Wordsworth, William (1770-1850).—Less important as a prosodist
than as a poet; but prosodically remarkable both for his blank verse,
for his sonnets, and for the Pindaric of his greatest Ode.
Wyatt, Sir Thomas (1503?-1542).—Our first English sonneteer and our
first reformer, into regular literary verse, of lyric after the fifteenth-
century disorder. An experimenter with terza, and in other ways
prosodically eminent.

CHAPTER III
ORIGINS OF LINES AND STANZAS
(It has seemed desirable to give some account (to an extent which
would in most cases be disproportionate for the Glossary) of the
ascertained, probable, or supposed origin of the principal lines and
line-combinations in English poetry. The arrangement is logical
rather than alphabetical. Slight repetition, on some points, of matter
previously given is unavoidable.)
A. Lines
I. Alliterative.—Enough has probably been said above of the old
alliterative line and its generic character; while the later variations,
which came upon it after its revival, have also been noticed and
exemplified. Its origin is quite unknown; but the presence of closely
allied forms, in the different Scandinavian and Teutonic languages,
assures, beyond doubt, a natural rise from some speech-rhythm or
tune-rhythm proper to the race and tongue. It is also probable that
the remarkable difference of lengths—short, normal, and extended—
which is observable in O.E. poetry is of the highest antiquity. It has
at any rate persevered to the present day in the metrical successors
of this line; and there is probably no other poetry which has—at a
majority of its periods, if not throughout—indulged in such variety of
line-length as English. Nor, perhaps, is there any which contains,
even in its oldest and roughest forms, a metrical or quasi-metrical
arrangement more close to the naturally increased, but not
denaturalised, emphasis of impassioned utterance, more thoroughly
born from the primeval oak and rock.
II. Short Lines.—Despite the tendency to variation of lines above
noted, A.S. poetry did not favour very short ones; and its faithful

disciple and champion, Guest, accordingly condemns them in
modern English poetry. This is quite wrong. In the bobs and other
examples in Middle English we find the line shortened almost, if not
actually, to the monosyllable, and this liberty has persisted through
all the best periods of English verse since, though frequently
frowned upon by pedantry. Its origin is, beyond all reasonable
doubt, to be traced to French and Provençal influence, especially to
that of the short refrain; but it is so congenial to the general
tendency noted above that very little suggestion must have been
needed. It must, however, be said that very short lines, in
combination with long ones, almost necessitate rhyme to punctuate
and illumine the divisions of symphonic effect; and, consequently, it
was not till rhyme came in that they could be safely and successfully
used. But when this was mastered there was no further difficulty. In
all the best periods of English lyric writing—in that of Alison and its
fellows, in the carols of the fifteenth century, in late Elizabethan and
Caroline lyric, and in nineteenth-century poetry—the admixture of
very short lines has been a main secret of lyrical success; and in
most cases it has probably been hardly at all a matter of deliberate
imitation, but due to an instinctive sense of the beauty and
convenience of the adjustment.
III. Octosyllable.—The historical origin of the octosyllabic (or, as the
accentual people call it, the four-beat or four-stress line) is one of
the most typical in the whole range of prosody, though the lesson of
the type may be differently interpreted. Taking it altogether, there is
perhaps no metre in which so large a body of modern, including
mediæval, poetry has been composed. But, although it is simply
dimeter iambic, acatalectic or catalectic as the case may be, it is
quite vain to try to discover frequent and continuous patterns of
origin for it in strictly classical prosody.[162] Odd lines, rarely exact,
in choric odes prove nothing, and the really tempting
Αμμων Ολυμπου δεσποτα

of Pindar is an uncompleted fragment which might have gone off
into any varieties of Pindaric. There are a few fragments of Alcman—
Ὡρας δ' εσηκε τρεις, θερος
and of the genuine Anacreon—
Μηδ' ὡστε κυμα ποντιον,
in the metre, while the spurious verse of the Anacreontea, a
catalectic form with trisyllabic equivalence, seems to have been
actually practised by the real poet. Alternately used, it is, of course,
frequent in the epodes of Horace, in Martial, etc. But the fact
remains that, as has been said, it is not a classical metre to any but
a very small extent, though those who attach no value to anything
but the beats may find it in bulk in the anapæstic dimeter of Greek
and Latin choruses. It is in the Latin hymns—that is to say, in Latin
after it had undergone a distinct foreign admixture—that the metre
first appears firmly and distinctly established. In the fourth century,
St. Ambrose without rhyme, and Hilary with it, employ the iambic
dimeter, and it soon becomes almost the staple, though Prudentius,
contemporary with both of them and more of a regular poet, while
he does use it, seems to prefer other metres. By the time, however,
when the modern prosodies began to take form, it was thoroughly
well settled; and every Christian nation in Europe knew examples of
it by heart.
It still, however, remains a problem exactly why this particular metre
should, as a matter of direct literary imitation, have commended
itself so widely to the northern nations. They had nearly or quite as
many examples in the same class of the trochaic dimeter
Gaude, plaude, Magdalena
and they paid no attention to this, though their southern neighbours
did. They had, from the time of Pope Damasus[163] downwards, and
in almost all the hymn-writers, mixed dactylic metres to choose

from; but for a staple they went to this. It seems impossible that
there should not have been some additional and natural reasons for
the adoption—reasons which, if they had not actually brought it
about without any literary patterns at all, directed poets to those
patterns irresistibly. Nor, as it seems to the present writer, is it at all
difficult to discover, as far at least as English is concerned, what
these reasons were.
The discovery might be made out of one's own head; but here as
elsewhere Layamon is a most important assistant and safeguard. A
mere glance at any edition of alliterative verse, printed in half lines,
will show that it has a rough resemblance on the page to
octosyllabics, though the outline is more irregular. A moderately
careful study of Layamon shows, as has been indicated, that, in
writing this verse with new influences at work upon him, he
substitutes octosyllabic couplet for it constantly. And the history in
the same way shows that this occasional substitution became a
habitual one with others. Not that there is any mystical virtue in four
feet, despite their frequency in the actual creation: but that, as an
equivalent of the old half line, the choice lies practically between
three and four. Now a three-foot line, though actually tried as in the
Bestiary and in parts of Horn, is, as a general norm, too short, is
ineffective and jingly, brings the rhyme too quick, and hampers the
exhibition of the sense by a too staccato and piecemeal
presentment. The abundant adoption of the octosyllable in French
no doubt assisted the spread in English. But it is not unimportant to
observe that English translators and adapters of French octosyllabic
poems by no means always preserve the metre, and that English
octosyllables often represent French poems which are differently
metred in the original.
IV. Decasyllable.—A connected literary origin for this great line—the
ancient staple of French poetry, the modern staple of English, and
(in still greater modernity) of German to some extent, as well as
(with the extension of one syllable necessitated by the prevailing
rhythm of the language) of Italian throughout its history—has always
been found extraordinarily difficult to assign. That some have even

been driven to the line which furnishes the opening couplet of the
Alcaic

Quam si clientum longa negotia,
or
Vides ut alta stet nive candidum,
an invariably hendecasyllabic line of the most opposite rhythm,
constitution, and division, will show the straits which must have
oppressed them. The fact is that there is nothing, either in Greek or
Latin prosody, in the least resembling it or suggestive of it. To
connect it with these prosodies at all reasonably, it would be
necessary to content ourselves with the supposition, not illogical or
impossible, but not very explanatory, that somebody found the
iambic dimeter too short, and the iambic trimeter too long, and split
the difference.
In another way, and abandoning the attempt to find parents or
sponsors in antiquity for this remarkable foundling, a not wholly
dissimilar conjecture becomes really illuminative—that the line of ten
syllables (or eleven with weak ending) proved itself the most
useful in the modern languages. As a matter of fact it appears in the
very earliest French poem we possess—the tenth- or perhaps even
ninth-century Hymn of St. Eulalia:
Bel auret corps, bellezour anima,
and in the (at youngest) tenth-century Provençal Boethius:
No credet Deu lo nostre creator.
If it still seem pusillanimous to be content with such an explanation,
one can share one's pusillanimity with Dante, who contents himself
with saying that the line of eleven syllables seems the stateliest and
most excellent, as well by reason of the length of time it occupies as
of the extent of subject, construction and language of which it is
capable. And in English, with which we are specially, if not indeed

wholly, concerned, history brings us the reinforcement of showing
that the decasyllable literally forced itself, in practice, upon the
English poet.
This all-important fact has been constantly obscured by the habit of
saying that Chaucer invented the heroic couplet in English—that
he, at any rate, borrowed it first from the French. Whether he did so
as a personal fact we cannot say, for he is not here to tell us. That
he need not have done so there is ample and irrefragable evidence.
In the process of providing substitutes for the old unmetrical line, it
is not only obvious that the decasyllable—which, from a period
certainly anterior to the rise of Middle English, had been the staple
metre, in long assonanced tirades or batches, of the French
Chansons de geste—must have suggested itself. It is still more
certain that it did. It is found in an unpolished and haphazard
condition, but unmistakable, in the Orison of our Lady (early
thirteenth century); it occurs in Genesis and Exodus, varying the
octosyllable itself, in the middle of that age; it is scattered about the
Romances, in the same company, at what must have been early
fourteenth century at latest; it occurs constantly in Hampole's Prick
of Conscience at the middle of this century; and there are solid
blocks of it in the Vernon MS., which was written (i.e. copied from
earlier work), at latest, before Chaucer is likely to have started the
Legend of Good Women or the Canterbury Tales. That his practice
settled and established it—though for long the octosyllable still
outbid it in couplet, and it was written chiefly in the stanza form of
rhyme-royal—is true. But by degrees the qualities which Dante had
alleged made it prevail, and prepared it as the line-length for blank
verse as well as for the heroic couplet, and for the bulk of narrative
stanza-writing. No doubt Chaucer was assisted by the practice of
Machault and other French poets. But there should be still less doubt
that, without that practice, he might, and probably would, have
taken it up. For the first real master of versification—whether he
were Chaucer, or (in unhappy default of him) somebody else, who
must have turned up sooner or later—could not but have seen, for
his own language, what Dante saw for his.

V. Alexandrine.—The Alexandrine or verse of twelve syllables,
iambically divided, does not resemble its relation, the octosyllable, in
having a doubtful classical ancestry; or its other relation, the
decasyllable, in having none. It is, from a certain point of view, the
exact representative of the great iambic trimeter which was the
staple metre of Greek tragedy, and was largely used in Greek and
Roman verse. The identity of the two was recognised in English as
early as the Mirror for Magistrates, and indeed could escape no one
who had the knowledge and used it in the most obvious way.
At the same time it is necessary frankly to say that this resemblance
—at least, as giving the key to origin—is, in all probability, wholly
delusive. There are twelve syllables in each line, and there are
iambics in both. But to any one who has acquired—as it is the
purpose of this book to help its readers to acquire or develop—a
prosodic sense, like the much-talked-of historic sense, it will seem
to be a matter of no small weight, that while the cæsura (central
pause) of the ancient trimeter is penthemimeral (at the fifth
syllable), or hepthemimeral (at the seventh), that of the modern
Alexandrine is, save by rare, and not often justified, license,
invariably at the sixth or middle—a thing which actually alters the
whole rhythmical constitution and effect of the line.[164] Nor, is the
name to be neglected. Despite the strenuous effort of modern times
to upset traditional notions, it remains a not seriously disputed fact
that the name Alexandrine comes from the French Roman
d'Alexandre, not earlier than the late twelfth century, and itself
following upon at least one decasyllabic Alexandreid. The metre,
however, suited French, and, as it had done on this particular
subject, ousted the decasyllable in the Chansons de geste generally;
while, with some intervals and revolts, it has remained the dress-
clothes of French poetry ever since, and even imposed itself as such
upon German for a considerable time.
In English, however, though, by accident and in special and partial
use, it has occupied a remarkable place, it has never been anything
like a staple. One of the most singular statements in Guest's English

Rhythms is that the verse of six accents (as he calls it) was
formerly the one most commonly used in our language. The
present writer is entirely unable to identify this formerly: and the
examples which Guest produces, of single and occasional occurrence
in O.E. and early M.E., seem to him for the most part to have
nothing to do with the form. But it was inevitable that on the one
hand the large use of the metre in French, and on the other its
nearness as a metrical adjustment to the old long line or stave,
should make it appear sometimes. The six-syllable lines of the
Bestiary and Horn are attempts to reproduce it in halves, and Robert
of Brunne reproduces it as a whole.[165] It appears not seldom in
the great metrical miscellany of the Vernon MS., and many of
Langland's accentual-alliterative lines reduce themselves to, or close
to it; while it very often makes a fugitive and unkempt appearance in
fifteenth-century doggerel. Not a few of the poems of the Mirror for
Magistrates are composed in it, and as an alternative to the
fourteener (this was possibly what Guest was thinking of) it figures
in the poulter's measure of the early and middle sixteenth century.
Sidney used it for the sonnet. But it was not till Drayton's Polyolbion
that it obtained the position of continuous metre for a long poem:
and this has never been repeated since, except in Browning's Fifine
at the Fair.
So, the most important appearances by far of the Alexandrine in
English are not continuous; but as employed to vary and complete
other lines. There are two of these in especial: the first among the
greatest metrical devices in English, the other (though variously
judged and not very widely employed) a great improvement. The
first is the addition, to an eight-line arrangement in decasyllables, of
a ninth in Alexandrine which constitutes the Spenserian stanza and
will be spoken of below. The other is the employment of the
Alexandrine as a variation of decasyllable in couplet, in triplet and
singly, which is, according to some, including the present writer,
visible in the riding-rhyme of Chaucer; which is often present in
the blank verse of Shakespeare; not absent from that of Milton in his
earlier attempts; employed in decasyllabic couplet by Cowley, and

Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.
More than just a book-buying platform, we strive to be a bridge
connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.
Join us on a journey of knowledge exploration, passion nurturing, and
personal growth every day!
ebookbell.com

Naci Dai Lawrence Mandel Arthur Ryman Using Openmp Portable Shared Memory Parallel Programming

More Related Content

Similar to Naci Dai Lawrence Mandel Arthur Ryman Using Openmp Portable Shared Memory Parallel Programming

Recently uploaded

Naci Dai Lawrence Mandel Arthur Ryman Using Openmp Portable Shared Memory Parallel Programming