1986
DOI: 10.1007/bf01407877
|View full text |Cite
|
Sign up to set email alerts
|

The butterfly barrier

Abstract: We describe an algorithm for barrier synchronization that requires only read and write to shared store. The algorithm is faster than the traditional locked counter approach for two processors and has an attractive log 2 N time scaling for larger N. The algorithm is free of hot spots and critical regions and requires a shared memory bandwidth which grows linearly with N, the number of participating processors. We verify the technique using both a real shared memory multiprocessor, for numbers of processors up t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
49
0
1

Year Published

1994
1994
2022
2022

Publication Types

Select...
8

Relationship

0
8

Authors

Journals

citations
Cited by 96 publications
(50 citation statements)
references
References 6 publications
0
49
0
1
Order By: Relevance
“…Most use some form of tree to gather and scatter information [6,10,14,16]; the butterfly and dissemination barriers of Brooks [2] and of Hensgen, Finkel, and Manber [6] use a symmetric pattern of synchronization operations that resembles an FFT or parallel prefix computation. The butterfly and dissemination barriers perform a total of O (P log P) writes to shared locations, but only O (log P) on their critical paths.…”
Section: Unnecessary Waitingmentioning
confidence: 99%
See 1 more Smart Citation
“…Most use some form of tree to gather and scatter information [6,10,14,16]; the butterfly and dissemination barriers of Brooks [2] and of Hensgen, Finkel, and Manber [6] use a symmetric pattern of synchronization operations that resembles an FFT or parallel prefix computation. The butterfly and dissemination barriers perform a total of O (P log P) writes to shared locations, but only O (log P) on their critical paths.…”
Section: Unnecessary Waitingmentioning
confidence: 99%
“…In the absence of contention, a remote memory reference (read) takes about 4 µs, roughly 5 times as long as a local reference. reinitialize (previous_instance) previous_instance := current_instance current_instance := if current_instance = &instances [2] then &instances[0] else current_instance + 1 Figure 6: A fuzzy adaptive combining tree barrier with local-only spinning and breadth-first process wakeup.…”
Section: Experimental Environmentmentioning
confidence: 99%
“…There are four main classes of algorithms: masterslave [7], all-to-all [8], tree-based [7,9], butterfly [10]. Among them, the all-to-all algorithm takes a distributed solution.…”
Section: Introduction and Related Workmentioning
confidence: 99%
“…There are four main classes of algorithms: master-slave [3], all-to-all [4], tree-based [3,5], and butterfly [6]. Recently, as a single chip is being able to integrate many cores, barrier synchronization becomes a critical concern in single-chip systems due to its impact on application performance.…”
Section: Introduction and Related Workmentioning
confidence: 99%