Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles 2021
DOI: 10.1145/3477132.3483563
|View full text |Cite
|
Sign up to set email alerts
|

The Aurora Single Level Store Operating System

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
1
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 9 publications
(3 citation statements)
references
References 22 publications
0
1
0
Order By: Relevance
“…POS continues the line of research on C/R for different tasks, with a particular focus on OSlevel C/R for GPU applications and on efficient and concurrent C/R execution. Though concurrent OS-level C/R has been extensively explored for processes that only run on the CPU [14,40,24,66,71,10,35,25,77], to the best of our knowledge, existing C/R systems that support GPU all leverage a stop-the-world design [48,47,21,15,67], which we have shown has notable performance degradation or applica-tion downtime in various scenarios. Meanwhile, there are also many task-level C/R designs, e.g., for ML tasks [44,75,16], HPC tasks [57] or leverage NVM for acceleration [56].…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…POS continues the line of research on C/R for different tasks, with a particular focus on OSlevel C/R for GPU applications and on efficient and concurrent C/R execution. Though concurrent OS-level C/R has been extensively explored for processes that only run on the CPU [14,40,24,66,71,10,35,25,77], to the best of our knowledge, existing C/R systems that support GPU all leverage a stop-the-world design [48,47,21,15,67], which we have shown has notable performance degradation or applica-tion downtime in various scenarios. Meanwhile, there are also many task-level C/R designs, e.g., for ML tasks [44,75,16], HPC tasks [57] or leverage NVM for acceleration [56].…”
Section: Related Workmentioning
confidence: 99%
“…ML tasks, such as training, are vulnerable to GPU failures [20,16,67]. C/R provides fault tolerance by periodically checkpointing and persisting the images [70,77,67]. Upon failure and recovery, the OS simply restores the task using the checkpointed image, creating an illusion that the tasks never failed.…”
Section: Background and Motivationmentioning
confidence: 99%
“…The mechanisms of checkpoint and restore (C/R) has been investigated by OSes for a long time [21,47]. For examples, KeyKOS [30], EROS [59], and Aurora [64] embrace the abstraction of single level store and make the whole-system checkpoint periodically for crash consistency; many researches on Linux [31,41,4,76,66] can dynamically generate applications' checkpoints which can be restored after crashes or on other machines. VAS-CRIU [66] also notices the inefficient of C/R brought by the file abstraction.…”
Section: Related Workmentioning
confidence: 99%