Carlos Pachajoa scite author profile

NIFT (Numerical Information Field Theory) is a software package designed to enable the development of signal inference algorithms that operate regardless of the underlying spatial grid and its resolution. Its object-oriented framework is written in P, although it accesses libraries written in C, C++, and C for efficiency. NIFT offers a toolkit that abstracts discretized representations of continuous spaces, fields in these spaces, and operators acting on fields into classes. Thereby, the correct normalization of operations on fields is taken care of automatically without concerning the user. This allows for an abstract formulation and programming of inference algorithms, including those derived within information field theory. Thus, NIFT permits its user to rapidly prototype algorithms in 1D, and then apply the developed code in higher-dimensional settings of real world problems. The set of spaces on which NIFT operates comprises point sets, n-dimensional regular grids, spherical spaces, their harmonic counterparts, and product spaces constructed as combinations of those. The functionality and diversity of the package is demonstrated by a Wiener filter code example that successfully runs without modification regardless of the space on which the inference problem is defined.

show abstract

Extending and Evaluating Fault-Tolerant Preconditioned Conjugate Gradient Methods

Pachajoa¹,

Levonyak²,

Gansterer³

2018

View full text Add to dashboard Cite

How to Make the Preconditioned Conjugate Gradient Method Resilient Against Multiple Node Failures

Pachajoa

Levonyak

Gansterer

et al. 2019

View full text Add to dashboard Cite

We study algorithmic approaches for recovering from the failure of several compute nodes in the parallel preconditioned conjugate gradient (PCG) solver on large-scale parallel computers. In particular, we analyze and extend an exact state reconstruction (ESR) approach, which is based on a method proposed by Chen (2011). In the ESR approach, the solver keeps redundant information from previous search directions, so that the solver state can be fully reconstructed if a node fails unexpectedly. ESR does not require checkpointing or external storage for saving dynamic solver data and has low overhead compared to the failure-free situation.In this paper, we improve the fault tolerance of the PCG algorithm based on the ESR approach. In particular, we support recovery from simultaneous or overlapping failures of several nodes for general sparsity patterns of the system matrix, which cannot be handled by Chen's method. For this purpose, we re ne the strategy for how to store redundant information across nodes. We analyze and implement our new method and perform numerical experiments with large sparse matrices from real-world applications on 128 nodes of the Vienna Scienti c Cluster (VSC). For recovering from three simultaneous node failures we observe average runtime overheads between only 2.8% and 55.0%. The overhead of the improved resilience depends on the sparsity pattern of the system matrix.

show abstract

On the Resilience of Conjugate Gradient and Multigrid Methods to Node Failures

Pachajoa

Gansterer

2018

View full text Add to dashboard Cite

Algorithm-Based Checkpoint-Recovery for the Conjugate Gradient Method

Pachajoa

Pacher

Levonyak

et al. 2020

View full text Add to dashboard Cite

As computers reach exascale and beyond, the incidence of faults will increase. Solutions to this problem are an active research topic. We focus on strategies to make the preconditioned conjugate gradient (PCG) solver resilient against node failures, speci cally, the exact state reconstruction (ESR) method, which exploits redundancies in PCG. Reducing the frequency at which redundant information is stored lessens the runtime overhead. However, after the node failure, the solver must restart from the last iteration for which redundant information was stored, which increases recovery overhead. This formulation highlights the method's similarities to checkpoint-restart (CR). Thus, this method, which we call ESR with periodic storage (ESRP), can be considered a form of algorithm-based checkpointrestart. The state is stored implicitly, by exploiting redundancy inherent to the algorithm, rather than explicitly as in CR. We also minimize the amount of data to be stored and retrieved compared to CR, but additional computation is required to reconstruct the solver's state. In this paper, we describe the necessary modi cations to ESR to convert it into ESRP, and perform an experimental evaluation. We compare ESRP experimentally with previously-existing ESR and application-level in-memory CR. Our results con rm that the overhead for ESR is reduced signi cantly, both in the failure-free case, and if node failures are introduced. In the former case, the overhead of ESRP is usually lower than that of CR. However, CR is faster if node failures happen. We claim that these di erences can be alleviated by the implementation of more appropriate preconditioners.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Carlos Pachajoa

NIFTY – Numerical Information Field Theory

Extending and Evaluating Fault-Tolerant Preconditioned Conjugate Gradient Methods

How to Make the Preconditioned Conjugate Gradient Method Resilient Against Multiple Node Failures

On the Resilience of Conjugate Gradient and Multigrid Methods to Node Failures

Algorithm-Based Checkpoint-Recovery for the Conjugate Gradient Method

Contact Info

Product

Resources

About