Fast Multi-Stack SVR

Fast Multi-Stack Slice-to-Volume Reconstruction

via Multi-Scale Unrolled Optimization

Margherita Firenze

MIT

mfirenze@mit.edu

Sean I. Young

Harvard Medical School

siyoung@mit.edu

Clinton J. Wang

MIT

clintonw@csail.mit.edu

Hyuk Jin Yun

Harvard Medical School

hyun@cmh.edu

Elfar Adalsteinsson

MIT

elfar@mit.edu

Kiho Im

Harvard Medical School

kiho.im@childrens.harvard.edu

P. Ellen Grant

Harvard Medical School

ellen.grant@childrens.harvard.edu

Polina Golland

MIT

polina@csail.mit.edu

Figure 1: Fast Multi-Stack Slice-to-Volume Reconstruction. Our proposed multi-stack SVR framework takes as input three motion-

corrupted stacks of 2D slices and reconstructs a volume (1 second). Super-resolution is performed With optional optimization (7 seconds).

Abstract

Fully convolutional networks have become the backbone of

modern medical imaging due to their ability to learn multi-

scale representations and perform end-to-end inference. Yet

their potential for slice-to-volume reconstruction (SVR), the

task of jointly estimating 3D anatomy and slice poses from

misaligned 2D acquisitions, remains underexplored. We in-

troduce a fast convolutional framework that fuses multiple

orthogonal 2D slice stacks to recover coherent 3D struc-

ture and reﬁnes slice alignment through lightweight model-

based optimization. Applied to fetal brain MRI, our ap-

proach reconstructs high-quality 3D volumes in under 10s,

with ∼1s slice registration and accuracy on par with state-

of-the-art iterative SVR pipelines, offering more than 40×

speedup. The framework uses non-rigid displacement ﬁelds

to represent transformations, generalizing to other SVR

problems like fetal body and placental MRI. Additionally,

the fast inference time paves the way for real-time, scanner-

side volumetric feedback during MRI acquisition.

1. Introduction

Fetal brain magnetic resonance imaging (MRI) is an im-

portant tool for investigating abnormal ultrasound ﬁndings

and expanding our understanding of fetal brain develop-

ment [1, 9, 15]. To alleviate the effects of fetal motion,

fast 3D MRI sequences are used, which limit motion arti-

facts in the acquired 2D images [18]. A “cool-off” period is

required due to safety limits on energy deposition between

consecutive slice acquisitions [13] and during these times

the fetus often moves considerably, causing the slices to be

misaligned. An example of this can be seen in Fig. 1, where

the coronal and sagittal views both have severe motion that

make their orthogonal views look incoherent.

Images are acquired in series, called stacks, and ideally

only three stacks are needed for the three standard views

of the brain (sagittal, axial, and coronal). Due to motion,

stacks may contain oblique (out of plane) slices. In these

cases, the stack is reacquired, resulting in as many as 20

stacks and leading to long, uncomfortable scan times and

thousands of images for radiologists to sift through.

These problems can be overcome by slice-to-volume

reconstruction (SVR) methods, which produce high-

resolution visualizations of the brain from a limited num-

ber of stacks. SVR methods align the acquired slices in 3D

and super-resolve the volume [4, 7, 14, 21, 28, 29]. SVR is

widely used in research for volumetric analysis. However,

long runtime limits scanner-side uses and clinical adoption,

as radiologists perform assessment shortly after acquisition,

making time-consuming SVR methods disrupt the standard

workﬂow. Fast SVR has the potential to improve radiolog-

ical assessment by providing a coherent volume in time for

radiological assessment, and to vastly accelerate and im-

prove fetal imaging by guiding decision-making during ac-

quisitions, i.e., when to stop acquiring new data because the

brain coverage is complete and which orientation to pre-

scribe for the next stack. Our main contributions are:

• We propose a fully convolutional neural network that reg-

isters multiple stacks of slices in under one second, and

reﬁnes poses and produces reconstructions of high qual-

ity in under 10 seconds.

• We integrate the neural network with model-based recon-

struction using data consistency with acquired slices.

• We evaluate the proposed method on simulated and real

clinical data, demonstrating state of the art reconstruction

accuracy and speed.

Notably, our framework is not constrained to rigid motion

models and only requires a small training set, which paves

the way for other MRI applications such as placental (de-

formable motion) and fetal body (poly-rigid) SVR.

2. Related Work

SVR is complicated by the fact that only 2D slices are avail-

able, unlike classic 2D-to-3D registration problems where

2D and 3D images are given. Therefore, SVR can be

thought of as two problems that have to be solved jointly:

volume reconstruction, i.e., recovering a high-resolution

3D volume from aligned slices, and slice registration, i.e.,

aligning 2D slices into a common coordinate system. These

steps are intertwined, as more accurate volumes lead to

more accurate slice poses and vise versa. Early optimiza-

tion methods updated the volume and slice poses in an alter-

nating fashion. Later research sought to use deep learning

to solve the registration task of SVR, optimize both regis-

tration and reconstruction with a neural network, and solve

registration in one pass using an unrolled deep learning ap-

proach.

Learning-free Optimization. Early methods framed SVR

as an optimization problem, with solutions that alternated

between reconstructing a 3D volume and estimating slice

poses [5–7, 11, 12, 14, 21, 23]. The SVRTK toolkit [14]

is a widely used package that improves this approach by

using robust statistics to remove outlier slices for better re-

constructions.

In SVRTK the volume is initialized using a designated

reference stack. To initialize the slice poses, stacks are

registered to the reference stack in bulk, setting poses of

all slices in the stack. Following this step, the volume re-

construction is achieved by minimizing a model-based loss.

The optimization encourages the simulated slices, the slices

predicted based on the volume and slice pose estimates, to

match the input slices. Then, the poses are updated by reg-

istering the input slices to the latest volume estimate. These

steps are repeated 5-7 times. The optimization can fail to

converge when large motion is present [24] or a reference

stack is not adequately chosen. Further, the method is time

consuming taking around 5 minutes using a multi-threaded

CPU implementation to reconstruct a volume from 3 slice

stacks. Finally, SVRTK predicts poses relative to a desig-

nated template stack, with no guarantees of the ﬁnal recon-

struction being in a canonical orientation that can be readily

interpreted by radiologists, resulting in oblique reconstruc-

tions.

Deep Learning Registration for SVR. Deep learning

promised to make SVR more robust to large motion and to

reconstruct the volume in the canonical orientation. Early

deep learning approaches for SVR used CNN architectures

trained on synthetic data to directly regress slice poses, ei-

ther as explicit rotation and translation parameters [22] or

as anchor-point representations [10, 17]. Transformer ar-

chitecture has been shown effective for coupled registration

of all slices in the stack by capturing their pose similari-

ties [28]. While these methods were fast, none were ac-

curate enough to outperform traditional optimization-based

approaches, but rather served as an initialization step to lead

to faster convergence using optimization methods. State-

space architectures followed by an MLP to predict slice

poses have been shown to achieve improved registration ac-

curacy [27].

In many deep learning SVR methods, once slice poses

are estimated, the latent volume is reconstructed using tra-

ditional model-based optimization [14]. Alternatively, the

model-based reconstruction alternates with registration as

in classical methods [28]. Other methods replace this

step with learned reconstruction networks, employing su-

pervised interpolation to perform super-resolution and in-

painting [27, 30]. While supervised inpainting methods pro-

duce high-quality details, they are not guaranteed to pro-

duce a ﬁnal reconstruction that is consistent with the in-

put slices. This has the potential to smooth over potential

volume

recon

coherent volume

Convolutional Pose Estimation

slice stacks + initial poses

Layer s

s-1

s+1

Δf

sim slice

{𝐼

}

skip slices {𝐼

}

Model-based

Reconstruction

slice

poses

pose refine

volume

refine

2D slice

encoder

2D + 3D

decoder

slice

prediction

s-1

[[ S

, S

… ]]

[[ C

, C

… ]]

[[ A

, A

… ]]

s=0

field

update

Figure 2: Method Overview. (A) SVR pipeline combines convolutional pose estimation with model-based reconstruction. (B) Iterative

2D+3D blocks reﬁne slice pose estimates at resolution s through simulated slice generation and ﬂow ﬁeld updates.

pathology and favor reconstructing “average” brains.

We use the model-based reconstruction approach in our

method as it is faster than INR based reconstructions and

ensures consistency with the acquired data. Supervised in-

painting methods are not guaranteed to produce a ﬁnal re-

construction that is consistent with the input slices. This has

the potential to smooth over potential anomalies and favor

reconstructing “average” brains [30].

Fully Neural SVR. More recently, implicit neural repre-

sentations (INR) have been used to perform registration and

reconstruction [4, 26, 29]. The INR optimizes a multi-layer

perceptron (MLP) during inference. The INR learns a con-

tinuous representation of the volume and adapts slice poses

as well as pixel and slice weights to remove outliers and

bias ﬁelds effectively. The NeSVoR package [29], which

ﬁrst proposed this approach, is considered state of the art

for its ability to resolve ﬁne details and produce robust re-

sults. Since the network is optimized at inference time, the

method requires long runtimes, around 4-5 minutes, and

specialized GPU infrastructure. The poses are initialized

using a fast deep-learning registration [28]. Similar to the

classic methods, the optimization fails to converge when the

initial poses are inaccurate. Further, in the ﬁnal query of

the INR to produce the volume, discretization artifacts can

occur from sampling the continuous network parameteriza-

tion.

Alternatively, optimizing two MLPs at inference time

has been proposed. The ﬁrst MLP performs registration and

the second MLP provides volume reconstruction similar to

the INR methods above [4, 26]. Additionally, meta learning

[4] has been shown to reduce the convergence time by ini-

tializing the weights using a small set of examples. Despite

these advancements the fastest implementation still requires

more than a minute and specialized GPU infrastructure.

Deep Learning SVR via Multi-scale Feed-forward Net-

works. One possible reason for the success of the deep

learning methods in 2 is that they repeat the classical opti-

mization steps thousands of times and parametrize the prob-

lem with millions of parameters. This is in contrast to

deep registration networks which are much faster but pre-

dict poses directly from slices without explicitly evoking

the forward model that couples the volume, the slice poses,

and the acquired slices. In order to combine the perfor-

mance gains of parameterized optimization approaches and

the speed of registration approaches, a possible solution is

to unroll the optimization steps across different layers of a

network. Speciﬁcally SVR is posed as a 2D-to-3D registra-

tion task between input slices and an unknown 3D volume.

In the network the poses are reﬁned at different resolutions

in successive layers of the network. This is done using a

U-net architecture [20] which is designed multi-scale feed

forward. Once trained, this network produces slice pose

estimates through a single feed-forward pass at inference

time, requiring less than a second [30]. Then to perform

super-resolution reconstruction inpainting is used [30]. Our

method expands on this work and integrates model-based

reconstruction following registration.

3. Preliminaries

Formally, SVR is an inverse problem where we seek to re-

construct an underlying volume V that is consistent with the

acquired slices. The forward imaging model predicts slice

from an underlying volume V ,

= M(F

−1

)V (1)

where F

∈ E(3) is a rigid transformation that deﬁnes the

plane of imaging of slice I

and function M (·) transforms

the 3D discretized point-spread function (PSF) by the trans-

formation that is its input. The PSF is determined by the

image acquisition parameters and can be approximated as a

Gaussian [11, 21]. M produces a sparse, non-square matrix

mapping voxel coordinates in the volume to slice coordi-

nates.

Both V and





are unknown. Classical SVR methods

use coordinate descent by alternating the estimation of V

(i.e., reconstruction) and





(i.e., registration).

The volume is initialized using the original slice poses

of the acquired stacks, where each slice is spread over a 3D

area given by the PSF weights and normalized by the total

amount of slice contributions to a voxel, i.e.,

init

(x) =



M(F

)



(x)

[

M(F

)

] (x)

. (2)

This step produces blurry volumes that can be further re-

ﬁned by minimizing a data consistency loss. Speciﬁcally,

“simulated slices” are generated from the estimated vol-

ume V the current slice poses





and the forward model

(1), and compared to the acquired slices





. This is also

known as model-based optimization. Formally, the recon-

struction step updates V while keeping the poses





V = argmin

∥I

− M(F

)V ∥

. (3)

During registration, the acquired slices are registered to the

ﬁxed volume V to update their poses





constant:

= argmin

∥I

− M(F

)V ∥

(4)

This alternating scheme separates pose and volume opti-

mization, with each step requiring computationally expen-

sive updates and many iterations to converge to a solution.

4. Method

We implement a reconstruction and slicing formulation that

makes implementation highly parallelizable by using the

ﬁrst order approximations for the forward model and vol-

ume reconstruction. To generalize our approach we employ

non-rigid displacements instead of rigid transforms.

We replace the discrete PSF matrix M

with identity I,

modeling slices as unit thin. As we explain later in this sec-

tion, this is a reasonable simpliﬁcation for low-resolution

layers in our network. Motion is modeled as a non-rigid

displacement ﬁeld f : R

→ R

mapping 2D pixel coor-

dinates p = (p

, p

) to 3D displacements. We uplift to 3D

via p

↑

by placing slices on z = 0 plane by appending a 0 to

each vector, i.e. (p

, p

)

↑

= (p

, p

, 0). The initial volume

reconstruction (2) becomes

init

(x) =



↑

+ f

(p), I

(p)



(x)

V(p

↑

+ f

(p), [I

> 0])

(x)

(5)

where V(x, I) denotes the volume pushing operation that

places the intensity given by I at the voxel coordinates x and

distributes the intensity using trilinear interpolation when

the 3D coordinate location does not coincide with a discrete

grid point. Finally, since multiple slices may contribute to

one 3D voxel location, the intensity is normalized by the to-

tal weight of contributions. This approximation is used for

slice estimation and is not reﬁned using (3) as in classical

methods.

To reﬁne the pose estimates, we ﬁrst construct the simu-

lated slices using the current pose estimates, similar to (4),

except without the use of M:

(p) = V

∗

↑

+ f

(p), V ) (6)

where V

∗

(x, v) denotes the volume sampling operation that

samples the intensity at the coordinate x and uses trilinear

interpolation when the 3D coordinate location does not co-

incide with a discrete point on the grid.

Then, we employ a learned convolutional operator, ∆f

which reﬁnes the displacement by comparing the simulated

and input slices

= f

s−1

+ ∆f

(

, I

′

), (7)

where s is the index of the layer. With these two operations

deﬁned we can implement the volume estimation and pose

reﬁnement steps many times as illustrated in Fig. 2. To

implement a multi-resolution strategy, we repeat these steps

at increasing resolutions, starting with slices sampled at low

resolutions and ending with high-resolution slices, using a

slice parametrization that accounts for orthogonal slices.

Slice pose parametrization. We initialize slice poses to

their prescribed positions, as given by the stack direction

and slice order, and have the network reﬁne them through

iterative updates with increasing resolution. We parametrize

the slice depth using a translation matrix T

and 3D orien-

tation (sagittal, axial, coronal) with a 4x4 matrix R

that

corresponds to either a sagittal, axial, or coronal orienta-

tion. Then to adjust to different resolutions of the slices, we

use C

and C

−1

which are translation matrices that center

and de-center the coordinates to rotate about the center of

the slice at resolution s. Finally, S

scales the pixel coordi-

nates to match the volume resolution. Note that T

encodes

the slice index in the stack and is different for each slice in

the stack; R

is shared by all slices in the same stack; C

and S

depend on the resolution of layer s in the network.

All together this transformation can be used to formulate

the displacement ﬁeld that places a slice in its prescribed

position at different scales s:

= (C

−1

− I)p (8)

where p

↑

denotes homogeneous 3D coordinates, for the

slice placed at z=0. This approach models each slice sepa-

rately, enabling the framework to work with variable slice

and stack numbers.

Finally, to train the network we use a multi-layer L

loss

on the residual displacement:

L(f

, f) =



GT,n

−

s=0

s↑



(9)

Once the network predicts the slice poses, we super-

resolve the volume using a model-based approach as in (3)

and also perform the pose update steps and reﬁne the poses.

High-resolution slice and volume estimation. At low res-

olutions modeling slices as unit thin is viable since the in-

plane sampling ratio substantially exceeds the slice thick-

ness. However, approximating slices as unit thickness in-

troduces rendering artifacts at full resolution due to the mis-

match between the in-plane and thickness dimensions. To

mitigate coverage gaps at full resolution, we project the dis-

placement ﬁeld using the method of Arun et al. [2] at the

ﬁnal layer, then apply a boxcar PSF to distribute slice values

across their thickness.

4.1. Implementation

Network Architecture. We build a custom U-net with a

2D encoder and a 2D + 3D decoder. The encoder constructs

multi-resolution slice features





. The decoder repeats a

2D to 3D block ﬁve times while doubling the resolution at

each layer, to emulate the classical SVR steps, as shown in

Panel B of Fig. 2 and 5–7.

We construct a feature volume V

∈ R

d×d

×d

at each resolution using ((5)). V

is a four dimensional

tensor, constructed from the slice features





from the

skip connection of layer s and the previous displacement

ﬁelds f

s−1

. Here d = [8, 16, 32, 64, 128] and d

[1024, 512, 256, 128, 64].

To reﬁne the displacement ﬁelds, we sample the volume

to create simulated slices





and compute their correla-

tion with the skip connection features





to estimate a

displacement residual ∆f

that is added to the previous dis-

placement estimate f

s−1

as in (7).

We simplify the architecture compared to previously pro-

posed solution [30] to make each layer’s predicted volume

and slices independent from the previous layer and depend

only on the previous displacement estimates and skip con-

nections.

Model-based reconstruction. The slice pose estimates are

used to initialize a model-based optimization that iteratively

reﬁnes both the 3D volume and slice poses to ensure con-

sistency with the acquired slices in (3) [14]. We employ

a GPU-accelerated implementation of this model-based re-

construction [28].

Training. We generate slice stacks from standard orthogo-

nal imaging planes (sagittal, axial, and coronal), each per-

turbed by a bulk in-plane rotation uniformly sampled in

the range [−12

◦

, 12

◦

] to simulate imperfect plane selection.

Starting from this initialization, we apply between 1 and

100 smooth motion perturbations per stack, generated by

interpolating randomly sampled rigid transformations using

cubic B-splines. This procedure captures both gradual mo-

tion patterns and abrupt movements. Rotational perturba-

tions are drawn from a zero-mean normal distribution with

a standard deviation of 20

◦

; translations are uniformly sam-

pled within [−6.1, 6.1] mm. We further apply Gaussian

noise, slice-wise bias ﬁeld augmentation, and gamma in-

tensity perturbations to make the network robust to imaging

artifacts. We train the network on a NVIDIA H200 GPU for

250k steps and pick the last model. We use ADAM with an

initial learning rate of 10

−

4 and poly scheduling.

5. Experimental Results

Table 1: Clinical Evaluation. Quantitative assessment of recon-

struction quality across 9 clinical subjects. We compute similarity

measures (SSIM, NCC, and PSNR) between the simulated slices

and input slices across methods. Running times listed (last row).

Method Slice SSIM (↑) NCC (↑) PSNR (↑)

SVoRT 0.971 ± 0.007 0.14 ± 0.01 37.5 ± 1.4

+ NeSVoR 0.959 ± 0.007 0.13 ± 0.01 35.7 ± 1.3

cSVR 0.952 ± 0.013 0.12 ± 0.01 35.4 ± 2.0

+ Reﬁne 0.966 ± 0.005 0.13 ± 0.01 37.0 ± 1.2

+ NeSVoR 0.959 ± 0.006 0.13 ± 0.01 35.3 ± 1.2

SVoRT + NeSVoR cSVR + Reﬁne + NeSVoR

10s 257s 3s 7s 251s

Data. We train and evaluate our model using FeTA [16],

a public dataset of high-quality T2-weighted coherent vol-

umes reconstructed using existing methods in 120 subjects

(gestational age (GA) 20–35 weeks, voxel size 0.8 mm

and 18 volumes (GA 21–38 weeks, voxel size 0.8 mm

)

from the CRL atlas [8]. We train the network on 108 sub-

jects and 18 atlases.

We evaluate our method on 12 held-out FeTA subjects

and 9 patients from [withheld for anonymity] (GA 25 − 35

weeks, pixel size 1.3 − 1.4mm, slice thickness 3mm). We

choose three stacks (sagittal, coronal, and axial) for each

subject and segment the intracranial content using a pub-

licly available method [19].

Baseline methods. We evaluate the accuracy of the pose es-

timates produced by our neural network when coupled with

three variants of volume reconstruction: (1) cSVR keeps the

slice poses estimated by the neural network and uses data

consistency to reconstruct the volume; (2) cSVR+Refine

Figure 3: Clinical Evaluation. Reconstructions for clinical subjects (GA 20–35 weeks) for all methods: SVoRT, SVoRT + NeSVoR,

cSVR, cSVR + Reﬁne, cSVR + NeSVoR. Our proposed fast method, cSVR + Reﬁne, achieves high-quality reconstructions comparable

to state of the art, with high grey and white matter contrast (green arrows). Our method as well as SVoRT struggles in cases of image

corruption as seen by the red arrows, where the reconstructions fail to exclude noisy areas of a slice.

continues to reﬁne the slice poses while alternating with

volume reconstruction; (3) cSVR+NeSVoR provides the

slice poses estimated by the network as an initialization for

NeSVoR, the state of the art method based on an implicit

neural representation (INR) of the resulting volume [29].

We also compare the performance of the cSVR variants

with two baseline methods: a transformer-based approach

SVoRT [28] and SVoRT+NeSVoR method that uses the

output of SVoRT as an initialization for NeSVoR. SVoRT

was trained on FeTA dataset [16] and shares the same vali-

dation set as our method.

Figure 4: Performance evaluation and sensitivity analysis. Robustness across methods to input stack perturbations (translation, rotation,

noise), evaluated via registration accuracy (top) and reconstruction quality (bottom). Our method, cSVR + reﬁne, is robust across high

levels of translation and rotation and our method coupled with NeSVoR reconstruction achieves the best overall performance.

Figure 5: Inference Time of Registration Methods Time to pre-

dict slice poses on clinical subjects of varying input sizes, compar-

ing SVoRT vs cSVR.

Evaluation on Synthetic Data. On the synthetic data,

where ground truth is available, we quantify the accuracy

of the pose prediction using the maximum total registration

error (TRE) and volume reconstruction quality using struc-

tural similarity index measure (SSIM) [25] between the re-

constructed and the ground truth volumes. For estimated

Table 2: Ablation Study. Effect of loss function and slice pose

initialization on clinical data.

Slice poses Multi-layer SSIM NCC PSNR

initialized loss applied

✗ ✓ 0.93 0.10 32.63

✓ ✗ 0.966 0.13 36.95

✓ ✓ 0.966 0.13 37.00

and ground-truth displacement ﬁelds

f and f , respectively,

we deﬁne

TRE(

f; f) = max



f(p) − f(p)



(10)

which captures the point of maximum distance between the

the predicted pixel location and the ground truth location.

Prior to computing the TRE, the output volume is registered

to the ground-truth volume using ANTS [3] to mitigate the

effect of global shifts in reconstruction. Although most pre-

vious work report only volume consistency scores, TRE is

1 2

Layer #:

30.0 ± 6.0

Total Registration Error (TRE, mm) at Network Layer #

30.0 ± 6.0 4.0 ± 0.8 3.0 ± 0.6 2.0 ± 0.5 2.0 ± 0.5

TRE:

Figure 6: Pose estimation in different layers of the network.

Network gradually estimates pose, with bulk of pose being pre-

dicted correctly in the ﬁrst two layers.

an important measure to distinguish between the registra-

tion performance of a given method and its reconstruction

performance. Notably, TRE scores for smaller slices, where

only part of the brain is visible, are higher as these regions

are harder to register. To mitigate this, we report median

maximum TRE per slice stacks for the sensitivity analysis.

To capture high variability of motion in clinical practice,

we evaluate our method on a range of motion settings, vary-

ing the possible range of sampled rotations and translations

separately for 12 subjects. We also evaluate the methods’

ability to mitigate noise in the reconstruction by varying the

noise level in the synthetic data.

Evaluation on Clinical Data. For clinical data, we evaluate

the consistency between the estimated and acquired slices

using mean SSIM, PSNR, and NCC (Table 1). We run all

evaluations using an NVIDIA A6000 GPU.

5.1. Results

Synthetic data. We evaluate all methods for varying de-

gree of corruptions, speciﬁcally varying rotations, transla-

tions, and image noise (Fig 4). Overall, cSVR is robust

to large rotations, translations, and noise. Adding the re-

ﬁnement steps, cSVR + Reﬁne, signiﬁcantly boosts perfor-

mance across all levels of rotations and translations. How-

ever, cSVR + Reﬁne hurts performance in the case of high

image noise. This is potentially explained as the reﬁne-

ment step is driven by slice consistency which is corrupted

by noise. When our method is combined with INR recon-

struction (NeSVoR) it achieves the most robust results of all

methods. SVoRT struggles with high levels of rotation and

translation. For rotation, NeSVoR is able to compensate,

but for large translations the INR is not able to reﬁne the

pose adequately.

Clinical Data. In the clinical cases, we observe that our

method performs similarly to the baseline methods, (Ta-

ble 1). The SSIM, PSNR, and NCC are all close for all

models, with our method being generally being the second

best. NeSVoR is prone to slight intensity shifts that con-

tribute to the lowest similarity metrics of the three despite

visually high-quality reconstructions (Fig 3). Our method

shows good contrast of grey matter and white matter, Fig.

(3 row 4). We observe NeSVoR + SVoRT produce sharp

reconstructions, as can be seen by the ﬁne detail in the cor-

tex in the last row. However NeSVoR is also prone to make

noisy reconstructions with a speckle-like pattern, as seen in

the ﬁrst row. An example of the problem of using only self-

consistency metrics can be seen in the third subject (GA

32w) where a noisy slice contributes a bright spot on the

left side of the brain. While all gradient-descent methods

reconstruct this artifact, NeSVoR successfully suppresses it.

Ablations Studies. We evaluate the quality of the slice

pose estimates for different layers of the network as seen in

Fig. 6 and ﬁnd the network gradually reﬁnes the pose, with

large deformations occurring in the early layers. We em-

pirically evaluate the runtime of the algorithms and validate

that our method scales linearly with the input slice count

while SVoRT (transformer-based) scales (roughly) quadrat-

ically (Fig. 5).

To test whether our parametrization of slice poses in (8)

is necessary for network learning, we initialize all slices

with only their slice index position by setting all rotations

matrices R

to identity, and train the network. In Table

2 we show the network has very poor performance in this

case. We also evaluate the effect of using a multi-layer loss

and see only a small improvement in PSNR.

6. Discussion

Limitations. Although our method could be used on mul-

tiple stacks, we only evaluated reconstructions based on 3

stack input. Our method requires pre-processing steps that

could be integrated into the network such as standardizing

slice ordering and orientations. We compared our method to

state of the art SVR methods (SVoRT + NeSVoR), but addi-

tional insights could be gleaned by comparing with concur-

rent work [4, 26, 27] once the code has been made public.

Future work. We plan to extend this framework to non-

rigid SVR applications such as placental MRI. We also plan

to train the network to learn how to reﬁne the poses and use

a cascading scheme to do convolutional-based reﬁnement.

7. Conclusions

We demonstrate a fast convolutional multi-stack SVR ap-

proach that is 40 times faster than the state of the art meth-

ods that produces comparable quality reconstructions. We

propose a slice parameterization, loss function, and a robust

reconstruction approach that enables this architecture to be

generalized to other SVR applications.

Acknowledgments. This work is supported by NSF GRFP,

NIH R01EB032708, R01HD114338, R01EB036945,

K99AG081493 and R00AG081493, and the MIT CSAIL-

Wistron Program.

References

[1] Michael Aertsen. The role of fetal brain magnetic res-

onance imaging in current fetal medicine. J. Belg. Soc.

Radiol., 106(1):130, 2022. 1

[2] K. S. Arun, T. S. Huang, and S. D. Blostein. Least-

squares ﬁtting of two 3-d point sets. IEEE Transac-

tions on Pattern Analysis and Machine Intelligence,

PAMI-9(5):698–700, 1987. 5

[3] Brian Avants, Nicholas J Tustison, and Gang Song.

Advanced normalization tools: V1.0. Insight J., 2009.

[4] Maik Dannecker, Thomas Sanchez, Meritxell Bach

Cuadra,

Ozg

un Turgut, Anthony N. Price, Lucilio

Cordero-Grande, Vanessa Kyriakopoulou, Joseph V.

Hajnal, and Daniel Rueckert. Meta-learning slice-to-

volume reconstruction in fetal brain mri using implicit

neural representations, 2025. 2, 3, 8

[5] Michael Ebner, Guotai Wang, Wenqi Li, Michael

Aertsen, Premal A. Patel, Rosalind Aughwane, An-

drew Melbourne, Tom Doel, Steven Dymarkowski,

Paolo De Coppi, Anna L. David, Jan Deprest,

ebastien Ourselin, and Tom Vercauteren. An auto-

mated framework for localization, segmentation and

super-resolution reconstruction of fetal brain mri.

NeuroImage, 206:116324, 2020. 2

[6] Michael Ebner, Guotai Wang, Wenqi Li, Michael

Aertsen, Premal A Patel, Rosalind Aughwane, An-

drew Melbourne, Tom Doel, Steven Dymarkowski,

Paolo De Coppi, et al. An automated framework

for localization, segmentation and super-resolution re-

construction of fetal brain MRI. NeuroImage, 206:

116324, 2020.

[7] Ali Gholipour, Judy A Estroff, and Simon K Warﬁeld.

Robust super-resolution volume reconstruction from

slice acquisitions: application to fetal brain MRI.

IEEE transactions on medical imaging, 29(10):1739–

1758, 2010. 2

[8] Ali Gholipour, Caitlin K Rollins, Clemente Velasco-

Annis, Abdelhakim Ouaalam, Alireza Akhondi-Asl,

Onur Afacan, Cynthia M Ortinau, Sean Clancy,

Catherine Limperopoulos, Edward Yang, Judy A Es-

troff, and Simon K Warﬁeld. A normative spatiotem-

poral MRI atlas of the fetal brain for automatic seg-

mentation and analysis of early brain growth. Sci.

Rep., 7(1):476, 2017. 5

[9] Paul D Grifﬁths, Michael Bradburn, Michael J

Campbell, Cindy L Cooper, Ruth Graham, Deborah

Jarvis, Mark D Kilby, Gerald Mason, Cara Mooney,

Stephen C Robson, Allan Wailoo, and MERIDIAN

collaborative group. Use of MRI in the diagnosis of fe-

tal brain abnormalities in utero (MERIDIAN): a multi-

centre, prospective cohort study. Lancet, 389(10068):

538–546, 2017. 1

[10] Benjamin Hou, Bishesh Khanal, Amir Alansary,

Steven McDonagh, Alice Davidson, Mary Rutherford,

Jo V. Hajnal, Daniel Rueckert, Ben Glocker, and Bern-

hard Kainz. 3d reconstruction in canonical co-ordinate

space from arbitrarily oriented 2d images, 2018. 2

[11] Shuzhou Jiang, Hui Xue, Alan Glover, Mary Ruther-

ford, Daniel Rueckert, and Joseph V Hajnal. MRI

of moving subjects using multislice snapshot images

with volume reconstruction (SVR): application to fe-

tal, neonatal, and adult brain studies. IEEE transac-

tions on medical imaging, 26(7):967–980, 2007. 2, 3

[12] Bernhard Kainz, Markus Steinberger, Wolfgang Wein,

Maria Kuklisova-Murgasova, Christina Malamate-

niou, Kevin Keraudren, Thomas Torsney-Weir, Mary

Rutherford, Paul Aljabar, Joseph V Hajnal, et al. Fast

volume reconstruction from motion corrupted stacks

of 2d slices. IEEE transactions on medical imaging,

34(9):1901–1913, 2015. 2

[13] Uday Krishnamurthy, Jaladhar Neelavalli, Swati

Mody, Lami Yeo, Pavan K Jella, Sheena Saleem,

Steven J Korzeniewski, Maria D Cabrera, Shadi

Ehterami, Ray O Bahado-Singh, Yashwanth Katkuri,

Ewart M Haacke, Edgar Hernandez-Andrade, Sonia S

Hassan, and Roberto Romero. MR imaging of the fe-

tal brain at 1.5T and 3.0T ﬁeld strengths: comparing

speciﬁc absorption rate (SAR) and image quality. J.

Perinat. Med., 43(2):209–220, 2015. 1

[14] Maria Kuklisova-Murgasova, Gerardine Quaghebeur,

Mary A. Rutherford, Joseph V. Hajnal, and Julia A.

Schnabel. Reconstruction of fetal brain mri with inten-

sity matching and complete outlier removal. Medical

Image Analysis, 16(8):1550–1564, 2012. 2, 5

[15] Lucia Manganaro, Silvia Capuani, Marco Gen-

narini, Valentina Miceli, Roberta Ninkova, Ilaria

Balba, Nicola Galea, Angelica Cupertino, Alessandra

Maiuro, Giada Ercolani, and Carlo Catalano. Fetal

mri: what’s new? a short review. European Radiology

Experimental, 7(1):41, 2023. 1

[16] Kelly Payette, Priscille de Dumast, Hamza Ke-

biri, Ivan Ezhov, Johannes Paetzold, Suprosanna

Shit, Asim Iqbal, Romesa Khan, Raimund Kottke,

Patrice Grehten, Hui Ji, Levente Lanczi, Marianna

Nagy, Beres Monika, Thi Nguyen, Giancarlo Na-

talucci, Theofanis Karayannis, Bjoern Menze, Mer-

itxell Bach Cuadra, and Andr

as Jakab. An automatic

multi-tissue human fetal brain segmentation bench-

mark using the fetal tissue annotation dataset. Scien-

tiﬁc Data, 8, 2021. 5, 6

[17] Yuchen Pei, Lisheng Wang, Fenqiang Zhao, Tao

Zhong, Lufan Liao, Dinggang Shen, and Gang Li.

Anatomy-guided convolutional neural network for

motion correction in fetal brain mri. In Machine

Learning in Medical Imaging, pages 384–393, Cham,

2020. Springer International Publishing. 2

[18] Daniela Prayer, Peter Christian Brugger, and Lucas

Prayer. Fetal MRI: techniques and protocols. Pedi-

atr. Radiol., 34(9):685–693, 2004. 1

[19] Marta B. M. Ranzini, Lucas Fidon, S

ebastien

Ourselin, Marc Modat, and Tom Vercauteren. Mon-

aifbs: Monai-based fetal brain mri deep learning seg-

mentation, 2021. 5

[20] Olaf Ronneberger, Philipp Fischer, and Thomas Brox.

U-net: Convolutional networks for biomedical image

segmentation. In International Conference on Medical

image computing and computer-assisted intervention,

pages 234–241. Springer, 2015. 3

[21] Franc¸ois Rousseau, Orhan A Glenn, Betina Ior-

danova, Cynthia Rodriguez-Carranza, Daniel B Vi-

gneron, A James Barkovich, and Colin Studholme.

Registration-based approach for reconstruction of

high-resolution in utero fetal mr brain images. Aca-

demic Radiology, 13(9):1072–1081, 2006. 2, 3

[22] Seyed Sadegh Mohseni Salehi, Shadab Khan, Deniz

Erdogmus, and Ali Gholipour. Real-time deep pose

estimation with geodesic loss for image-to-template

rigid registration, 2018. 2

[23] S

ebastien Tourbier, Xavier Bresson, Patric Hagmann,

Jean-Philippe Thiran, Reto Meuli, and Meritxell Bach

Cuadra. An efﬁcient total variation algorithm for

super-resolution in fetal brain MRI with adaptive reg-

ularization. NeuroImage, 118:584–597, 2015. 2

[24] Alena U. Uus, Alexia Egloff Collado, Thomas A.

Roberts, Joseph V. Hajnal, Mary A. Rutherford, and

Maria Deprez. Retrospective motion correction in

foetal mri for clinical applications: existing meth-

ods, applications and integration into clinical practice.

British Journal of Radiology, 96(1147):20220071,

2022. 2

[25] Zhou Wang, Alan Conrad Bovik, Hamid Rahim

Sheikh, and Eero P Simoncelli. Image quality as-

sessment: from error visibility to structural similarity.

IEEE Trans. Image Process., 13(4):600–612, 2004. 7

[26] Jiangjie Wu, Lixuan Chen, Zhenghao Li, Xin Li, Tao-

tao Sun, Lihui Wang, Rongpin Wang, Hongjiang Wei,

and Yuyao Zhang. 3D isotropic high-resolution fetal

brain MRI reconstruction from motion corrupted thick

data based on physical-informed unsupervised learn-

ing. IEEE J. Biomed. Health Inform., PP(99):1–14,

2025. 3, 8

[27] Jiangjie Wu, Hongjiang Wei, and Yuyao Zhang. Svr-

mamba: Slice-to-volume reconstruction from multiple

mri stacks with slice sequence guided mamba. Pro-

ceedings of the AAAI Conference on Artiﬁcial Intelli-

gence, 39(8):8404–8412, 2025. 2, 8

[28] Junshen Xu, Daniel Moyer, P. Ellen Grant, Polina Gol-

land, Juan Eugenio Iglesias, and Elfar Adalsteinsson.

Svort: Iterative transformer for slice-to-volume regis-

tration in fetal brain mri. In Medical Image Computing

and Computer Assisted Intervention – MICCAI 2022,

pages 3–13, Cham, 2022. Springer Nature Switzer-

land. 2, 3, 5, 6, 11

[29] Junshen Xu, Daniel Moyer, Borjan Gagoski, Juan Eu-

genio Iglesias, P. Ellen Grant, Polina Golland, and El-

far Adalsteinsson. Nesvor: Implicit neural represen-

tation for slice-to-volume reconstruction in mri. IEEE

Transactions on Medical Imaging, 42(6):1707–1719,

2023. 2, 3, 6, 11

[30] Sean I. Young, Ya

el Balbastre, Bruce Fischl, Polina

Golland, and Juan Eugenio Iglesias. Fully convolu-

tional slice-to-volume reconstruction for single-stack

mri, 2024. 2, 3, 5

Fast Multi-Stack Slice-to-Volume Reconstruction

via Multi-Scale Unrolled Optimization

Supplementary Material

In this supplement, we provide additional details of meth-

ods described in the paper, additional ablations, and visual

results.

A. Implementation Details

Network architecture. Our specialized U-net has 10 lay-

ers, ﬁve in the encoder and ﬁve in the decoder. Each layer

consists of 4 convolutions with 2D kernel of size (1,3,3) in

the encoder and 3D kernel in the decoder (2,3,3). In the

decoder, the volume recon and slice prediction are forward

operators while the ﬁeld update is a convolutional layer.

B. Model-based reconstruction

We use an implementation from the NeSVoR pack-

age [29] for model-based reconstruction by calling the

command nesvor svr with slices that are initialized

with the network’s predicted pose. We use the op-

tions --no-global-exclusion,--n-iter 5, and

--n-iter-rec 3. Disabling global exclusions avoids

large holes in the reconstruction caused by many slices be-

ing excluded. We ﬁnd that increasing the number of outer

iterations (n-iter) from 3 to 5 improves results and thus

fewer inner loops (n-iter-rec) are needed. After re-

construction, we rescale the volume by the average of the

slice scaling factors to avoid global shifts in intensity. We

disable the pose reﬁnement step in the case of using cSVR

alone and enable it to perform cSVR + Reﬁne.

C. Baselines

SVoRT Baseline. We use the latest version of the SVoRT

package (v2) [28] provided in the NeSVoR library by call-

ing the command nesvor svr with the motion-corrupted

stacks. We use the options --no-global-exclusion,

--n-iter 5, and --n-iter-rec 3 to ensure consis-

tency with the model-based reconstruction. We similarly

rescale the slices to avoid global shifts in the reconstructed

volume.

NeSVoR. We use the original reference implementation

provided in the NeSVoR library by calling the command

nesvor reconstruct. We additionally disable the

output-mean-intensity functionality to ensure the

volumes are reconstructed in the same range as the input

slices.

D. Model Scale

We downscale each layer’s features by a factor of 2,4,8 and

compute the model’s performance on the clinical test cases.

This corresponds to the number of parameters in Table 3.

The largest gain is from scaling the model to over 100 M

parameters. At very small model sizes, there is large varia-

tions in performance across models.

Table 3: Ablation Study. Effect of total number of parameters

Model size SSIM NCC PSNR

(↑) (↑) (↑)

10 M 0.961 0.126 36.2

14 M 0.947 0.113 34.7

165 M 0.966 0.132 36.9

660 M (cSVR) 0.966 0.132 37.0

E. Additional Results

We compare reconstructions from synthetic data under se-

vere rotation, translation, and noise corruption across the

different methods (Fig. 7). Our model with reﬁnement

(cSVR + Reﬁne) reconstructs coherent brains in all three

cases. The most notable difference between methods in the

case of large translation. It is important to note that the INR-

based method (NeSVoR) fails to reconstruct the brain areas

in places of poor coverage and instead creates black spots.

We also provide coronal and sagittal views of the clinical

subjects provided in the main paper as well as 4 other sub-

jects of the 9 used for evaluation in the paper (Fig. 3). Our

method performs comparably to state of the art with signif-

icant time improvements.

Figure 7: Synthetic Evaluation. Reconstructions from one synthetic subject with high rotation, translation, and noise. Our model with

reﬁnement (cSVR + Reﬁne) performs well in all three corruption scenarios, with the most notable difference in the case of translation. INR

based methods create black areas in places of poor coverage.

Figure 8: Clinical Evaluation Coronal View. Reconstructions for 5 clinical subjects (GA 20–35 weeks) for all methods: SVoRT, SVoRT

+ NeSVoR, cSVR, cSVR + Reﬁne, cSVR + NeSVoR.

Figure 9: Clinical Evaluation Sagittal View. Reconstructions for 5 clinical subjects (GA 20–35 weeks) for all methods: SVoRT, SVoRT

+ NeSVoR, cSVR, cSVR + Reﬁne, cSVR + NeSVoR.

Figure 10: Clinical Evaluation Axial View. Reconstructions on 4 other clinical subjects (GA 20–35 weeks) for all methods: SVoRT,

SVoRT + NeSVoR, cSVR, cSVR + Reﬁne, cSVR + NeSVoR.

Figure 11: Clinical Evaluation Coronal View. Reconstructions on 4 other clinical subjects (GA 20–35 weeks) for all methods: SVoRT,

SVoRT + NeSVoR, cSVR, cSVR + Reﬁne, cSVR + NeSVoR.

Figure 12: Clinical Evaluation Sagittal View. Reconstructions on 4 other clinical subjects (GA 20–35 weeks) for all methods: SVoRT,

SVoRT + NeSVoR, cSVR, cSVR + Reﬁne, cSVR + NeSVoR.