Fast Multi-Stack Slice-to-Volume Reconstruction
via Multi-Scale Unrolled Optimization
Margherita Firenze
MIT
mfirenze@mit.edu
Sean I. Young
Harvard Medical School
siyoung@mit.edu
Clinton J. Wang
MIT
clintonw@csail.mit.edu
Hyuk Jin Yun
Harvard Medical School
hyun@cmh.edu
Elfar Adalsteinsson
MIT
elfar@mit.edu
Kiho Im
Harvard Medical School
kiho.im@childrens.harvard.edu
P. Ellen Grant
Harvard Medical School
ellen.grant@childrens.harvard.edu
Polina Golland
MIT
polina@csail.mit.edu
Figure 1: Fast Multi-Stack Slice-to-Volume Reconstruction. Our proposed multi-stack SVR framework takes as input three motion-
corrupted stacks of 2D slices and reconstructs a volume (1 second). Super-resolution is performed With optional optimization (7 seconds).
Abstract
Fully convolutional networks have become the backbone of
modern medical imaging due to their ability to learn multi-
scale representations and perform end-to-end inference. Yet
their potential for slice-to-volume reconstruction (SVR), the
task of jointly estimating 3D anatomy and slice poses from
misaligned 2D acquisitions, remains underexplored. We in-
troduce a fast convolutional framework that fuses multiple
orthogonal 2D slice stacks to recover coherent 3D struc-
ture and refines slice alignment through lightweight model-
based optimization. Applied to fetal brain MRI, our ap-
proach reconstructs high-quality 3D volumes in under 10s,
with 1s slice registration and accuracy on par with state-
of-the-art iterative SVR pipelines, offering more than 40×
speedup. The framework uses non-rigid displacement fields
to represent transformations, generalizing to other SVR
problems like fetal body and placental MRI. Additionally,
the fast inference time paves the way for real-time, scanner-
side volumetric feedback during MRI acquisition.
1. Introduction
Fetal brain magnetic resonance imaging (MRI) is an im-
portant tool for investigating abnormal ultrasound findings
and expanding our understanding of fetal brain develop-
ment [1, 9, 15]. To alleviate the effects of fetal motion,
fast 3D MRI sequences are used, which limit motion arti-
facts in the acquired 2D images [18]. A “cool-off period is
required due to safety limits on energy deposition between
consecutive slice acquisitions [13] and during these times
the fetus often moves considerably, causing the slices to be
misaligned. An example of this can be seen in Fig. 1, where
the coronal and sagittal views both have severe motion that
make their orthogonal views look incoherent.
1
Images are acquired in series, called stacks, and ideally
only three stacks are needed for the three standard views
of the brain (sagittal, axial, and coronal). Due to motion,
stacks may contain oblique (out of plane) slices. In these
cases, the stack is reacquired, resulting in as many as 20
stacks and leading to long, uncomfortable scan times and
thousands of images for radiologists to sift through.
These problems can be overcome by slice-to-volume
reconstruction (SVR) methods, which produce high-
resolution visualizations of the brain from a limited num-
ber of stacks. SVR methods align the acquired slices in 3D
and super-resolve the volume [4, 7, 14, 21, 28, 29]. SVR is
widely used in research for volumetric analysis. However,
long runtime limits scanner-side uses and clinical adoption,
as radiologists perform assessment shortly after acquisition,
making time-consuming SVR methods disrupt the standard
workflow. Fast SVR has the potential to improve radiolog-
ical assessment by providing a coherent volume in time for
radiological assessment, and to vastly accelerate and im-
prove fetal imaging by guiding decision-making during ac-
quisitions, i.e., when to stop acquiring new data because the
brain coverage is complete and which orientation to pre-
scribe for the next stack. Our main contributions are:
We propose a fully convolutional neural network that reg-
isters multiple stacks of slices in under one second, and
refines poses and produces reconstructions of high qual-
ity in under 10 seconds.
We integrate the neural network with model-based recon-
struction using data consistency with acquired slices.
We evaluate the proposed method on simulated and real
clinical data, demonstrating state of the art reconstruction
accuracy and speed.
Notably, our framework is not constrained to rigid motion
models and only requires a small training set, which paves
the way for other MRI applications such as placental (de-
formable motion) and fetal body (poly-rigid) SVR.
2. Related Work
SVR is complicated by the fact that only 2D slices are avail-
able, unlike classic 2D-to-3D registration problems where
2D and 3D images are given. Therefore, SVR can be
thought of as two problems that have to be solved jointly:
volume reconstruction, i.e., recovering a high-resolution
3D volume from aligned slices, and slice registration, i.e.,
aligning 2D slices into a common coordinate system. These
steps are intertwined, as more accurate volumes lead to
more accurate slice poses and vise versa. Early optimiza-
tion methods updated the volume and slice poses in an alter-
nating fashion. Later research sought to use deep learning
to solve the registration task of SVR, optimize both regis-
tration and reconstruction with a neural network, and solve
registration in one pass using an unrolled deep learning ap-
proach.
Learning-free Optimization. Early methods framed SVR
as an optimization problem, with solutions that alternated
between reconstructing a 3D volume and estimating slice
poses [5–7, 11, 12, 14, 21, 23]. The SVRTK toolkit [14]
is a widely used package that improves this approach by
using robust statistics to remove outlier slices for better re-
constructions.
In SVRTK the volume is initialized using a designated
reference stack. To initialize the slice poses, stacks are
registered to the reference stack in bulk, setting poses of
all slices in the stack. Following this step, the volume re-
construction is achieved by minimizing a model-based loss.
The optimization encourages the simulated slices, the slices
predicted based on the volume and slice pose estimates, to
match the input slices. Then, the poses are updated by reg-
istering the input slices to the latest volume estimate. These
steps are repeated 5-7 times. The optimization can fail to
converge when large motion is present [24] or a reference
stack is not adequately chosen. Further, the method is time
consuming taking around 5 minutes using a multi-threaded
CPU implementation to reconstruct a volume from 3 slice
stacks. Finally, SVRTK predicts poses relative to a desig-
nated template stack, with no guarantees of the final recon-
struction being in a canonical orientation that can be readily
interpreted by radiologists, resulting in oblique reconstruc-
tions.
Deep Learning Registration for SVR. Deep learning
promised to make SVR more robust to large motion and to
reconstruct the volume in the canonical orientation. Early
deep learning approaches for SVR used CNN architectures
trained on synthetic data to directly regress slice poses, ei-
ther as explicit rotation and translation parameters [22] or
as anchor-point representations [10, 17]. Transformer ar-
chitecture has been shown effective for coupled registration
of all slices in the stack by capturing their pose similari-
ties [28]. While these methods were fast, none were ac-
curate enough to outperform traditional optimization-based
approaches, but rather served as an initialization step to lead
to faster convergence using optimization methods. State-
space architectures followed by an MLP to predict slice
poses have been shown to achieve improved registration ac-
curacy [27].
In many deep learning SVR methods, once slice poses
are estimated, the latent volume is reconstructed using tra-
ditional model-based optimization [14]. Alternatively, the
model-based reconstruction alternates with registration as
in classical methods [28]. Other methods replace this
step with learned reconstruction networks, employing su-
pervised interpolation to perform super-resolution and in-
painting [27, 30]. While supervised inpainting methods pro-
duce high-quality details, they are not guaranteed to pro-
duce a final reconstruction that is consistent with the in-
put slices. This has the potential to smooth over potential
2
volume
recon
coherent volume
Convolutional Pose Estimation
slice stacks + initial poses
f
f
Layer s
s
s-1
s+1
Δf
s
V
s
sim slice
{𝐼
"
s
}
skip slices {𝐼
s
}
f
Model-based
Reconstruction
slice
poses
pose refine
volume
refine
2D slice
encoder
2D + 3D
decoder
slice
prediction
f
s-1
[[ S
1
, S
2
… ]]
[[ C
1
, C
2
… ]]
[[ A
1
, A
2
… ]]
f
s=0
A)
B)
f
s
+
field
update
Figure 2: Method Overview. (A) SVR pipeline combines convolutional pose estimation with model-based reconstruction. (B) Iterative
2D+3D blocks refine slice pose estimates at resolution s through simulated slice generation and flow field updates.
pathology and favor reconstructing “average” brains.
We use the model-based reconstruction approach in our
method as it is faster than INR based reconstructions and
ensures consistency with the acquired data. Supervised in-
painting methods are not guaranteed to produce a final re-
construction that is consistent with the input slices. This has
the potential to smooth over potential anomalies and favor
reconstructing “average” brains [30].
Fully Neural SVR. More recently, implicit neural repre-
sentations (INR) have been used to perform registration and
reconstruction [4, 26, 29]. The INR optimizes a multi-layer
perceptron (MLP) during inference. The INR learns a con-
tinuous representation of the volume and adapts slice poses
as well as pixel and slice weights to remove outliers and
bias fields effectively. The NeSVoR package [29], which
first proposed this approach, is considered state of the art
for its ability to resolve fine details and produce robust re-
sults. Since the network is optimized at inference time, the
method requires long runtimes, around 4-5 minutes, and
specialized GPU infrastructure. The poses are initialized
using a fast deep-learning registration [28]. Similar to the
classic methods, the optimization fails to converge when the
initial poses are inaccurate. Further, in the final query of
the INR to produce the volume, discretization artifacts can
occur from sampling the continuous network parameteriza-
tion.
Alternatively, optimizing two MLPs at inference time
has been proposed. The first MLP performs registration and
the second MLP provides volume reconstruction similar to
the INR methods above [4, 26]. Additionally, meta learning
[4] has been shown to reduce the convergence time by ini-
tializing the weights using a small set of examples. Despite
these advancements the fastest implementation still requires
more than a minute and specialized GPU infrastructure.
Deep Learning SVR via Multi-scale Feed-forward Net-
works. One possible reason for the success of the deep
learning methods in 2 is that they repeat the classical opti-
mization steps thousands of times and parametrize the prob-
lem with millions of parameters. This is in contrast to
deep registration networks which are much faster but pre-
dict poses directly from slices without explicitly evoking
the forward model that couples the volume, the slice poses,
and the acquired slices. In order to combine the perfor-
mance gains of parameterized optimization approaches and
the speed of registration approaches, a possible solution is
to unroll the optimization steps across different layers of a
network. Specifically SVR is posed as a 2D-to-3D registra-
tion task between input slices and an unknown 3D volume.
In the network the poses are refined at different resolutions
in successive layers of the network. This is done using a
U-net architecture [20] which is designed multi-scale feed
forward. Once trained, this network produces slice pose
estimates through a single feed-forward pass at inference
time, requiring less than a second [30]. Then to perform
super-resolution reconstruction inpainting is used [30]. Our
method expands on this work and integrates model-based
reconstruction following registration.
3. Preliminaries
Formally, SVR is an inverse problem where we seek to re-
construct an underlying volume V that is consistent with the
acquired slices. The forward imaging model predicts slice
I
n
from an underlying volume V ,
I
n
= M(F
1
n
)V (1)
where F
n
E(3) is a rigid transformation that defines the
plane of imaging of slice I
n
and function M (·) transforms
the 3D discretized point-spread function (PSF) by the trans-
formation that is its input. The PSF is determined by the
image acquisition parameters and can be approximated as a
Gaussian [11, 21]. M produces a sparse, non-square matrix
3
mapping voxel coordinates in the volume to slice coordi-
nates.
Both V and
F
n
are unknown. Classical SVR methods
use coordinate descent by alternating the estimation of V
(i.e., reconstruction) and
F
n
(i.e., registration).
The volume is initialized using the original slice poses
of the acquired stacks, where each slice is spread over a 3D
area given by the PSF weights and normalized by the total
amount of slice contributions to a voxel, i.e.,
V
init
(x) =
P
n
M(F
n
)
T
I
n
(x)
[
P
n
M(F
n
)
T
] (x)
. (2)
.
This step produces blurry volumes that can be further re-
fined by minimizing a data consistency loss. Specifically,
“simulated slices” are generated from the estimated vol-
ume V the current slice poses
F
n
and the forward model
(1), and compared to the acquired slices
I
n
. This is also
known as model-based optimization. Formally, the recon-
struction step updates V while keeping the poses
F
n
:
ˆ
V = argmin
V
X
n
I
n
M(F
n
)V
2
. (3)
During registration, the acquired slices are registered to the
fixed volume V to update their poses
F
n
constant:
ˆ
F
n
= argmin
F
n
I
n
M(F
n
)V
2
(4)
.
This alternating scheme separates pose and volume opti-
mization, with each step requiring computationally expen-
sive updates and many iterations to converge to a solution.
4. Method
We implement a reconstruction and slicing formulation that
makes implementation highly parallelizable by using the
first order approximations for the forward model and vol-
ume reconstruction. To generalize our approach we employ
non-rigid displacements instead of rigid transforms.
We replace the discrete PSF matrix M
n
with identity I,
modeling slices as unit thin. As we explain later in this sec-
tion, this is a reasonable simplification for low-resolution
layers in our network. Motion is modeled as a non-rigid
displacement field f : R
2
R
3
mapping 2D pixel coor-
dinates p = (p
x
, p
y
) to 3D displacements. We uplift to 3D
via p
by placing slices on z = 0 plane by appending a 0 to
each vector, i.e. (p
x
, p
y
)
= (p
x
, p
y
, 0). The initial volume
reconstruction (2) becomes
V
init
(x) =
h
P
n
P
p
V
p
+ f
n
(p), I
n
(p)
i
(x)
h
P
n
P
p
V(p
+ f
n
(p), [I
n
> 0])
i
(x)
(5)
where V(x, I) denotes the volume pushing operation that
places the intensity given by I at the voxel coordinates x and
distributes the intensity using trilinear interpolation when
the 3D coordinate location does not coincide with a discrete
grid point. Finally, since multiple slices may contribute to
one 3D voxel location, the intensity is normalized by the to-
tal weight of contributions. This approximation is used for
slice estimation and is not refined using (3) as in classical
methods.
To refine the pose estimates, we first construct the simu-
lated slices using the current pose estimates, similar to (4),
except without the use of M:
ˆ
I
n
(p) = V
(p
+ f
n
(p), V ) (6)
where V
(x, v) denotes the volume sampling operation that
samples the intensity at the coordinate x and uses trilinear
interpolation when the 3D coordinate location does not co-
incide with a discrete point on the grid.
Then, we employ a learned convolutional operator, f
which refines the displacement by comparing the simulated
and input slices
f
s
n
= f
s1
n
+ f
s
(
ˆ
I
n
, I
n
), (7)
where s is the index of the layer. With these two operations
defined we can implement the volume estimation and pose
refinement steps many times as illustrated in Fig. 2. To
implement a multi-resolution strategy, we repeat these steps
at increasing resolutions, starting with slices sampled at low
resolutions and ending with high-resolution slices, using a
slice parametrization that accounts for orthogonal slices.
Slice pose parametrization. We initialize slice poses to
their prescribed positions, as given by the stack direction
and slice order, and have the network refine them through
iterative updates with increasing resolution. We parametrize
the slice depth using a translation matrix T
n
and 3D orien-
tation (sagittal, axial, coronal) with a 4x4 matrix R
n
that
corresponds to either a sagittal, axial, or coronal orienta-
tion. Then to adjust to different resolutions of the slices, we
use C
s
and C
1
s
which are translation matrices that center
and de-center the coordinates to rotate about the center of
the slice at resolution s. Finally, S
s
scales the pixel coordi-
nates to match the volume resolution. Note that T
n
encodes
the slice index in the stack and is different for each slice in
the stack; R
n
is shared by all slices in the same stack; C
s
and S
s
depend on the resolution of layer s in the network.
All together this transformation can be used to formulate
the displacement field that places a slice in its prescribed
position at different scales s:
f
s
n
= (C
s
R
n
C
1
s
S
1
s
T
n
I)p (8)
where p
denotes homogeneous 3D coordinates, for the
slice placed at z=0. This approach models each slice sepa-
4
rately, enabling the framework to work with variable slice
and stack numbers.
Finally, to train the network we use a multi-layer L
2
loss
on the residual displacement:
L(f
GT
, f) =
X
n
f
GT,n
1
5
4
X
s=0
f
s
n
2
2
(9)
Once the network predicts the slice poses, we super-
resolve the volume using a model-based approach as in (3)
and also perform the pose update steps and refine the poses.
High-resolution slice and volume estimation. At low res-
olutions modeling slices as unit thin is viable since the in-
plane sampling ratio substantially exceeds the slice thick-
ness. However, approximating slices as unit thickness in-
troduces rendering artifacts at full resolution due to the mis-
match between the in-plane and thickness dimensions. To
mitigate coverage gaps at full resolution, we project the dis-
placement field using the method of Arun et al. [2] at the
final layer, then apply a boxcar PSF to distribute slice values
across their thickness.
4.1. Implementation
Network Architecture. We build a custom U-net with a
2D encoder and a 2D + 3D decoder. The encoder constructs
multi-resolution slice features
I
s
n
. The decoder repeats a
2D to 3D block five times while doubling the resolution at
each layer, to emulate the classical SVR steps, as shown in
Panel B of Fig. 2 and 5–7.
We construct a feature volume V
s
R
d×d
s
×d
s
×d
s
at each resolution using ((5)). V
s
is a four dimensional
tensor, constructed from the slice features
I
s
n
from the
skip connection of layer s and the previous displacement
fields f
s1
n
. Here d = [8, 16, 32, 64, 128] and d
s
=
[1024, 512, 256, 128, 64].
To refine the displacement fields, we sample the volume
to create simulated slices
ˆ
I
s
n
and compute their correla-
tion with the skip connection features
I
s
n
to estimate a
displacement residual f
s
that is added to the previous dis-
placement estimate f
s1
as in (7).
We simplify the architecture compared to previously pro-
posed solution [30] to make each layer’s predicted volume
and slices independent from the previous layer and depend
only on the previous displacement estimates and skip con-
nections.
Model-based reconstruction. The slice pose estimates are
used to initialize a model-based optimization that iteratively
refines both the 3D volume and slice poses to ensure con-
sistency with the acquired slices in (3) [14]. We employ
a GPU-accelerated implementation of this model-based re-
construction [28].
Training. We generate slice stacks from standard orthogo-
nal imaging planes (sagittal, axial, and coronal), each per-
turbed by a bulk in-plane rotation uniformly sampled in
the range [12
, 12
] to simulate imperfect plane selection.
Starting from this initialization, we apply between 1 and
100 smooth motion perturbations per stack, generated by
interpolating randomly sampled rigid transformations using
cubic B-splines. This procedure captures both gradual mo-
tion patterns and abrupt movements. Rotational perturba-
tions are drawn from a zero-mean normal distribution with
a standard deviation of 20
; translations are uniformly sam-
pled within [6.1, 6.1] mm. We further apply Gaussian
noise, slice-wise bias field augmentation, and gamma in-
tensity perturbations to make the network robust to imaging
artifacts. We train the network on a NVIDIA H200 GPU for
250k steps and pick the last model. We use ADAM with an
initial learning rate of 10
4 and poly scheduling.
5. Experimental Results
Table 1: Clinical Evaluation. Quantitative assessment of recon-
struction quality across 9 clinical subjects. We compute similarity
measures (SSIM, NCC, and PSNR) between the simulated slices
and input slices across methods. Running times listed (last row).
Method Slice SSIM () NCC () PSNR ()
SVoRT 0.971 ± 0.007 0.14 ± 0.01 37.5 ± 1.4
+ NeSVoR 0.959 ± 0.007 0.13 ± 0.01 35.7 ± 1.3
cSVR 0.952 ± 0.013 0.12 ± 0.01 35.4 ± 2.0
+ Refine 0.966 ± 0.005 0.13 ± 0.01 37.0 ± 1.2
+ NeSVoR 0.959 ± 0.006 0.13 ± 0.01 35.3 ± 1.2
SVoRT + NeSVoR cSVR + Refine + NeSVoR
10s 257s 3s 7s 251s
Data. We train and evaluate our model using FeTA [16],
a public dataset of high-quality T2-weighted coherent vol-
umes reconstructed using existing methods in 120 subjects
(gestational age (GA) 20–35 weeks, voxel size 0.8 mm
3
),
and 18 volumes (GA 21–38 weeks, voxel size 0.8 mm
3
)
from the CRL atlas [8]. We train the network on 108 sub-
jects and 18 atlases.
We evaluate our method on 12 held-out FeTA subjects
and 9 patients from [withheld for anonymity] (GA 25 35
weeks, pixel size 1.3 1.4mm, slice thickness 3mm). We
choose three stacks (sagittal, coronal, and axial) for each
subject and segment the intracranial content using a pub-
licly available method [19].
Baseline methods. We evaluate the accuracy of the pose es-
timates produced by our neural network when coupled with
three variants of volume reconstruction: (1) cSVR keeps the
slice poses estimated by the neural network and uses data
consistency to reconstruct the volume; (2) cSVR+Refine
5
Figure 3: Clinical Evaluation. Reconstructions for clinical subjects (GA 20–35 weeks) for all methods: SVoRT, SVoRT + NeSVoR,
cSVR, cSVR + Refine, cSVR + NeSVoR. Our proposed fast method, cSVR + Refine, achieves high-quality reconstructions comparable
to state of the art, with high grey and white matter contrast (green arrows). Our method as well as SVoRT struggles in cases of image
corruption as seen by the red arrows, where the reconstructions fail to exclude noisy areas of a slice.
continues to refine the slice poses while alternating with
volume reconstruction; (3) cSVR+NeSVoR provides the
slice poses estimated by the network as an initialization for
NeSVoR, the state of the art method based on an implicit
neural representation (INR) of the resulting volume [29].
We also compare the performance of the cSVR variants
with two baseline methods: a transformer-based approach
SVoRT [28] and SVoRT+NeSVoR method that uses the
output of SVoRT as an initialization for NeSVoR. SVoRT
was trained on FeTA dataset [16] and shares the same vali-
dation set as our method.
6
Figure 4: Performance evaluation and sensitivity analysis. Robustness across methods to input stack perturbations (translation, rotation,
noise), evaluated via registration accuracy (top) and reconstruction quality (bottom). Our method, cSVR + refine, is robust across high
levels of translation and rotation and our method coupled with NeSVoR reconstruction achieves the best overall performance.
Figure 5: Inference Time of Registration Methods Time to pre-
dict slice poses on clinical subjects of varying input sizes, compar-
ing SVoRT vs cSVR.
Evaluation on Synthetic Data. On the synthetic data,
where ground truth is available, we quantify the accuracy
of the pose prediction using the maximum total registration
error (TRE) and volume reconstruction quality using struc-
tural similarity index measure (SSIM) [25] between the re-
constructed and the ground truth volumes. For estimated
Table 2: Ablation Study. Effect of loss function and slice pose
initialization on clinical data.
Slice poses Multi-layer SSIM NCC PSNR
initialized loss applied
0.93 0.10 32.63
0.966 0.13 36.95
0.966 0.13 37.00
and ground-truth displacement fields
ˆ
f and f , respectively,
we define
TRE(
ˆ
f; f) = max
p
ˆ
f(p) f(p)
2
(10)
which captures the point of maximum distance between the
the predicted pixel location and the ground truth location.
Prior to computing the TRE, the output volume is registered
to the ground-truth volume using ANTS [3] to mitigate the
effect of global shifts in reconstruction. Although most pre-
vious work report only volume consistency scores, TRE is
7
0
1 2
3
4
5
Layer #:
30.0 ± 6.0
Total Registration Error (TRE, mm) at Network Layer #
30.0 ± 6.0 4.0 ± 0.8 3.0 ± 0.6 2.0 ± 0.5 2.0 ± 0.5
TRE:
Figure 6: Pose estimation in different layers of the network.
Network gradually estimates pose, with bulk of pose being pre-
dicted correctly in the first two layers.
an important measure to distinguish between the registra-
tion performance of a given method and its reconstruction
performance. Notably, TRE scores for smaller slices, where
only part of the brain is visible, are higher as these regions
are harder to register. To mitigate this, we report median
maximum TRE per slice stacks for the sensitivity analysis.
To capture high variability of motion in clinical practice,
we evaluate our method on a range of motion settings, vary-
ing the possible range of sampled rotations and translations
separately for 12 subjects. We also evaluate the methods’
ability to mitigate noise in the reconstruction by varying the
noise level in the synthetic data.
Evaluation on Clinical Data. For clinical data, we evaluate
the consistency between the estimated and acquired slices
using mean SSIM, PSNR, and NCC (Table 1). We run all
evaluations using an NVIDIA A6000 GPU.
5.1. Results
Synthetic data. We evaluate all methods for varying de-
gree of corruptions, specifically varying rotations, transla-
tions, and image noise (Fig 4). Overall, cSVR is robust
to large rotations, translations, and noise. Adding the re-
finement steps, cSVR + Refine, significantly boosts perfor-
mance across all levels of rotations and translations. How-
ever, cSVR + Refine hurts performance in the case of high
image noise. This is potentially explained as the refine-
ment step is driven by slice consistency which is corrupted
by noise. When our method is combined with INR recon-
struction (NeSVoR) it achieves the most robust results of all
methods. SVoRT struggles with high levels of rotation and
translation. For rotation, NeSVoR is able to compensate,
but for large translations the INR is not able to refine the
pose adequately.
Clinical Data. In the clinical cases, we observe that our
method performs similarly to the baseline methods, (Ta-
ble 1). The SSIM, PSNR, and NCC are all close for all
models, with our method being generally being the second
best. NeSVoR is prone to slight intensity shifts that con-
tribute to the lowest similarity metrics of the three despite
visually high-quality reconstructions (Fig 3). Our method
shows good contrast of grey matter and white matter, Fig.
(3 row 4). We observe NeSVoR + SVoRT produce sharp
reconstructions, as can be seen by the fine detail in the cor-
tex in the last row. However NeSVoR is also prone to make
noisy reconstructions with a speckle-like pattern, as seen in
the first row. An example of the problem of using only self-
consistency metrics can be seen in the third subject (GA
32w) where a noisy slice contributes a bright spot on the
left side of the brain. While all gradient-descent methods
reconstruct this artifact, NeSVoR successfully suppresses it.
Ablations Studies. We evaluate the quality of the slice
pose estimates for different layers of the network as seen in
Fig. 6 and find the network gradually refines the pose, with
large deformations occurring in the early layers. We em-
pirically evaluate the runtime of the algorithms and validate
that our method scales linearly with the input slice count
while SVoRT (transformer-based) scales (roughly) quadrat-
ically (Fig. 5).
To test whether our parametrization of slice poses in (8)
is necessary for network learning, we initialize all slices
with only their slice index position by setting all rotations
matrices R
n
to identity, and train the network. In Table
2 we show the network has very poor performance in this
case. We also evaluate the effect of using a multi-layer loss
and see only a small improvement in PSNR.
6. Discussion
Limitations. Although our method could be used on mul-
tiple stacks, we only evaluated reconstructions based on 3
stack input. Our method requires pre-processing steps that
could be integrated into the network such as standardizing
slice ordering and orientations. We compared our method to
state of the art SVR methods (SVoRT + NeSVoR), but addi-
tional insights could be gleaned by comparing with concur-
rent work [4, 26, 27] once the code has been made public.
Future work. We plan to extend this framework to non-
rigid SVR applications such as placental MRI. We also plan
to train the network to learn how to refine the poses and use
a cascading scheme to do convolutional-based refinement.
7. Conclusions
We demonstrate a fast convolutional multi-stack SVR ap-
proach that is 40 times faster than the state of the art meth-
ods that produces comparable quality reconstructions. We
propose a slice parameterization, loss function, and a robust
reconstruction approach that enables this architecture to be
generalized to other SVR applications.
Acknowledgments. This work is supported by NSF GRFP,
NIH R01EB032708, R01HD114338, R01EB036945,
K99AG081493 and R00AG081493, and the MIT CSAIL-
Wistron Program.
8
References
[1] Michael Aertsen. The role of fetal brain magnetic res-
onance imaging in current fetal medicine. J. Belg. Soc.
Radiol., 106(1):130, 2022. 1
[2] K. S. Arun, T. S. Huang, and S. D. Blostein. Least-
squares fitting of two 3-d point sets. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence,
PAMI-9(5):698–700, 1987. 5
[3] Brian Avants, Nicholas J Tustison, and Gang Song.
Advanced normalization tools: V1.0. Insight J., 2009.
7
[4] Maik Dannecker, Thomas Sanchez, Meritxell Bach
Cuadra,
¨
Ozg
¨
un Turgut, Anthony N. Price, Lucilio
Cordero-Grande, Vanessa Kyriakopoulou, Joseph V.
Hajnal, and Daniel Rueckert. Meta-learning slice-to-
volume reconstruction in fetal brain mri using implicit
neural representations, 2025. 2, 3, 8
[5] Michael Ebner, Guotai Wang, Wenqi Li, Michael
Aertsen, Premal A. Patel, Rosalind Aughwane, An-
drew Melbourne, Tom Doel, Steven Dymarkowski,
Paolo De Coppi, Anna L. David, Jan Deprest,
S
´
ebastien Ourselin, and Tom Vercauteren. An auto-
mated framework for localization, segmentation and
super-resolution reconstruction of fetal brain mri.
NeuroImage, 206:116324, 2020. 2
[6] Michael Ebner, Guotai Wang, Wenqi Li, Michael
Aertsen, Premal A Patel, Rosalind Aughwane, An-
drew Melbourne, Tom Doel, Steven Dymarkowski,
Paolo De Coppi, et al. An automated framework
for localization, segmentation and super-resolution re-
construction of fetal brain MRI. NeuroImage, 206:
116324, 2020.
[7] Ali Gholipour, Judy A Estroff, and Simon K Warfield.
Robust super-resolution volume reconstruction from
slice acquisitions: application to fetal brain MRI.
IEEE transactions on medical imaging, 29(10):1739–
1758, 2010. 2
[8] Ali Gholipour, Caitlin K Rollins, Clemente Velasco-
Annis, Abdelhakim Ouaalam, Alireza Akhondi-Asl,
Onur Afacan, Cynthia M Ortinau, Sean Clancy,
Catherine Limperopoulos, Edward Yang, Judy A Es-
troff, and Simon K Warfield. A normative spatiotem-
poral MRI atlas of the fetal brain for automatic seg-
mentation and analysis of early brain growth. Sci.
Rep., 7(1):476, 2017. 5
[9] Paul D Griffiths, Michael Bradburn, Michael J
Campbell, Cindy L Cooper, Ruth Graham, Deborah
Jarvis, Mark D Kilby, Gerald Mason, Cara Mooney,
Stephen C Robson, Allan Wailoo, and MERIDIAN
collaborative group. Use of MRI in the diagnosis of fe-
tal brain abnormalities in utero (MERIDIAN): a multi-
centre, prospective cohort study. Lancet, 389(10068):
538–546, 2017. 1
[10] Benjamin Hou, Bishesh Khanal, Amir Alansary,
Steven McDonagh, Alice Davidson, Mary Rutherford,
Jo V. Hajnal, Daniel Rueckert, Ben Glocker, and Bern-
hard Kainz. 3d reconstruction in canonical co-ordinate
space from arbitrarily oriented 2d images, 2018. 2
[11] Shuzhou Jiang, Hui Xue, Alan Glover, Mary Ruther-
ford, Daniel Rueckert, and Joseph V Hajnal. MRI
of moving subjects using multislice snapshot images
with volume reconstruction (SVR): application to fe-
tal, neonatal, and adult brain studies. IEEE transac-
tions on medical imaging, 26(7):967–980, 2007. 2, 3
[12] Bernhard Kainz, Markus Steinberger, Wolfgang Wein,
Maria Kuklisova-Murgasova, Christina Malamate-
niou, Kevin Keraudren, Thomas Torsney-Weir, Mary
Rutherford, Paul Aljabar, Joseph V Hajnal, et al. Fast
volume reconstruction from motion corrupted stacks
of 2d slices. IEEE transactions on medical imaging,
34(9):1901–1913, 2015. 2
[13] Uday Krishnamurthy, Jaladhar Neelavalli, Swati
Mody, Lami Yeo, Pavan K Jella, Sheena Saleem,
Steven J Korzeniewski, Maria D Cabrera, Shadi
Ehterami, Ray O Bahado-Singh, Yashwanth Katkuri,
Ewart M Haacke, Edgar Hernandez-Andrade, Sonia S
Hassan, and Roberto Romero. MR imaging of the fe-
tal brain at 1.5T and 3.0T field strengths: comparing
specific absorption rate (SAR) and image quality. J.
Perinat. Med., 43(2):209–220, 2015. 1
[14] Maria Kuklisova-Murgasova, Gerardine Quaghebeur,
Mary A. Rutherford, Joseph V. Hajnal, and Julia A.
Schnabel. Reconstruction of fetal brain mri with inten-
sity matching and complete outlier removal. Medical
Image Analysis, 16(8):1550–1564, 2012. 2, 5
[15] Lucia Manganaro, Silvia Capuani, Marco Gen-
narini, Valentina Miceli, Roberta Ninkova, Ilaria
Balba, Nicola Galea, Angelica Cupertino, Alessandra
Maiuro, Giada Ercolani, and Carlo Catalano. Fetal
mri: what’s new? a short review. European Radiology
Experimental, 7(1):41, 2023. 1
[16] Kelly Payette, Priscille de Dumast, Hamza Ke-
biri, Ivan Ezhov, Johannes Paetzold, Suprosanna
Shit, Asim Iqbal, Romesa Khan, Raimund Kottke,
Patrice Grehten, Hui Ji, Levente Lanczi, Marianna
Nagy, Beres Monika, Thi Nguyen, Giancarlo Na-
talucci, Theofanis Karayannis, Bjoern Menze, Mer-
itxell Bach Cuadra, and Andr
´
as Jakab. An automatic
multi-tissue human fetal brain segmentation bench-
mark using the fetal tissue annotation dataset. Scien-
tific Data, 8, 2021. 5, 6
[17] Yuchen Pei, Lisheng Wang, Fenqiang Zhao, Tao
Zhong, Lufan Liao, Dinggang Shen, and Gang Li.
Anatomy-guided convolutional neural network for
motion correction in fetal brain mri. In Machine
9
Learning in Medical Imaging, pages 384–393, Cham,
2020. Springer International Publishing. 2
[18] Daniela Prayer, Peter Christian Brugger, and Lucas
Prayer. Fetal MRI: techniques and protocols. Pedi-
atr. Radiol., 34(9):685–693, 2004. 1
[19] Marta B. M. Ranzini, Lucas Fidon, S
´
ebastien
Ourselin, Marc Modat, and Tom Vercauteren. Mon-
aifbs: Monai-based fetal brain mri deep learning seg-
mentation, 2021. 5
[20] Olaf Ronneberger, Philipp Fischer, and Thomas Brox.
U-net: Convolutional networks for biomedical image
segmentation. In International Conference on Medical
image computing and computer-assisted intervention,
pages 234–241. Springer, 2015. 3
[21] Franc¸ois Rousseau, Orhan A Glenn, Betina Ior-
danova, Cynthia Rodriguez-Carranza, Daniel B Vi-
gneron, A James Barkovich, and Colin Studholme.
Registration-based approach for reconstruction of
high-resolution in utero fetal mr brain images. Aca-
demic Radiology, 13(9):1072–1081, 2006. 2, 3
[22] Seyed Sadegh Mohseni Salehi, Shadab Khan, Deniz
Erdogmus, and Ali Gholipour. Real-time deep pose
estimation with geodesic loss for image-to-template
rigid registration, 2018. 2
[23] S
´
ebastien Tourbier, Xavier Bresson, Patric Hagmann,
Jean-Philippe Thiran, Reto Meuli, and Meritxell Bach
Cuadra. An efficient total variation algorithm for
super-resolution in fetal brain MRI with adaptive reg-
ularization. NeuroImage, 118:584–597, 2015. 2
[24] Alena U. Uus, Alexia Egloff Collado, Thomas A.
Roberts, Joseph V. Hajnal, Mary A. Rutherford, and
Maria Deprez. Retrospective motion correction in
foetal mri for clinical applications: existing meth-
ods, applications and integration into clinical practice.
British Journal of Radiology, 96(1147):20220071,
2022. 2
[25] Zhou Wang, Alan Conrad Bovik, Hamid Rahim
Sheikh, and Eero P Simoncelli. Image quality as-
sessment: from error visibility to structural similarity.
IEEE Trans. Image Process., 13(4):600–612, 2004. 7
[26] Jiangjie Wu, Lixuan Chen, Zhenghao Li, Xin Li, Tao-
tao Sun, Lihui Wang, Rongpin Wang, Hongjiang Wei,
and Yuyao Zhang. 3D isotropic high-resolution fetal
brain MRI reconstruction from motion corrupted thick
data based on physical-informed unsupervised learn-
ing. IEEE J. Biomed. Health Inform., PP(99):1–14,
2025. 3, 8
[27] Jiangjie Wu, Hongjiang Wei, and Yuyao Zhang. Svr-
mamba: Slice-to-volume reconstruction from multiple
mri stacks with slice sequence guided mamba. Pro-
ceedings of the AAAI Conference on Artificial Intelli-
gence, 39(8):8404–8412, 2025. 2, 8
[28] Junshen Xu, Daniel Moyer, P. Ellen Grant, Polina Gol-
land, Juan Eugenio Iglesias, and Elfar Adalsteinsson.
Svort: Iterative transformer for slice-to-volume regis-
tration in fetal brain mri. In Medical Image Computing
and Computer Assisted Intervention MICCAI 2022,
pages 3–13, Cham, 2022. Springer Nature Switzer-
land. 2, 3, 5, 6, 11
[29] Junshen Xu, Daniel Moyer, Borjan Gagoski, Juan Eu-
genio Iglesias, P. Ellen Grant, Polina Golland, and El-
far Adalsteinsson. Nesvor: Implicit neural represen-
tation for slice-to-volume reconstruction in mri. IEEE
Transactions on Medical Imaging, 42(6):1707–1719,
2023. 2, 3, 6, 11
[30] Sean I. Young, Ya
¨
el Balbastre, Bruce Fischl, Polina
Golland, and Juan Eugenio Iglesias. Fully convolu-
tional slice-to-volume reconstruction for single-stack
mri, 2024. 2, 3, 5
10
Fast Multi-Stack Slice-to-Volume Reconstruction
via Multi-Scale Unrolled Optimization
Supplementary Material
In this supplement, we provide additional details of meth-
ods described in the paper, additional ablations, and visual
results.
A. Implementation Details
Network architecture. Our specialized U-net has 10 lay-
ers, five in the encoder and five in the decoder. Each layer
consists of 4 convolutions with 2D kernel of size (1,3,3) in
the encoder and 3D kernel in the decoder (2,3,3). In the
decoder, the volume recon and slice prediction are forward
operators while the field update is a convolutional layer.
B. Model-based reconstruction
We use an implementation from the NeSVoR pack-
age [29] for model-based reconstruction by calling the
command nesvor svr with slices that are initialized
with the network’s predicted pose. We use the op-
tions --no-global-exclusion,--n-iter 5, and
--n-iter-rec 3. Disabling global exclusions avoids
large holes in the reconstruction caused by many slices be-
ing excluded. We find that increasing the number of outer
iterations (n-iter) from 3 to 5 improves results and thus
fewer inner loops (n-iter-rec) are needed. After re-
construction, we rescale the volume by the average of the
slice scaling factors to avoid global shifts in intensity. We
disable the pose refinement step in the case of using cSVR
alone and enable it to perform cSVR + Refine.
C. Baselines
SVoRT Baseline. We use the latest version of the SVoRT
package (v2) [28] provided in the NeSVoR library by call-
ing the command nesvor svr with the motion-corrupted
stacks. We use the options --no-global-exclusion,
--n-iter 5, and --n-iter-rec 3 to ensure consis-
tency with the model-based reconstruction. We similarly
rescale the slices to avoid global shifts in the reconstructed
volume.
NeSVoR. We use the original reference implementation
provided in the NeSVoR library by calling the command
nesvor reconstruct. We additionally disable the
output-mean-intensity functionality to ensure the
volumes are reconstructed in the same range as the input
slices.
D. Model Scale
We downscale each layer’s features by a factor of 2,4,8 and
compute the model’s performance on the clinical test cases.
This corresponds to the number of parameters in Table 3.
The largest gain is from scaling the model to over 100 M
parameters. At very small model sizes, there is large varia-
tions in performance across models.
Table 3: Ablation Study. Effect of total number of parameters
Model size SSIM NCC PSNR
() () ()
10 M 0.961 0.126 36.2
14 M 0.947 0.113 34.7
165 M 0.966 0.132 36.9
660 M (cSVR) 0.966 0.132 37.0
E. Additional Results
We compare reconstructions from synthetic data under se-
vere rotation, translation, and noise corruption across the
different methods (Fig. 7). Our model with refinement
(cSVR + Refine) reconstructs coherent brains in all three
cases. The most notable difference between methods in the
case of large translation. It is important to note that the INR-
based method (NeSVoR) fails to reconstruct the brain areas
in places of poor coverage and instead creates black spots.
We also provide coronal and sagittal views of the clinical
subjects provided in the main paper as well as 4 other sub-
jects of the 9 used for evaluation in the paper (Fig. 3). Our
method performs comparably to state of the art with signif-
icant time improvements.
11
Figure 7: Synthetic Evaluation. Reconstructions from one synthetic subject with high rotation, translation, and noise. Our model with
refinement (cSVR + Refine) performs well in all three corruption scenarios, with the most notable difference in the case of translation. INR
based methods create black areas in places of poor coverage.
12
Figure 8: Clinical Evaluation Coronal View. Reconstructions for 5 clinical subjects (GA 20–35 weeks) for all methods: SVoRT, SVoRT
+ NeSVoR, cSVR, cSVR + Refine, cSVR + NeSVoR.
13
Figure 9: Clinical Evaluation Sagittal View. Reconstructions for 5 clinical subjects (GA 20–35 weeks) for all methods: SVoRT, SVoRT
+ NeSVoR, cSVR, cSVR + Refine, cSVR + NeSVoR.
14
Figure 10: Clinical Evaluation Axial View. Reconstructions on 4 other clinical subjects (GA 20–35 weeks) for all methods: SVoRT,
SVoRT + NeSVoR, cSVR, cSVR + Refine, cSVR + NeSVoR.
15
Figure 11: Clinical Evaluation Coronal View. Reconstructions on 4 other clinical subjects (GA 20–35 weeks) for all methods: SVoRT,
SVoRT + NeSVoR, cSVR, cSVR + Refine, cSVR + NeSVoR.
16
Figure 12: Clinical Evaluation Sagittal View. Reconstructions on 4 other clinical subjects (GA 20–35 weeks) for all methods: SVoRT,
SVoRT + NeSVoR, cSVR, cSVR + Refine, cSVR + NeSVoR.
17