An End-To-End-Trainable Iterative Network Architecture for Accelerated Radial Multi-Coil 2D Cine MR Image Reconstruction

Purpose: Iterative Convolutional Neural Networks (CNNs) which resemble unrolled learned iterative schemes have shown to consistently deliver state-of-the-art results for image reconstruction problems across different imaging modalities. However, because these methodes include the forward model in the architecture, their applicability is often restricted to either relatively small reconstruction problems or to problems with operators which are computationally cheap to compute. As a consequence, they have so far not been applied to dynamic non-Cartesian multi-coil reconstruction problems. Methods: In this work, we propose a CNN-architecture for image reconstruction of accelerated 2D radial cine MRI with multiple receiver coils. The network is based on a computationally light CNN-component and a subsequent conjugate gradient (CG) method which can be jointly trained end-to-end using an efficient training strategy. We investigate the proposed training-strategy and compare our method to other well-known reconstruction techniques with learned and non-learned regularization methods. Results: Our proposed method outperforms all other methods based on non-learned regularization. Further, it performs similar or better than a CNN-based method employing a 3D U-Net and a method using adaptive dictionary learning. In addition, we empirically demonstrate that even by training the network with only iteration, it is possible to increase the length of the network at test time and further improve the results. Conclusions: End-to-end training allows to highly reduce the number of trainable parameters of and stabilize the reconstruction network. Further, because it is possible to change the length of the network at test time, the need to find a compromise between the complexity of the CNN-block and the number of iterations in each CG-block becomes irrelevant.


I. Introduction
Magnetic Resonance Imaging (MRI) is an important tool for the assessment of different cardiovascular diseases. Cardiac cine MRI, for example, is used to assess the cardiac function as well as left and right ventricular volumes and left ventricular mass 1 . However, MRI is also known to suffer from relatively long data-acquisition times. In cardiac cine MRI, the data-acquisition usually has to take place during a single breathhold of the patient in order to avoid artefacts arising from the respiratory motion. Since for patients with limited breathhold capabilities this can be challenging, undersampling techniques can be used to accelerate the measurement process. Undersampling in Fourier-domain, the so-called k-space, leads to a violation of the Nyquist-Shannon sampling theorem and therefore, regularization techniques must be used to reconstruct artefact-free images.
Recently, image reconstruction has experienced a paradigm shift with the re-emergence of Neural Networks (NNs), and in particular, Convolutional NNs (CNNs) which can be employed as regularization methods 2 . CNNs can be used in different ways as regularization methods for image reconstruction problems. A key element to categorize these methods is wheter or not the forward operator is used in the learning process, see 3 . A straight-forward approach is to simply post-process an initial estimate of the solution which is impaired by noise and artefacts, see e.g. 4 , 5 , 6 , 7 . Further, cross-/multi-domain methods which pre-process the k-space and subsequently process the initially obtained reconstruction have been investigated as well 8 . However, the post-processed image might lack data-consistency, meaning that it is not clear how well the post-processed image matches the measured k-space data.
Thus, approaches which used the output of a pre-trained CNN as image-priors have been considered as well, see e.g. 9 , 10 . Thereby, the CNN-priors are used in the formulation of a Tikhonov functional which is subsequently minimized to increase the data-consistency of the solution. These methods have the (computational) advantage of decoupling the regularization step from increasing data-consistency and thus are in general applicable to arbitrary large-scale image reconstruction problems as well. However, the network-training is typically carried out in the absence of the forward model and although this strategy was reported to be successful, image reconstruction methods based on NNs have been reported to possibly suffer from instabilities 11 . This issue could directly affect the obtained CNN-prior and thus, directly affect the quality of the final reconstruction, since its solution depends on the CNN-prior. Interestingly, in 11 , the authors showed that the CNN-based reconstruction methods which include the forward and adjoint operators in the network architecture are the most stable with respect to small perturbations and adversarial attacks. Further, including the forward model in the network architecture has been reported to lower the maximum errorbound of the CNNs 12 . Thus, it is desirable to find a way to include the physical model in the CNN-architecture, even for computationally expensive/large-scale problems.
Methods in which the CNNs architectures resemble unrolled iterative schemes of finite length are referred to as variational/iterative/cascaded networks, see e.g. 13  Thereby, these methods typically consist of CNN-blocks and data-consistency (DC)-blocks, which are given in the form of gradient-steps with respect to a data-fidelity term, see e.g. 14 , 19 , or as layers which implement the minimizer of a functional involving a data-fidelity term, see e.g. 13 , 16 , 18 . While the CNN-blocks can be interpreted as learned regularizers, the DC-blocks utilize the measured undersampled k-space data to increase/ensure the data-consistency of the intermediate CNN-outputs.
Unfortunately, including the forward and adjoint operators in the CNN can at the same time represent the computational bottleneck of these methods, since the forward operator as well as its adjoint can be computationally expensive to evaluate. As a consequence, the applicability of iterative networks is currently still limited to either reconstruction problems with easy-to-compute forward operators, e.g. a FFT-transform sampled on a Cartesian grid 13 , 16 , 18 , 20 or to non-dynamic problems with non-Cartesian sampling schemes, see 21 for a single-coil data-acquisition or 22 for a multiple coil-acquisition. For example, in 13 , 16 , 18 , only a single-coil is used for the encoding operator. In 20 , multiple receiver coils were used for a dynamic 3D reconstruction problem with a Cartesian sampling grid. For non-dynamic problems, on the other hand, iterative CNNs for multi-coil data-acquisition on a Cartesian grid have been received more attention, see e.g. 14 , 15 , 23 .
Acquisition protocols using non-Cartesian sampling trajectories such as radial or spiral sampling, can be an attractive alternative to standard Cartesian acquisitions, especially because the arising undersampling artefacts are much more incoherent compared to the ones arising from a Cartesian acquisition, see 24 . However, radially acquired k-space data is computationally more demanding to reconstruct as the forward encoding operator involves gridding on a Cartesian grid before the fast Fourier-transform can be applied, see 25 for more details. In 21 , an iterative network for a non-dynamic radial single-coil acquisition was proposed. However, the standard in the clinical routine, is in fact given by multi-coil data-acquisitions 26 through which the acquisition of the k-space data increases in terms of computational complexity.
Finally, acquiring a dynamic process is inherently computationally more demanding as for each time point, the encoding operator has to be applied to the image.
In this work, we present a novel computationally light and efficient CNN-block architecture as well as a training strategy which are tailored to image reconstruction of dynamic multicoil radial MR data of the heart. The combination of the proposed network architecture and training strategy allows to train the entire reconstruction network in an end-to-end manner, even for dynamic problems with non-uniformly sampled data as well as multiple receiver coils. To the best of our knowledge, this is the first work which overcomes these computational difficulties and presents such an end-to-end trainable network architecture for dynamic non-Cartesian multi-coil MR reconstruction problems.
The paper is structured as follows: In Section II., we formally introduce the reconstruction problem and the proposed method by discussing in more detail the used CNN-block as well as the training strategy. In Section III., we show extensive experiments to validate the efficacy of our approach and compare it to different state-of-the-art methods for dynamic cardiac radial MRI. We then discuss the main advantages and limitations of our work in Section IV. and conclude the work in Section V..

II.A. Problem Formulation
The forward operator A maps the dynamic cine MR image to its corresponding k-space. In this work, we focus on a 2D radial encoding operator using multiple receiver coils. More precisely, the operator A is given by where I Nc denotes the identity operator and C consists of the concatenation of the N c different coil-sensitivity maps which are multiplied with the cine MR image, i.e. C = [C 1 , . . . , C Nc ] T , with C j = diag(c j , c j , . . . , c j ) ∈ C N ×N and c j ∈ C Nx×Ny . The operator E = diag(E 1 , . . . , E Nt ) denotes a composition of 2D radial encoding operators E t which for

II. MATERIALS AND METHODS
each temporal point t ∈ {1, . . . , N t } sample a 2D image x t ∈ C Nx×Ny along a non-Cartesian grid in the Fourier domain. In particular, for this work, we consider trajectories given by the radial golden-angle method 27 but point out that other trajectories can be considered as well. By J = {1, . . . , N rad } we denote the set of indices the k-space coefficients which are needed to sample a 2D image x t at Nyquist limit. In order to accelerate the image acquisition process, the set J is typically constructed by reducing the number of radial lines. We denote by A I the radial Fourier encoding operator which samples a complex-valued 2D cine MR image x at a radial grid given by the indices in the set I = I 1 ∪ . . . ∪ I Nt with I t ⊂ J for all t = 1, . . . , N t . Thus, the considered reconstruction problem is given by where y I = [y 1 I , . . . , y nc I ] T with y c I ∈ C Nt·m rad denotes the measured k-space data for the dynamic 2D cine MR image x. Multiple receiver coils are typically used in order to achieve m rad > N rad such that problem (2) is overdetermined and can be solved by considering the normal equations System (3) could in principle be solved by conjugate gradient (CG)-like methods yielding an approximation of the solution given by where the configuration of the receiver coils ensures that the operator A H I A I is invertible. However, for non-Cartesian trajectories, problem (3) is ill-conditioned. Thus, methods used to approximate x * can exhibit a semi-convergence behavior and lead to undesired noiseamplification 28 . Hence, regularization techniques must be used to stabilize the inversion process.
In this work, we propose a reconstruction network based on a fixed-point iteration which has the form where u Θ is a CNN-block which reduces undersampling artefacts and noise and f DC is a module which increases data-consistency of the intermediate outputs of the CNN-blocks. Limits of (5) simultaneously satisfy data-consistency induced by the DC module and regularity induced by the CNN-block. least squares functional (see (14)) and where u Θ defines a linear projection, the fixed-point iteration (5) can be shown to converge to the minimizer of the functional Note that, similar to 17 but in contrast to 13 , 16 , 14 , the set of parameters Θ is the same for each CNN-block and therefore in general allows a repeated application of the iterative network u M Θ . Further, the proposed CNN-block differs from the one in 17 , as we shall discuss later. Figure 1 shows an illustration of the described reconstruction network. In the following, we describe the CNN-block u Θ and the DC-module f DC λ in more detail.

II.B. Proposed CNN-Block
In order to keep the notation as simple as possible, we neglect the indices referring to the iteration within the network. We illustrate the CNN-block for the first iteration of the network, i.e. when the input of the CNN-block is the initial non-uniform FFT (NUFFT)reconstruction. The process described below is employed within every CNN block in the proposed network architecture. Let W denote a diagonal operator which contains the entries of the density compensation function in k-space. By x I :=A I y I :=A H I Wy I we denote the initial NUFFT-reconstruction which is obtained by re-gridding the measured and density compensated k-space coefficients onto a Cartesian grid and applying the inverse FFT (IFFT).
First, we compute a temporal average of the input over all cardiac phases, i.e.
x I,t ∈ C Nx×Ny and stack it along the temporal axis, i.e. µ I = [ν I , . . . , ν I ] ∈ C Nx×Ny×Nt . Then, we subtract µ I from the initial NUFFT-reconstruction x I and apply a temporal Fourier-transform F t , i.e. we obtain z = F t (x I − µ I ). Subtracting the temporal average from image frames is a well-known and established approach to sparsify image sequences, for example in videocompression, see e.g. 29 . Further, applying a temporal FFT provides an even sparser representation and has been used for example in 30 , 18 for dynamic MR image reconstruction. Thus, the CNN-block learns to reduce the present undersampling artefacts and noise in a sparse domain. We define the operators R xt and R yt which rotate z by appropriately permuting the relevant axes of the images such that At this point, z xt and z yt can be interpreted as N y complex-valued images of shape N x × N t and N x complex-valued images of shape N y × N t , respectively. Then, a simple 2D U-Net 31 , which we denote by c Θ is applied both to z xt and z yt . Note that we use the same U-Net for both xt-and yt-branches, i.e. the weights are shared among the two. Also, note that because the U-Net only consists of convolutional and max-pooling layers and has no fully-connected layers, the approach can be used for the case N x = N y as well. Then, we obtain and calculate the estimate z CNN by reassembling the processed spatio temporal slices, i.e.
The factor 1/2 in (12) is needed because every pixel is used twice, once in the xt-branch and once in the yt-branch. Finally, the estimate x CNN is given by applying F H t and adopting a residual connection, i.e. as activation function. The number of initially extracted feature maps which is doubled after each max-pooling layer is n f = 16 and is on purpose chosen to be relatively small in order to keep the model's complexity as low as possible. The reason for that is that the 2D U-Net will have to be used together with a (computationally expensive) CG-block, which we describe in the next Subsection. In the decoding path, bilinear upsampling followed by a 3 × 3 convolutional layer with no activation function is used to upsample the images. Our

II.C. DC-Module
We illustrate the DC-module for the first iteration of the network, i.e. where the incoming input of the CNN-block is given by the initial reconstruction. For later iterations, the construction is analogous. Let us fix x CNN = u Θ (x I ) and consider For the special case where the operator A is an isometry, i.e. A Y = 1, where · Y denotes the operator-norm of A, problem (14) has a closed-form solution, see e.g. 13 , 16 . A simple example of A being an isometry is given by a single-coil acquisition on a Cartesian grid using a simple FFT. In this case, the solution of (14) can be obtained by performing a linear combination of the measured k-space data y I and the one estimated by applying A I to x CNN .
See e.g. 13 , for more details. However, in the more general case, a minimizer of (6) is obtained by solving a system of linear equations. By setting the derivative of (14) with respect to x to zero, it can be easily seen that the system one needs to solve is given by System (15) can be solved by means of any iterative algorithm and, since the operator H is symmetric, an appropriate choice is the conjugate gradient (CG) method 32 . Note that functional (14) is linear in x and therefore, due to strong-convexity, solving (15) leads to the unique minimizer of (14). In practice, as we discuss later in the implementation details, the DC-module is an implementation of a finite number of iterations of the CG-method. We denote the number of such iterations by n CG . Note that in the CG method, the operator H = A H I A I has to be applied at each iteration. In addition, note that in general, the application of A I as well as A H I can be computationally way more expensive than for a simple FFT, for example if the gridding of the k-space coefficients is part of the operators as in our case. Thus, it is desirable to have a CNN-block with as few trainable parameters as possible such that end-to-end training of the entire network is possible in a reasonable amount of time.

II.D. Training Scheme
Because the solution of (15) has to be approximated using an iterative scheme which employs the (computionally expensive) application of the operator H at each iteration, end-to-end training of the entire reconstruction network from scratch would be time consuming. Thus, we circumvent this issue by the following more efficient training strategy.
First, in a pre-training step, we only train a single CNN-block on image pairs as is typically done in Deep Learning-based post-processing methods. More precisely, we minimize the L 2 -norm of the error between the output of the CNN-block which was predicted from the initial reconstruction x I and its corresponding label. Then, in a second training-stage, we construct a network as in (5) and initialize each of the CNN-blocks by the previously obtained parameter set Θ and perform a further fine-tuning of the entire network by endto-end training. Further, the regularization parameter λ contained in each fo the CG-blocks f DC λ can be included in the set of trainable parameters as well and is trained to find the optimal strength of the contribution of the regularization term R. This means that it implicitly learns to estimate the noise-level present in the measured k-space data for the whole dataset.

II.E. Experiments with In-Vivo Data
To evaluate the efficacy of our proposed approach, we performed different experiments. First, we investigated the effect of our proposed training strategy. We also compared the proposed network with different configurations of hyper-parameters M and n CG . Finally, we further compared our proposed approach to several other methods for non-Cartesian cardiac cine MR image reconstruction. In the following, we provide the reader with information about the used dataset, the methods of comparison and some details on the implementation of the proposed method.

II.F. Dataset
We used a dataset of cine MR images of n = 19 subjects (15 healthy volunteers + 4 patients).
For each healthy volunteer as well as for two patients, N z = 12 different orientations of cine

II. MATERIALS AND METHODS
MR images were acquired. For the resting two patients, only N z = 6 slices were acquired due to limited breathhold capabilities. Thus, we have a total of 216 complex-valued cine MR images of shape N x × N y × N t = 320 × 320 × 30. We split the dataset in 144/36/36 images used for training, validation and testing, where for the test set, we use the 36 cine MR images of the patients in order to be able to qualitativey assess the images with respect to clinically relevant features. All images were acquired using a bSSFP sequence on a 1.5 T MR scanner during a single breathhold of 10 seconds. The images used as ground-truth data for the retrospective undersampling were reconstructed from the k-space data which was sampled along N θ = 3400 radial spokes with kt-SENSE 33 . From these images, we retrospectively simulated k-space data by sampling N θ = 560 and N θ = 1130 radial spokes. Since sampling along N θ = 3400 already corresponds to an acceleration factor of approximately ∼ 3 (with respect to the Nyquist-limit) which was needed to perform the scan during one breathhold, N θ = 560 and N θ = 1130 correspond to acceleration factors of ∼ 9 and ∼ 18, respectively.
Further, the k-space data was corrupted by normally distributed noise with standard variation σ = 0.02 which was added to each the real and imaginary parts of y c I for c = 1, . . . , 12 after having centred them. The calculation of the density compensation function for a fixed set of trajectories was based on partial Voronoi diagrams 34 . The coil sensitivity maps were estimated from the fully sampled central k-space region.

II.G. Quantitative Measures
We evaluated the performance of our proposed network architecture in terms of different error-and image-similarity-based measures. The peak signal-to-noise ratio (PSNR) and the

II.H. Implementation Details
The architecture was implemented in PyTorch. Complex-valued images were stored as twochannel-images. The forward and the adjoint operator A I and A H I were implemented using the publicly available library Torch KBNUFFT 40 , 41 which also allows to perform backpropagation across the forward and adjoint operators. During training, the k-space trajectories, the coil-sensitivity maps as well as the density compensation functions were stored as tensors for the implementation of A I and A H I . In 13 , 16 , 42 , for example, where closed-formulas are given for the DC-block, the measured k-space data as well as the masks are used as inputs to the DC-blocks. Note that we make the regularization parameter λ trainable such that a trade-off between the measured k-space and the output of the CNN-blocks is learned. In order to constrain λ to be strictly positive, during the fine-tuning of the model, we apply a Softplus activation-function, i.e. we set λ := SoftPlus(λ) = 1 β (log(1+exp(βλ)), which mapsλ to the interval (0, ∞). We used the default parameter β = 1. The implementation of the encoding operators A I , A H I as well as the proposed CNN-block u Θ and the entire reconstruction network u M Θ will be made available on https://github.com/koflera/DynamicRadCineMRI after peer-review. Note that a CG scheme is usually stopped after a certain stopping-criterion is met. Typically, a commonly used stopping criterion is if the norm of the newly calculated residual r k is small enough, i.e. r k | 2 ≤ TOL b 2 for a tolerance TOL chosen by the user.
During fine-tuning the iterative network, we fix the number of CG iterations n CG but when testing the network on unseen data, we can choose to use the number of iterations the CNN was fine-tuned with or set an own stopping-criterion. All experiments were performed on an NVIDIA GeForce RTX 2080 with 11 GB memory.

II.I. Comparison to Other Methods
We compared our proposed approach to the following methods which employ recently published learned and well-established non-learned regularization methods. As well-established reconstruction methods, we applied iterative SENSE 43 , a Total Variation (TV)-minimization method 44 and kt-SENSE 30 . Further, we compared our proposed method to a method based on dictionary learning (DL) and sparse coding (SC) 45 , 46 using adaptive DL and adaptive SC 47 and a method which employs CNN-based regularization in the form of previously obtained image priors 9 , 10 , where we used a previously trained 3D U-Net 6 for obtaining the prior.
Note that on purpose we did not compare our proposed approach to other CNN-based methods involving generative adversarial networks (GANs). The reason is that we are mainly interested in the performance of the combination of the proposed CNN-block in terms of artefacts-reduction as well as the trade-off between employing a CG-module or not. In addition, note that if the hardware allows it, it is always possible to add different components based on GANs to regularize the output of the CNN-blocks. Further, note that although there exist several other state-of-the-art methods using cascaded/iterative networks for dynamic cine MRI, see e.g. 13 , 16 , 18 , the underlying reconstruction problem is a different one (single-coil and Cartesian vs. multi-coil and radial) and thus the methods are not directly applicable as originally published.

III. Results
In the following, we report the obtained results concerning the training behavior and the reconstruction performance of our proposed method.

III.A. Computational Complexity of the Forward and Adjoint Operators
Here, we evaluated the computational complexity of the proposed network architecture in The number of temporal points was set to N t = 30 and the number of coils was N c = 12. Figure 3 shows the allocated GPU memory as well as the required time to perform one weights-update by performing a forward-and a backward-pass through the entire reconstruction network. By N θ we denote the total number of radial spokes which are acquired for each of the N t cardiac phases. The first row of Figure 3 shows the average allocated GPU-memory as well the required time to perform one weights-update depending on the spatial image size. This is shown for n CG = 1 and different numbers of radial spokes N θ .
It can be seen that already for n cg = 1, the required GPU memory amounts to approximately 512 GB -1024 GB and performing one step of back-propagation is in the range of 4 seconds. The second row of Figure 3 shows the same quantities for fixed N θ = 560 and different n CG . Here, we also see that employing a relatively high number of CG iterations, say n CG = 12, almost requires 2 GB of GPU memory and, more importantly, requires more than 30 seconds. Training the entire network in this configuration for 100 epochs would for example already require 5 days. By that, one can estimate that training the entire network from scratch in an end-to-end manner could easily amount to weeks or months. Further, the required time does not vary much for different image sizes, meaning that even for relatively small reconstruction problems, the application of A I and A H I is inherently computationally demanding.
These computational aspects highlight the importance of the efficacy of the CNN-block in terms of having a small number of parameters to ensure fast convergence during training and at the same time a good performance in terms of undersampling artefacts-reduction.

III.B. Efficacy of the Training Scheme
Here, we show the impact of including the forward and the adjoint operators A I and A H I in the network architecture during the learning process. Figure 4 shows    where (C) was the output of the CNN-block, the initial NUFFT-reconstrution x I obtained from N θ = 560 radial spokes (E) and the corresponding ground-truth image obtained by kt-SENSE with N θ = 3400 radial spokes (F). We see that after having finetuned the entire network, the point-wise error of the final reconstruction is the smallest. Further, the cardiac motion is much better preserved as is pointed out by the yellow arrows. in terms of all reported measures.

III.C. Variation of the Hyper-Parameters M and n CG
As already mentioned, at test time, one does not necessarily need to stick to the configuration of the network in terms of length M and number of CG-iterations n CG which were used for fine-tuning the network. In particular, for the fine-tuning stage, the choice of M and n CG is mainly driven by factors as training times and hardware constraints which do not play a role at test time.
Thus, we performed a parameter study where at test time, we varied the length of the net-

III.D. Reconstruction Results
Here, we compare our reconstruction method to the image reconstruction methods previously introduced. As discussed in Subsection III.C. we chose to use M = 12 and n CG = 4 although the network was trained with M = 1 and n CG = 8. Note that, since the same strategy seemed not to be consistently useful for the 3D U-Net which was trained without the integration of the encoding operators, we report here the values for M = 1 and n CG = 12 which are also the ones used in 10 . In the supplementary material, the results for the variation of M and n CG can be found as well. The regularization parameter for the dictionary learning method 47 and the CNN-based regularization method 10 was chosen as λ = 1. Figure 6 shows an example of the results obtained with the previously described methods of comparison and our proposed approach. As can be seen, the total variation-minimization method as well as kt-SENSE successfully removed the undersampling artefacts but also led to a loss of image details as is indicated by the yellow arrows. In contrast, all learning-based method yielded a good reconstruction performance in terms of preservation of the cardiac motion. We see that our proposed method is in addition the one which best reduced residual image noise in the images. Table 2 shows the results achieved in terms of the chosen measures. The best achieved results are highlighted as bold numbers. Again, the experiments were repeated for two different undersampling (i.e. acceleration) factors, given by sampling the k-space along N θ = 560 and N θ = 1130 radial spokes, respectively.
The numbers well-reflect the observations from Figure 6 and we see that all methods based on regularization methods employing learning-based methods consistently outperform the methods using hand-crafted regularization methods with respect to all measures. All three reported methods using machine learning-based regularization achieve competitive results, where we see that our proposed method yields substantially better results compared to the dictionary learning (DL)-based method and the other CNN-based method in terms of errorbased measures. In terms of image similarity-based measures, the difference between the dictionary learning-based method and ours becomes less prominent except for VIQP. Interestingly, the 3D-U-Net-based method is consistently surpassed either by the dictionary learning-based method or our proposed one. All observations are consistent among both acceleration factors with N θ = 560 and N θ = 1130. In addition, note that while the obtained results for DL, the 3D U-Net based iterative reconstruction and our proposed approach

IV. Discussion
In this work, we have proposed the first end-to-end trainable iterative reconstruction network for dynamic multi-coil MR image reconstruction using non-uniform sampling schemes. As we have seen, the proposed end-to-end trainable reconstruction network provides a competitive method for image reconstruction for 2D cine MR image reconstruction with non-Cartesian multi-coil encoding operators. In the following, we highlight several advantages and limitations of our proposed approach and put it in relation to other works.

IV.A. End-To-End Trainability
Since the considered forward model is computationally demanding, methods as 9 , 10 can be used as an alternative, where the generation of the CNN-output and the step of increasing data-consistency are decoupled from each other. However, the major advantage of our proposed network over methods similar to 9 or 10 , is that the reconstruction network is trained in and-end-to-end manner. As can be seen in Figure 4 in Subsection III.B., including the DC-module given by the CG-method in the network architecture is highly beneficial since it further reduces the training-and validation-error and also reduces the gap between the both, leading to a better generalization power. This result experimentally confirms the theory on the achievable performance of the reconstruction network derived in 12 . Further, we demonstrated that the proposed training strategy is a viable option for training the entire network in an end-to-end manner in a relatively short amount of time while achieving good convergence properties of the network parameters.

IV.B. The Choice of the CNN-Block
As we have seen in Figure 4, the proposed CNN-block's trainable parameters converge relatively fast (approximately 6 hours) to a set of suitable trainable parameters in the pre- to from 3D to 2D by the application of 2D convolutional layers in the temporally Fouriertransformed spatio-temporal domain. Thus, it inherits all benefits method presented in 7 .
The most important property is that due to the change of perspective on the data, for each single cine MR image, N x +N y samples are actually considered to train the CNN-block which is essential for training. Since only a relatively small amount of training data in terms of number of subjects already suffices for a successful training, the end-to-end-training, which is particularly computationally expensive during the fine-tuning stage, can be carried out in a reasonable amount of time without the need to additionally augment the dataset in order to prevent overfitting.
Note that, of course, if enough training data is available and the hardware constraints allow it, training the network with computationally heavier CNN-blocks is possible as well.
Thus, the proposed method can also be seen as a general method for training an end-to-end reconstruction network for large-scale image reconstruction problems with computationally demanding forward operators. it remains somewhat unclear why this is exactly possible. We believe that the reason for this lies the end-to-end training the entire network but leave a rigorous theoretical investigation and a convergence analysis as future work.

IV.E. Differences and Similarities to Other Works
Our presented approach shares similarities across different works. First, for the case M = 1 it can be seen as an extension of the approaches presented in 9 , 7 , 48 in the sense that we only generate only one CNN-based image-prior which is used in a Tikhonov functional. However, because of the proposed end-to-end training strategy, the obtained CNN-based image-prior tends to be much a better estimate of the ground-truth image compared to the cases when the training of the CNN-block is decoupled. Further, for M > 1, the structure of the network is similar to 13 , 17 , with the difference that first of all, the considered inverse problem is different (radial multi-coil instead of Cartesian single-coil) and thus the DC-module is a CG-block instead of the implementation of a closed-form solution to (14). Second, our CNNblock consists of a spatio-temporal 2D U-Net which is applied in the Fourier-transformed spatio-temporal domain.
In our work, the 2D U-Nets are applied in the temporally Fourier-transformed spatiotemporal domain and thus use the same change of perspective on the data as in 7 . However, in 7 , the slices extraction and re-assembling process is not part of the network architecture and thus does not allow end-to-end training. Further, we apply the U-net after having performed a temporal FFT, similar to 18 .
In 49 , where a non-Cartesian multi-coil dynamic acquisition is considered, the measured kspace data is first interpolated onto a Cartesian grid. After this interpolation, a simple FFT can be used as forward operator and thus facilitates the construction of an iterative network.
However, because the k-space data-interpolation is decoupled from the network, the network cannot learn to compensate for the interpolation errors. Thus, in order to really utilize the measured k-space data, applying the actual encoding operator (which involves the griddingstep) which is associated to the considered image reconstruction problem is unavoidable.
In contrast, in our approach, the gridding of the measured k-space is part of the network architecture. This important difference is the main motivation of the work since due to the computational difficulties linked with the integration of the non-uniform FFT in the network architecture, a computationally light CNN-block has to be used.
Although the method shares similar features to other works, this is -to the best our knowledge -the first work to combine several components to construct a network architecture which is trainable in an end-to-end manner for a dynamic non-uniform multi-coil MR image reconstruction problem by actually using the radially acquired data and not the one interpolated onto a Cartesian grid. In fact, we believe that the large-scale of the considered problem is the reason for the lack of end-to-end trainable reconstruction networks for dynamic non-Cartesian data-acquisition protocols using multiple receiver coils.

V. Conclusion
In this work, we have proposed a new end-to-end trainable data-consistent reconstruction network for accelerated 2D dynamic MR image reconstruction with non-uniformly sampled data using multiple receiver coils. Further, since end-to-end training is computationally expensive because of the forward and the adjoint operators are included in the network, we have proposed and investigated an efficient training strategy to circumvent this issue.
In addition, we have compared our method to other well-established iterative reconstruction methods as well as several methods based on learned regularization. Our proposed method surpassed all methods using non-learned regularization methods and achieved competitive results compared to a dictionary learning-based method and a method based on      Table 2 shows an analogous comparison for our proposed method with the difference that it was trained with M = 1 and n CG = 8 and the CNN-block consists of a U-Net with 93 617 trainable parameters. Also for this hyper-parameter configuration, increasing M yields in general better results. This result is important as having M = 1 is in general computationally more manageable since the CG-block only has to be applied once. Thus, the results shown here suggest that it is possible to change the configuration of M and n CG at test time without the need of having M > 1 during the fine-tuning stage. Note that in contrast, for the 3D U-Net which was simply trained on input-output pairs without the integration of the physical models, increasing M at test time seems not to be particularly beneficial. regions where cardiac motion is visible. This comparison shows that sacrificing expressiveness of the CNN-block in order to be able to fine-tune with M > 1 might not be necessary, see the discussion in the paper for more details. Table 3 shows a similar variation of M and n CG for the 3D U-Net. Interestingly, increasing M while keeping n CG = 4 does not consistently improve the results. Thus, we attribute the possibility to increase M at test time to the fact that in our proposed method, the forward model is part of the reconstruction network while training.
iii Table 3: Quantitative results for the 3D U-Net for which we also varied M and n CG .