No, the process is done the other way around. Instead of registering all the subframes to the reference frame, you register the reference frame to all the subframes.
Let's elaborate a bit more. We start with the set of images (unregistered, CFA). We select one of them as the reference, and then the register all the images, after debayerization, as usual, but instead of applying the transformation, we just save the transformation matrices (or, we may transform them, and average to get a new reference image, that has higher SNR). Now, we create a first approximation to the ideal image. It could just be the reference image, upsampled 2 times. The goal on the inverse problem is to minimize certain potential, that contents two terms: a data fidelity term, and a regularization.
The data fidelity term is the total absolute diference between the ideal image to each of the captured frames. So, to compute that term, we have to translate/rotate the ideal image (RGB, normal color), then undersample it (integer resample, averaging), and finally subsample it according to the CFA pattern. We end with several greyscale, bayered images that are the projections of the ideal image to each captured RAW CFA frame. Then we calculate the difference.
The regularization step may be understood just as a noise reduction filter. In fact, the expectation-maximization algorithm is just these two steps (data fidelity and then noise reduction). You alternate both steps until a convergence is reached, and the optimal image is achieved, given our potential terms.