Having said that, the structure of calibration steps has being modeled following the behaviour of ccd data, to address some unwanted signals or data corrections. And this is very much related to the imaging model.
Photons are capturated in each pixel cell, that is physically a semiconductor. This semiconductor is built from 2 different materials, one dopped with electrons, and the other with the lack of them. Also, there are other materials that create the barriers that define the cells, charge drainage, etc. They are not perfect, and have variations. Both these variations generates (among others) two effects: pixels have different sensibility, and also the thermodinamic propperties vary along them (so, the dark current is pixel-dependant).
After the adquisition, we have a well full (depending on the signal intensity) of electrons that were generated both by hits by photons and by spontaneous jumps created by heat (and they are indistinguishable). This charge needs to be moved to a analog to digital conversor. In CCD cameras, this is done by moving the charges from one pixel to another, usually along columns (where the potential barriers are weaker and controllable), and then are conducted sequentially to one or more amplifiers. This electron movement may also generate some voltage differences, consistent along columns. And, the amplifiers also generates some random measurements, that are translated into a readout error. Furthermore, the lecture process of the voltage/signal is done in such way that references voltages are needed, so a pedestal value is incorporated.
At the end, we have a measured signal that is in reality a mix of electrons generated by photons, and by a lot of other sources. Without loss of generality, we may model this as: z = K*x+y
where, we may identify (at least) two different sources that are affected by those multiplicative effects (pixel by pixel dependant). An so we have: z = K1*p + K2*t + r
where p stands for the photon induced electrons, t for the thermally generated ones, and r by all the readout processes.
Fortunatelly, we may control these processes. If we take a short enough exposure, without light hitting the sensor, we may measure the readout signals (bias frames). If we take long exposures, we may measure both the readout and thermal signals (dark frames).
What we have left here is only the photonic source (plus noise, of course).
Only is left to model the pixel differencies in sensibility. This is both a mixture of the intrinsic variations in the semiconductor/materials/etc., and also reflects uneven illumination of the sensor by the optical design, elements in the path (like dust), etc. So, what we do? Exposure a surface that we know should be homogeneus. The captured data (the flat frame), after all the previous corrections, should reflect the multiplicative effects. Please note that we need to remove all additive sources from the flat to make it a reliable model of the multiplicative effects. We need, at least, to use the bias frames to remove the pedestal. Otherwise, the math just don't hold.
Still, we may get good visual results if we skip some steps... if the pedestal is low enough, perhaps the flat is good enough. If cooled, thermal signal may be ignored, thus dark frames are no longer needed... But they do help.
BTW, things on CMOS chips are slightly different, specially because they do not follow as linear behaviors as CCDs, so this procedure is only an approximation. Anyway, if all the calibration frames are taken in conditions as close as possible to the conditions the light frames were exposed, and in-chip processes are not signal dependant (raw data is raw data, and not processed, either digitally or analog), we should get pretty consistent results.
If we add dithering to this, and taking a lot of frames, we are also making the statistics to play nicer with us, to reduce noise. If no dithering is used, the noise is highly correlated, specially by thermal/readout sources. Also, errors in the previous steps have a larger impact on the result. Dithering uncorrelates the noise sources, improving the chance that we arrive to the true signal at the end.
We may rely on this to skip some calibration steps (darks, specially)... if the number of acquisitions is large enough. But I would play it safe.