well... data stored on disk is just a sequential list of numbers. same goes for main memory. so in order to get a handle on more complex data types than just 1 dimensional arrays, any piece of data needs to have some metadata that describes what it is. like for instance if what you are storing on disk is a matrix, somewhere ahead of the data itself there will be some kind of header that describes how many rows and columns are in the matrix, and what memory position represents a break between rows/columns. sometimes - see next paragraph.
sometimes data is stored on disk as straight binary data with no metadata. when a program loads it, it already knows what it should be, and superimposes it's own "template" on the data. for instance you can load some data straight into memory and then (in C) cast it to some complex type which is defined in your program. then your program knows what bits of data are what. you can look up "data structures" or the "struct" directive for C to get an idea of what this is. but the xisf format is not straight binary data; it has its own metadata.
so the fits or xisf file format itself defines some 'preamble' that describes how big the image is, so that it can be transformed from a linear list of numbers to an array of numbers and displayed on screen as a square. and of course there's a lot of other stuff in that preamble that describes various properties of the image. as it turns out, a lot of this stuff is represented in the view explorer of pixinsight, so it is user-visible. since an xisf file can have multiple images embedded, there is probably some kind of 'meta-header' that first says at what position in the file image 1 starts and at what position image 2 starts, etc. the xisf file format is actually open source, so if you read that you'll see in exact terms what i'm talking about.
as for the size, it could be differences in that metadata if the size differences are small. it could be due to the crop masks being different, or even the presence/absence of a crop mask at all.
rob