Since it is a binary image, a logical operation may do the trick: !(pix1 && pix2 && ... && pixN). Furthermore, if it is a black pixel, we already know the result, so there is no need to calculate again; only perform that for white pixels. Also, it could be coded with an loop exit, if it ever finds a black pixel...
Anyway, it is not an O(N^2) operation (N the number of pixels). It is O(N*M), M the size of the kernel.
Georg, I'm 99% sure that the problems with the stars at the boundaries will banish with a 3px zero padding, before the gradient calculation/merging. It will slow down the process a little bit, but is not a great deal. Furthemore, since we are using a solver based on FFT calculations (and I would say that this is one of the narrowest bottlenecks), this padding will be smaller in many cases than the padding internally performed, to achieve optimized sizes.