Or you could use some heuristics like: the length of a caption is never more than 1,000 characters, it always resides in one single text frame, always using font named "x", geometrically never far away from the image container…
We simply cannot know.
Uwe