It takes about a fifth of a second to produce a syllable, or about  a fifteenth or twentieth of a second for each consonant or vowel. Now it turns out it takes a little longer than that to move the lips, tongue and jaw for each vowel and consonant. So what is happening?
Coarticulation concepts

A typical definition of coarticulation is that articulators are moving simultaneously but for different phonemes, or that phonemes overlap in time, which explicitly implicates a belief in some sort of underlying "segment" that has its physical expression in articulatory behaviour. Indeed, Liberman & Mattingly insist that some sort of discrete representation is always implied, even for those who would deny it.
An example from Bulgarian

This example shows how three syllables were organized:
e t â r e
from the utterance
Petâr e papa
The sequence is taken from an X-ray motion film of Bulgarian speech.
