Unfortunately, YouTube videos can only be seeked to 2-sec keyframes.
That makes it impossible to convert text into fluent video presentation by mere seeking.
Though unsuccessful, you might find some joy in this project.
this is where captions are extracted, but you can change it
this is where available words will be shown