I agree that it's non-intuitive that the play and record buttons don't automatically engage the sequencer.
Having gotten used to how it works, it's less of an issue for me now - but it seems arbitrary and I still find myself forgetting on occasion. Can imagine that there is some architectural reason it's structured this way - but if that reason isn't obvious from the users vantage point it feels less than coherent.
"Interface expert Jef Raskin came out strongly against modes, writing, "Modes are a significant source of errors, confusion, unnecessary restrictions, and complexity in interfaces." Later he notes, " 'It is no accident that swearing is denoted by #&%!#$&,' writes my colleague, Dr. James Winter; it is 'what a typewriter used to do when you typed numbers when the Caps Lock was engaged'." Raskin dedicated his book The Humane Interface to describe the principles of a modeless interface for computers"
https://en.wikipedia.org/wiki/Mode_(user_interface)#Assessment