When you have a 2d renderer simulating the classic isometric perspective where you see the world from a diagonal angle, then you have to add x, y and z values for each sprite/tile to calculate the draw order. When you are using an orthogonal perspective (like in Zelda), then draw order is calculated by adding y and z. One part that can be a bit tricky is figuring out where exactly you need to place the pivot point for each sprite/tile.
One trick classic games like SNES Zelda use is that they give the illusion of having different heights, but they actually don't. The maps are flat, with the height being faked through the design of the tiles.
But if you have access to a modern 3d engine, then you might be able to save yourself a lot of headache by creating your game as a 3d scene and then render it with an orthographic camera so it looks 2d. Lots of retro-games with an isometric perspective use that trick nowadays.