* Add GLM-4-0414 model support
Based on zRzRzRzRzRzRzR's PR on mainline llama.cpp.
Still some issues where it doesn't work:
* offloading >=60 layers to GPU
* no flash attention
* Remove seemingly unused llm_tensor enums
Both of these seem unused and LLM_TENSOR_ATTN_POST_NORM already
existed which seems pretty similar? Don't think they were used in the
python code either...
So removed these as possibly just cruft:
* LLM_TENSOR_POST_ATTN_NORM
* LLM_TENSOR_POST_MLP_NORM
* Set flash attention precision to f32 on GLM4 arch
* Set non flash attention precision to f32 on GLM4
* Remove reshape_3d() for Vcur in build_glm4()
This fixes the non-flash-attention inferencing on both CPU and CUDA.