Exclusive insight: Apple M1 Single "Core" comparisons miss the mark (with reference points)
Exclusive insight: Apple M1 Single "Core" comparisons miss the mark (with reference points)
Looking for stronger references would help support your argument. I included a wiki link since it’s clearer to read than a formal paper, and sources are typically cited there. The diagram in Figure 3 demonstrates how a tensor core executes FMA operations, which aligns with the idea that tensor cores boost performance in deep learning tasks. If you’re unsure, you can reach out to NVIDIA and TensorFlow for clarification.
Additional instruction sets need special hardware to handle them. AVX-512 provides dedicated registers and ALUs for these operations. While it works, the VNNI part of AVX-512 performs the same calculations efficiently. Since GPUs are bigger and process more data per unit area than CPUs, the size of a VNNI module remains minimal compared to a full GPU. Also, GPUs face constraints from VRAM, making them less suitable when large amounts of RAM are required for training. CPUs become more practical if you can afford other dedicated accelerators like TPUs. I’m not sure how this fits into the bigger picture, but do you have any experience with model training or CPU design? (you can safely ignore this question)