Here’s a better example of that triangle pattern I mentioned for cluster 3
https://jazzy-bublanina-350de8.netlify.app/triangle
Overall, I think that given more tweaking and tuning, it may be possible to train a simple CNN to do attention head classification. Based off of my preliminary testing, I think clustering shows less promise, though after significant trial and error it did produce some interesting clusters.