Over the past few months, I’ve been collecting databases of hacked accounts. I’ve already gone into some depth as to why (password audits), so I won’t go over that here. But it’s been interesting carefully working through how to pull out the truly unique passwords. At first, I assumed that a simple DISTINCT would do it, but with the wrong collation, it doesn’t. The collation is essentially how it determines which passwords are unique. Mysql’s default collation does not consider strings using the same letters with different cases as unique. As it turns out, one collation that works properly is ‘utf8_bin’.
So, I had to clear out all my tables of well over 2 Billion passwords and start over. It’s taken several weeks to re-import and sort these databases. First was formatting and importing them. At over 70 databases in the collection, this was a bit of a task in itself. Then, I had to de-duplicate each one and put it into an interim database. The last step is to de-duplicate that and put them into a final database that has every truly unique password in it. I am finally running that query today.
The next steps will be to begin generating the different hashes, such as NTLM, md5sum, sha1, etc. When they’re all generated, I’ll have to see how I can make it possible for people to use it.