I've finally found a way to download SEC.GOV data in a consistent and less stressful way. I want to give the University of Notre Dame Software Repository for Accounting and Finance a shout-out for their excellent work. Thanks to them I can finally start taming this beast.
I've struggled for years trying to figure out how to download SEC data. That repository is so wacky that it's hard to find filings in there. The Notre Dame researchers created some python scripts that will let you automate the entire process. They even included parsing scripts that will extract textual properties of the document. These properties are what I like to call "pre-labels" or draft targets for machine learning.
After parsing the 100,000's of 8-K and 8-K/A documents, I was able to generate a reference spreadsheet, see below. This is about 90% of the way to a final training set. All I need to do is generate a label from this information and then insert the entire SEC document into the training set (without HTML tags!). Still, a few more preprocessing steps but it's almost there.
There's a couple of things to note if you want to use their scripts, you'll need to tweak them to work for you. It took me about a day to figure everything out and then organize it the way I want to. I create a /downloader folder and put the scripts in there with an /EDGAR to hold all my downloaded text files.
For processing those files I created a /dataprep folder that contains all scripts that generate the "pre-labels." My future HTML preprocessing script will sit there first before I merge them together into the main Generic_Parser.py script.
If you want to use these scripts note that they are not for commercial use. As of today, the software is licensed as: "All software and data are provided on an "as is" basis, without warranties, for non-commercial purposes. The software is free for academic researchers."