Airflow directed acyclic graph

Airflow directed acyclic graph

Problem: Interested in knowing what was going on across the US government in relation to data science.
Solution: Wrote a program to query APIs related to hiring, spending, laws, regulations, and code repositories
  • For over a year now, this script has been running and collecting updated data every day.
  • I have had people ask me for the jobs dataset because those historical data are not easily accessible.
  • I have analyzed the jobs dataset several times over the last year to find commonly mentioned skills.
Status: In “Production” (this is just something I use at home)
Skills Used: API access (http requests), Airflow (for scheduling), sqlite, SQL, functional programming
Languages Used: Python

In the Fall of 2019, BLS hired 3 data scientists under the job title Operations Research Analyst because at the time (and as of this writing) the U.S. government does not have a dedicated job series for data science. We ended up with three great picks, all internal hires moved to new positions, but it made me wonder about how efficient it was to hire that way. Who would ever look under Operations Research Analyst looking for data science jobs? I became interested in where else data science jobs were being posted on USAJobs. To answer that question, I wrote a script to query the USAJobs API keyword search endpoint looking for 8 keywords; data scientist, data science, artificial intelligence, machine learning, business analytics, data visualization, and natural language processing. I tried other keywords like simply “data analysis” but that resulted in way too many false positives. It turns out almost any office occupation does some sort of analysis of some sort of data.

Once I had this scripted out, I realized I could use this idea (querying APIs) to keep track of what is happening across government related to data science (or any other topic). So I sought out other APIs that might provide useful information and wrote similar code to pull data from them. Over time, I have added to the list, but these are the ones I currently query on a daily basis.

  • USAJobs – Most (all?) federal hiring goes through USAJobs so it is a great resource for information on hiring and job requirements.
  • Propublica Congress API – Information on bills proposed in congress. You can get follow-up information on whether they passed, but I have focused simply on what’s getting proposed.
  • Federal Register – The federal register lists new regulations available for comment, listings of upcoming advisory meetings, executive orders, amongst other things.
  • – Aggregates code repositories from government agencies on sites like Github, Gitlab, etc.
  • – Provides information on federal contracting including brief text descriptions describing the service being contracted for.

After scripting separate modules for each API, I wrote a module for running them all with a single command. I call the collection of scripts gather_gov. It is in a folder structure like a package, but I have not formally packaged it for use on my computer or elsewhere. I also decided to save the data collected from each service to separate tables in a single sqlite database, because (1) I didn’t want to have a bunch of separate files and (2) I wanted an excuse to use a database and maybe work on my SQL (though this isn’t a particularly good example because the tables aren’t relational).

I quickly realized that I wasn’t going to want to run this on a regular basis, so I took this as an opportunity to learn Airflow which is an open source package for scheduling jobs. I managed to learn enough to build my first Directed Acyclic Graph (DAG), a network of subtasks that are part of a larger job, in one morning. That was January of 2020, and this script (with minor tweaks) has been running every day since.

Since January 2020, I have amassed more than 1,500 job announcements from USAJobs, nearly 200 bills, 460 code repositories, 560 Federal Register Notices, and 1,460 contracts from USASpending (though that includes historical bills going back to 2010). Each day at around 2:30, the script runs and I get an email telling me what was added that day.

The Airflow DAG also updates the following map of the location data pulled from job announcements in USAJobs.

Early on, I set up another script using the same set of packages to look for another set of keywords related to the Bureau of Labor Statistics in case there were any interesting insights there. I still haven’t looked at it much, but I do get the daily emails and it’s nice to see what jobs are being posted and what Federal Register notices are being posted.

To Do

There are still some things I need to do with this project:

  • Do more analysis with the data. I have done quite a bit with the USAJobs data but far less with the other data sources. There is some rich information there, including where federal contracting money related to data science is going. This will include, perhaps, the automation of reports using RMarkdown.
  • Get this off of my computer. Right now, everything runs on my laptop. That is certainly not ideal. I just haven’t had the interest in moving it to a server and learning how to Dockerize it. That is on the horizon. Due to the pandemic I haven’t left the house much over the last year so my computer always has internet access when the script needs to run, but that won’t always be the case.
  • Package this up and otherwise share the code. I need to abstract the code a bit to make sure it is usable for others without much additional work. I also need to ensure an easy workflow for users to include their own API keys and make sure mine don’t get shared. Then this is going up on Github.

Leave a Reply

Your email address will not be published. Required fields are marked *