The Spark & Jupyter seminar was created as part of the NSF-funded Data Science for All project with the goal of producing seminars that students from across a spectrum of majors (upper and lower division) could take to get introduced to data science. We have also presented this multiple times at the community college level.
In this seminar, we introduce students to data science using Apache Spark on web-based Jupyter notebooks available as free Databricks Community Accounts (clap, clap, for Databricks making these available for free - they also have an academic initiative they make materials available through).
Students load a notebook we have created that loads data we provided from the USA Spending website. The download from their website is a bit messy, so we have created data files for five years of data and we have the notebooks load the data directly from a website, so even if the students are working virtually during the seminar, it will run quickly. The data covers years from both the Obama and Trump administrations, so we start with the question of how spending could differ under these two administrations. We teach the students how to build a SQL query and use Spark SQL to query the data and generate visualizations, wrapping up with a calculation of Benford's Law - a technique used for fraud detection.
Although this technology is new to most of the students, we use an interactive approach that's hands-on so we don't lose them. The materials are designed as a stand-alone 3-hour seminar, but the materials could be broken up and used as a module in a course.
The link provided here is to our project's website at San Jose State University (where the faculty involved in the project teach in MIS and AIS). The above link will take you to the webpage for this particular seminar, but currently there are 8 seminars created and all provide materials under the same Creative Commons License.
The webpage provides links to some of the material including a PDF of the slides, the datasets created for the seminar, the notebook used, and some additional materials. We also make available additional teaching materials, all of the materials bundled in a Canvas cartridge, the PowerPoint slides in case you want to edit them, and a notebook and test materials we use for creating digital badges for participants. The additional materials and test questions will be provided to anyone with a faculty email address and webpage (it can even be a page that just lists you with your email as the instructor at an actual university or college). The instructions for requesting the additional materials will always be at this address on the Data Science for All website. All of these materials are available to any faculty member to use or modify under the CC license, we just don't put them directly on the website in case anyone is using the test questions (yes, we know, they will end up on the web, but we try our best not to disclose the materials to students in case any faculty are using the questions on a test or quiz, and ask you to also).
Development of the Data Science for All Seminar Series is funded under NSF grant #1829622