Data Science Learning Trajectory Mining with BeautifulSoup, NetworkX and Pyvis
I have recently outlined my positionality and I contended that Data Science as a field requires methodological expansions to include qualitative sensibilities when handling data arising form socio-cultural and economic contexts. Then to articulate my learning process, I surveyed existing literature on the topic and articulated frameworks and guiding principles for my design of a self-directed, project-based learning environment. In that process, I have gathered the core competencies as defined by researchers and project ideas that mirror the inquiry process in the field of Data Science. The next step, from a curriculum design perspective, is to consider a learning trajectory for Data Science, so that learners like myself may structure and sequence the core competencies into a path that can be well defined and guided by expert research.
Since curriculum in universities clearly state prerequisites, I approached the generation of learning trajectories by reverse-engineering this information using the bulletin data in Data Science programs — my first data visualization project. You can find the source code at my Github.
Problem Statement: Given courses in a curriculum, and each have some prerequisites, generate a visual representation of the curriculum, with a long term goal to synthesize this information from multiple sources.
Initial Thoughts
If each course corresponds to a vertices on a graph, and each edge represents a prerequisite relationship, then a directed graph can capture this information. The moment I considered graph theory my mind flashed back to Combinatorics class, recalling the definitions of incidence matrices and use the positive semi-definite Laplacian matrices to check for connected-ness by considering if 0 is an eigenvalue. “A graph G is connected if and only if 0 is a simple eigenvalue of L, where L is the Laplacian matrix of G,” I whispered to myself as if a litany against fear.
Though, since this is a Data Science approach guided by Learning Sciences design sensibilities, we’ll scale back on the theory and get our hands dirty with some coding. If you follow along, you should be able to make your own version of this project (a network visualization of web-scrapped data).
Programming Plan
I have some hopes that this can be eventually generalized to other universities too, to gather more data. I divided this visualization project into two components:
- Storing the data in a way that would allow for any scraper to generate a visual. I approached this using two objects: (a)
Course
which contains the subject code, the course code, the course title and course description, with a list of prerequisite courses, and (b)Curriculum
which contains a dictionary of courses and a dictionary of aliases. Curriculum object will also need to handle aliases, since some universities offer courses in multiple departments, but the course content is actually considered the same. - (University specific) Web Scrapper: For the goal of creating a minimal working example first, I decided to use Case Western Reserve University’s Data Science course inventory page. (Note: I only pulled the page once and saved it to an html file — be kind to the school servers please).
- (Not yet implemented) The goal is to eventually use these networks (and their associated course descriptions/titles) from multiple universities, and apply Latent Dirichlet Allocation (LDA) on this text (great article by Thushan Ganegedara).
Prerequisites
- Operating System Setup: I recommend using Ubuntu 20.04 LTS. One can certainly try a live CD but any serious work I recommend actually making a partition and installing it on your machine. Python comes preinstalled on almost every Linux system.
- A good editor like Sublime Text (or Vim if you’re really brave) then set it up for development in python. Make sure to have AutoPEP-8 compliance and Flake8 linter.
- Some practice with running python scripts and using github.
Preparations: Setting up Continuous Integration
First, I created a directory to host the project source code
mkdir curriculum-mapper
cd curriculum-mapper
touch README.md
touch Curriculum.py
touch .gitignore
subl .gitignore
I went to gitignore.io to generate a .gitignore
file for this project. After that, I created a virtual environment in and activated it for the workspace. This is a good habit — at least have a virtual environment for coding. Personally, I use a virtual environment for every one of my projects, and I keep the environment outside of my git repo directory.
cd ..
python -m venv curriculum-mapper-venv
source curriculum-mapper-venv/bin/activate
After writing a file called curriculum.py
that included the class definitions for Course
and Curriculum
objects, I made a Github repository on the website, and pushed my first commit.
# in "curriculum-mapper" repo directory
git init
git add .
git commit -m "first commit"
git branch -M main
git remote add origin git@github.com:TimmyChan/curriculum-mapper.git
git push -u origin main
Then, I focused on writing unit tests using the pytest
framework in preparation for Continuous Integration practice. This produced two test files, test_course.py
and test_curriculum.py
.
The last line below is used frequently, since the docker system uses requirements.txt
to load all the packages and dependencies for the build.
pip install pytest
pip install flake8
pip install wheel
pip install beautifulsoup4pip freeze > requirements.txt
As I kept coding, I kept installing new packages in the virtual environment. At this point, my code kept breaking because this is the first time I’ve decided to do CI — which included some instructions on using linters in the .circleci/config.yml
in the following tutorials from CircleCI.
While I had learned the PEP-8 naming conventions for my master’s thesis, this time around Flake8
kept warning me about blank lines with tabs, and various formatting issues. Luckily, I use Sublime Text as my editor, which has many packages for python. To address this, I recommend SublimeLinter.
I headed over to CircleCI and configured the Continuous Integration pipeline, which ran my unit tests and linter.
At this point, looks like my environment is setup to do some scrapping!
Web Scraping with BeautifulSoup
My code only calls on the website if the html file is not already stored in the directory. Since we use BeautifulSoup, I decided to store all the saved .html
files into a directory called “canned_soup” to be polite and prevent over-pinging a school website.
def polite_crawler(filename, url):
''' saves a copy of the html to not overping '''
try:
os.makedirs("canned_soup")
except Exception:
pass
try:
# try to open html, where we cache the soup
with open("canned_soup/" + filename, "r") as file:
soup = BeautifulSoup(file, "lxml")
return soup
except Exception:
res = requests.get(url)
# using lxml because of bs4 doc
soup = BeautifulSoup(res.content, "lxml")
with open("canned_soup/" + filename, "w") as file:
file.write(str(soup))
return soup
Bringing up Case Western Reserve University bulletin website on my browser and inspecting the website HTML structure allowed me to write the following script. The first section is just the packages used. A quick explanation:
PyVis
is a wrapper around a popular javascript library called visJSNetworkX
is a package used for analyzing networks (graphs)Matplotlib
is a standard package for data visualizationos
is a system package for making directories/opening filesre
is a system package for Regular Expressionsrequests
is a package for handling http in pythonunicodedata
is a package to convert some characters, like \xa0, which is actually non-breaking space in Latin1 (ISO 8859-1) into Unicode.BeautifulSoup
is a package for parsing html files into actual meaningful data.curriculum.py
contains a couple of custom classes written by me, that handles courses, prerequisites, as well as clearing up aliases and contains functions to generate graphs and visuals.
A brief walk through of the script:
- Initiating the curriculum object.
- Scrape through the core courses and related courses from tables first.
- I noted that the course detailed descriptions were all listed in
div
tags that were labeledcourseblock
; then scraping through that data and adding each course and their prereqs and their aliases, thecurriculum
object does most of the work. - In the script file, I use
Curriculum.print_graph()
function in a script to generate a local html file, which passes through the alias lists for everycourse
and chooses a common name based on user definedpreferred_subject_code
.
Outcome
For this visualization, the out_degree
(the number of arrows leaving a node) determines the size of the node. This would mean that nodes that are foundational are larger in size. For the color, that is determined by a colormap (from matplotlib) based on the course number. The higher the number, the deeper the color. Finally, a course is blue when it’s the course of interest, and grey when it’s a course outside the degree of interest.
Armed with this visualization tool, we can then extrapolate each particular topic within the core constructs, and make explicit what one is expected to know before tackling a topic.
Did you get a chance to try out the script? Do you see ways how this can be mapped out better? Feel free to use this example for a school of interest! Let me know how it goes, leave a comment. Subscribe to follow this attempt to democratize learning learning Data Science, guided by a cognitive apprenticeship framing and the inquiry based learning model.
References and Further Reading
AutoPEP8 — Packages — Package Control. (n.d.). Retrieved March 10, 2022, from https://packagecontrol.io/packages/AutoPEP8
Beautiful Soup Documentation — Beautiful Soup 4.9.0 documentation. (n.d.). Retrieved March 10, 2022, from https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Chan, T. (2022a, February 17). From Pure Math to Data Science. Medium. https://mathtodata.medium.com/from-pure-math-to-data-science-293883864cb2
Chan, T. (2022b, February 28). Mixed Methods Data Science: Qualitative Sensibilities. Medium. https://mathtodata.medium.com/data-science-curriculum-starting-point-for-pure-mathematicians-347efe61f743
Chan, T. (2022). TimmyChan/curriculum-mapper [Python]. https://github.com/TimmyChan/curriculum-mapper (Original work published 2022)
Chan, T. (2022c, March 10). Data Science Project Based Learning. Medium. https://mathtodata.medium.com/data-science-project-based-learning-afd2bd6f8f11
Configuring a Python Application on CircleCI — CircleCI. (n.d.). Retrieved March 10, 2022, from https://circleci.com/docs/2.0/language-python/
Department of Computer and Data Sciences < Case Western Reserve University. (n.d.). Retrieved March 10, 2022, from https://bulletin.case.edu/schoolofengineering/compdatasci/#courseinventory
Ganegedara, T. (2021, November 15). Intuitive Guide to Latent Dirichlet Allocation. Medium. https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-latent-dirichlet-allocation-437c81220158
gitignore.io — Create Useful .gitignore Files For Your Project. (n.d.). Retrieved March 10, 2022, from https://www.toptal.com/developers/gitignore
Incidence matrix. (2021). In Wikipedia. https://en.wikipedia.org/w/index.php?title=Incidence_matrix&oldid=1053190313
Install Ubuntu desktop. (n.d.). Ubuntu. Retrieved March 10, 2022, from https://ubuntu.com/tutorials/install-ubuntu-desktop
Interactive network visualizations — Pyvis 0.1.3.1 documentation. (n.d.). Retrieved March 10, 2022, from https://pyvis.readthedocs.io/en/latest/index.html
Jupyter/IPython Notebook Quick Start Guide — Jupyter/IPython Notebook Quick Start Guide 0.1 documentation. (n.d.). Retrieved March 10, 2022, from https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/
Laplacian Matrices | An Introduction to Algebraic Graph Theory. (n.d.). Retrieved March 10, 2022, from https://www.geneseo.edu/~aguilar/public/notes/Graph-Theory-HTML/ch4-laplacian-matrices.html
Matplotlib — Visualization with Python. (n.d.). Retrieved March 10, 2022, from https://matplotlib.org/
Newman, M. E. J. (2003). The Structure and Function of Complex Networks. SIAM Review, 45(2), 167–256. https://doi.org/10.1137/S003614450342480
Python, R. (n.d.). Setting Up Sublime Text 3 for Full Stack Python Development — Real Python. Retrieved March 10, 2022, from https://realpython.com/setting-up-sublime-text-3-for-full-stack-python-development/
Requests: HTTP for HumansTM — Requests 2.27.1 documentation. (n.d.). Retrieved March 10, 2022, from https://docs.python-requests.org/en/latest/