Data Science Learning Trajectory Mining with BeautifulSoup, NetworkX and Pyvis

Timmy Chan
8 min readMar 11, 2022

--

I have recently outlined my positionality and I contended that Data Science as a field requires methodological expansions to include qualitative sensibilities when handling data arising form socio-cultural and economic contexts. Then to articulate my learning process, I surveyed existing literature on the topic and articulated frameworks and guiding principles for my design of a self-directed, project-based learning environment. In that process, I have gathered the core competencies as defined by researchers and project ideas that mirror the inquiry process in the field of Data Science. The next step, from a curriculum design perspective, is to consider a learning trajectory for Data Science, so that learners like myself may structure and sequence the core competencies into a path that can be well defined and guided by expert research.

Since curriculum in universities clearly state prerequisites, I approached the generation of learning trajectories by reverse-engineering this information using the bulletin data in Data Science programs — my first data visualization project. You can find the source code at my Github.

Problem Statement: Given courses in a curriculum, and each have some prerequisites, generate a visual representation of the curriculum, with a long term goal to synthesize this information from multiple sources.

Initial Thoughts

If each course corresponds to a vertices on a graph, and each edge represents a prerequisite relationship, then a directed graph can capture this information. The moment I considered graph theory my mind flashed back to Combinatorics class, recalling the definitions of incidence matrices and use the positive semi-definite Laplacian matrices to check for connected-ness by considering if 0 is an eigenvalue. “A graph G is connected if and only if 0 is a simple eigenvalue of L, where L is the Laplacian matrix of G,” I whispered to myself as if a litany against fear.

Though, since this is a Data Science approach guided by Learning Sciences design sensibilities, we’ll scale back on the theory and get our hands dirty with some coding. If you follow along, you should be able to make your own version of this project (a network visualization of web-scrapped data).

Programming Plan

I have some hopes that this can be eventually generalized to other universities too, to gather more data. I divided this visualization project into two components:

  1. Storing the data in a way that would allow for any scraper to generate a visual. I approached this using two objects: (a) Course which contains the subject code, the course code, the course title and course description, with a list of prerequisite courses, and (b) Curriculum which contains a dictionary of courses and a dictionary of aliases. Curriculum object will also need to handle aliases, since some universities offer courses in multiple departments, but the course content is actually considered the same.
  2. (University specific) Web Scrapper: For the goal of creating a minimal working example first, I decided to use Case Western Reserve University’s Data Science course inventory page. (Note: I only pulled the page once and saved it to an html file — be kind to the school servers please).
  3. (Not yet implemented) The goal is to eventually use these networks (and their associated course descriptions/titles) from multiple universities, and apply Latent Dirichlet Allocation (LDA) on this text (great article by Thushan Ganegedara).

Prerequisites

Set up a clean environment!
  1. Operating System Setup: I recommend using Ubuntu 20.04 LTS. One can certainly try a live CD but any serious work I recommend actually making a partition and installing it on your machine. Python comes preinstalled on almost every Linux system.
  2. A good editor like Sublime Text (or Vim if you’re really brave) then set it up for development in python. Make sure to have AutoPEP-8 compliance and Flake8 linter.
  3. Some practice with running python scripts and using github.

Preparations: Setting up Continuous Integration

First, I created a directory to host the project source code

mkdir curriculum-mapper 
cd curriculum-mapper
touch README.md
touch Curriculum.py
touch .gitignore
subl .gitignore

I went to gitignore.io to generate a .gitignore file for this project. After that, I created a virtual environment in and activated it for the workspace. This is a good habit — at least have a virtual environment for coding. Personally, I use a virtual environment for every one of my projects, and I keep the environment outside of my git repo directory.

cd ..
python -m venv curriculum-mapper-venv
source curriculum-mapper-venv/bin/activate

After writing a file called curriculum.py that included the class definitions for Course and Curriculum objects, I made a Github repository on the website, and pushed my first commit.

# in "curriculum-mapper" repo directory
git init
git add .
git commit -m "first commit"
git branch -M main
git remote add origin git@github.com:TimmyChan/curriculum-mapper.git
git push -u origin main

Then, I focused on writing unit tests using the pytest framework in preparation for Continuous Integration practice. This produced two test files, test_course.py and test_curriculum.py .

The last line below is used frequently, since the docker system uses requirements.txt to load all the packages and dependencies for the build.

pip install pytest
pip install flake8
pip install wheel
pip install beautifulsoup4
pip freeze > requirements.txt

As I kept coding, I kept installing new packages in the virtual environment. At this point, my code kept breaking because this is the first time I’ve decided to do CI — which included some instructions on using linters in the .circleci/config.yml in the following tutorials from CircleCI.

While I had learned the PEP-8 naming conventions for my master’s thesis, this time around Flake8 kept warning me about blank lines with tabs, and various formatting issues. Luckily, I use Sublime Text as my editor, which has many packages for python. To address this, I recommend SublimeLinter.

I headed over to CircleCI and configured the Continuous Integration pipeline, which ran my unit tests and linter.

CircleCI automatically doing my unit tests and linting — you can see where I installed a SublimeLinter here

At this point, looks like my environment is setup to do some scrapping!

Web Scraping with BeautifulSoup

Inspecting the Source Code of a web page, but jump to the element of interest

My code only calls on the website if the html file is not already stored in the directory. Since we use BeautifulSoup, I decided to store all the saved .html files into a directory called “canned_soup” to be polite and prevent over-pinging a school website.

def polite_crawler(filename, url):
''' saves a copy of the html to not overping '''
try:
os.makedirs("canned_soup")
except Exception:
pass
try:
# try to open html, where we cache the soup
with open("canned_soup/" + filename, "r") as file:
soup = BeautifulSoup(file, "lxml")
return soup
except Exception:
res = requests.get(url)
# using lxml because of bs4 doc
soup = BeautifulSoup(res.content, "lxml")
with open("canned_soup/" + filename, "w") as file:
file.write(str(soup))
return soup

Bringing up Case Western Reserve University bulletin website on my browser and inspecting the website HTML structure allowed me to write the following script. The first section is just the packages used. A quick explanation:

  1. PyVis is a wrapper around a popular javascript library called visJS
  2. NetworkX is a package used for analyzing networks (graphs)
  3. Matplotlib is a standard package for data visualization
  4. os is a system package for making directories/opening files
  5. re is a system package for Regular Expressions
  6. requests is a package for handling http in python
  7. unicodedata is a package to convert some characters, like \xa0, which is actually non-breaking space in Latin1 (ISO 8859-1) into Unicode.
  8. BeautifulSoup is a package for parsing html files into actual meaningful data.
  9. curriculum.py contains a couple of custom classes written by me, that handles courses, prerequisites, as well as clearing up aliases and contains functions to generate graphs and visuals.
Example: using the Curriculum-Mapper project on Case Western Reserve University’s Data Science program

A brief walk through of the script:

  1. Initiating the curriculum object.
  2. Scrape through the core courses and related courses from tables first.
  3. I noted that the course detailed descriptions were all listed in div tags that were labeled courseblock ; then scraping through that data and adding each course and their prereqs and their aliases, the curriculum object does most of the work.
  4. In the script file, I use Curriculum.print_graph() function in a script to generate a local html file, which passes through the alias lists for every course and chooses a common name based on user defined preferred_subject_code.

Outcome

Visualization made using Python (Uses Pyvis, Networkx and Beautiful Soup)

For this visualization, the out_degree (the number of arrows leaving a node) determines the size of the node. This would mean that nodes that are foundational are larger in size. For the color, that is determined by a colormap (from matplotlib) based on the course number. The higher the number, the deeper the color. Finally, a course is blue when it’s the course of interest, and grey when it’s a course outside the degree of interest.

Armed with this visualization tool, we can then extrapolate each particular topic within the core constructs, and make explicit what one is expected to know before tackling a topic.

Did you get a chance to try out the script? Do you see ways how this can be mapped out better? Feel free to use this example for a school of interest! Let me know how it goes, leave a comment. Subscribe to follow this attempt to democratize learning learning Data Science, guided by a cognitive apprenticeship framing and the inquiry based learning model.

is a mathematician actively seeking a data scientist role.

LinkedInTwitterFacebook

References and Further Reading

AutoPEP8 — Packages — Package Control. (n.d.). Retrieved March 10, 2022, from https://packagecontrol.io/packages/AutoPEP8

Beautiful Soup Documentation — Beautiful Soup 4.9.0 documentation. (n.d.). Retrieved March 10, 2022, from https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Chan, T. (2022a, February 17). From Pure Math to Data Science. Medium. https://mathtodata.medium.com/from-pure-math-to-data-science-293883864cb2

Chan, T. (2022b, February 28). Mixed Methods Data Science: Qualitative Sensibilities. Medium. https://mathtodata.medium.com/data-science-curriculum-starting-point-for-pure-mathematicians-347efe61f743

Chan, T. (2022). TimmyChan/curriculum-mapper [Python]. https://github.com/TimmyChan/curriculum-mapper (Original work published 2022)

Chan, T. (2022c, March 10). Data Science Project Based Learning. Medium. https://mathtodata.medium.com/data-science-project-based-learning-afd2bd6f8f11

Configuring a Python Application on CircleCI — CircleCI. (n.d.). Retrieved March 10, 2022, from https://circleci.com/docs/2.0/language-python/

Department of Computer and Data Sciences < Case Western Reserve University. (n.d.). Retrieved March 10, 2022, from https://bulletin.case.edu/schoolofengineering/compdatasci/#courseinventory

Ganegedara, T. (2021, November 15). Intuitive Guide to Latent Dirichlet Allocation. Medium. https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-latent-dirichlet-allocation-437c81220158

gitignore.io — Create Useful .gitignore Files For Your Project. (n.d.). Retrieved March 10, 2022, from https://www.toptal.com/developers/gitignore

Incidence matrix. (2021). In Wikipedia. https://en.wikipedia.org/w/index.php?title=Incidence_matrix&oldid=1053190313

Install Ubuntu desktop. (n.d.). Ubuntu. Retrieved March 10, 2022, from https://ubuntu.com/tutorials/install-ubuntu-desktop

Interactive network visualizations — Pyvis 0.1.3.1 documentation. (n.d.). Retrieved March 10, 2022, from https://pyvis.readthedocs.io/en/latest/index.html

Jupyter/IPython Notebook Quick Start Guide — Jupyter/IPython Notebook Quick Start Guide 0.1 documentation. (n.d.). Retrieved March 10, 2022, from https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/

Laplacian Matrices | An Introduction to Algebraic Graph Theory. (n.d.). Retrieved March 10, 2022, from https://www.geneseo.edu/~aguilar/public/notes/Graph-Theory-HTML/ch4-laplacian-matrices.html

Matplotlib — Visualization with Python. (n.d.). Retrieved March 10, 2022, from https://matplotlib.org/

Newman, M. E. J. (2003). The Structure and Function of Complex Networks. SIAM Review, 45(2), 167–256. https://doi.org/10.1137/S003614450342480

Python, R. (n.d.). Setting Up Sublime Text 3 for Full Stack Python Development — Real Python. Retrieved March 10, 2022, from https://realpython.com/setting-up-sublime-text-3-for-full-stack-python-development/

Requests: HTTP for HumansTM — Requests 2.27.1 documentation. (n.d.). Retrieved March 10, 2022, from https://docs.python-requests.org/en/latest/

--

--

Timmy Chan

Professional Software Engineer, Master Mathematician interested in learning and implementing multidisciplinary approaches to complex questions