Scraper

Tool to scrape data and distill it.

Prerequisites

  • Have NodeJS, pnpm, and Git installed.
  • Clone the FNTU repository. Run git clone git@github.com:Acrylic125/fntu.git then cd scraper and pnpm i.

Running the Scraper

Once cloned, run the following commands to start scraping:

pnpm run start courses
pnpm run start locations

The code can be found here.

High level overview

Scraper Steps

  1. Source the data we want to scrape. See Data Sources for more details.
  2. Download the pages (i.e. Data Sources).
  3. Scrape the data from the pages, and consolidate it. (In JSON format)
  4. (Optional) Transform the data, and consolidate it. (In JSON format)
  5. (Optional) Insert the data into a database. (Using Drizzle ORM)

You may modify any of these steps to fit your needs.

Data Sources

Data sources are the pages we want to scrape from.

Use

Data Sources

Courses

Locations

Sourcing main locations.

We have to link the names used in undergraduate programsto the names used in MapIndoors. Thus, we add Altername Names (altNames) to each location. We source altNames from:

Tips for Scraping

  1. Try to find relevant pages to scrape from.
  2. Go into Inspect Element, typically under Inspector tab, download the HTML and see what data is given. Use GenAI to help you extract the data you need.
  3. Go into Inspect Element, typically under Network tab, see what requests are made to the server.