Scrape university entrance exam score

Posted by Hai Vo on Friday, August 6, 2021

Why

University entrance exam is one of the most important exam of every Vietnamese student. In the year of 2020, by analyzing exam score, people discover abnormal data point and then detect a serious cheating. The full data is normally not public by the Education Department. However, anyone could get the exam result through multiple website by fill in correct candidate ID. This post present how to use R to send API request to retrieve full exam score in the whole country in 2021. The full script and data could be found at my github repository.

How does it work

There are several website provide web interface to get the score. In my script, I use two sources of data: https://diemthi.vnanet.vn and https://tienphong.vn/tra-cuu-diem-thi.tpo.

Under the hood, these two website use API to retrieve the data, and user can access this API request directly via web browser devtools (network tab). The API from vnanet.vn is quite slow and only allow to retrieve 1 result per request. On the other hand, tienphong.vn allows user to get a maximum of 300 result per request, hence I decide to use the latter options to get the whole data.

These two API take the student ID as input and provide full result of that student as output. Input are in the form {province_code}{student_id}. Province_code vary from 01 to 64, while the student_id are from 1 to the max number of student attend in the exam. For example, input = ‘01000001’ is for student 1 from the province that had code 01 (Ha Noi).

Demonstration

Option 1:

library(tidyverse)
library(glue)
library(httr)
library(here)
get_score <- function(sbd) {
  url <- glue("https://diemthi.vnanet.vn/Home/SearchBySobaodanh?code={ sbd }&nam=2021")
  GET(url) %>% 
    content(type = 'text', encoding = 'UTF-8') %>% 
    jsonlite::fromJSON() %>% 
    .[['result']] %>% 
    as_tibble()
}

get_score('01000001') %>% 
  knitr::kable()
CityCode CityArea Code Toan NguVan NgoaiNgu VatLi HoaHoc SinhHoc KHTN DiaLi LichSu GDCD KHXH ResultGroup Result
01 NA 01000001 2.20 3.50 5.50 2.50 [{“g”:“A07”,“p”:10.20},{“g”:“C00”,“p”:11.50},{“g”:“C03”,“p”:8.20},{“g”:“C04”,“p”:11.20}]

Option 2: In this API options, I can use sbd = ‘0100001’ to get the result of 10 attendance in one request from 01000011 to 01000019

get_score2 <- function(sbd){
  
  # prepare URL
  url <- glue('https://tienphong.vn/api/diemthi/get/result?type=0&keyword={ sbd }&kythi=THPT&nam=2021&cumthi=0')
  ## send request
  a <- GET(url,
           add_headers('referer'= 'https://tienphong.vn/tra-cuu-diem-thi.tpo')
  ) %>% 
    content(as = 'text') %>% 
    jsonlite::fromJSON()
  
  # parse request to text format
  a$data$results %>% 
    rvest::read_html() %>% 
    rvest::html_text2() %>% 
    return()
}

# data received in the form of TSV:
get_score2('0100001') %>% 
  data.table::fread() %>% 
  knitr::kable()
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
1 1000019 7.0 8.50 8.8 NA NA NA 4.00 5.25 6.75 NA
2 1000018 8.8 8.25 NA 8.00 5.00 6.75 NA NA NA NA
3 1000017 7.8 8.00 9.6 7.50 8.00 7.25 NA NA NA NA
4 1000016 7.8 8.50 9.4 NA NA NA 6.50 7.50 8.00 NA
5 1000015 6.0 7.75 9.0 NA NA NA 4.00 7.75 7.00 NA
6 1000014 7.4 8.00 8.6 NA NA NA 6.00 6.25 7.50 NA
7 1000013 7.4 6.75 9.0 NA NA NA 3.75 8.50 6.50 NA
8 1000012 6.4 6.75 7.8 NA NA NA 5.50 7.00 7.50 NA
9 1000011 6.0 7.75 8.2 NA NA NA 3.00 7.25 8.50 NA
10 1000010 8.8 6.25 9.2 8.75 8.75 3.00 NA NA NA NA

In the script, I also exploit multi-core in my machine (4-cores) to send multiple request simultaneously by using {furrr} package (front-end to the {future} package). The use of parallel code yield around 3 times faster result (60 mins for nearly 1 millions result) compare with normal use.

Because the API sent me the province code instead of province name, I will need an extra step to get the province.

library(rvest)
province_code <- read_html("https://diemthi.vnanet.vn/Home/") %>% 
  html_elements("#listCity") %>% 
  html_text2() %>% 
  str_replace_all(pattern = "(\\d{2})", "\n\\1, ") %>% 
  read_csv(skip = 1, col_names = c('province_code', 'province')) %>% 
  mutate(province = stringi::stri_trans_general(province, "Latin-ASCII")) %>% 
  mutate(province = str_remove(province, 'So GDDT '),
         province = str_remove(province, "So GD KHCN "))

province_code
## # A tibble: 64 × 2
##    province_code province       
##    <chr>         <chr>          
##  1 01            Ha Noi         
##  2 02            TP. Ho Chi Minh
##  3 03            Hai Phong      
##  4 04            Da Nang        
##  5 05            Ha Giang       
##  6 06            Cao Bang       
##  7 07            Lai Chau       
##  8 08            Lao Cai        
##  9 09            Tuyen Quang    
## 10 10            Lang Son       
## # … with 54 more rows

Finally I just need to join to data sets, this is what final result look like:

here::i_am("content/project/2021-08-06-diem-thi-thpt/index.markdown")
path <- "content/project/2021-08-06-diem-thi-thpt"

data <- read_rds(here(path, "data/output.rds"))

data %>% 
  left_join(province_code) %>% 
  head(10) %>% 
  knitr::kable()
province_code V2 toan van nn li hoa sinh su dia gdcd province
01 1000099 7.8 7.75 7.2 NA NA NA 2.75 6.50 7.50 Ha Noi
01 1000098 9.4 7.50 8.4 8.75 8.0 4.75 NA NA NA Ha Noi
01 1000097 3.8 8.25 6.8 NA NA NA 4.75 6.75 9.00 Ha Noi
01 1000096 5.8 8.00 9.2 NA NA NA 4.25 7.00 7.25 Ha Noi
01 1000095 5.2 6.00 2.6 NA NA NA 7.50 8.00 8.50 Ha Noi
01 1000094 3.8 7.00 4.2 NA NA NA 3.75 7.25 7.50 Ha Noi
01 1000093 9.4 7.50 NA 8.50 5.5 6.75 NA NA NA Ha Noi
01 1000092 6.2 7.75 4.0 NA NA NA 3.75 8.00 8.25 Ha Noi
01 1000091 6.0 6.50 3.8 NA NA NA 4.50 6.00 8.00 Ha Noi
01 1000090 8.4 8.00 9.2 NA NA NA 6.25 8.50 8.50 Ha Noi