Scrape university entrance exam score-Haivo

Why

University entrance exam is one of the most important exam of every Vietnamese student. In the year of 2020, by analyzing exam score, people discover abnormal data point and then detect a serious cheating. The full data is normally not public by the Education Department. However, anyone could get the exam result through multiple website by fill in correct candidate ID. This post present how to use R to send API request to retrieve full exam score in the whole country in 2021. The full script and data could be found at my github repository.

How does it work

There are several website provide web interface to get the score. In my script, I use two sources of data: https://diemthi.vnanet.vn and https://tienphong.vn/tra-cuu-diem-thi.tpo.

Under the hood, these two website use API to retrieve the data, and user can access this API request directly via web browser devtools (network tab). The API from vnanet.vn is quite slow and only allow to retrieve 1 result per request. On the other hand, tienphong.vn allows user to get a maximum of 300 result per request, hence I decide to use the latter options to get the whole data.

These two API take the student ID as input and provide full result of that student as output. Input are in the form {province_code}{student_id}. Province_code vary from 01 to 64, while the student_id are from 1 to the max number of student attend in the exam. For example, input = ‘01000001’ is for student 1 from the province that had code 01 (Ha Noi).

Demonstration

Option 1:

library(tidyverse)
library(glue)
library(httr)
library(here)
get_score <- function(sbd) {
  url <- glue("https://diemthi.vnanet.vn/Home/SearchBySobaodanh?code={ sbd }&nam=2021")
  GET(url) %>% 
    content(type = 'text', encoding = 'UTF-8') %>% 
    jsonlite::fromJSON() %>% 
    .[['result']] %>% 
    as_tibble()
}

get_score('01000001') %>% 
  knitr::kable()

CityCode	CityArea	Code	Toan	NguVan	NgoaiNgu	VatLi	HoaHoc	SinhHoc	KHTN	DiaLi	LichSu	GDCD	KHXH	ResultGroup	Result
01	NA	01000001	2.20	3.50						5.50	2.50			[{“g”:“A07”,“p”:10.20},{“g”:“C00”,“p”:11.50},{“g”:“C03”,“p”:8.20},{“g”:“C04”,“p”:11.20}]

Option 2: In this API options, I can use sbd = ‘0100001’ to get the result of 10 attendance in one request from 01000011 to 01000019

get_score2 <- function(sbd){
  
  # prepare URL
  url <- glue('https://tienphong.vn/api/diemthi/get/result?type=0&keyword={ sbd }&kythi=THPT&nam=2021&cumthi=0')
  ## send request
  a <- GET(url,
           add_headers('referer'= 'https://tienphong.vn/tra-cuu-diem-thi.tpo')
  ) %>% 
    content(as = 'text') %>% 
    jsonlite::fromJSON()
  
  # parse request to text format
  a$data$results %>% 
    rvest::read_html() %>% 
    rvest::html_text2() %>% 
    return()
}

# data received in the form of TSV:
get_score2('0100001') %>% 
  data.table::fread() %>% 
  knitr::kable()

V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	V11	V12
1	1000019	7.0	8.50	8.8	NA	NA	NA	4.00	5.25	6.75	NA
2	1000018	8.8	8.25	NA	8.00	5.00	6.75	NA	NA	NA	NA
3	1000017	7.8	8.00	9.6	7.50	8.00	7.25	NA	NA	NA	NA
4	1000016	7.8	8.50	9.4	NA	NA	NA	6.50	7.50	8.00	NA
5	1000015	6.0	7.75	9.0	NA	NA	NA	4.00	7.75	7.00	NA
6	1000014	7.4	8.00	8.6	NA	NA	NA	6.00	6.25	7.50	NA
7	1000013	7.4	6.75	9.0	NA	NA	NA	3.75	8.50	6.50	NA
8	1000012	6.4	6.75	7.8	NA	NA	NA	5.50	7.00	7.50	NA
9	1000011	6.0	7.75	8.2	NA	NA	NA	3.00	7.25	8.50	NA
10	1000010	8.8	6.25	9.2	8.75	8.75	3.00	NA	NA	NA	NA

In the script, I also exploit multi-core in my machine (4-cores) to send multiple request simultaneously by using {furrr} package (front-end to the {future} package). The use of parallel code yield around 3 times faster result (60 mins for nearly 1 millions result) compare with normal use.

Because the API sent me the province code instead of province name, I will need an extra step to get the province.

library(rvest)
province_code <- read_html("https://diemthi.vnanet.vn/Home/") %>% 
  html_elements("#listCity") %>% 
  html_text2() %>% 
  str_replace_all(pattern = "(\\d{2})", "\n\\1, ") %>% 
  read_csv(skip = 1, col_names = c('province_code', 'province')) %>% 
  mutate(province = stringi::stri_trans_general(province, "Latin-ASCII")) %>% 
  mutate(province = str_remove(province, 'So GDDT '),
         province = str_remove(province, "So GD KHCN "))

province_code

## # A tibble: 64 × 2
##    province_code province       
##    <chr>         <chr>          
##  1 01            Ha Noi         
##  2 02            TP. Ho Chi Minh
##  3 03            Hai Phong      
##  4 04            Da Nang        
##  5 05            Ha Giang       
##  6 06            Cao Bang       
##  7 07            Lai Chau       
##  8 08            Lao Cai        
##  9 09            Tuyen Quang    
## 10 10            Lang Son       
## # … with 54 more rows

Finally I just need to join to data sets, this is what final result look like:

here::i_am("content/project/2021-08-06-diem-thi-thpt/index.markdown")
path <- "content/project/2021-08-06-diem-thi-thpt"

data <- read_rds(here(path, "data/output.rds"))

data %>% 
  left_join(province_code) %>% 
  head(10) %>% 
  knitr::kable()

province_code	V2	toan	van	nn	li	hoa	sinh	su	dia	gdcd	province
01	1000099	7.8	7.75	7.2	NA	NA	NA	2.75	6.50	7.50	Ha Noi
01	1000098	9.4	7.50	8.4	8.75	8.0	4.75	NA	NA	NA	Ha Noi
01	1000097	3.8	8.25	6.8	NA	NA	NA	4.75	6.75	9.00	Ha Noi
01	1000096	5.8	8.00	9.2	NA	NA	NA	4.25	7.00	7.25	Ha Noi
01	1000095	5.2	6.00	2.6	NA	NA	NA	7.50	8.00	8.50	Ha Noi
01	1000094	3.8	7.00	4.2	NA	NA	NA	3.75	7.25	7.50	Ha Noi
01	1000093	9.4	7.50	NA	8.50	5.5	6.75	NA	NA	NA	Ha Noi
01	1000092	6.2	7.75	4.0	NA	NA	NA	3.75	8.00	8.25	Ha Noi
01	1000091	6.0	6.50	3.8	NA	NA	NA	4.50	6.00	8.00	Ha Noi
01	1000090	8.4	8.00	9.2	NA	NA	NA	6.25	8.50	8.50	Ha Noi

Scrape university entrance exam score

Why

How does it work

Demonstration

CATALOG

FEATURED TAGS

FRIENDS