Recently, the ubiquity of mobile devices leads to an increasing demand of public network services, e.g., WiFi hot spots.
As a part of this trend, modern transportation systems are equipped with public WiFi devices to provide Internet access for passengers as people spend a large amount of time on public transportation in their daily life.
However, one of the key issues in public WiFi spots is the privacy concern due to its open access nature.
Existing works either studied location privacy risk in human traces or privacy leakage in private networks such as cellular networks based on the data from cellular carriers.
To the best of our knowledge, none of these work has been focused on bus WiFi privacy based on large-scale real-world data.
In this paper, to explore the privacy risk in bus WiFi systems, we focus on two key questions how likely bus WiFi users can be uniquely re-identified if partial usage information is leaked and how we can protect users from the leaked information.
To understand the above questions, we conduct a case study in a large-scale bus WiFi system, which contains 20 million connection records and 78 million location records from 770 thousand bus WiFi users during a two-month period.
Technically, we design two models for our uniqueness analyses and protection, i.e., a PB-FIND model to identify the probability a user can be uniquely re-identified from leaked information;
a PB-HIDE model to protect users from potentially leaked information.
Specifically, we systematically measure the user uniqueness on users’ finger traces (i.e., connection URL and domain), foot traces (i.e., locations), and hybrid traces (i.e., both finger and foot traces).
Our measurement results reveal
(i) 97.8% users can be uniquely re-identified by 4 random domain records of their finger traces and 96.2% users can be uniquely re-identified by 5 random locations on buses;
(ii) 98.1% users can be uniquely re-identified by only 2 random records if both their connection records and locations are leaked to attackers.
Moreover, the evaluation results show
our PB-HIDE algorithm protects more than 95% users from the potentially leaked information by inserting only 1.5% synthetic records in the original dataset to preserve their data utility.