With colleagues Teague and Culnane, I helped uncover one of the largest privacy breaches in Australian history 2016–17. Federal health and human services in mid-2016 released an open dataset of 30 years of Medicare and Pharmaceutical Benefits Schemes transaction records, for 10% of the Australian population. The intention was to drive health economics research, for evidence-based policy development. Unfortunately minimal privacy protections were in place, while the data reported sensitive treatments e.g., for AIDS, late-term abortions, etc. Initially we completely reidentified doctors, due to improper hashing of their IDs. As a result the dataset was taken offline and a public statement released by the Department. It could not be recalled. A year later we announced we had reidentified patients such as well-known figures in Australian sport and politics.
The day after Medicare’s retraction, the Attorney General published a plan to legislate against reidentification of Commonwealth datasets. In the months to come the Reidentification Criminal Offence Bill (an amendment to the Privacy Act 1988) was introduced to Parliament criminalising the act of reidentification, unless with prior permission. The bill, if passed, would be retroactively applied and reverse the burden of proof on accused. While stifling security experts and journalists responsibly disclosing existing privacy breaches to the government, the bill would not prevent private corporations or foreign entities outside Australian jurisdiction from misusing Commonwealth data. Of 15 submissions to the ensuing Parliamentary Inquiry examining the appropriateness of the bill, 14 were against including the Law Council of Australia, Australian Bankers’ Association, and EFF. Our submission to the inquiry achieved significant impact, being directly quoted 9 times in the Senate Committee’s final report. We wrote an Op-Ed in the Sydney Morning Herald clearly explaining why criminalising reidentification would do more harm than good.
Colleagues Culnane, Teague and I demonstrated a significant
reidentification of public transport users, in a 2018 datathon dataset
of fine-grained touch-on touch-off events on Victorian public transport
(the Myki card). We were able to find ourselves, triangulate a colleague
based on a single cotravelling event, and a State MP through linking
with social media postings. The privacy breach was significant:
it would be easy for stalkers to learn when children travel alone, where
survivors of domestic abuse now live, etc. from the release. The
Victorian privacy commissioner OVIC’s
investigation found PTV had broken state laws.
With colleagues Ohrimenko, Culnane, Teague, I have contributed towards several technical privacy assessments of government data initiatives and privacy sector projects. Contracted by the Australian Bureau of Statistics (ABS), we have for example analysed the privacy of several options for name encoding for private record linkage—as might be used for Australian Census data for example. For Transport for NSW, we have performed a technical privacy assessment of a Data61-processed dataset of Opal transport card bus, train, ferry touch ons/offs again under contract. The data has subsequently been published. We have also discovered vulnerabilities in the hashing methodology published by the UK Office of National Statistics in a third privacy assessment (explained here). These are non-exhaustive examples of technical assessments performed by the group. Common themes to this work are reflected in our 2018 report for the Office of the Victorian Information Commissioner. These and broader issues around robustness of AI are summarised in an OVIC book chapter.
In 2020, the COVID-19 pandemic swept across the world with devistating
consequences. An important strategy for slowing the spread of the
coronavirus is contact tracing, which traditionally had been manual,
laborious but accurate. Naturally many governments hoped that high
uptake of Bluetooth-enabled smart phones could be leveraged for
automated contact tracing, supporting over-burdened human contact traces
if and when COVID-19 put strain on health systems. In March Australia
opted to adopt and adapt Singapore’s Bluetooth contact tracing system
TraceTogether in its app called COVIDSafe. While the contact tracing effectiveness
of
the
system
has since come in to focus, many tech commentators and researchers
identified seemingly unnecessary compromise in the system’s privacy
provisions. I reported
of these privacy critiques with colleagues Farokhi (Melbourne),
Asghar and Kaafar (Macquarie), a report that was cited by the
government’s COVIDSafe
PIA. Like many other experts, we aimed to highlight the flexibility
and greater privacy of decentralised approaches. With colleagues Leins
and Culnane (Melbourne), we wrote
in the MJA on a broader range of issues on the techno-legal and
ethical dilemma’s for automated contact tracing. For more up to date
detail on the COVIDSafe implementation interested readers should check
out the thorough posts by Vanessa et
al..
Data linkage has many public interest uses which demand accuracy but
also proper representation of uncertianty. The framework of Bayesian
probability is ideal for incorporating uncertainty into data linking,
however techniques in the framework are often different to scale to
large or even moderate sized dataset. With the Australian Bureau of
Statistics and U.S. Census Bureau, student Marchant (Melbourne),
colleagues Steorts (Duke), Kaplan (Colorado State) and I adapted the
blink
Bayesian model of Steorts for data linking, to be
amenable to parallel inference. Our resulting unsupervised end-to-end
Bayesian record linkage system d-blink
, open-sourced running on
Apache Spark, enjoys efficiency gains of up to 300x. Maintaining
uncertainty throughout the linkage process enables uncertainty
propagation to downstream data analysis tasks potentially leading to
accuracy gains. Tests on ABS and US Census (State of Wyoming: 2010
Decennial Census with administrative records from the Social Security
Administration’s Numident) data are reported in the JCGS
paper.
Over 2010–13, I was one of two researchers and a small handful of developers, building a production system for data integration—an application of machine learning in databases that leveraged our research at Microsoft e.g., [VLDB’12]. The system shipped multiple times internally (resulting in 4x ShipIt! awards for sustained product transfer). Notable applications were to the Bing Search engine across multiple verticals, and the Xbox game console. After the 2011/12 refresh, in which our data integration was a key contribution from Research, Xbox revenue increased by several $100m (due to increased sales of consoles and Xbox Live subscriptions). Within Microsoft Research, this impact was attributed to our small team. In Bing’s social vertical, our system matched over 1b records daily. I continue to work on data integration at Melbourne.
Through 2016 my group with colleague Bailey collaborated with the Austin Hospital’s transplantation unit, on predicting outcomes (graft failure) of liver transplantation for Australian demographics. With machine learning-based approaches, PhD student Yamuna Kankanige could improve by over 20% the predictive accuracy of the Donor Risk Index [Transplantation’17]—a risk score widely used by Australian surgeons today, in planning transplants and follow-up interventions.
In 2011 with Narayanan (now Princeton) and Shi (now Cornell), I helped demonstrate the power of privacy attacks to Kaggle (a $16m Series A, Google acquired platform for crowdsourcing machine learning) [IJCNN’11]. After determining the source of an anonymised social network dataset, intended for use in a link prediction contest, we downloaded and linked it to the competition test set. Normally a linkage attack would end there, having re-identified users. We used it to look up correct test answers and win the competition by ‘cheating’. No privacy breach resulted and contestants remained able to compete. However the result raised awareness for Kaggle, to the stark reality of privacy attacks. Team member Narayanan subsequently consulted on the privacy of the $3m Heritage Health Prize dataset.
With a Berkeley group led by Dawn Song [report], I helped improve the security of Mozilla’s open-source development processes. While open-source projects tend to improve system security through the principle of ‘many eyes’, Mozilla was publishing security-related commits to the public Firefox web browser source repository, often a month before those commits would be automatically pushed to users. We trained a learning-based ranker to predict which commits were more likely security-related. An attacker could then easily sift through a few commits by hand to find zero-day exploits, on average a month prior to patching. As a result of our work Mozilla made security-related commits private until they were published as patches.
Note: view estimates are wildly approximate, sometimes “cumulative” overestimates.
I have co-authored submissions to 12 policy and legislation consultations and inquiries run by government departments and agencies. These responses highlight challenges and best-practice solutions in data privacy and AI.
As member of the Australian Academy of Science’s NCICS I contributed to the Digital Futures strategic plan.
Since arriving at the University of Melbourne Oct 2013, I have been awarded competitive funding (Cat 1–4) of $10.94m total, $7.45m as lead-CI, $3.17m on a per-CI basis. Funding includes:
Many more committees and working groups at departmental, faculty and university levels.