My group and I have implemented our research contributions as useful software artifacts—in many cases as open-source software on which collaboration and contributions are most welcome. Other software has shipped within larger systems.
ActiveEval
implements a framework for active evaluation [KDD2021 paper] in
open-source Python. It solves the problem of estimating the performance
of a classifier on an unlabeled pool (test set), using labels queried
from an oracle (e.g. an expert or crowdsourcing platform). Several
methods are implemented including passive sampling, stratified sampling,
static importance sampling and adaptive importance sampling. The
importance sampling methods aim to minimise the variance of the
estimated performance measure, and can yield more precise estimates for
a given label budget. Several evaluation measures are currently
supported including accuracy, F-measure, and precision-recall curves.
The package is designed to be extensible.
d-blink
is an open-source Apache Spark package for end-to-end Bayesian
statistical record linkage. The package implements our [Marchant
et al. 2020] extensions to the blink
model of [Steorts
2015]. As a fully Bayesian model, blink
models a
(posterior) distribution over all matchings of records, useful for
propagating uncertainty through downstream analyses. The approach is
also unsupervised, meaning it does not require ground-truth annotations
of examples of matching records to train on. d-blink
makes
a number of non-trivial extensions that facilitate distributed inference
for significantly scaling blink
up to larger datasets.
AFLTeam
is an open-source parallel coverage-guided grebox fuzzing system for
finding software bugs. It employs dynamic task allocation using graph
partitioning and search algorithms to best optimise fuzzing for parallel
computing settings. It implements the ASE’21 paper [Pham
et al. 2021].
Legion
is an open-source concolic (concrete-symbolic) execution tool that
balances fuzzing and symbolic execution for software bug discovery,
using the AI technique of Monte Carlo tree search (MCTS) as detailed in
[Liu et al. 2020].
Fuzzing generates concrete inputs at low cost, without guarantees of
coverage of deep execution paths; symbolic execution can compute inputs
for all paths, but is challenging to scale due to the path explosion
problem.
diffpriv
is an open-source R package for privacy-preserving machine learning and
statistics. The package implements numerous mechanisms for guaranteeing
differential
privacy—the leading framework for releasing data analyses on
privacy-sensitive data in aggregate while protecting privacy of
individuals, deployed Google, Apple and the U.S. Census Bureau, and
awarded the 2017 Gödel Prize in theoretical computer science. The
mechanisms implemented include our Bernstein mechanism for
releasing functions such as classifiers trained on sensitive data [AAAI’17].
Notably diffpriv
also implements our sensitivity
sampler [ICML’17]
for easy privatisation of black-box functions including programs, for
which mathematical analysis required for differential privacy is
difficult or impossible. The package is distributed on CRAN and hosted
on GitHub.
OASIS
is
an open-source Python package for efficient creation of labelled
classification test sets under extreme class imbalance. A common
approach to creating test sets from an unlabelled pool of data is to
sample uniformly points for (expert) labeling, on which a classifier of
interest can be evaluated. However under class imbalance of for example
1:100, thousands of labels must be acquired for
statistically-significant evaluation. OASIS
implements an
asymptotically optimal adaptive importance sampler [VLDB’17]
for actively sampling points for labeling, such that evaluation
can take place under much more extreme class imbalances. The package is
distributed on PyPI and
hosted on GitHub.
affy
is an open-source R package for analysis of Affymetrix GeneChip
microarray data at the probe level. Of the several functions and classes
implemented in affy
, I contributed towards functionality of
gene expression detection, as in the early 2000s I was
interested in machine learning-based approaches for this task.
affy
is among the top-5% most downloaded packages
distributed through Bioconductor, a major open-source
project for high-throughput genomic data.
The following collects datasets that may be used freely for non-commercial, research or educational purposes. I only ask that you cite the paper associated with the dataset, and to consider emailing about your extension so that we might link back. If linking to this page, please use the permanent link
Attribute-Value-Level Matching in the [AAAI’15] paper explores normalisation of attribute values across multiple data sources, where attributes could be categorical, numerical, otherwise and could multi-valued. For example the genre(s) of a movie or the cuisine(s) of a restaurant. To benchmark our statistical approach (based on Canonical Correlation Analysis) against baselines, we crawled and prepared two datasets of four sources each on movie genres (7852 records across IMDB, Rotten Tomatoes, The Movie DB, Yahoo! Movies) and restaurant cuisines (3120 records across Factual, Foursquare, Google Places, Yelp). After performing a simple entity resolution (record linkage) to align matched records across sources, we extracted the attribute to be matched (genres or cuisines respectively). The data here represents keys to the original records, plus attribute values, all in entity-aligned order. Finally, we also used Amazon Mechanical Turk to produce human judgments to evaluate the attribute-value matchings. More details can be found in the download READMEs and the paper.