174 lines
6.3 KiB
Plaintext
174 lines
6.3 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Exercises in Algorithm Design"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Issues to Consider\n",
|
|
"\n",
|
|
"- How many queries are required, and what kind of composition can we use?\n",
|
|
" - Is parallel composition possible?\n",
|
|
" - Should we use sequential composition, advanced composition, or a variant of differential privacy?\n",
|
|
"- Can we use the sparse vector technique?\n",
|
|
"- Can we use the exponential mechanism?\n",
|
|
"- How should we distribute the privacy budget?\n",
|
|
"- If there are unbounded sensitivities, how can we bound them?\n",
|
|
"- Would synthetic data help?\n",
|
|
"- Would post-processing to \"de-noise\" help?"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 1. Generalized Sample and Aggregate\n",
|
|
"\n",
|
|
"Design a variant of sample and aggregate which does *not* require the analyst to specify the output range of the query function $f$.\n",
|
|
"\n",
|
|
"**Ideas**: use SVT to find good upper and lower bounds on $f(x)$ for the whole dataset first. The result of $clip(f(x), lower, upper)$ has bounded sensitivity, so we can use this query with SVT. Then use sample and aggregate with these upper and lower bounds."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 2. Summary Statistics\n",
|
|
"\n",
|
|
"Design an algorithm to produce differentially private versions of the following statistics:\n",
|
|
"\n",
|
|
"- Mean: $\\mu = \\frac{1}{n} \\sum_{i=1}^n x_i$\n",
|
|
"- Variance: $var = \\frac{1}{n} \\sum_{i=1}^n (x_i - \\mu)^2$\n",
|
|
"- Standard deviation: $\\sigma = \\sqrt{\\frac{1}{n} \\sum_{i=1}^n (x_i - \\mu)^2}$\n",
|
|
"\n",
|
|
"**Ideas**:\n",
|
|
"\n",
|
|
"**Mean**\n",
|
|
"\n",
|
|
"1. Use SVT to find upper and lower clipping bounds\n",
|
|
"2. Compute noisy sum and count, and derive mean by post-processing\n",
|
|
"\n",
|
|
"**Variance**\n",
|
|
"\n",
|
|
"1. Split it into a count query ($\\frac{1}{n}$ - we have the answer from above) and a sum query\n",
|
|
"2. What's the sensitivity of $\\sum_{i=1}^n (x_i - \\mu)^2$? It's $b^2$; we can clip and compute $\\sum_{i=1}^n (x_i - \\mu)^2$, then multiply by (1) by post processing\n",
|
|
"\n",
|
|
"**Standard Deviation**\n",
|
|
"\n",
|
|
"1. Just take the square root of variance\n",
|
|
"\n",
|
|
"Total queries:\n",
|
|
"- Lower clipping bound (SVT)\n",
|
|
"- Upper clipping bound (SVT)\n",
|
|
"- Noisy sum (mean)\n",
|
|
"- Noisy count\n",
|
|
"- Noisy sum (variance)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 3. Heavy Hitters\n",
|
|
"\n",
|
|
"Google's RAPPOR system {cite}`rappor` is designed to find the most popular settings for Chrome's home page. Design an algorithm which:\n",
|
|
"\n",
|
|
"- Given a list of the 10,000 most popular web pages by traffic,\n",
|
|
"- Determines the top 10 most-popular home pages out of the 10,000 most popular web pages\n",
|
|
"\n",
|
|
"\n",
|
|
"**Ideas**: Use parallel composition and take the noisy top 10"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 4. Hierarchical Queries\n",
|
|
"\n",
|
|
"Design an algorithm to produce summary statistics for the U.S. Census. Your algorithm should produce total population counts at the following levels:\n",
|
|
"\n",
|
|
"- Census tract\n",
|
|
"- City / town\n",
|
|
"- ZIP Code\n",
|
|
"- County\n",
|
|
"- State\n",
|
|
"- USA\n",
|
|
"\n",
|
|
"**Ideas**:\n",
|
|
"\n",
|
|
"Idea 1: *Only* compute the bottom level (census tract), using parallel composition. Add up all the tract counts to get the city counts, and so on up the hierarchy. Advantage: lowers privacy budget.\n",
|
|
"\n",
|
|
"Idea 2: Compute counts at all levels, using parallel composition for each level. Tune the budget split using real data; probably we need more accuracy for the smaller levels of the hierarchy.\n",
|
|
"\n",
|
|
"Idea 3: As (2), but also use post-processing to re-scale lower levels of the hierarchy based on higher ones; truncate counts to whole numbers; move negative counts to 0."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 5. Workloads of Range Queries\n",
|
|
"\n",
|
|
"Design an algorithm to accurately answer a workload of *range queries*. Range queries are queries on a single table of the form: \"how many rows have a value that is between $a$ and $b$?\" (i.e. the count of rows which lie in a specific range). \n",
|
|
"\n",
|
|
"### Part 1\n",
|
|
"The whole workload is pre-specified as a finite sequence of ranges: $\\{(a_1, b_1), \\dots, (a_k, b_k)\\}$, and \n",
|
|
"\n",
|
|
"### Part 2\n",
|
|
"The length of the workload $k$ is pre-specified, but queries arrive in a streaming fashion and must be answered as they arrive.\n",
|
|
"\n",
|
|
"### Part 3\n",
|
|
"The workload may be infinite.\n",
|
|
"\n",
|
|
"**Ideas**:\n",
|
|
"\n",
|
|
"Just run each query with sequential composition.\n",
|
|
"\n",
|
|
"For part 1, combine them so we can use $L2$ sensitivity. When $k$ is large, this will work well with Gaussian noise.\n",
|
|
"\n",
|
|
"Or, build synthetic data:\n",
|
|
"\n",
|
|
"- For each range $(i, i+1)$, find a count (parallel composition). This is a synthetic data representation! We can answer infinitely many queries by adding up the counts of all the segments in this histogram which are contained in the desired interval.\n",
|
|
"- For part 2, use SVT\n",
|
|
"\n",
|
|
"For SVT: for each query in the stream, ask how far the real answer is from the synthetic data answer. If it's far, query the real answer's range (as a histogram, using parallel composition) and update the synthetic data. Otherwise just give the synthetic data answer. This way you *ONLY* pay for updates to the synthetic data."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.9.9"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|