GP/notebooks/ch11.ipynb

174 lines
6.3 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exercises in Algorithm Design"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Issues to Consider\n",
"\n",
"- How many queries are required, and what kind of composition can we use?\n",
" - Is parallel composition possible?\n",
" - Should we use sequential composition, advanced composition, or a variant of differential privacy?\n",
"- Can we use the sparse vector technique?\n",
"- Can we use the exponential mechanism?\n",
"- How should we distribute the privacy budget?\n",
"- If there are unbounded sensitivities, how can we bound them?\n",
"- Would synthetic data help?\n",
"- Would post-processing to \"de-noise\" help?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Generalized Sample and Aggregate\n",
"\n",
"Design a variant of sample and aggregate which does *not* require the analyst to specify the output range of the query function $f$.\n",
"\n",
"**Ideas**: use SVT to find good upper and lower bounds on $f(x)$ for the whole dataset first. The result of $clip(f(x), lower, upper)$ has bounded sensitivity, so we can use this query with SVT. Then use sample and aggregate with these upper and lower bounds."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Summary Statistics\n",
"\n",
"Design an algorithm to produce differentially private versions of the following statistics:\n",
"\n",
"- Mean: $\\mu = \\frac{1}{n} \\sum_{i=1}^n x_i$\n",
"- Variance: $var = \\frac{1}{n} \\sum_{i=1}^n (x_i - \\mu)^2$\n",
"- Standard deviation: $\\sigma = \\sqrt{\\frac{1}{n} \\sum_{i=1}^n (x_i - \\mu)^2}$\n",
"\n",
"**Ideas**:\n",
"\n",
"**Mean**\n",
"\n",
"1. Use SVT to find upper and lower clipping bounds\n",
"2. Compute noisy sum and count, and derive mean by post-processing\n",
"\n",
"**Variance**\n",
"\n",
"1. Split it into a count query ($\\frac{1}{n}$ - we have the answer from above) and a sum query\n",
"2. What's the sensitivity of $\\sum_{i=1}^n (x_i - \\mu)^2$? It's $b^2$; we can clip and compute $\\sum_{i=1}^n (x_i - \\mu)^2$, then multiply by (1) by post processing\n",
"\n",
"**Standard Deviation**\n",
"\n",
"1. Just take the square root of variance\n",
"\n",
"Total queries:\n",
"- Lower clipping bound (SVT)\n",
"- Upper clipping bound (SVT)\n",
"- Noisy sum (mean)\n",
"- Noisy count\n",
"- Noisy sum (variance)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Heavy Hitters\n",
"\n",
"Google's RAPPOR system {cite}`rappor` is designed to find the most popular settings for Chrome's home page. Design an algorithm which:\n",
"\n",
"- Given a list of the 10,000 most popular web pages by traffic,\n",
"- Determines the top 10 most-popular home pages out of the 10,000 most popular web pages\n",
"\n",
"\n",
"**Ideas**: Use parallel composition and take the noisy top 10"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Hierarchical Queries\n",
"\n",
"Design an algorithm to produce summary statistics for the U.S. Census. Your algorithm should produce total population counts at the following levels:\n",
"\n",
"- Census tract\n",
"- City / town\n",
"- ZIP Code\n",
"- County\n",
"- State\n",
"- USA\n",
"\n",
"**Ideas**:\n",
"\n",
"Idea 1: *Only* compute the bottom level (census tract), using parallel composition. Add up all the tract counts to get the city counts, and so on up the hierarchy. Advantage: lowers privacy budget.\n",
"\n",
"Idea 2: Compute counts at all levels, using parallel composition for each level. Tune the budget split using real data; probably we need more accuracy for the smaller levels of the hierarchy.\n",
"\n",
"Idea 3: As (2), but also use post-processing to re-scale lower levels of the hierarchy based on higher ones; truncate counts to whole numbers; move negative counts to 0."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Workloads of Range Queries\n",
"\n",
"Design an algorithm to accurately answer a workload of *range queries*. Range queries are queries on a single table of the form: \"how many rows have a value that is between $a$ and $b$?\" (i.e. the count of rows which lie in a specific range). \n",
"\n",
"### Part 1\n",
"The whole workload is pre-specified as a finite sequence of ranges: $\\{(a_1, b_1), \\dots, (a_k, b_k)\\}$, and \n",
"\n",
"### Part 2\n",
"The length of the workload $k$ is pre-specified, but queries arrive in a streaming fashion and must be answered as they arrive.\n",
"\n",
"### Part 3\n",
"The workload may be infinite.\n",
"\n",
"**Ideas**:\n",
"\n",
"Just run each query with sequential composition.\n",
"\n",
"For part 1, combine them so we can use $L2$ sensitivity. When $k$ is large, this will work well with Gaussian noise.\n",
"\n",
"Or, build synthetic data:\n",
"\n",
"- For each range $(i, i+1)$, find a count (parallel composition). This is a synthetic data representation! We can answer infinitely many queries by adding up the counts of all the segments in this histogram which are contained in the desired interval.\n",
"- For part 2, use SVT\n",
"\n",
"For SVT: for each query in the stream, ask how far the real answer is from the synthetic data answer. If it's far, query the real answer's range (as a histogram, using parallel composition) and update the synthetic data. Otherwise just give the synthetic data answer. This way you *ONLY* pay for updates to the synthetic data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}