GP/ch11.ipynb at master - GP - WXXXXX

174 lines

6.3 KiB

Plaintext

Raw Permalink Blame History

 {
  "cells": [
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "# Exercises in Algorithm Design"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## Issues to Consider\n",
     "\n",
     "- How many queries are required, and what kind of composition can we use?\n",
     "  - Is parallel composition possible?\n",
     "  - Should we use sequential composition, advanced composition, or a variant of differential privacy?\n",
     "- Can we use the sparse vector technique?\n",
     "- Can we use the exponential mechanism?\n",
     "- How should we distribute the privacy budget?\n",
     "- If there are unbounded sensitivities, how can we bound them?\n",
     "- Would synthetic data help?\n",
     "- Would post-processing to \"de-noise\" help?"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## 1. Generalized Sample and Aggregate\n",
     "\n",
     "Design a variant of sample and aggregate which does *not* require the analyst to specify the output range of the query function $f$.\n",
     "\n",
     "**Ideas**: use SVT to find good upper and lower bounds on $f(x)$ for the whole dataset first. The result of $clip(f(x), lower, upper)$ has bounded sensitivity, so we can use this query with SVT. Then use sample and aggregate with these upper and lower bounds."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## 2. Summary Statistics\n",
     "\n",
     "Design an algorithm to produce differentially private versions of the following statistics:\n",
     "\n",
     "- Mean: $\\mu = \\frac{1}{n} \\sum_{i=1}^n x_i$\n",
     "- Variance: $var = \\frac{1}{n} \\sum_{i=1}^n (x_i - \\mu)^2$\n",
     "- Standard deviation: $\\sigma = \\sqrt{\\frac{1}{n} \\sum_{i=1}^n (x_i - \\mu)^2}$\n",
     "\n",
     "**Ideas**:\n",
     "\n",
     "**Mean**\n",
     "\n",
     "1. Use SVT to find upper and lower clipping bounds\n",
     "2. Compute noisy sum and count, and derive mean by post-processing\n",
     "\n",
     "**Variance**\n",
     "\n",
     "1. Split it into a count query ($\\frac{1}{n}$ - we have the answer from above) and a sum query\n",
     "2. What's the sensitivity of $\\sum_{i=1}^n (x_i - \\mu)^2$? It's $b^2$; we can clip and compute $\\sum_{i=1}^n (x_i - \\mu)^2$, then multiply by (1) by post processing\n",
     "\n",
     "**Standard Deviation**\n",
     "\n",
     "1. Just take the square root of variance\n",
     "\n",
     "Total queries:\n",
     "- Lower clipping bound (SVT)\n",
     "- Upper clipping bound (SVT)\n",
     "- Noisy sum (mean)\n",
     "- Noisy count\n",
     "- Noisy sum (variance)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## 3. Heavy Hitters\n",
     "\n",
     "Google's RAPPOR system {cite}`rappor` is designed to find the most popular settings for Chrome's home page. Design an algorithm which:\n",
     "\n",
     "- Given a list of the 10,000 most popular web pages by traffic,\n",
     "- Determines the top 10 most-popular home pages out of the 10,000 most popular web pages\n",
     "\n",
     "\n",
     "**Ideas**: Use parallel composition and take the noisy top 10"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## 4. Hierarchical Queries\n",
     "\n",
     "Design an algorithm to produce summary statistics for the U.S. Census. Your algorithm should produce total population counts at the following levels:\n",
     "\n",
     "- Census tract\n",
     "- City / town\n",
     "- ZIP Code\n",
     "- County\n",
     "- State\n",
     "- USA\n",
     "\n",
     "**Ideas**:\n",
     "\n",
     "Idea 1: *Only* compute the bottom level (census tract), using parallel composition. Add up all the tract counts to get the city counts, and so on up the hierarchy. Advantage: lowers privacy budget.\n",
     "\n",
     "Idea 2: Compute counts at all levels, using parallel composition for each level. Tune the budget split using real data; probably we need more accuracy for the smaller levels of the hierarchy.\n",
     "\n",
     "Idea 3: As (2), but also use post-processing to re-scale lower levels of the hierarchy based on higher ones; truncate counts to whole numbers; move negative counts to 0."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## 5. Workloads of Range Queries\n",
     "\n",
     "Design an algorithm to accurately answer a workload of *range queries*. Range queries are queries on a single table of the form: \"how many rows have a value that is between $a$ and $b$?\" (i.e. the count of rows which lie in a specific range). \n",
     "\n",
     "### Part 1\n",
     "The whole workload is pre-specified as a finite sequence of ranges: $\\{(a_1, b_1), \\dots, (a_k, b_k)\\}$, and \n",
     "\n",
     "### Part 2\n",
     "The length of the workload $k$ is pre-specified, but queries arrive in a streaming fashion and must be answered as they arrive.\n",
     "\n",
     "### Part 3\n",
     "The workload may be infinite.\n",
     "\n",
     "**Ideas**:\n",
     "\n",
     "Just run each query with sequential composition.\n",
     "\n",
     "For part 1, combine them so we can use $L2$ sensitivity. When $k$ is large, this will work well with Gaussian noise.\n",
     "\n",
     "Or, build synthetic data:\n",
     "\n",
     "- For each range $(i, i+1)$, find a count (parallel composition). This is a synthetic data representation! We can answer infinitely many queries by adding up the counts of all the segments in this histogram which are contained in the desired interval.\n",
     "- For part 2, use SVT\n",
     "\n",
     "For SVT: for each query in the stream, ask how far the real answer is from the synthetic data answer. If it's far, query the real answer's range (as a histogram, using parallel composition) and update the synthetic data. Otherwise just give the synthetic data answer. This way you *ONLY* pay for updates to the synthetic data."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": []
   }
  ],
  "metadata": {
   "kernelspec": {
    "display_name": "Python 3",
    "language": "python",
    "name": "python3"
   },
   "language_info": {
    "codemirror_mode": {
     "name": "ipython",
     "version": 3
    },
    "file_extension": ".py",
    "mimetype": "text/x-python",
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
    "version": "3.9.9"
   }
  },
  "nbformat": 4,
  "nbformat_minor": 2
 }