GP/ch13.ipynb at master - GP - WXXXXX

685 lines

24 KiB

Plaintext

Raw Permalink Blame History

 {
  "cells": [
   {
    "cell_type": "code",
    "execution_count": 1,
    "metadata": {
     "tags": [
      "remove-cell"
     ]
    },
    "outputs": [],
    "source": [
     "%matplotlib inline\n",
     "import matplotlib.pyplot as plt\n",
     "plt.style.use('seaborn-whitegrid')\n",
     "import pandas as pd\n",
     "import numpy as np\n",
     "\n",
     "adult = pd.read_csv(\"adult_with_pii.csv\")\n",
     "def laplace_mech(v, sensitivity, epsilon):\n",
     "    return v + np.random.laplace(loc=0, scale=sensitivity / epsilon)\n",
     "def pct_error(orig, priv):\n",
     "    return np.abs(orig - priv)/orig * 100.0\n"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "# Local Differential Privacy\n",
     "\n",
     "```{admonition} Learning Objectives\n",
     "After reading this chapter, you will be able to:\n",
     "- Define the local model of differential privacy and contrast it with the central model\n",
     "- Define and implement the randomized response and unary encoding mechanisms\n",
     "- Describe the accuracy implications of these mechanisms and the challenges of the local model\n",
     "```\n",
     "\n",
     "So far, we have only considered the *central model* of differential privacy, in which the sensitive data is collected centrally in a single dataset. In this setting, we assume that the *analyst* is malicious, but that there is a *trusted data curator* who holds the dataset and correctly executes the differentially private mechanisms the analyst specifies.\n",
     "\n",
     "This setting is often not realistic. In many cases, the data curator and the analyst are *the same*, and no trusted third party actually exists to hold the data and execute mechanisms. In fact, the organizations which collect the most sensitive data tend to be exactly the ones we *don't* trust; such organizations certainly can't function as trusted data curators.\n",
     "\n",
     "An alternative to the central model of differential privacy is the *local model of differential privacy*, in which data is made differentially private before it leaves the control of the data subject. For example, you might add noise to your data *on your device* before sending it to the data curator. In the local model, the data curator does not need to be trusted, since the data they collect *already* satisfies differential privacy.\n",
     "\n",
     "The local model thus has one huge advantage over the central model: data subjects don't need to trust anyone else but themselves. This advantage has made it popular in real-world deployments, including the ones by [Google](https://github.com/google/rappor) and [Apple](https://www.apple.com/privacy/docs/Differential_Privacy_Overview.pdf).\n",
     "\n",
     "Unfortunately, the local model also has a significant drawback: the accuracy of query results in the local model is typically *orders of magnitude lower* for the same privacy cost as the same query under central differential privacy. This huge loss in accuracy means that only a small handful of query types are suitable for local differential privacy, and even for these, a large number of participants is required.\n",
     "\n",
     "In this section, we'll see two mechanisms for local differential privacy. The first is called *randomized response*, and the second is called *unary encoding*."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## Randomized Response\n",
     "\n",
     "[Randomized response](https://en.wikipedia.org/wiki/Randomized_response) {cite}`warner1965` is a mechanism for local differential privacy which was first proposed in a 1965 [paper by S. L. Warner](https://www.jstor.org/stable/2283137?seq=1#metadata_info_tab_contents). At the time, the technique was intended to improve bias in survey responses about sensitive issues, and it was not originally proposed as a mechanism for differential privacy (which wouldn't be invented for another 40 years). After differential privacy was developed, statisticians realized that this existing technique *already* satisfied the definition.\n",
     "\n",
     "Dwork and Roth present a variant of randomized response, in which the data subject answers a \"yes\" or \"no\" question as follows:\n",
     "\n",
     "1. Flip a coin\n",
     "2. If the coin is heads, answer the question truthfully\n",
     "3. If the coin is tails, flip another coin\n",
     "4. If the second coin is heads, answer \"yes\"; if it is tails, answer \"no\"\n",
     "\n",
     "The randomization in this algorithm comes from the two coin flips. As in all other differentially private algorithms, this randomization creates uncertainty about the true answer, which is the source of privacy.\n",
     "\n",
     "As it turns out, this randomized response algorithm satisfies $\\epsilon$-differential privacy for $\\epsilon = \\log(3) = 1.09$.\n",
     "\n",
     "Let's implement the algorithm for a simple \"yes\" or \"no\" question: \"is your occupation 'Sales'?\" We can flip a coin in Python using `np.random.randint(0, 2)`; the result is either a 0 or a 1."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 184,
    "metadata": {},
    "outputs": [],
    "source": [
     "def rand_resp_sales(response):\n",
     "    truthful_response = response == 'Sales'\n",
     "    \n",
     "    # first coin flip\n",
     "    if np.random.randint(0, 2) == 0:\n",
     "        # answer truthfully\n",
     "        return truthful_response\n",
     "    else:\n",
     "        # answer randomly (second coin flip)\n",
     "        return np.random.randint(0, 2) == 0"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "Let's ask 200 people who *do* work in sales to respond using randomized response, and look at the results."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 185,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "True     155\n",
        "False     45\n",
        "dtype: int64"
       ]
      },
      "execution_count": 185,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "pd.Series([rand_resp_sales('Sales') for i in range(200)]).value_counts()"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "What we see is that we get both \"yesses\" and \"nos\" - but that the \"yesses\" outweigh the \"nos.\" This output demonstrates both features of the differentially private algorithms we've already seen - it includes uncertainty, which creates privacy, but also displays enough signal to allow us to infer something about the population. \n",
     "\n",
     "Let's try the same thing on some actual data. We'll take all of the occupations in the US Census dataset we've been using, and encode responses for the question \"is your occupation 'Sales'?\" for each one. In an actual deployed system, we wouldn't collect this dataset centrally at all - instead, each respondant would run `rand_resp_sales` locally, and submit their randomized response to the data curator. For our experiment, we'll run `rand_resp_sales` on the existing dataset."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 186,
    "metadata": {},
    "outputs": [],
    "source": [
     "responses = [rand_resp_sales(r) for r in adult['Occupation']]"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 187,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "False    22633\n",
        "True      9928\n",
        "dtype: int64"
       ]
      },
      "execution_count": 187,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "pd.Series(responses).value_counts()"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "This time, we get many more \"nos\" than \"yesses.\" This makes a lot of sense, with a little thought, because the majority of the participants in the dataset are *not* in sales.\n",
     "\n",
     "The key question now is: how do we estimate the *acutal* number of salespeople in the dataset, based on these responses? The number of \"yesses\" is not a good estimate for the number of salespeople:"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 188,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "3650"
       ]
      },
      "execution_count": 188,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "len(adult[adult['Occupation'] == 'Sales'])"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "And this is not a surprise, since many of the \"yesses\" come from the random coin flips of the algorithm.\n",
     "\n",
     "In order to get an estimate of the true number of salespeople, we need to analyze the randomness in the randomized response algorithm and estimate how many of the \"yes\" responses are from actual salespeople, and how many are \"fake\" yesses which resulted from random coin flips. We know that:\n",
     "\n",
     "- With probability $\\frac{1}{2}$, each respondant responds randomly\n",
     "- With probability $\\frac{1}{2}$, each random response is a \"yes\"\n",
     "\n",
     "So, the probability that a respondant responds \"yes\" by random chance (rather than because they're a salesperson) is $\\frac{1}{2} \\cdot \\frac{1}{2} = \\frac{1}{4}$. This means we can expect one-quarter of our *total* responses to be \"fake yesses.\""
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 189,
    "metadata": {},
    "outputs": [],
    "source": [
     "responses = [rand_resp_sales(r) for r in adult['Occupation']]\n",
     "\n",
     "# we expect 1/4 of the responses to be \"yes\" based entirely on the coin flip\n",
     "# these are \"fake\" yesses\n",
     "fake_yesses = len(responses)/4\n",
     "\n",
     "# the total number of yesses recorded\n",
     "num_yesses = np.sum([1 if r else 0 for r in responses])\n",
     "\n",
     "# the number of \"real\" yesses is the total number of yesses minus the fake yesses\n",
     "true_yesses = num_yesses - fake_yesses"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "The other factor we need to consider is that half of the respondants answer randomly, but *some of the random respondants might actually be salespeople*. How many of them are salespeople? We have no data on that, since they answered randomly!\n",
     "\n",
     "But, since we split the respondants into \"truth\" and \"random\" groups randomly (by the first coin flip), we can hope that there are roughly the same number of salespeople in both groups. Therefore, if we can estimate the number of salespeople in the \"truth\" group, we can double this number to get the number of salespeople in total."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 190,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "3721.5"
       ]
      },
      "execution_count": 190,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "# true_yesses estimates the total number of yesses in the \"truth\" group\n",
     "# we estimate the total number of yesses for both groups by doubling\n",
     "rr_result = true_yesses*2\n",
     "rr_result"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "How close is that to the true number of salespeople? Let's compare!"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 192,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "3650"
       ]
      },
      "execution_count": 192,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "true_result = np.sum(adult['Occupation'] == 'Sales')\n",
     "true_result"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 193,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "1.9589041095890412"
       ]
      },
      "execution_count": 193,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "pct_error(true_result, rr_result)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "With this approach, and fairly large counts (e.g. more than 3000, in this case), we generally get \"acceptable\" error - something below 5%. If your goal is to determine the most popular occupation, this approach is likely to work. However, when counts are smaller, the error will quickly get larger.\n",
     "\n",
     "Furthermore, randomized response is *orders of magnitude* worse than the Laplace mechanism in the central model. Let's compare the two for this example:"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 169,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "0.011423124062500005"
       ]
      },
      "execution_count": 169,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "pct_error(true_result, laplace_mech(true_result, 1, 1))"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "Here, we get an error of about 0.01%, even though our $\\epsilon$ value for the central model is slightly lower than the $\\epsilon$ we used for randomized response.\n",
     "\n",
     "There *are* better algorithms for the local model, but the inherent limitations of having to add noise before submitting your data mean that local model algorithms will *always* have worse accuracy than the best central model algorithms."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## Unary Encoding"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "Randomized response allows us to ask a yes/no question with local differential privacy. What if we want to build a histogram?\n",
     "\n",
     "A number of different algorithms for solving this problem in the local model of differential privacy have been proposed. A [2017 paper by Wang et al.](https://arxiv.org/abs/1705.04421) {cite}`wang2017` provides a good summary of some optimal approaches. Here, we'll examine the simplest of these, called *unary encoding*. This approach is the basis for [Google's RAPPOR system](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42852.pdf) {cite}`rappor` (with a number of modifications to make it work better for large domains and multiple responses over time).\n",
     "\n",
     "The first step is to define the domain for responses - the labels of the histogram bins we care about. For our example, we want to know how many participants are associated with each occupation, so our domain is the set of occupations."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 194,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "array(['Adm-clerical', 'Exec-managerial', 'Handlers-cleaners',\n",
        "       'Prof-specialty', 'Other-service', 'Sales', 'Craft-repair',\n",
        "       'Transport-moving', 'Farming-fishing', 'Machine-op-inspct',\n",
        "       'Tech-support', 'Protective-serv', 'Armed-Forces', 'Priv-house-serv'], dtype=object)"
       ]
      },
      "execution_count": 194,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "domain = adult['Occupation'].dropna().unique()\n",
     "domain"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "We're going to define three functions, which together implement the unary encoding mechanism:\n",
     "\n",
     "1. `encode`, which encodes the response\n",
     "2. `perturb`, which perturbs the encoded response\n",
     "3. `aggregate`, which reconstructs final results from the perturbed responses\n",
     "\n",
     "The name of this technique comes from the encoding method used: for a domain of size $k$, each responses is encoded as a length-$k$ vector of bits, with all positions 0 except the one corresponding to the occupation of the respondant. In machine learning, this representation is called a \"one-hot encoding.\"\n",
     "\n",
     "For example, 'Sales' is the 6th element of the domain, so the 'Sales' occupation is encoded with a vector whose 6th element is a 1."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 195,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]"
       ]
      },
      "execution_count": 195,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "def encode(response):\n",
     "    return [1 if d == response else 0 for d in domain]\n",
     "\n",
     "encode('Sales')"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "The next step is `perturb`, which flips bits in the response vector to ensure differential privacy. The probability that a bit gets flipped is based on two parameters $p$ and $q$, which together determine the privacy parameter $\\epsilon$ (based on a formula we will see in a moment).\n",
     "\n",
     "$$ \\mathsf{Pr}[B'[i] = 1] =   \\left\\{\n",
     "\\begin{array}{ll}\n",
     "      p\\;\\;\\;\\text{if}\\;B[i] = 1 \\\\\n",
     "      q\\;\\;\\;\\text{if}\\;B[i] = 0\\\\\n",
     "\\end{array} \n",
     "\\right.  $$"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 196,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "[1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0]"
       ]
      },
      "execution_count": 196,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "def perturb(encoded_response):\n",
     "    return [perturb_bit(b) for b in encoded_response]\n",
     "\n",
     "def perturb_bit(bit):\n",
     "    p = .75\n",
     "    q = .25\n",
     "\n",
     "    sample = np.random.random()\n",
     "    if bit == 1:\n",
     "        if sample <= p:\n",
     "            return 1\n",
     "        else:\n",
     "            return 0\n",
     "    elif bit == 0:\n",
     "        if sample <= q:\n",
     "            return 1\n",
     "        else: \n",
     "            return 0\n",
     "\n",
     "perturb(encode('Sales'))"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "Based on the values of $p$ and $q$, we can calculate the value of the privacy parameter $\\epsilon$. For $p=.75$ and $q=.25$, we will see an $\\epsilon$ of slightly more than 2.\n",
     "\n",
     "\\begin{align}\n",
     "\\epsilon = \\log{\\left(\\frac{p (1-q)}{(1-p) q}\\right)}\n",
     "\\end{align}"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 207,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "2.1972245773362196"
       ]
      },
      "execution_count": 207,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "def unary_epsilon(p, q):\n",
     "    return np.log((p*(1-q)) / ((1-p)*q))\n",
     "\n",
     "unary_epsilon(.75, .25)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "The final piece is aggregation. If we hadn't done any perturbation, then we could simply take the set of response vectors and add them element-wise to get counts for each element in the domain:"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 203,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "[('Adm-clerical', 3770),\n",
        " ('Exec-managerial', 4066),\n",
        " ('Handlers-cleaners', 1370),\n",
        " ('Prof-specialty', 4140),\n",
        " ('Other-service', 3295),\n",
        " ('Sales', 3650),\n",
        " ('Craft-repair', 4099),\n",
        " ('Transport-moving', 1597),\n",
        " ('Farming-fishing', 994),\n",
        " ('Machine-op-inspct', 2002),\n",
        " ('Tech-support', 928),\n",
        " ('Protective-serv', 649),\n",
        " ('Armed-Forces', 9),\n",
        " ('Priv-house-serv', 149)]"
       ]
      },
      "execution_count": 203,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "counts = np.sum([encode(r) for r in adult['Occupation']], axis=0)\n",
     "list(zip(domain, counts))"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "But as we saw with randomized response, the \"fake\" responses caused by flipped bits cause the results to be difficult to interpret. If we perform the same procedure with the perturbed responses, the counts are all wrong:"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 208,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "[('Adm-clerical', 10042),\n",
        " ('Exec-managerial', 10204),\n",
        " ('Handlers-cleaners', 9006),\n",
        " ('Prof-specialty', 10238),\n",
        " ('Other-service', 9635),\n",
        " ('Sales', 9844),\n",
        " ('Craft-repair', 10233),\n",
        " ('Transport-moving', 8863),\n",
        " ('Farming-fishing', 8721),\n",
        " ('Machine-op-inspct', 9122),\n",
        " ('Tech-support', 8753),\n",
        " ('Protective-serv', 8523),\n",
        " ('Armed-Forces', 8157),\n",
        " ('Priv-house-serv', 8042)]"
       ]
      },
      "execution_count": 208,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "counts = np.sum([perturb(encode(r)) for r in adult['Occupation']], axis=0)\n",
     "list(zip(domain, counts))"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "The aggregate step of the unary encoding algorithm takes into account the number of \"fake\" responses in each category, which is a function of both $p$ and $q$, and the number of responses $n$:\n",
     "\n",
     "\\begin{align}\n",
     "A[i] = \\frac{\\sum_j B'_j[i] - n q}{p - q}\n",
     "\\end{align}"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 205,
    "metadata": {},
    "outputs": [],
    "source": [
     "def aggregate(responses):\n",
     "    p = .75\n",
     "    q = .25\n",
     "    \n",
     "    sums = np.sum(responses, axis=0)\n",
     "    n = len(responses)\n",
     "    \n",
     "    return [(v - n*q) / (p-q) for v in sums]  "
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 206,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "[('Adm-clerical', 3865.5),\n",
        " ('Exec-managerial', 4047.5),\n",
        " ('Handlers-cleaners', 989.5),\n",
        " ('Prof-specialty', 4001.5),\n",
        " ('Other-service', 2993.5),\n",
        " ('Sales', 3699.5),\n",
        " ('Craft-repair', 4093.5),\n",
        " ('Transport-moving', 1613.5),\n",
        " ('Farming-fishing', 715.5),\n",
        " ('Machine-op-inspct', 2119.5),\n",
        " ('Tech-support', 947.5),\n",
        " ('Protective-serv', 821.5),\n",
        " ('Armed-Forces', -92.5),\n",
        " ('Priv-house-serv', 387.5)]"
       ]
      },
      "execution_count": 206,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "responses = [perturb(encode(r)) for r in adult['Occupation']]\n",
     "counts = aggregate(responses)\n",
     "list(zip(domain, counts))"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "As we saw with randomized response, these results are accurate enough to obtain a rough ordering of the domain elements (at least the most popular ones), but orders of magnitude less accurate than we could obtain with the Laplace mechanism in the central model of differential privacy.\n",
     "\n",
     "Other methods have been proposed for performing histogram queries in the local model, including some detailed in the [paper](https://arxiv.org/abs/1705.04421) linked earlier. These can improve accuracy somewhat, but the fundamental limitations of having to ensure differential privacy for *each sample individually* in the local model mean that even the most complex technique can't match the accuracy of the mechanisms we've seen in the central model."
    ]
   }
  ],
  "metadata": {
   "celltoolbar": "Tags",
   "kernelspec": {
    "display_name": "Python 3",
    "language": "python",
    "name": "python3"
   },
   "language_info": {
    "codemirror_mode": {
     "name": "ipython",
     "version": 3
    },
    "file_extension": ".py",
    "mimetype": "text/x-python",
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
    "version": "3.6.10"
   }
  },
  "nbformat": 4,
  "nbformat_minor": 2
 }