GP/ch3.ipynb at master - GP - WXXXXX

244 lines

9.3 KiB

Plaintext

Raw Permalink Blame History

 {
  "cells": [
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "# Differential Privacy\n",
     "\n",
     "```{admonition} Learning Objectives\n",
     "After reading this chapter, you will be able to:\n",
     "\n",
     "- Define differential privacy\n",
     "- Explain the importance of the privacy parameter $\\epsilon$\n",
     "- Use the Laplace mechanism to enforce differential privacy for counting queries\n",
     "```\n",
     "\n",
     "Like $k$-Anonymity, *differential privacy* {cite}`dwork2006A,dwork2006B` is a formal notion of privacy (i.e. it's possible to prove that a data release has the property). Unlike $k$-Anonymity, however, differential privacy is a property of *algorithms*, and not a property of *data*. That is, we can prove that an *algorithm* satisfies differential privacy; to show that a *dataset* satisfies differential privacy, we must show that the algorithm which produced it satisfies differential privacy.\n",
     "\n",
     "\n",
     "```{admonition} Definition\n",
     "A function which satisfies differential privacy is often called a *mechanism*. We say that a *mechanism* $F$ satisfies differential privacy if for all *neighboring datasets* $x$ and $x'$, and all possible outputs $S$,\n",
     "\n",
     "\\begin{equation}\n",
     "\\frac{\\mathsf{Pr}[F(x) = S]}{\\mathsf{Pr}[F(x') = S]} \\leq e^\\epsilon\n",
     "\\end{equation}\n",
     "```\n",
     "\n",
     "Two datasets are considered neighbors if they differ in the data of a single individual. Note that $F$ is typically a *randomized* function, which has many possible outputs under the same input. Therefore, the probability distribution describing its outputs is not just a point distribution.\n",
     "\n",
     "The important implication of this definition is that $F$'s output will be pretty much the same, *with or without* the data of any specific individual. In other words, the randomness built into $F$ should be \"enough\" so that an observed output from $F$ will not reveal which of $x$ or $x'$ was the input. Imagine that my data is present in $x$ but not in $x'$. If an adversary can't determine which of $x$ or $x'$ was the input to $F$, then the adversary can't tell whether or not my data was *present* in the input - let alone the contents of that data.\n",
     "\n",
     "The $\\epsilon$ parameter in the definition is called the *privacy parameter* or the *privacy budget*. $\\epsilon$ provides a knob to tune the \"amount of privacy\" the definition provides. Small values of $\\epsilon$ require $F$ to provide *very* similar outputs when given similar inputs, and therefore provide higher levels of privacy; large values of $\\epsilon$ allow less similarity in the outputs, and therefore provide less privacy. \n",
     "\n",
     "How should we set $\\epsilon$ to prevent bad outcomes in practice? Nobody knows. The general consensus is that $\\epsilon$ should be around 1 or smaller, and values of $\\epsilon$ above 10 probably don't do much to protect privacy - but this rule of thumb could turn out to be very conservative. We will have more to say on this subject later on."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## The Laplace Mechanism"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "Differential privacy is typically used to answer specific queries. Let's consider a query on the census data, *without* differential privacy."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 4,
    "metadata": {
     "tags": [
      "remove-cell"
     ]
    },
    "outputs": [],
    "source": [
     "import pandas as pd\n",
     "import numpy as np\n",
     "import matplotlib.pyplot as plt\n",
     "plt.style.use('seaborn-whitegrid')\n",
     "adult = pd.read_csv(\"adult_with_pii.csv\")"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "\"How many individuals in the dataset are 40 years old or older?\""
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 6,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "14237"
       ]
      },
      "execution_count": 6,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "adult[adult['Age'] >= 40].shape[0]"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "The easiest way to achieve differential privacy for this query is to add random noise to its answer. The key challenge is to add enough noise to satisfy the definition of differential privacy, but not so much that the answer becomes too noisy to be useful. To make this process easier, some basic *mechanisms* have been developed in the field of differential privacy, which describe exactly what kind of - and how much - noise to use. One of these is called the *Laplace mechanism* {cite}`dwork2006B`.\n",
     "\n",
     "```{admonition} Definition\n",
     "According to the Laplace mechanism, for a function $f(x)$ which returns a number, the following definition of $F(x)$ satisfies $\\epsilon$-differential privacy:\n",
     "\n",
     "\\begin{equation}\n",
     "F(x) = f(x) + \\textsf{Lap}\\left(\\frac{s}{\\epsilon}\\right)\n",
     "\\end{equation}\n",
     "\n",
     "where $s$ is the *sensitivity* of $f$, and $\\textsf{Lap}(S)$ denotes sampling from the Laplace distribution with center 0 and scale $S$.\n",
     "```\n",
     "\n",
     "The *sensitivity* of a function $f$ is the amount $f$'s output changes when its input changes by 1. Sensitivity is a complex topic, and an integral part of designing differentially private algorithms; we will have much more to say about it later. For now, we will just point out that *counting queries* always have a sensitivity of 1: if a query counts the number of rows in the dataset with a particular property, and then we modify exactly one row of the dataset, then the query's output can change by at most 1.\n",
     "\n",
     "Thus we can achieve differential privacy for our example query by using the Laplace mechanism with sensitivity 1 and an $\\epsilon$ of our choosing. For now, let's pick $\\epsilon = 0.1$. We can sample from the Laplace distribution using Numpy's `random.laplace`."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 21,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "14238.147613610243"
       ]
      },
      "execution_count": 21,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "sensitivity = 1\n",
     "epsilon = 0.1\n",
     "\n",
     "adult[adult['Age'] >= 40].shape[0] + np.random.laplace(loc=0, scale=sensitivity/epsilon)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "You can see the effect of the noise by running this code multiple times. Each time, the output changes, but most of the time, the answer is close enough to the true answer (14,235) to be useful."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## How Much Noise is Enough?"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "How do we know that the Laplace mechanism adds enough noise to prevent the re-identification of individuals in the dataset? For one thing, we can try to break it! Let's write down a malicious counting query, which is specifically designed to determine whether Karrie Trusslove has an income greater than \\$50k."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 28,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "1"
       ]
      },
      "execution_count": 28,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "karries_row = adult[adult['Name'] == 'Karrie Trusslove']\n",
     "karries_row[karries_row['Target'] == '<=50K'].shape[0]"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "This result definitely violates Karrie's privacy, since it reveals the value of the income column for Karrie's row. Since we know how to ensure differential privacy for counting queries with the Laplace mechanism, we can do so for this query:"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 29,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "2.198682025336349"
       ]
      },
      "execution_count": 29,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "sensitivity = 1\n",
     "epsilon = 0.1\n",
     "\n",
     "karries_row = adult[adult['Name'] == 'Karrie Trusslove']\n",
     "karries_row[karries_row['Target'] == '<=50K'].shape[0] + \\\n",
     "  np.random.laplace(loc=0, scale=sensitivity/epsilon)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "Is the true answer 0 or 1? There's too much noise to be able to reliably tell. This is how differential privacy is *intended* to work - the approach does not *reject* queries which are determined to be malicious; instead, it adds enough noise that the results of a malicious query will be useless to the adversary."
    ]
   }
  ],
  "metadata": {
   "celltoolbar": "Tags",
   "kernelspec": {
    "display_name": "Python 3",
    "language": "python",
    "name": "python3"
   },
   "language_info": {
    "codemirror_mode": {
     "name": "ipython",
     "version": 3
    },
    "file_extension": ".py",
    "mimetype": "text/x-python",
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
    "version": "3.9.9"
   }
  },
  "nbformat": 4,
  "nbformat_minor": 2
 }