111111111

This commit is contained in:
wx 2025-02-14 12:07:10 +08:00
commit addc6cf11e
62 changed files with 86064 additions and 0 deletions

3
.gitignore vendored Normal file
View File

@ -0,0 +1,3 @@
.DS_Store
_build
.ipynb_checkpoints

3
README.md Normal file
View File

@ -0,0 +1,3 @@
# Programming Differential Privacy
This is the source repository for the book "Programming Differential Privacy." You can find the book online [here](https://uvm-plaid.github.io/programming-dp).

31
_config.yml Normal file
View File

@ -0,0 +1,31 @@
# Book settings
title: Programming Differential Privacy
author: Joseph P. Near and Chiké Abuah
copyright: "2021"
logo: logo.png
execute:
timeout: -1
execute_notebooks: force
sphinx:
config:
mathjax_path: https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js
latex:
latex_documents:
targetname: book.tex
parse:
myst_enable_extensions:
# don't forget to list any other extensions you want enabled,
# including those that are enabled by default!
- amsmath
- dollarmath
bibtex_bibfiles:
- references.bib
repository:
url: https://github.com/uvm-plaid/programming-dp
branch: master

28
deploy.sh Normal file
View File

@ -0,0 +1,28 @@
# build cn book
echo "# build cn book"
jupyter-book build zh_cn/notebooks
jupyter-book build --builder pdflatex zh_cn/notebooks
# build en book
echo "# build en book"
jupyter-book build notebooks
jupyter-book build --builder pdflatex notebooks
# cp cn assets
echo "# cp cn assets"
mkdir -p _build/html/cn
cp -R zh_cn/_build/html/* _build/html/cn/
cp zh_cn/static/index.html _build/html/cn/
cp zh_cn/static/book-logo.png _build/html/cn/
cp zh_cn/_build/latex/cn_book.pdf _build/html/cn 2>/dev/null || :
# cp en assets
echo "# cp en assets"
cp static/index.html _build/html/
cp static/book-logo.png _build/html/
cp static/CNAME _build/html/
cp _build/latex/book.pdf _build/html/ 2>/dev/null || :
# deploy book
echo "# deploy book"
ghp-import -n -p -f _build/html

108
extras/logo.ipynb Normal file

File diff suppressed because one or more lines are too long

19
notebooks/_toc.yml Normal file
View File

@ -0,0 +1,19 @@
format: jb-book
root: cover
chapters:
- file: intro
- file: ch1
- file: ch2
- file: ch3
- file: ch4
- file: ch5
- file: ch6
- file: ch7
- file: ch8
- file: ch9
- file: ch10
- file: ch11
- file: ch12
- file: ch13
- file: ch14
- file: bibliography

Binary file not shown.

Binary file not shown.

32562
notebooks/adult_with_pii.csv Normal file

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,49 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "d69c3a50",
"metadata": {},
"source": [
"# Bibliography"
]
},
{
"cell_type": "markdown",
"id": "16a28d8b",
"metadata": {},
"source": [
"```{bibliography}\n",
":style: unsrt\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "723f9996",
"metadata": {},
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

1323
notebooks/ch1.ipynb Normal file

File diff suppressed because one or more lines are too long

566
notebooks/ch10.ipynb Normal file

File diff suppressed because one or more lines are too long

173
notebooks/ch11.ipynb Normal file
View File

@ -0,0 +1,173 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exercises in Algorithm Design"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Issues to Consider\n",
"\n",
"- How many queries are required, and what kind of composition can we use?\n",
" - Is parallel composition possible?\n",
" - Should we use sequential composition, advanced composition, or a variant of differential privacy?\n",
"- Can we use the sparse vector technique?\n",
"- Can we use the exponential mechanism?\n",
"- How should we distribute the privacy budget?\n",
"- If there are unbounded sensitivities, how can we bound them?\n",
"- Would synthetic data help?\n",
"- Would post-processing to \"de-noise\" help?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Generalized Sample and Aggregate\n",
"\n",
"Design a variant of sample and aggregate which does *not* require the analyst to specify the output range of the query function $f$.\n",
"\n",
"**Ideas**: use SVT to find good upper and lower bounds on $f(x)$ for the whole dataset first. The result of $clip(f(x), lower, upper)$ has bounded sensitivity, so we can use this query with SVT. Then use sample and aggregate with these upper and lower bounds."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Summary Statistics\n",
"\n",
"Design an algorithm to produce differentially private versions of the following statistics:\n",
"\n",
"- Mean: $\\mu = \\frac{1}{n} \\sum_{i=1}^n x_i$\n",
"- Variance: $var = \\frac{1}{n} \\sum_{i=1}^n (x_i - \\mu)^2$\n",
"- Standard deviation: $\\sigma = \\sqrt{\\frac{1}{n} \\sum_{i=1}^n (x_i - \\mu)^2}$\n",
"\n",
"**Ideas**:\n",
"\n",
"**Mean**\n",
"\n",
"1. Use SVT to find upper and lower clipping bounds\n",
"2. Compute noisy sum and count, and derive mean by post-processing\n",
"\n",
"**Variance**\n",
"\n",
"1. Split it into a count query ($\\frac{1}{n}$ - we have the answer from above) and a sum query\n",
"2. What's the sensitivity of $\\sum_{i=1}^n (x_i - \\mu)^2$? It's $b^2$; we can clip and compute $\\sum_{i=1}^n (x_i - \\mu)^2$, then multiply by (1) by post processing\n",
"\n",
"**Standard Deviation**\n",
"\n",
"1. Just take the square root of variance\n",
"\n",
"Total queries:\n",
"- Lower clipping bound (SVT)\n",
"- Upper clipping bound (SVT)\n",
"- Noisy sum (mean)\n",
"- Noisy count\n",
"- Noisy sum (variance)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Heavy Hitters\n",
"\n",
"Google's RAPPOR system {cite}`rappor` is designed to find the most popular settings for Chrome's home page. Design an algorithm which:\n",
"\n",
"- Given a list of the 10,000 most popular web pages by traffic,\n",
"- Determines the top 10 most-popular home pages out of the 10,000 most popular web pages\n",
"\n",
"\n",
"**Ideas**: Use parallel composition and take the noisy top 10"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Hierarchical Queries\n",
"\n",
"Design an algorithm to produce summary statistics for the U.S. Census. Your algorithm should produce total population counts at the following levels:\n",
"\n",
"- Census tract\n",
"- City / town\n",
"- ZIP Code\n",
"- County\n",
"- State\n",
"- USA\n",
"\n",
"**Ideas**:\n",
"\n",
"Idea 1: *Only* compute the bottom level (census tract), using parallel composition. Add up all the tract counts to get the city counts, and so on up the hierarchy. Advantage: lowers privacy budget.\n",
"\n",
"Idea 2: Compute counts at all levels, using parallel composition for each level. Tune the budget split using real data; probably we need more accuracy for the smaller levels of the hierarchy.\n",
"\n",
"Idea 3: As (2), but also use post-processing to re-scale lower levels of the hierarchy based on higher ones; truncate counts to whole numbers; move negative counts to 0."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Workloads of Range Queries\n",
"\n",
"Design an algorithm to accurately answer a workload of *range queries*. Range queries are queries on a single table of the form: \"how many rows have a value that is between $a$ and $b$?\" (i.e. the count of rows which lie in a specific range). \n",
"\n",
"### Part 1\n",
"The whole workload is pre-specified as a finite sequence of ranges: $\\{(a_1, b_1), \\dots, (a_k, b_k)\\}$, and \n",
"\n",
"### Part 2\n",
"The length of the workload $k$ is pre-specified, but queries arrive in a streaming fashion and must be answered as they arrive.\n",
"\n",
"### Part 3\n",
"The workload may be infinite.\n",
"\n",
"**Ideas**:\n",
"\n",
"Just run each query with sequential composition.\n",
"\n",
"For part 1, combine them so we can use $L2$ sensitivity. When $k$ is large, this will work well with Gaussian noise.\n",
"\n",
"Or, build synthetic data:\n",
"\n",
"- For each range $(i, i+1)$, find a count (parallel composition). This is a synthetic data representation! We can answer infinitely many queries by adding up the counts of all the segments in this histogram which are contained in the desired interval.\n",
"- For part 2, use SVT\n",
"\n",
"For SVT: for each query in the stream, ask how far the real answer is from the synthetic data answer. If it's far, query the real answer's range (as a histogram, using parallel composition) and update the synthetic data. Otherwise just give the synthetic data answer. This way you *ONLY* pay for updates to the synthetic data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

1100
notebooks/ch12.ipynb Normal file

File diff suppressed because one or more lines are too long

684
notebooks/ch13.ipynb Normal file
View File

@ -0,0 +1,684 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"tags": [
"remove-cell"
]
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"plt.style.use('seaborn-whitegrid')\n",
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"adult = pd.read_csv(\"adult_with_pii.csv\")\n",
"def laplace_mech(v, sensitivity, epsilon):\n",
" return v + np.random.laplace(loc=0, scale=sensitivity / epsilon)\n",
"def pct_error(orig, priv):\n",
" return np.abs(orig - priv)/orig * 100.0\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Local Differential Privacy\n",
"\n",
"```{admonition} Learning Objectives\n",
"After reading this chapter, you will be able to:\n",
"- Define the local model of differential privacy and contrast it with the central model\n",
"- Define and implement the randomized response and unary encoding mechanisms\n",
"- Describe the accuracy implications of these mechanisms and the challenges of the local model\n",
"```\n",
"\n",
"So far, we have only considered the *central model* of differential privacy, in which the sensitive data is collected centrally in a single dataset. In this setting, we assume that the *analyst* is malicious, but that there is a *trusted data curator* who holds the dataset and correctly executes the differentially private mechanisms the analyst specifies.\n",
"\n",
"This setting is often not realistic. In many cases, the data curator and the analyst are *the same*, and no trusted third party actually exists to hold the data and execute mechanisms. In fact, the organizations which collect the most sensitive data tend to be exactly the ones we *don't* trust; such organizations certainly can't function as trusted data curators.\n",
"\n",
"An alternative to the central model of differential privacy is the *local model of differential privacy*, in which data is made differentially private before it leaves the control of the data subject. For example, you might add noise to your data *on your device* before sending it to the data curator. In the local model, the data curator does not need to be trusted, since the data they collect *already* satisfies differential privacy.\n",
"\n",
"The local model thus has one huge advantage over the central model: data subjects don't need to trust anyone else but themselves. This advantage has made it popular in real-world deployments, including the ones by [Google](https://github.com/google/rappor) and [Apple](https://www.apple.com/privacy/docs/Differential_Privacy_Overview.pdf).\n",
"\n",
"Unfortunately, the local model also has a significant drawback: the accuracy of query results in the local model is typically *orders of magnitude lower* for the same privacy cost as the same query under central differential privacy. This huge loss in accuracy means that only a small handful of query types are suitable for local differential privacy, and even for these, a large number of participants is required.\n",
"\n",
"In this section, we'll see two mechanisms for local differential privacy. The first is called *randomized response*, and the second is called *unary encoding*."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Randomized Response\n",
"\n",
"[Randomized response](https://en.wikipedia.org/wiki/Randomized_response) {cite}`warner1965` is a mechanism for local differential privacy which was first proposed in a 1965 [paper by S. L. Warner](https://www.jstor.org/stable/2283137?seq=1#metadata_info_tab_contents). At the time, the technique was intended to improve bias in survey responses about sensitive issues, and it was not originally proposed as a mechanism for differential privacy (which wouldn't be invented for another 40 years). After differential privacy was developed, statisticians realized that this existing technique *already* satisfied the definition.\n",
"\n",
"Dwork and Roth present a variant of randomized response, in which the data subject answers a \"yes\" or \"no\" question as follows:\n",
"\n",
"1. Flip a coin\n",
"2. If the coin is heads, answer the question truthfully\n",
"3. If the coin is tails, flip another coin\n",
"4. If the second coin is heads, answer \"yes\"; if it is tails, answer \"no\"\n",
"\n",
"The randomization in this algorithm comes from the two coin flips. As in all other differentially private algorithms, this randomization creates uncertainty about the true answer, which is the source of privacy.\n",
"\n",
"As it turns out, this randomized response algorithm satisfies $\\epsilon$-differential privacy for $\\epsilon = \\log(3) = 1.09$.\n",
"\n",
"Let's implement the algorithm for a simple \"yes\" or \"no\" question: \"is your occupation 'Sales'?\" We can flip a coin in Python using `np.random.randint(0, 2)`; the result is either a 0 or a 1."
]
},
{
"cell_type": "code",
"execution_count": 184,
"metadata": {},
"outputs": [],
"source": [
"def rand_resp_sales(response):\n",
" truthful_response = response == 'Sales'\n",
" \n",
" # first coin flip\n",
" if np.random.randint(0, 2) == 0:\n",
" # answer truthfully\n",
" return truthful_response\n",
" else:\n",
" # answer randomly (second coin flip)\n",
" return np.random.randint(0, 2) == 0"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's ask 200 people who *do* work in sales to respond using randomized response, and look at the results."
]
},
{
"cell_type": "code",
"execution_count": 185,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True 155\n",
"False 45\n",
"dtype: int64"
]
},
"execution_count": 185,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.Series([rand_resp_sales('Sales') for i in range(200)]).value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What we see is that we get both \"yesses\" and \"nos\" - but that the \"yesses\" outweigh the \"nos.\" This output demonstrates both features of the differentially private algorithms we've already seen - it includes uncertainty, which creates privacy, but also displays enough signal to allow us to infer something about the population. \n",
"\n",
"Let's try the same thing on some actual data. We'll take all of the occupations in the US Census dataset we've been using, and encode responses for the question \"is your occupation 'Sales'?\" for each one. In an actual deployed system, we wouldn't collect this dataset centrally at all - instead, each respondant would run `rand_resp_sales` locally, and submit their randomized response to the data curator. For our experiment, we'll run `rand_resp_sales` on the existing dataset."
]
},
{
"cell_type": "code",
"execution_count": 186,
"metadata": {},
"outputs": [],
"source": [
"responses = [rand_resp_sales(r) for r in adult['Occupation']]"
]
},
{
"cell_type": "code",
"execution_count": 187,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"False 22633\n",
"True 9928\n",
"dtype: int64"
]
},
"execution_count": 187,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.Series(responses).value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This time, we get many more \"nos\" than \"yesses.\" This makes a lot of sense, with a little thought, because the majority of the participants in the dataset are *not* in sales.\n",
"\n",
"The key question now is: how do we estimate the *acutal* number of salespeople in the dataset, based on these responses? The number of \"yesses\" is not a good estimate for the number of salespeople:"
]
},
{
"cell_type": "code",
"execution_count": 188,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3650"
]
},
"execution_count": 188,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(adult[adult['Occupation'] == 'Sales'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And this is not a surprise, since many of the \"yesses\" come from the random coin flips of the algorithm.\n",
"\n",
"In order to get an estimate of the true number of salespeople, we need to analyze the randomness in the randomized response algorithm and estimate how many of the \"yes\" responses are from actual salespeople, and how many are \"fake\" yesses which resulted from random coin flips. We know that:\n",
"\n",
"- With probability $\\frac{1}{2}$, each respondant responds randomly\n",
"- With probability $\\frac{1}{2}$, each random response is a \"yes\"\n",
"\n",
"So, the probability that a respondant responds \"yes\" by random chance (rather than because they're a salesperson) is $\\frac{1}{2} \\cdot \\frac{1}{2} = \\frac{1}{4}$. This means we can expect one-quarter of our *total* responses to be \"fake yesses.\""
]
},
{
"cell_type": "code",
"execution_count": 189,
"metadata": {},
"outputs": [],
"source": [
"responses = [rand_resp_sales(r) for r in adult['Occupation']]\n",
"\n",
"# we expect 1/4 of the responses to be \"yes\" based entirely on the coin flip\n",
"# these are \"fake\" yesses\n",
"fake_yesses = len(responses)/4\n",
"\n",
"# the total number of yesses recorded\n",
"num_yesses = np.sum([1 if r else 0 for r in responses])\n",
"\n",
"# the number of \"real\" yesses is the total number of yesses minus the fake yesses\n",
"true_yesses = num_yesses - fake_yesses"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The other factor we need to consider is that half of the respondants answer randomly, but *some of the random respondants might actually be salespeople*. How many of them are salespeople? We have no data on that, since they answered randomly!\n",
"\n",
"But, since we split the respondants into \"truth\" and \"random\" groups randomly (by the first coin flip), we can hope that there are roughly the same number of salespeople in both groups. Therefore, if we can estimate the number of salespeople in the \"truth\" group, we can double this number to get the number of salespeople in total."
]
},
{
"cell_type": "code",
"execution_count": 190,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3721.5"
]
},
"execution_count": 190,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# true_yesses estimates the total number of yesses in the \"truth\" group\n",
"# we estimate the total number of yesses for both groups by doubling\n",
"rr_result = true_yesses*2\n",
"rr_result"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How close is that to the true number of salespeople? Let's compare!"
]
},
{
"cell_type": "code",
"execution_count": 192,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3650"
]
},
"execution_count": 192,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"true_result = np.sum(adult['Occupation'] == 'Sales')\n",
"true_result"
]
},
{
"cell_type": "code",
"execution_count": 193,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1.9589041095890412"
]
},
"execution_count": 193,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pct_error(true_result, rr_result)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With this approach, and fairly large counts (e.g. more than 3000, in this case), we generally get \"acceptable\" error - something below 5%. If your goal is to determine the most popular occupation, this approach is likely to work. However, when counts are smaller, the error will quickly get larger.\n",
"\n",
"Furthermore, randomized response is *orders of magnitude* worse than the Laplace mechanism in the central model. Let's compare the two for this example:"
]
},
{
"cell_type": "code",
"execution_count": 169,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.011423124062500005"
]
},
"execution_count": 169,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pct_error(true_result, laplace_mech(true_result, 1, 1))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here, we get an error of about 0.01%, even though our $\\epsilon$ value for the central model is slightly lower than the $\\epsilon$ we used for randomized response.\n",
"\n",
"There *are* better algorithms for the local model, but the inherent limitations of having to add noise before submitting your data mean that local model algorithms will *always* have worse accuracy than the best central model algorithms."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Unary Encoding"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Randomized response allows us to ask a yes/no question with local differential privacy. What if we want to build a histogram?\n",
"\n",
"A number of different algorithms for solving this problem in the local model of differential privacy have been proposed. A [2017 paper by Wang et al.](https://arxiv.org/abs/1705.04421) {cite}`wang2017` provides a good summary of some optimal approaches. Here, we'll examine the simplest of these, called *unary encoding*. This approach is the basis for [Google's RAPPOR system](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42852.pdf) {cite}`rappor` (with a number of modifications to make it work better for large domains and multiple responses over time).\n",
"\n",
"The first step is to define the domain for responses - the labels of the histogram bins we care about. For our example, we want to know how many participants are associated with each occupation, so our domain is the set of occupations."
]
},
{
"cell_type": "code",
"execution_count": 194,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['Adm-clerical', 'Exec-managerial', 'Handlers-cleaners',\n",
" 'Prof-specialty', 'Other-service', 'Sales', 'Craft-repair',\n",
" 'Transport-moving', 'Farming-fishing', 'Machine-op-inspct',\n",
" 'Tech-support', 'Protective-serv', 'Armed-Forces', 'Priv-house-serv'], dtype=object)"
]
},
"execution_count": 194,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"domain = adult['Occupation'].dropna().unique()\n",
"domain"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We're going to define three functions, which together implement the unary encoding mechanism:\n",
"\n",
"1. `encode`, which encodes the response\n",
"2. `perturb`, which perturbs the encoded response\n",
"3. `aggregate`, which reconstructs final results from the perturbed responses\n",
"\n",
"The name of this technique comes from the encoding method used: for a domain of size $k$, each responses is encoded as a length-$k$ vector of bits, with all positions 0 except the one corresponding to the occupation of the respondant. In machine learning, this representation is called a \"one-hot encoding.\"\n",
"\n",
"For example, 'Sales' is the 6th element of the domain, so the 'Sales' occupation is encoded with a vector whose 6th element is a 1."
]
},
{
"cell_type": "code",
"execution_count": 195,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]"
]
},
"execution_count": 195,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def encode(response):\n",
" return [1 if d == response else 0 for d in domain]\n",
"\n",
"encode('Sales')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The next step is `perturb`, which flips bits in the response vector to ensure differential privacy. The probability that a bit gets flipped is based on two parameters $p$ and $q$, which together determine the privacy parameter $\\epsilon$ (based on a formula we will see in a moment).\n",
"\n",
"$$ \\mathsf{Pr}[B'[i] = 1] = \\left\\{\n",
"\\begin{array}{ll}\n",
" p\\;\\;\\;\\text{if}\\;B[i] = 1 \\\\\n",
" q\\;\\;\\;\\text{if}\\;B[i] = 0\\\\\n",
"\\end{array} \n",
"\\right. $$"
]
},
{
"cell_type": "code",
"execution_count": 196,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0]"
]
},
"execution_count": 196,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def perturb(encoded_response):\n",
" return [perturb_bit(b) for b in encoded_response]\n",
"\n",
"def perturb_bit(bit):\n",
" p = .75\n",
" q = .25\n",
"\n",
" sample = np.random.random()\n",
" if bit == 1:\n",
" if sample <= p:\n",
" return 1\n",
" else:\n",
" return 0\n",
" elif bit == 0:\n",
" if sample <= q:\n",
" return 1\n",
" else: \n",
" return 0\n",
"\n",
"perturb(encode('Sales'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Based on the values of $p$ and $q$, we can calculate the value of the privacy parameter $\\epsilon$. For $p=.75$ and $q=.25$, we will see an $\\epsilon$ of slightly more than 2.\n",
"\n",
"\\begin{align}\n",
"\\epsilon = \\log{\\left(\\frac{p (1-q)}{(1-p) q}\\right)}\n",
"\\end{align}"
]
},
{
"cell_type": "code",
"execution_count": 207,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2.1972245773362196"
]
},
"execution_count": 207,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def unary_epsilon(p, q):\n",
" return np.log((p*(1-q)) / ((1-p)*q))\n",
"\n",
"unary_epsilon(.75, .25)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The final piece is aggregation. If we hadn't done any perturbation, then we could simply take the set of response vectors and add them element-wise to get counts for each element in the domain:"
]
},
{
"cell_type": "code",
"execution_count": 203,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('Adm-clerical', 3770),\n",
" ('Exec-managerial', 4066),\n",
" ('Handlers-cleaners', 1370),\n",
" ('Prof-specialty', 4140),\n",
" ('Other-service', 3295),\n",
" ('Sales', 3650),\n",
" ('Craft-repair', 4099),\n",
" ('Transport-moving', 1597),\n",
" ('Farming-fishing', 994),\n",
" ('Machine-op-inspct', 2002),\n",
" ('Tech-support', 928),\n",
" ('Protective-serv', 649),\n",
" ('Armed-Forces', 9),\n",
" ('Priv-house-serv', 149)]"
]
},
"execution_count": 203,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"counts = np.sum([encode(r) for r in adult['Occupation']], axis=0)\n",
"list(zip(domain, counts))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But as we saw with randomized response, the \"fake\" responses caused by flipped bits cause the results to be difficult to interpret. If we perform the same procedure with the perturbed responses, the counts are all wrong:"
]
},
{
"cell_type": "code",
"execution_count": 208,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('Adm-clerical', 10042),\n",
" ('Exec-managerial', 10204),\n",
" ('Handlers-cleaners', 9006),\n",
" ('Prof-specialty', 10238),\n",
" ('Other-service', 9635),\n",
" ('Sales', 9844),\n",
" ('Craft-repair', 10233),\n",
" ('Transport-moving', 8863),\n",
" ('Farming-fishing', 8721),\n",
" ('Machine-op-inspct', 9122),\n",
" ('Tech-support', 8753),\n",
" ('Protective-serv', 8523),\n",
" ('Armed-Forces', 8157),\n",
" ('Priv-house-serv', 8042)]"
]
},
"execution_count": 208,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"counts = np.sum([perturb(encode(r)) for r in adult['Occupation']], axis=0)\n",
"list(zip(domain, counts))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The aggregate step of the unary encoding algorithm takes into account the number of \"fake\" responses in each category, which is a function of both $p$ and $q$, and the number of responses $n$:\n",
"\n",
"\\begin{align}\n",
"A[i] = \\frac{\\sum_j B'_j[i] - n q}{p - q}\n",
"\\end{align}"
]
},
{
"cell_type": "code",
"execution_count": 205,
"metadata": {},
"outputs": [],
"source": [
"def aggregate(responses):\n",
" p = .75\n",
" q = .25\n",
" \n",
" sums = np.sum(responses, axis=0)\n",
" n = len(responses)\n",
" \n",
" return [(v - n*q) / (p-q) for v in sums] "
]
},
{
"cell_type": "code",
"execution_count": 206,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('Adm-clerical', 3865.5),\n",
" ('Exec-managerial', 4047.5),\n",
" ('Handlers-cleaners', 989.5),\n",
" ('Prof-specialty', 4001.5),\n",
" ('Other-service', 2993.5),\n",
" ('Sales', 3699.5),\n",
" ('Craft-repair', 4093.5),\n",
" ('Transport-moving', 1613.5),\n",
" ('Farming-fishing', 715.5),\n",
" ('Machine-op-inspct', 2119.5),\n",
" ('Tech-support', 947.5),\n",
" ('Protective-serv', 821.5),\n",
" ('Armed-Forces', -92.5),\n",
" ('Priv-house-serv', 387.5)]"
]
},
"execution_count": 206,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"responses = [perturb(encode(r)) for r in adult['Occupation']]\n",
"counts = aggregate(responses)\n",
"list(zip(domain, counts))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we saw with randomized response, these results are accurate enough to obtain a rough ordering of the domain elements (at least the most popular ones), but orders of magnitude less accurate than we could obtain with the Laplace mechanism in the central model of differential privacy.\n",
"\n",
"Other methods have been proposed for performing histogram queries in the local model, including some detailed in the [paper](https://arxiv.org/abs/1705.04421) linked earlier. These can improve accuracy somewhat, but the fundamental limitations of having to ensure differential privacy for *each sample individually* in the local model mean that even the most complex technique can't match the accuracy of the mechanisms we've seen in the central model."
]
}
],
"metadata": {
"celltoolbar": "Tags",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.10"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

1008
notebooks/ch14.ipynb Normal file

File diff suppressed because one or more lines are too long

918
notebooks/ch2.ipynb Normal file

File diff suppressed because one or more lines are too long

243
notebooks/ch3.ipynb Normal file
View File

@ -0,0 +1,243 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Differential Privacy\n",
"\n",
"```{admonition} Learning Objectives\n",
"After reading this chapter, you will be able to:\n",
"\n",
"- Define differential privacy\n",
"- Explain the importance of the privacy parameter $\\epsilon$\n",
"- Use the Laplace mechanism to enforce differential privacy for counting queries\n",
"```\n",
"\n",
"Like $k$-Anonymity, *differential privacy* {cite}`dwork2006A,dwork2006B` is a formal notion of privacy (i.e. it's possible to prove that a data release has the property). Unlike $k$-Anonymity, however, differential privacy is a property of *algorithms*, and not a property of *data*. That is, we can prove that an *algorithm* satisfies differential privacy; to show that a *dataset* satisfies differential privacy, we must show that the algorithm which produced it satisfies differential privacy.\n",
"\n",
"\n",
"```{admonition} Definition\n",
"A function which satisfies differential privacy is often called a *mechanism*. We say that a *mechanism* $F$ satisfies differential privacy if for all *neighboring datasets* $x$ and $x'$, and all possible outputs $S$,\n",
"\n",
"\\begin{equation}\n",
"\\frac{\\mathsf{Pr}[F(x) = S]}{\\mathsf{Pr}[F(x') = S]} \\leq e^\\epsilon\n",
"\\end{equation}\n",
"```\n",
"\n",
"Two datasets are considered neighbors if they differ in the data of a single individual. Note that $F$ is typically a *randomized* function, which has many possible outputs under the same input. Therefore, the probability distribution describing its outputs is not just a point distribution.\n",
"\n",
"The important implication of this definition is that $F$'s output will be pretty much the same, *with or without* the data of any specific individual. In other words, the randomness built into $F$ should be \"enough\" so that an observed output from $F$ will not reveal which of $x$ or $x'$ was the input. Imagine that my data is present in $x$ but not in $x'$. If an adversary can't determine which of $x$ or $x'$ was the input to $F$, then the adversary can't tell whether or not my data was *present* in the input - let alone the contents of that data.\n",
"\n",
"The $\\epsilon$ parameter in the definition is called the *privacy parameter* or the *privacy budget*. $\\epsilon$ provides a knob to tune the \"amount of privacy\" the definition provides. Small values of $\\epsilon$ require $F$ to provide *very* similar outputs when given similar inputs, and therefore provide higher levels of privacy; large values of $\\epsilon$ allow less similarity in the outputs, and therefore provide less privacy. \n",
"\n",
"How should we set $\\epsilon$ to prevent bad outcomes in practice? Nobody knows. The general consensus is that $\\epsilon$ should be around 1 or smaller, and values of $\\epsilon$ above 10 probably don't do much to protect privacy - but this rule of thumb could turn out to be very conservative. We will have more to say on this subject later on."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The Laplace Mechanism"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Differential privacy is typically used to answer specific queries. Let's consider a query on the census data, *without* differential privacy."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"tags": [
"remove-cell"
]
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"plt.style.use('seaborn-whitegrid')\n",
"adult = pd.read_csv(\"adult_with_pii.csv\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\"How many individuals in the dataset are 40 years old or older?\""
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"14237"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"adult[adult['Age'] >= 40].shape[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The easiest way to achieve differential privacy for this query is to add random noise to its answer. The key challenge is to add enough noise to satisfy the definition of differential privacy, but not so much that the answer becomes too noisy to be useful. To make this process easier, some basic *mechanisms* have been developed in the field of differential privacy, which describe exactly what kind of - and how much - noise to use. One of these is called the *Laplace mechanism* {cite}`dwork2006B`.\n",
"\n",
"```{admonition} Definition\n",
"According to the Laplace mechanism, for a function $f(x)$ which returns a number, the following definition of $F(x)$ satisfies $\\epsilon$-differential privacy:\n",
"\n",
"\\begin{equation}\n",
"F(x) = f(x) + \\textsf{Lap}\\left(\\frac{s}{\\epsilon}\\right)\n",
"\\end{equation}\n",
"\n",
"where $s$ is the *sensitivity* of $f$, and $\\textsf{Lap}(S)$ denotes sampling from the Laplace distribution with center 0 and scale $S$.\n",
"```\n",
"\n",
"The *sensitivity* of a function $f$ is the amount $f$'s output changes when its input changes by 1. Sensitivity is a complex topic, and an integral part of designing differentially private algorithms; we will have much more to say about it later. For now, we will just point out that *counting queries* always have a sensitivity of 1: if a query counts the number of rows in the dataset with a particular property, and then we modify exactly one row of the dataset, then the query's output can change by at most 1.\n",
"\n",
"Thus we can achieve differential privacy for our example query by using the Laplace mechanism with sensitivity 1 and an $\\epsilon$ of our choosing. For now, let's pick $\\epsilon = 0.1$. We can sample from the Laplace distribution using Numpy's `random.laplace`."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"14238.147613610243"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sensitivity = 1\n",
"epsilon = 0.1\n",
"\n",
"adult[adult['Age'] >= 40].shape[0] + np.random.laplace(loc=0, scale=sensitivity/epsilon)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can see the effect of the noise by running this code multiple times. Each time, the output changes, but most of the time, the answer is close enough to the true answer (14,235) to be useful."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## How Much Noise is Enough?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How do we know that the Laplace mechanism adds enough noise to prevent the re-identification of individuals in the dataset? For one thing, we can try to break it! Let's write down a malicious counting query, which is specifically designed to determine whether Karrie Trusslove has an income greater than \\$50k."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"karries_row = adult[adult['Name'] == 'Karrie Trusslove']\n",
"karries_row[karries_row['Target'] == '<=50K'].shape[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This result definitely violates Karrie's privacy, since it reveals the value of the income column for Karrie's row. Since we know how to ensure differential privacy for counting queries with the Laplace mechanism, we can do so for this query:"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2.198682025336349"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sensitivity = 1\n",
"epsilon = 0.1\n",
"\n",
"karries_row = adult[adult['Name'] == 'Karrie Trusslove']\n",
"karries_row[karries_row['Target'] == '<=50K'].shape[0] + \\\n",
" np.random.laplace(loc=0, scale=sensitivity/epsilon)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Is the true answer 0 or 1? There's too much noise to be able to reliably tell. This is how differential privacy is *intended* to work - the approach does not *reject* queries which are determined to be malicious; instead, it adds enough noise that the results of a malicious query will be useless to the adversary."
]
}
],
"metadata": {
"celltoolbar": "Tags",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

663
notebooks/ch4.ipynb Normal file

File diff suppressed because one or more lines are too long

529
notebooks/ch5.ipynb Normal file

File diff suppressed because one or more lines are too long

390
notebooks/ch6.ipynb Normal file

File diff suppressed because one or more lines are too long

650
notebooks/ch7.ipynb Normal file

File diff suppressed because one or more lines are too long

411
notebooks/ch8.ipynb Normal file

File diff suppressed because one or more lines are too long

300
notebooks/ch9.ipynb Normal file
View File

@ -0,0 +1,300 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"tags": [
"remove-cell"
]
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"plt.style.use('seaborn-whitegrid')\n",
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"adult = pd.read_csv(\"adult_with_pii.csv\")\n",
"def laplace_mech(v, sensitivity, epsilon):\n",
" return v + np.random.laplace(loc=0, scale=sensitivity / epsilon)\n",
"def pct_error(orig, priv):\n",
" return np.abs(orig - priv)/orig * 100.0\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# The Exponential Mechanism\n",
"\n",
"```{admonition} Learning Objectives\n",
"After reading this chapter, you will be able to:\n",
"- Define, implement, and apply the Exponential and Report Noisy Max mechanisms\n",
"- Describe the challenges of applying the Exponential mechanism in practice\n",
"- Describe the advantages of these mechanisms\n",
"```\n",
"\n",
"The fundamental mechanisms we have seen so far (Laplace and Gaussian) are focused on numerical answers, and add noise directly to the answer itself. What if we want to return a precise answer (i.e. no added noise), but still preserve differential privacy? One solution is the exponential mechanism {cite}`mcsherry2007`, which allows selecting the \"best\" element from a set while preserving differential privacy. The analyst defines which element is the \"best\" by specifying a *scoring function* that outputs a score for each element in the set, and also defines the set of things to pick from. The mechanism provides differential privacy by *approximately* maximizing the score of the element it returns - in other words, to satisfy differential privacy, the exponential mechanism sometimes returns an element from the set which does *not* have the highest score.\n",
"\n",
"The exponential mechanism satisfies $\\epsilon$-differential privacy:\n",
"\n",
"1. The analyst selects a set $\\mathcal{R}$ of possible outputs\n",
"2. The analyst specifies a scoring function $u : \\mathcal{D} \\times \\mathcal{R} \\rightarrow \\mathbb{R}$ with global sensitivity $\\Delta u$\n",
"3. The exponential mechanism outputs $r \\in \\mathcal{R}$ with probability proportional to:\n",
"\n",
"\\begin{align}\n",
"\\exp \\Big(\\frac{\\epsilon u(x, r)}{2 \\Delta u} \\Big)\n",
"\\end{align}\n",
"\n",
"The biggest practical difference between the exponential mechanism and the previous mechanisms we've seen (e.g. the Laplace mechanism) is that the output of the exponential mechanism is *always* a member of the set $\\mathcal{R}$. This is extremely useful when selecting an item from a finite set, when a noisy answer would not make sense. For example, we might want to pick a date for a big meeting, which uses each participant's personal calendar to maximize the number of participants without a conflict, while providing differential privacy for the calendars. Adding noise to a date doesn't make much sense: it might turn a Friday into a Saturday, and *increase* the number of conflicts significantly. The exponential mechanism is perfect for problems like this one: it selects a date *without noise*.\n",
"\n",
"The exponential mechanism is interesting for several reasons:\n",
"\n",
"- The privacy cost of the mechanism is just $\\epsilon$, regardless of the size of $\\mathcal{R}$ - more on this next.\n",
"- It works for both finite and infinite sets $\\mathcal{R}$, but it can be really challenging to build a practical implementation which samples from the appropriate probability distribution when $\\mathcal{R}$ is infinite.\n",
"- It represents a \"fundamental mechanism\" of $\\epsilon$-differential privacy: all other $\\epsilon$-differentially private mechanisms can be defined in terms of the exponential mechanism with the appropriate definition of the scoring function $u$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The Exponential Mechanism for Finite Sets\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"10.683"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"options = adult['Marital Status'].unique()\n",
"\n",
"def score(data, option):\n",
" return data.value_counts()[option]/1000\n",
"\n",
"score(adult['Marital Status'], 'Never-married')"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Married-civ-spouse'"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def exponential(x, R, u, sensitivity, epsilon):\n",
" # Calculate the score for each element of R\n",
" scores = [u(x, r) for r in R]\n",
" \n",
" # Calculate the probability for each element, based on its score\n",
" probabilities = [np.exp(epsilon * score / (2 * sensitivity)) for score in scores]\n",
" \n",
" # Normalize the probabilties so they sum to 1\n",
" probabilities = probabilities / np.linalg.norm(probabilities, ord=1)\n",
"\n",
" # Choose an element from R based on the probabilities\n",
" return np.random.choice(R, 1, p=probabilities)[0]\n",
"\n",
"exponential(adult['Marital Status'], options, score, 1, 1)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Married-civ-spouse 179\n",
"Never-married 21\n",
"dtype: int64"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"r = [exponential(adult['Marital Status'], options, score, 1, 1) for i in range(200)]\n",
"pd.Series(r).value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Report Noisy Max\n",
"\n",
"Can we recover the exponential mechanism using the Laplace mechanism? In the case of a finite set $\\mathcal{R}$, the basic idea of the exponential mechanism - to select from a set with differential privacy - suggests a naive implementation in terms of the Laplace mechanism:\n",
"\n",
"1. For each $r \\in \\mathcal{R}$, calculate a *noisy score* $u(x, r) + \\mathsf{Lap}\\left(\\frac{\\Delta u}{\\epsilon}\\right)$\n",
"2. Output the element $r \\in \\mathcal{R}$ with the maximum noisy score\n",
"\n",
"Since the scoring function $u$ is $\\Delta u$ sensitive in $x$, each \"query\" in step 1 satisfies $\\epsilon$-differential privacy. Thus if $\\mathcal{R}$ contains $n$ elements, the above algorithm satisfies $n\\epsilon$-differential privacy by sequential composition.\n",
"\n",
"However, if we used the exponential mechanism, the total cost would be just $\\epsilon$ instead! Why is the exponential mechanism so much better? Because *it releases less information*.\n",
"\n",
"Our analysis of the Laplace-based approach defined above is very pessimistic. The whole set of noisy scores computed in step 1 actually satisfies $n\\epsilon$-differential privacy, and we could release the whole thing. That the output in step 2 satisfies $n\\epsilon$-differential privacy follows from the post-processing property.\n",
"\n",
"But the exponential mechanism releases *only* the identity of the element with the maximum noisy score - *not* the score itself, or the scores of any other element.\n",
"\n",
"The algorithm defined above is often called the *report noisy max* algorithm, and it actually satisfies $\\epsilon$-differential privacy, no matter how large the set $\\mathcal{R}$ is - specifically because it releases *only* the identity of the element with the largest noisy count. The proof can be found in [Dwork and Roth](https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf) {cite}`dwork2014`, Claim 3.9.\n",
"\n",
"Report noisy max is easy to implement, and it's easy to see that it produces very similar results to our earlier implementation of the exponential mechanism for finite sets."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Married-civ-spouse'"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def report_noisy_max(x, R, u, sensitivity, epsilon):\n",
" # Calculate the score for each element of R\n",
" scores = [u(x, r) for r in R]\n",
"\n",
" # Add noise to each score\n",
" noisy_scores = [laplace_mech(score, sensitivity, epsilon) for score in scores]\n",
"\n",
" # Find the index of the maximum score\n",
" max_idx = np.argmax(noisy_scores)\n",
" \n",
" # Return the element corresponding to that index\n",
" return R[max_idx]\n",
"\n",
"report_noisy_max(adult['Marital Status'], options, score, 1, 1)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"Married-civ-spouse 192\n",
"Never-married 8\n",
"dtype: int64"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"r = [report_noisy_max(adult['Marital Status'], options, score, 1, 1) for i in range(200)]\n",
"pd.Series(r).value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So the exponential mechanism can be replaced with report noisy max when the set $\\mathcal{R}$ is finite, but what about when it's infinite? We can't easily add Laplace noise to an infinite set of scores. In this context, we have to use the actual exponential mechanism. \n",
"\n",
"In practice, however, using the exponential mechanism for infinite sets is often challenging or impossible. While it's easy to write down the probability density function defined by the mechanism, it's often the case that no efficient algorithm exists for sampling from it. As a result, numerous theoretical papers appeal to the exponential mechanism to show that a differentially private algorithm \"exists\" with certain desirable properties, but many of these algorithms are impossible to use in practice."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The Exponential Mechanism as Fundamental Mechanism for Differential Privacy\n",
"\n",
"We've seen that it's not possible to recover the exponential mechanism using the Laplace mechanism plus sequential composition, because we can't capture the fact that the algorithm we designed doesn't release all of the noisy scores. What about the reverse - can we recover the Laplace mechanism from the exponential mechanism? It turns out that we can!\n",
"\n",
"Consider a query $q(x) : \\mathcal{D} \\rightarrow \\mathbb{R}$ with sensitivity $\\Delta q$. We can release an $\\epsilon$-differentially private answer by adding Laplace noise: $F(x) = q(x) + \\mathsf{Lap}(\\Delta q / \\epsilon)$. The probability density function for this differentially private version of $q$ is:\n",
"\n",
"\\begin{align}\n",
"\\mathsf{Pr}[F(x) = r] =& \\frac{1}{2b} \\exp\\Big(- \\frac{\\lvert r - \\mu \\rvert}{b}\\Big)\\\\\n",
"=& \\frac{\\epsilon}{2 \\Delta q} \\exp\\Big(- \\frac{\\epsilon \\lvert r - q(x) \\rvert}{\\Delta q}\\Big)\n",
"\\end{align}\n",
"\n",
"Consider what happens when we set the scoring function for the exponential mechanism to $u(x, r) = -2 \\lvert q(x) - r \\rvert$. The exponential mechanism says that we should sample from the probability distribution proportional to:\n",
"\n",
"\\begin{align}\n",
"\\mathsf{Pr}[F(x) = r] =&\\; \\exp \\Big(\\frac{\\epsilon u(x, r)}{2 \\Delta u} \\Big)\\\\\n",
"&= \\exp \\Big(\\frac{\\epsilon (-2 \\lvert q(x) - r \\rvert)}{2 \\Delta q} \\Big)\\\\\n",
"&= \\exp \\Big(- \\frac{\\epsilon \\lvert r - q(x) \\rvert}{\\Delta q} \\Big)\\\\\n",
"\\end{align}\n",
"\n",
"So it's possible to recover the Laplace mechanism from the exponential mechanism, and we get the same results (up to constant factors - the general analysis for the exponential mechanism is not tight in all cases).\n",
"\n",
"The exponential mechanism is extremely general - it's generally possible to re-define any $\\epsilon$-differentially private mechanism in terms of a carefully chosen definition of the scoring function $u$. If we can analyze the sensitivity of this scoring function, then the proof of differential privacy comes for free.\n",
"\n",
"On the other hand, applying the general analysis of the exponential mechanism sometimes comes at the cost of looser bounds (as in the Laplace example above), and mechanisms defined in terms of the exponential mechanism are often very difficult to implement. The exponential mechanism is often used to prove theoretical lower bounds (by showing that a differentially private algorithm *exists*), but practical algorithms often replicate the same behavior using some other approach (as in the case of report noisy max above)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"celltoolbar": "Tags",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.10"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

7
notebooks/cover.md Normal file
View File

@ -0,0 +1,7 @@
# Programming Differential Privacy
![logo](logo.png)
**A book about differential privacy, for programmers**
**By Joseph P. Near and Chiké Abuah**

50
notebooks/intro.ipynb Normal file
View File

@ -0,0 +1,50 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Introduction\n",
"\n",
"This is a book about differential privacy, for programmers. It is intended to give you an introduction to the challenges of data privacy, introduce you to the techniques that have been developed for addressing those challenges, and help you understand how to implement some of those techniques. \n",
"\n",
"The book contains numerous examples *as programs*, including implementations of many concepts. Each chapter is generated from a self-contained Jupyter Notebook. You can click on the \"download\" button at the top-right of the chapter, and then select \".ipynb\" to download the notebook for that chapter, and you'll be able to execute the examples yourself. Many of the examples are generated by code that is hidden (for readability) in the chapters you'll see here. You can show this code by clicking the \"Click to show\" labels adjacent to these cells.\n",
"\n",
"This book assumes a working knowledge of Python, as well as basic knowledge of the pandas and NumPy libraries. You will also benefit from some background in discrete mathematics and probability - a basic undergraduate course in these topics should be more than sufficient.\n",
"\n",
"This book is open source, and the latest version will always be available online [here](https://uvm-plaid.github.io/programming-dp/notebooks/intro.html). The source code is available [on GitHub](https://github.com/uvm-plaid/programming-dp). If you would like to fix a typo, suggest an improvement, or report a bug, please open an issue on GitHub.\n",
"\n",
"The techniques described in this book have developed out of the study of *data privacy*. For our purposes, we will define data privacy this way:\n",
"\n",
"```{admonition} Definition\n",
"*Data privacy* techniques have the goal of allowing analysts to learn about *trends* in sensitive data, without revealing information specific to *individuals*.\n",
"```\n",
"\n",
"This is a broad definition, and many different techniques fall under it. But it's important to note what this definition *excludes*: techniques for ensuring *security*, like encryption. Encrypted data doesn't reveal *anything* - so it fails to meet the first requirement of our definition. The distinction between security and privacy is an important one: privacy techniques involve an *intentional* release of information, and attempt to control *what can be learned* from that release; security techniques usually *prevent* the release of information, and control *who can access* data. This book covers privacy techniques, and we will only discuss security when it has important implications for privacy.\n",
"\n",
"This book is primarily focused on differential privacy. The first couple of chapters outline some of the reasons why: differential privacy (and its variants) is the only formal approach we know about that seems to provide robust privacy protection. Commonly-used approaches that have been used for decades (like de-identification and aggregation) have more recently been shown to break down under sophisticated privacy attacks, and even more modern techniques (like $k$-Anonymity) are susceptible to certain attacks. For this reason, differential privacy is fast becoming the gold standard in privacy protection, and thus it is the primary focus of this book."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.9"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

BIN
notebooks/logo.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 32 KiB

237
notebooks/references.bib Normal file
View File

@ -0,0 +1,237 @@
@misc{identifiability
, author = {Sweeney, Latanya}
, title={Simple Demographics Often Identify People Uniquely}
, url={https://dataprivacylab.org/projects/identifiability/}
, journal={Identifiability}}
@article{sweeney2002,
author = {Sweeney, Latanya},
title = {k-ANONYMITY: A MODEL FOR PROTECTING PRIVACY},
journal = {International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems},
volume = {10},
number = {05},
pages = {557-570},
year = {2002},
doi = {10.1142/S0218488502001648},
URL = {
https://doi.org/10.1142/S0218488502001648
},
eprint = {
https://doi.org/10.1142/S0218488502001648
}}
@inproceedings{mcsherry2009,
author = {McSherry, Frank D.},
title = {Privacy Integrated Queries: An Extensible Platform for Privacy-Preserving Data Analysis},
year = {2009},
isbn = {9781605585512},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/1559845.1559850},
doi = {10.1145/1559845.1559850},
abstract = {We report on the design and implementation of the Privacy Integrated Queries (PINQ) platform for privacy-preserving data analysis. PINQ provides analysts with a programming interface to unscrubbed data through a SQL-like language. At the same time, the design of PINQ's analysis language and its careful implementation provide formal guarantees of differential privacy for any and all uses of the platform. PINQ's unconditional structural guarantees require no trust placed in the expertise or diligence of the analysts, substantially broadening the scope for design and deployment of privacy-preserving data analysis, especially by non-experts.},
booktitle = {Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data},
pages = {1930},
numpages = {12},
keywords = {differential privacy, linq, confidentiality, anonymization},
location = {Providence, Rhode Island, USA},
series = {SIGMOD '09}
}
@InProceedings{dwork2006,
author="Dwork, Cynthia
and Kenthapadi, Krishnaram
and McSherry, Frank
and Mironov, Ilya
and Naor, Moni",
editor="Vaudenay, Serge",
title="Our Data, Ourselves: Privacy Via Distributed Noise Generation",
booktitle="Advances in Cryptology - EUROCRYPT 2006",
year="2006",
publisher="Springer Berlin Heidelberg",
address="Berlin, Heidelberg",
pages="486--503"
}
@inproceedings{dwork2006A,
author = {Dwork, Cynthia},
title = {Differential Privacy},
year = {2006},
isbn = {3540359079},
publisher = {Springer-Verlag},
address = {Berlin, Heidelberg},
url = {https://doi.org/10.1007/11787006_1},
doi = {10.1007/11787006_1},
abstract = {In 1977 Dalenius articulated a desideratum for statistical databases: nothing about an individual should be learnable from the database that cannot be learned without access to the database. We give a general impossibility result showing that a formalization of Dalenius' goal along the lines of semantic security cannot be achieved. Contrary to intuition, a variant of the result threatens the privacy even of someone not in the database. This state of affairs suggests a new measure, differential privacy, which, intuitively, captures the increased risk to one's privacy incurred by participating in a database. The techniques developed in a sequence of papers [8, 13, 3], culminating in those described in [12], can achieve any desired level of privacy under this measure. In many cases, extremely accurate information about the database can be provided while simultaneously ensuring very high levels of privacy},
booktitle = {Proceedings of the 33rd International Conference on Automata, Languages and Programming - Volume Part II},
pages = {112},
numpages = {12},
location = {Venice, Italy},
series = {ICALP'06}
}
@inproceedings{dwork2006B,
author = {Dwork, Cynthia and McSherry, Frank and Nissim, Kobbi and Smith, Adam},
title = {Calibrating Noise to Sensitivity in Private Data Analysis},
year = {2006},
isbn = {3540327312},
publisher = {Springer-Verlag},
address = {Berlin, Heidelberg},
url = {https://doi.org/10.1007/11681878_14},
doi = {10.1007/11681878_14},
abstract = {We continue a line of research initiated in [10,11]on privacy-preserving statistical databases. Consider a trusted server that holds a database of sensitive information. Given a query function f mapping databases to reals, the so-called true answer is the result of applying f to the database. To protect privacy, the true answer is perturbed by the addition of random noise generated according to a carefully chosen distribution, and this response, the true answer plus noise, is returned to the user.Previous work focused on the case of noisy sums, in which f = ∑ig(xi), where xi denotes the ith row of the database and g maps database rows to [0,1]. We extend the study to general functions f, proving that privacy can be preserved by calibrating the standard deviation of the noise according to the sensitivity of the function f. Roughly speaking, this is the amount that any single argument to f can change its output. The new analysis shows that for several particular applications substantially less noise is needed than was previously understood to be the case.The first step is a very clean characterization of privacy in terms of indistinguishability of transcripts. Additionally, we obtain separation results showing the increased value of interactive sanitization mechanisms over non-interactive.},
booktitle = {Proceedings of the Third Conference on Theory of Cryptography},
pages = {265284},
numpages = {20},
location = {New York, NY},
series = {TCC'06}
}
@inproceedings{nissim2007,
author = {Nissim, Kobbi and Raskhodnikova, Sofya and Smith, Adam},
title = {Smooth Sensitivity and Sampling in Private Data Analysis},
year = {2007},
isbn = {9781595936318},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/1250790.1250803},
doi = {10.1145/1250790.1250803},
abstract = {We introduce a new, generic framework for private data analysis.The goal of private data analysis is to release aggregate information about a data set while protecting the privacy of the individuals whose information the data set contains.Our framework allows one to release functions f of the data withinstance-based additive noise. That is, the noise magnitude is determined not only by the function we want to release, but also bythe database itself. One of the challenges is to ensure that the noise magnitude does not leak information about the database. To address that, we calibrate the noise magnitude to the smoothsensitivity of f on the database x --- a measure of variabilityof f in the neighborhood of the instance x. The new frameworkgreatly expands the applicability of output perturbation, a technique for protecting individuals' privacy by adding a smallamount of random noise to the released statistics. To our knowledge, this is the first formal analysis of the effect of instance-basednoise in the context of data privacy.Our framework raises many interesting algorithmic questions. Namely,to apply the framework one must compute or approximate the smoothsensitivity of f on x. We show how to do this efficiently for several different functions, including the median and the cost ofthe minimum spanning tree. We also give a generic procedure based on sampling that allows one to release f(x) accurately on manydatabases x. This procedure is applicable even when no efficient algorithm for approximating smooth sensitivity of f is known orwhen f is given as a black box. We illustrate the procedure by applying it to k-SED (k-means) clustering and learning mixtures of Gaussians.},
booktitle = {Proceedings of the Thirty-Ninth Annual ACM Symposium on Theory of Computing},
pages = {7584},
numpages = {10},
keywords = {private data analysis, output perturbation, clustering, sensitivity, privacy preserving data mining},
location = {San Diego, California, USA},
series = {STOC '07}
}
@inproceedings{dwork2009,
author = {Dwork, Cynthia and Lei, Jing},
title = {Differential Privacy and Robust Statistics},
year = {2009},
isbn = {9781605585062},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/1536414.1536466},
doi = {10.1145/1536414.1536466},
abstract = {We show by means of several examples that robust statistical estimators present an excellent starting point for differentially private estimators. Our algorithms use a new paradigm for differentially private mechanisms, which we call Propose-Test-Release (PTR), and for which we give a formal definition and general composition theorems.},
booktitle = {Proceedings of the Forty-First Annual ACM Symposium on Theory of Computing},
pages = {371380},
numpages = {10},
keywords = {propose-test-release paradigm, local sensitivity, differential privacy, robust statistics},
location = {Bethesda, MD, USA},
series = {STOC '09}
}
@article{dwork2014,
title={The algorithmic foundations of differential privacy},
author={Dwork, Cynthia and Roth, Aaron and others},
journal={Foundations and Trends{\textregistered} in Theoretical Computer Science},
volume={9},
number={3--4},
pages={211--407},
year={2014},
publisher={Now Publishers, Inc.}
}
@INPROCEEDINGS{dwork2010,
author={Dwork, Cynthia and Rothblum, Guy N. and Vadhan, Salil},
booktitle={2010 IEEE 51st Annual Symposium on Foundations of Computer Science},
title={Boosting and Differential Privacy},
year={2010}, volume={}, number={}, pages={51-60}, doi={10.1109/FOCS.2010.12}}
@inproceedings{bun2018composable,
title={Composable and versatile privacy via truncated CDP},
author={Bun, Mark and Dwork, Cynthia and Rothblum, Guy N and Steinke, Thomas},
booktitle={Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing},
pages={74--86},
year={2018},
organization={ACM}
}
@inproceedings{mironov2017renyi,
title={Renyi differential privacy},
author={Mironov, Ilya},
booktitle={Computer Security Foundations Symposium (CSF), 2017 IEEE 30th},
pages={263--275},
year={2017},
organization={IEEE}
}
@inproceedings{bun2016concentrated,
title={Concentrated differential privacy: Simplifications, extensions, and lower bounds},
author={Bun, Mark and Steinke, Thomas},
booktitle={Theory of Cryptography Conference},
pages={635--658},
year={2016},
organization={Springer}
}
@INPROCEEDINGS{mcsherry2007,
author={McSherry, Frank and Talwar, Kunal},
booktitle={48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07)},
title={Mechanism Design via Differential Privacy},
year={2007}, volume={}, number={}, pages={94-103}, doi={10.1109/FOCS.2007.66}}
@inproceedings{dwork2009A,
author = {Dwork, Cynthia and Naor, Moni and Reingold, Omer and Rothblum, Guy N. and Vadhan, Salil},
title = {On the Complexity of Differentially Private Data Release: Efficient Algorithms and Hardness Results},
year = {2009},
isbn = {9781605585062},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/1536414.1536467},
doi = {10.1145/1536414.1536467},
abstract = {We consider private data analysis in the setting in which a trusted and trustworthy curator, having obtained a large data set containing private information, releases to the public a "sanitization" of the data set that simultaneously protects the privacy of the individual contributors of data and offers utility to the data analyst. The sanitization may be in the form of an arbitrary data structure, accompanied by a computational procedure for determining approximate answers to queries on the original data set, or it may be a "synthetic data set" consisting of data items drawn from the same universe as items in the original data set; queries are carried out as if the synthetic data set were the actual input. In either case the process is non-interactive; once the sanitization has been released the original data and the curator play no further role.For the task of sanitizing with a synthetic dataset output, we map the boundary between computational feasibility and infeasibility with respect to a variety of utility measures. For the (potentially easier) task of sanitizing with unrestricted output format, we show a tight qualitative and quantitative connection between hardness of sanitizing and the existence of traitor tracing schemes.},
booktitle = {Proceedings of the Forty-First Annual ACM Symposium on Theory of Computing},
pages = {381390},
numpages = {10},
keywords = {cryptography, privacy, differential privacy, traitor tracing, exponential mechanism},
location = {Bethesda, MD, USA},
series = {STOC '09}
}
@inproceedings{rappor,
author = {Erlingsson, \'{U}lfar and Pihur, Vasyl and Korolova, Aleksandra},
title = {RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response},
year = {2014},
isbn = {9781450329576},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/2660267.2660348},
doi = {10.1145/2660267.2660348},
abstract = {Randomized Aggregatable Privacy-Preserving Ordinal Response, or RAPPOR, is a technology for crowdsourcing statistics from end-user client software, anonymously, with strong privacy guarantees. In short, RAPPORs allow the forest of client data to be studied, without permitting the possibility of looking at individual trees. By applying randomized response in a novel manner, RAPPOR provides the mechanisms for such collection as well as for efficient, high-utility analysis of the collected data. In particular, RAPPOR permits statistics to be collected on the population of client-side strings with strong privacy guarantees for each client, and without linkability of their reports. This paper describes and motivates RAPPOR, details its differential-privacy and utility guarantees, discusses its practical deployment and properties in the face of different attack models, and, finally, gives results of its application to both synthetic and real-world data.},
booktitle = {Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security},
pages = {10541067},
numpages = {14},
keywords = {population statistics, crowdsourcing, cloud computing, statistical inference, privacy protection},
location = {Scottsdale, Arizona, USA},
series = {CCS '14}
}
@article{warner1965,
author = { Stanley L. Warner },
title = {Randomized Response: A Survey Technique for Eliminating Evasive Answer Bias},
journal = {Journal of the American Statistical Association},
volume = {60},
number = {309},
pages = {63-69},
year = {1965},
publisher = {Taylor & Francis},
doi = {10.1080/01621459.1965.10480775},
note ={PMID: 12261830},
URL = {https://www.tandfonline.com/doi/abs/10.1080/01621459.1965.10480775}}
@inproceedings {wang2017,
author = {Tianhao Wang and Jeremiah Blocki and Ninghui Li and Somesh Jha},
title = {Locally Differentially Private Protocols for Frequency Estimation},
booktitle = {26th {USENIX} Security Symposium ({USENIX} Security 17)},
year = {2017},
isbn = {978-1-931971-40-9},
address = {Vancouver, BC},
pages = {729--745},
url = {https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/wang-tianhao},
publisher = {{USENIX} Association},
month = aug,
}

104
notes.md Normal file
View File

@ -0,0 +1,104 @@
# Notes
To build the book:
```
jupyter-book build .
```
To publish the html:
```
ghp-import -n -p -f _build/html
```
To generate latex:
```
jupyter-book build . --builder latex
```
To get decent latex output, add this:
```
\PassOptionsToPackage{svgnames}{xcolor}
...
\sphinxsetup{%
verbatimwithframe=false,
VerbatimColor={named}{OldLace},
TitleColor={named}{DarkGoldenrod},
hintBorderColor={named}{LightCoral},
attentionborder=3pt,
attentionBorderColor={named}{Crimson},
attentionBgColor={named}{FloralWhite},
noteborder=2pt,
noteBorderColor={named}{Olive},
cautionborder=3pt,
cautionBorderColor={named}{Cyan},
cautionBgColor={named}{LightCyan}}
```
Or:
```
\PassOptionsToPackage{svgnames}{xcolor}
...
\sphinxsetup{%
% verbatimwithframe=false,
VerbatimColor={named}{Ivory},
TitleColor={named}{DarkGoldenrod},
hintBorderColor={named}{LightCoral},
attentionborder=3pt,
attentionBorderColor={named}{Crimson},
attentionBgColor={named}{FloralWhite},
noteborder=2pt,
noteBorderColor={named}{Olive},
cautionborder=3pt,
cautionBorderColor={named}{Cyan},
cautionBgColor={named}{LightCyan}}
\setkeys{Gin}{width=.60\csname Gin@nat@width\endcsname,keepaspectratio}
```
And also:
```
\makeatletter
\usepackage{ifthen}
\usepackage{tcolorbox}
\tcbuselibrary{skins}
\renewenvironment{sphinxadmonition}[2]
{
%Green colored box for Conditions
\ifthenelse{\equal{#2}{Conditions}}{
\medskip
\begin{tcolorbox}[before={}, enhanced, colback=green!10,
colframe=green!65!black,fonttitle=\bfseries,
title=\sphinxstrong{#2}, arc=0mm, drop fuzzy shadow=blue!50!black!50!white]}{
%Blue colored box for Notes
\ifthenelse{\equal{#2}{Note:}}{
\medskip
\begin{tcolorbox}[before={}, enhanced, colback=blue!5!white,
colframe=blue!75!black,fonttitle=\bfseries,
title=\sphinxstrong{#2}, arc=0mm, drop fuzzy shadow=blue!50!black!50!white]}{
%Orange colored box for Warnings
\ifthenelse{\equal{#2}{Warning:}}{
\medskip
\begin{tcolorbox}[before={}, enhanced, colback=orange!5!white,
colframe=orange!75!black,fonttitle=\bfseries,
title=\sphinxstrong{#2}, arc=0mm, drop fuzzy shadow=blue!50!black!50!white]}{
%Red colored box for everthing else
\medskip
\begin{tcolorbox}[before={}, enhanced, colback=red!5!white,
colframe=red!75!black, fonttitle=\bfseries,
title=\sphinxstrong{#2}, arc=0mm, drop fuzzy shadow=blue!50!black!50!white]}
}
}
}
{
\end{tcolorbox}\par\bigskip
}
\makeatother
```

3
requirements.txt Normal file
View File

@ -0,0 +1,3 @@
jupyter-book
matplotlib
numpy

1
static/CNAME Normal file
View File

@ -0,0 +1 @@
programming-dp.com

BIN
static/book-logo.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 95 KiB

238
static/index.html Normal file
View File

@ -0,0 +1,238 @@
<html prefix="og: https://ogp.me/ns#">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<!-- Bootstrap CSS -->
<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.1.3/css/bootstrap.min.css" integrity="sha384-MCw98/SFnGE8fJT3GXwEOngsV7Zt27NXFoaoApmYm81iuXoPkFOJwJ8ERdknLPMO" crossorigin="anonymous">
<title>Programming Differential Privacy</title>
<meta name="description" content="A book about differential privacy, for programmers."/>
<meta property="og:title" content="Programming Differential Privacy" />
<meta property="og:type" content="book" />
<meta property="og:image" content="https://uvm-plaid.github.io/programming-dp/book-logo.png" />
<meta property="og:image:secure_url" content="https://uvm-plaid.github.io/programming-dp/book-logo.png" />
<meta property="og:image:type" content="image/png" />
<meta property="og:image:alt" content="book logo" />
<meta name="twitter:card" content="summary" />
<meta name="twitter:site" content="@josephnear" />
<meta name="twitter:title" content="Programming Differential Privacy" />
<meta name="twitter:description" content="A book about differential privacy, for programmers" />
<meta name="twitter:image" content="https://uvm-plaid.github.io/programming-dp/book-logo.png" />
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "Book",
"name": "Programming Differential Privacy",
"about": "A book about differential privacy, for programmers"
"image": "https://uvm-plaid.github.io/programming-dp/book-logo.png",
}
</script>
<PageMap>
<DataObject type="thumbnail">
<Attribute name="src" value="https://uvm-plaid.github.io/programming-dp/book-logo.png"/>
</DataObject>
</PageMap>
<meta name="thumbnail" content="https://uvm-plaid.github.io/programming-dp/book-logo.png" />
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<style type="text/css">
body { background-color: #FFFFFF;
font-family: Optima, Palatino, Arial, sans-serif, Helvetica;
padding:0ex 0ex 0ex 0ex ;
margin: 0ex 0ex 0ex 0ex ;
}
hr {
border: 0;
width: 90%;
height: 2px;
color: #e3e7ef;
background-color: #e3e7ef;
margin:3ex 2ex 0ex 0ex ;
}
h1 {
font-family: Optima,Segoe,Segoe UI,Candara,Calibri,Arial,sans-serif;
color:#333333;
font-size:2.7em;
letter-spacing:-2px;
}
h2 {
font-family: Optima,Segoe,Segoe UI,Candara,Calibri,Arial,sans-serif;
color:#333333;
font-size:1.7em;
padding-bottom: 0.5em;
}
h3 {
font-family: Optima,Segoe,Segoe UI,Candara,Calibri,Arial,sans-serif;
color:#444444;
}
h4 {
font-family: Optima,Segoe,Segoe UI,Candara,Calibri,Arial,sans-serif;
color:#777777;
padding: 0.5em;
}
big {
font-size:1.3em;
}
spam {
color:#ffffff;
}
a {
padding: .2em;
}
html { font-size: 18px !important }
@media (pointer: coarse) {
a {
padding: .4em;
}
}
img {
max-width: 300px;
}
a:link {
color:#444444;
text-decoration: underline;
}
a:visited {
color:#444444;
}
a:hover {
color:#222222;
text-decoration:none;
}
a:active {
color:#000000;
}
p.small {
font-variant: small-caps;
}
.box {
display: flex;
justify-content: center;
text-align: center;
flex-flow: row nowrap;
margin: 10%;
}
@media screen and (max-width:600px) {
.box {
flex-flow: column nowrap;
}
}
</style>
<style type=text/css>
#book-logo{
border: black;
border-width: thin;
border-style: solid;
max-width: 70%;
height: auto;
}
#citation{
font-size: 10px;
background: lightgray;
padding: 10px;
font-family: monospace;
width: 80%;
}
</style>
</head>
<body>
<div class="box">
<div style="flex: 0 0 50%; margin-bottom: 10px;">
<a href="https://programming-dp.com/cover.html">
<img id="book-logo" src="book-logo.png" alt="Book Logo" >
</a>
</div>
<div style="flex: 0 0 50%; padding-bottom: 2em; text-align: left;">
<h1>Programming Differential Privacy</h1>
<h2>A book about differential privacy, for programmers</h2>
<h2>By <a href="http://uvm.edu/~jnear">Joseph P. Near</a>
and <a href="http://uvm.edu/~cabuah">Chiké Abuah</a></h2>
<p><b><i>Programming Differential Privacy</i> uses examples and
Python code to explain the ideas behind differential privacy!</b>
The book is suitable for undergraduate students in computer science,
and no theory background is expected.</p>
<p><b><i>Programming Differential Privacy</i> is executable!</b>
Each chapter is actually generated from Python code. If you view the
HTML version of the book, you can click on the "Launch Binder" icon
at the top of each page to start an interactive version of that
chapter.</p>
<p>
<ul>
<li><b><a href="https://programming-dp.com/cover.html">Click
here to read the HTML version</a></b>
<li><b><a href="book.pdf">Click here to download the PDF version</a></b>
<li><b><a href="https://programming-dp.com/cn/index.html">Click
here to read the Chinese translation</a></b>
<p> (Translated by <a href="https://github.com/liuweiran900217">Weiran Liu</a> and <a href="https://github.com/little12">Shuang Li </a>) </p>
</ul>
</p>
<p>This book was originally developed at the University of Vermont
as part
of <a href="https://jnear.github.io/cs211-data-privacy/">CS211:
Data Privacy</a>. The material has since been used at the
University of Chicago, Penn State, and Rice University. If you're
using the book in your course, please let us know!</p>
<p><b><i>Programming Differential Privacy</i> is a living,
open-source book.</b> We welcome comments, suggestions, and
contributions via issues and pull requests
on <a href="https://github.com/uvm-plaid/programming-dp">the GitHub
repository</a>.
<p>Please use the following to cite the book:</p>
<div id="citation">
@book{near_abuah_2021,<br>
&nbsp;&nbsp; title={Programming Differential Privacy},<br>
&nbsp;&nbsp; author={Near, Joseph P. and Abuah, Chiké},<br>
&nbsp;&nbsp; volume={1},<br>
&nbsp;&nbsp; url={https://uvm-plaid.github.io/programming-dp/}, <br>
&nbsp;&nbsp; year={2021}<br>
}<br>
</div>
</div>
</div>
</body>
</html>

108
zh_cn/README.md Normal file
View File

@ -0,0 +1,108 @@
# 动手学差分隐私Programming Differential Privacy
This is the source repository for the book "Programming Differential Privacy." You can find the book online [here](https://uvm-plaid.github.io/programming-dp).
这是教材《动手学差分隐私Programming Differential Privacy》的源代码仓库。你可以[在此](https://uvm-plaid.github.io/programming-dp)在线阅读本教材。
## 声明Declaration
The translation work is authorized by original authors Joseph P. Near and Chiké Abuah. The translation is supported by the editor Lei Yao from China Machine Press. The Chinese version of the book will be freely available online, and there will be a printed version edited by China Machine Press.
本教材的翻译工作得到了原作者Joseph P. Near和Chiké Abuah的授权并得到了机械工业出版社China Machine Press编辑姚蕾老师的支持。本教材的中文版本将免费在线发布并推出机械工业出版社编辑的纸质版本。
## 翻译工具Translation Tool
The Chinese version of the book is translated via [DataSpell](https://www.jetbrains.com/dataspell/).
本书中文版应用[DataSpell](https://www.jetbrains.com/dataspell/)进行翻译。
## 安装Jupyter BookInstall Jupyter Book
See [Jupyter Book Overview](https://jupyterbook.org/en/stable/start/overview.html) for details on installing Jupyter Book.
查看[Jupyter Book Overview](https://jupyterbook.org/en/stable/start/overview.html)给出的安装过程安装Jupyter Book工具。
> You can install Jupyter Book via [pip](https://pip.pypa.io/en/stable/):
>
> 你可以通过[pip](https://pip.pypa.io/en/stable/)安装Jupyter Book
>
> ```shell
> pip install -U jupyter-book
> ```
>
> or via [conda-forge](https://conda-forge.org/):
>
> 或者通过[conda-forge](https://conda-forge.org/)安装Jupyter Book
>
> ```shell
> conda install -c conda-forge jupyter-book
> ```
>
> This will install everything you need to build a Jupyter Book locally.
>
> 这两种方法都可以安装在本地构建Jupyter Book所需的全部工具。
## 中文版本配置Configuration for Chinese Version
You need to add the following configuration in `_config.yml` to generate a well-displayed Chinese version of the PDF:
为使编译得到的PDF版本正确显示中文需在`_config.yml`下增加下述配置:
```text
sphinx:
config:
language: zh_CN
```
Note that you may meet display problems when using Chinese labels in `matplotlib.pyplot`. We use a library named `mplfonts` to successfully solve that problem. See [here](https://www.zhihu.com/question/25404709) for details. First, install `mplfonts`.
当使用`matplotlib.pyplot`时,可能会遇到中文显示乱码的问题。我们成功使用`mplfonts`库解决了这个问题,详见[这里的描述](https://www.zhihu.com/question/25404709)。首先,安装`mplfonts`。
```shell
pip install mplfonts -i https://pypi.tuna.tsinghua.edu.cn/simple
```
Next, add the following codes before using `matplotlib.pyplot`.
随后,在使用`matplotlib.pyplot`前增加下述代码。
```python
from mplfonts.bin.cli import init
init()
from mplfonts import use_font
use_font('SimHei')
```
Unfortunately, the style `seaborn-whitegrid` used in the English version of the book does not support Chinese characters. We need to change `seaborn-whitegrid` as `fivethirtyeight`.
不幸的是,英文版书籍中使用的绘图风格`seaborn-whitegrid`不支持中文字符。我们需要把`seaborn-whitegrid`换为`fivethirtyeight`。
```python
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
```
## 编译中文版Compile Chinese Version
If you follow `deploy.sh` and direct execute `jupyter-book build .` under the directory containing source codes for the Chinese version, you may get the following error:
如果你应用`deploy.sh`描述的方法,在中文版源代码目录下直接执行`jupyter-book build .`,你将会看到下述错误信息:
```text
The Table of Contents file is malformed: toc is not a mapping: <class 'list'>
You may need to migrate from the old format, using:
jupyter-book toc migrate /XXX/zh_cn/_toc.yml -o /XXX/zh_cn/_toc.yml
```
Please just execute the recommended instruction to migrate `_toc.yml` from the old format.
如果使用新版本的`jupyter-book`完成编译,需先在源代码目录下执行下述命令,将`_toc.yml`迁移为新版本。
```text
jupyter-book toc migrate /XXX/zh_cn/_toc.yml -o /XXX/zh_cn/_toc.yml
```
For old-format `_toc.yml`, `jupyter-book` would treat it as an article (`format: jb-article`) rather than a book (`format: jb-book`). You need to modify the auto-generated `_toc.yml` by changing `format: jb-article` to `format: jb-book`, and changing `sections:` to `chapters:`. After that, you can successfully build the project and generate the PDF version with the correct format.
对于旧版本的`_toc.yml``jupyter-book`会把项目看成是文章(`format: jb-article`)而非书籍(`format: jb-book`)。为此,需要对自动迁移的`_toc.yml`进行修改,将`format: jb-article`替换为`format: jb-book`,并将`sections:`替换为`chapters:`。这样就可以成功构建项目并生成正确格式的PDF。

31
zh_cn/_config.yml Normal file
View File

@ -0,0 +1,31 @@
# Book settings
title: 动手学差分隐私
author: Joseph P. Near、Chiké Abuah刘巍然、李双
copyright: "2021"
logo: logo_zh_cn.png
execute:
timeout: -1
execute_notebooks: force
latex:
latex_documents:
targetname: cn_book.tex
parse:
myst_enable_extensions:
# don't forget to list any other extensions you want enabled,
# including those that are enabled by default!
- amsmath
- dollarmath
bibtex_bibfiles:
- references.bib
repository:
url: https://github.com/uvm-plaid/programming-dp
branch: master
sphinx:
config:
language: zh_CN

File diff suppressed because one or more lines are too long

BIN
zh_cn/extras/logo_zh_cn.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 10 KiB

19
zh_cn/notebooks/_toc.yml Normal file
View File

@ -0,0 +1,19 @@
format: jb-book
root: cover
chapters:
- file: intro
- file: ch1
- file: ch2
- file: ch3
- file: ch4
- file: ch5
- file: ch6
- file: ch7
- file: ch8
- file: ch9
- file: ch10
- file: ch11
- file: ch12
- file: ch13
- file: ch14
- file: bibliography

Binary file not shown.

Binary file not shown.

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,61 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "d69c3a50",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"# 参考文献"
]
},
{
"cell_type": "markdown",
"id": "16a28d8b",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"```{bibliography}\n",
":style: unsrt\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "723f9996",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

1539
zh_cn/notebooks/ch1.ipynb Normal file

File diff suppressed because one or more lines are too long

696
zh_cn/notebooks/ch10.ipynb Normal file

File diff suppressed because one or more lines are too long

196
zh_cn/notebooks/ch11.ipynb Normal file
View File

@ -0,0 +1,196 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"# 算法设计练习"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## 需要考虑的问题\n",
"\n",
"- 一共需要多少次问询,以及我们可以使用何种组合定理?\n",
" - 可以使用并行组合性吗?\n",
" - 我们应该使用串行组合性,高级组合性,还是差分隐私变体?\n",
"- 我们可以使用稀疏向量技术吗?\n",
"- 我们可以使用指数机制吗?\n",
"- 我们应该如何分配隐私预算?\n",
"- 如果敏感度无上界,该如何限制敏感度的上界?\n",
"- 使用合成数据会带来帮助吗?\n",
"- 后处理性有助于\"降噪\"吗?\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## 1. 更普适的\"采样-聚合\"算法\n",
"\n",
"设计一个变种\"采样-聚合\"算法,使其*不*需要分析者指定问询函数$f$的输出范围。\n",
"\n",
"**思想**:首先,使用稀疏向量技术找到适用于整个数据集的$f(x)$上界和下界。由于$clip(f(x), lower, upper)$是一个敏感度有上界的问询,我们可以在此问询上应用稀疏向量技术。随后,基于得到的上界和下界使用\"采样-聚合\"算法。"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## 2. 汇总统计\n",
"\n",
"设计一种算法来生成满足差分隐私的下述统计数据:\n",
"\n",
"- 均值:$\\mu = \\frac{1}{n} \\sum_{i=1}^n x_i$\n",
"- 方差:$var = \\frac{1}{n} \\sum_{i=1}^n (x_i - \\mu)^2$\n",
"- 标准差:$\\sigma = \\sqrt{\\frac{1}{n} \\sum_{i=1}^n (x_i - \\mu)^2}$\n",
"\n",
"**思想**\n",
"\n",
"**均值**\n",
"\n",
"1. 使用稀疏向量技术找到裁剪边界的上界和下界\n",
"2. 计算噪声求和值与噪声计数值,再应用后处理性得到均值\n",
"\n",
"**方差**\n",
"\n",
"1. 将方差问询拆分为一个计数问询(并计算得到$\\frac{1}{n}$,我们可以从计算均值问询的过程中得到计数问询的结果)与一个求和问询\n",
"2. $\\sum_{i=1}^n (x_i - \\mu)^2$的敏感度是什么?我们可以裁剪并计算$\\sum_{i=1}^n (x_i - \\mu)^2$然后根据后处理性与1得到的结果相乘\n",
"\n",
"**标准差**\n",
"\n",
"1. 只需计算方差的平方根\n",
"\n",
"涉及到的全部问询:\n",
"- 裁剪下界(稀疏向量技术)\n",
"- 裁剪上界(稀疏向量技术)\n",
"- 噪声求和(均值)\n",
"- 噪声计数\n",
"- 噪声求和(方差)"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## 3. 频繁项\n",
"\n",
"谷歌的RAPPOR系统{cite}`rappor`用于统计谷歌浏览器用户最常设置的主页是什么。请设计下述算法:\n",
"\n",
"- 给定基于流量统计得到的10,000个最受欢迎的网页列表\n",
"- 从这10,000个最受欢迎的网页中确定前10个最受欢迎的主页\n",
"\n",
"**思想**使用并行组合性获取加噪后排名前10的主页"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## 4.分层查询\n",
"\n",
"设计一种为美国人口普查信息生成汇总统计结果的算法。算法应该能按下述从低到高的层次输出相应的人口统计结果:\n",
"\n",
"- 人口普查区\n",
"- 城市 / 乡镇\n",
"- 邮编\n",
"- 国家\n",
"- 洲\n",
"- 美国\n",
"\n",
"**思想**\n",
"\n",
"思想1使用并行组合性*只*计算最低层次(人口普查区)的人口统计结果。将所有区域的计数结果相加,得到各个城市的人口统计结果,以此类推,得到所有统计结果。优势:隐私预算低。\n",
"\n",
"思想2计算所有层次的计数结果分别对每一层使用并行组合性根据真实数据调整预算分配。优势对于较低层我们可以得到更准确的统计结果。\n",
"\n",
"思想32一样但应用后处理性基于更高层的计数结果重新缩放较低层的计数结果将缩放后的浮点数结果截断为整数将负数设置为0。"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## 5. 一系列范围问询\n",
"\n",
"设计一种算法来准确回答一系列*范围问询*。这些范围问询都是针对某一个数据表的问询:\"有多少行数据的值在$a$和$b$之间?\"(即特定取值范围的数据行数)。\n",
"\n",
"### 第1部分\n",
"\n",
"这一系列范围问询是已经预先确定的、数量有限的、形式为: $\\{(a_1, b_1), \\dots, (a_k, b_k)\\}$的问询序列。\n",
"\n",
"### 第2部分\n",
"\n",
"这一系列范围问询序列的长度$k$是预先确定的,但是问询以流方式执行,每一个问询必须在执行时就给出回复。\n",
"\n",
"### 第3部分\n",
"\n",
"范围问询序列可能是无限长的。\n",
"\n",
"**思想**\n",
"\n",
"根据串行组合性,依次执行每一个问询\n",
"\n",
"对于第1部分我们可以引入$L2$敏感度,从而引入高斯机制。当$k$很大时,高斯噪声的应用效果会更好。\n",
"\n",
"或者,我们可以构造合成数据:\n",
"\n",
"- 为每个问询范围$(i, i+1)$计算一个计数值(这样就可以应用并行组合性了)。这就是所谓的合成数据表示法!我们可以将直方图中落在指定问询区间内的所有分段计数结果相加,从而回答无穷多的范围问询。\n",
"- 对于第2部分使用稀疏向量技术\n",
"\n",
"使用稀疏向量技术:对于问询流中的每个问询,查看真实数据回复结果与合成数据回复结果之间的差值。如果差值较大,则问询一次真实数据,得到(应用并行组合性,得到直方图形式的)回复结果,并更新合成数据。否则,给出直接返回合成数据回复结果。这样一来,只有当需要更新合成数据时,我们*才需要*消耗隐私预算。"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

1231
zh_cn/notebooks/ch12.ipynb Normal file

File diff suppressed because one or more lines are too long

772
zh_cn/notebooks/ch13.ipynb Normal file
View File

@ -0,0 +1,772 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"tags": [
"remove-cell"
],
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"plt.style.use('seaborn-whitegrid')\n",
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"adult = pd.read_csv(\"adult_with_pii.csv\")\n",
"def laplace_mech(v, sensitivity, epsilon):\n",
" return v + np.random.laplace(loc=0, scale=sensitivity / epsilon)\n",
"def pct_error(orig, priv):\n",
" return np.abs(orig - priv)/orig * 100.0\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"# 本地差分隐私\n",
"\n",
"```{admonition} 学习目标\n",
"阅读本章后,您将能够:\n",
"- 定义差分隐私的本地模型,并比较本地模型与中心模型的异同\n",
"- 定义和实现随机应答和一元编码机制\n",
"- 描述这些机制的准确性影响,以及本地模型的挑战\n",
"```\n",
"\n",
"截至目前,我们只考虑了差分隐私的*中心模型*Central Model。在中心模型中原始敏感数据被汇总到单个数据集中。在这种场景下我们假定*分析者*是恶意的,但存在一个*可信任的数据管理者*,由它持有数据集并能正确执行分析者指定的差分隐私机制。\n",
"\n",
"这种设定通常是不现实的。在很多情况下,数据管理者和分析者是*同一个人*,且实际上不存在一个可信第三方,能由它持有数据集并执行差分隐私机制。事实上,往往是我们*不*信任的组织来收集我们最敏感的数据。这样的组织显然无法成为可信数据管理者。\n",
"\n",
"中心差分隐私模型的一种替代方案是差分隐私*本地模型*Local Model。在本地模型中数据在离开数据主体控制之前就已经满足差分隐私。例如在将数据发送给数据管理者之前用户就在自己的设备上为自己的数据添加噪声。在本地模型中数据管理者不需要是可信的因为他们收集的是已经满足差分隐私的数据。\n",
"\n",
"因此,相比于中心模型,本地模型有着巨大的优势:数据主体不需要相信除他们自己以外的任何人。这一优势使得本地模型在实际系统中有着广泛的应用,包括[谷歌](https://github.com/google/rappor)和[苹果](https://www.apple.com/privacy/docs/Differential_Privacy_Overview.pdf)都部署了基于本地模型的差分隐私应用。\n",
"\n",
"不幸的是,本地模型也有明显的缺点:在相同的隐私消耗量下,对于相同的问询,本地模型问询结果的准确性通常比中心模型*低几个数量级*。这种巨大的准确性损失意味着只有较少类型的问询适用于本地差分隐私。即便如此,只有当数据量较大(即参与者数量较多时)时,差分隐私本地模型分析结果的准确率才可以满足实际要求。\n",
"\n",
"本章,我们将学习两种本地差分隐私机制。第一种是*随机应答*Randomized Response第二种是*一元编码*Unary Encoding。"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## 随机应答\n",
"\n",
"[随机应答](https://en.wikipedia.org/wiki/Randomized_response) {cite}`warner1965`是一种本地差分隐私机制,[S. L. Warner](https://www.jstor.org/stable/2283137?seq=1#metadata_info_tab_contents)在其1965年的论文中首次提出了这一机制。当时该技术提出的目的是允许用户可以用错误的回复来应答调研中的敏感问题且学者们当初也没有意识到这是一种差分隐私机制此后40年内学者们都尚未提出差分隐私的概念。在提出差分隐私的概念后统计学家们才意识到随机应答技术*已经*满足了差分隐私的定义。\n",
"\n",
"Dwork和Roth提出了一种随机应答变种机制。在此机制中数据主体按下述方法用\"是\"或\"不是\"来回答一个问题:\n",
"\n",
"1. 掷一枚硬币\n",
"2. 如果硬币正面向上,如实回答问题\n",
"3. 如果硬币反面向上,再掷一枚硬币\n",
"4. 如果第二枚硬币也是正面向上,回答\"是\";否则,回答\"否\"\n",
"\n",
"该算法的随机性来自两次硬币的抛掷结果。正如其他差分隐私算法一样,硬币抛掷结果的随机性为真实结果引入了不确定性,而这种不确定性正是差分隐私机制可以提供隐私保护的根本原因。\n",
"\n",
"事实证明,该随机应答算法满足$\\epsilon$-差分隐私,其中$\\epsilon = \\log(3) = 1.09$。\n",
"\n",
"让我们来实现这个算法,并用其回答一个简单的\"是或否\"问题:\"你的职业是'销售'吗?\"我们可以在Python中使用`np.random.randint(0, 2)`函数模拟硬币抛掷过程。此函数的输出仅可能是0或1。"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"def rand_resp_sales(response):\n",
" truthful_response = response == 'Sales'\n",
" \n",
" # 第一次抛掷硬币\n",
" if np.random.randint(0, 2) == 0:\n",
" # 如实回答\n",
" return truthful_response\n",
" else:\n",
" # (用第二次硬币抛掷结果)随机应答\n",
" return np.random.randint(0, 2) == 0"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"让我们来询问200名从事销售工作的人请他们使用随机应答算法回答此问题看看结果如何。"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/plain": "True 152\nFalse 48\ndtype: int64"
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.Series([rand_resp_sales('Sales') for i in range(200)]).value_counts()"
]
},
{
"cell_type": "markdown",
"source": [
"可以看到,我们可以得到答案为\"是\"和\"否\"的人数,但\"是\"的数量远多于\"否\"的数量。与我们学过的算法类似,此输出结果也展示出了差分隐私算法的两个性质:算法引入一定的不确定性来实现隐私保护,但算法的输出结果仍然释放出足够的信号,帮助我们推断出人口相关信息。\n",
"\n",
"让我们试试在实际数据上做同样的实验。我们从一直使用的美国人口数据集中获取所有个体的职业信息。我们要问询的问题是\"你的职业是'销售'吗?\",并对每个职业的回复结果进行编码。在实际部署的系统中,我们不会集中收集真实数据。相对地,每个回复者会在本地执行`rand_resp_sales`(随机应答销售职业)函数,并把随机应答结果提交给数据管理者。在实验中,我们在现有的数据集上执行`rand_resp_sales`函数。"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"responses = [rand_resp_sales(r) for r in adult['Occupation']]"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/plain": "False 22553\nTrue 10008\ndtype: int64"
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.Series(responses).value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"这次,我们得到的\"否\"数量比\"是\"数量更多。稍加思考,就会发现这是一个合理的统计结果,因为数据集中大多数参与者的职位都不是销售。\n",
"\n",
"现在的关键问题是:我们如何根据这些回复,估计出数据集中销售人员的*真实*人数呢?\"是\"的数量并不能很好地估计销售人员数量:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/plain": "3650"
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(adult[adult['Occupation'] == 'Sales'])"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"这并不奇怪,因为很多\"是\"都来自于算法中的随机硬币抛掷结果。\n",
"\n",
"为了估计销售人员的正确人数,我们需要分析随机应答算法的随机性,估计出有多少\"是\"来自实际销售人员,以及有多少\"是\"来自随机硬币抛掷结果。我们知道:\n",
"\n",
"- 每个响应者随机回复的概率为$\\frac{1}{2}$\n",
"- 每个随机回复中\"是\"的概率为$\\frac{1}{2}$\n",
"\n",
"因此,响应者随机回复(而不是因为他们真的是销售人员才回复)\"是\"的概率为$\\frac{1}{2} \\cdot \\frac{1}{2} = \\frac{1}{4}$。这意味着我们得到的回复中有四分之一是假的\"是\"。"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"responses = [rand_resp_sales(r) for r in adult['Occupation']]\n",
"\n",
"# 我们估计出有1/4的\"是\"回复完全来自于硬币的随机抛掷结果\n",
"# 这些都是假的\"是\"\n",
"fake_yeses = len(responses)/4\n",
"\n",
"# 回复为\"是\"的总人数\n",
"num_yeses = np.sum([1 if r else 0 for r in responses])\n",
"\n",
"# 真实\"是\"的人数等于回复为\"是\"的总人数减去假\"是\"的人数\n",
"true_yeses = num_yeses - fake_yeses"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"另一个我们需要考虑的因素是,虽然有一半受访者是随机应答的,但*在这些随机应答的响应者中,部分响应者实际上可能也是销售人员*。随机应答响应者中有多少是销售人员呢?我们得不到相关数据,因为他们的应答是完全随机的!\n",
"\n",
"但是,因为我们(根据第一次硬币抛掷结果)把受访者随机分为了\"真实\"和\"随机\"两组,我们期望两组的销售人员数量基本一致。因此,如果我们能估计出\"真实\"组的销售人员数量,那么我们可以将该人数翻倍,进而得到销售人员总数。"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/plain": "3747.5"
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 用true_yesses估计\"真实\"组中回答\"是\"的人数\n",
"# 我们把人数翻倍,估计出回复为\"是\"的总人数\n",
"rr_result = true_yeses*2\n",
"rr_result"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"得到的人数和销售人员的真实人数有多接近呢?让我们来比较一下!"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/plain": "3650"
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"true_result = np.sum(adult['Occupation'] == 'Sales')\n",
"true_result"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/plain": "2.671232876712329"
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pct_error(true_result, rr_result)"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"当总人数相对比较大时例如本例的总人数超过了3000我们通常可以使用此方法得到一个错误率\"可接受\"的统计结果。此例子中的错误率低于5%。如果我们的目标是统计最受欢迎的职位,这个方法可以帮助我们得到较为准确的结果。然而,统计结果的错误率会随着总人数的降低而快速增大。\n",
"\n",
"此外,随机应答的准确率和中心模型拉普拉斯机制的准确率相比要差出*几个数量级*。让我们使用此例子比较一下这两种机制:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/plain": "0.003038123413791429"
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pct_error(true_result, laplace_mech(true_result, 1, 1))"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"即使我们中心模型中的$\\epsilon$值略低于随机应答的$\\epsilon$中心模型的误差也仅约为0.01%,远小于本地模型。\n",
"\n",
"确实*存在*效果更好的本地模型算法。然而,本地模型存在天生的限制条件:必须在提交数据前增加噪声。这意味着本地模型算法的准确率*总是*比最好的中心模型算法准确率低。"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## 一元编码"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"随机应答允许我们基于本地差分隐私回答\"是或否\"的问题。如何实现直方图问询呢?\n",
"\n",
"学者们已经提出了多种不同的算法,来解决本地差分隐私的直方图问询问题。[Wang等人](https://arxiv.org/abs/1705.04421) {cite}`wang2017`在2017年的论文中总结了一些优化方法。这里我们介绍其中最简单的一个方法*一元编码*。该方法是[谷歌RAPPOR系统](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42852.pdf) {cite}`rappor`的基础算法谷歌RAPPOR系统对基础算法作了大量的修改使算法支持更大的标签数量、支持随时间推移的多次应答。\n",
"\n",
"我们首先需要定义应答域,即直方图包含的标签。下述例子中,我们想要知道各个职业的从业者人数,因此应答域是所有职位所构成的集合。"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/plain": "array(['Adm-clerical', 'Exec-managerial', 'Handlers-cleaners',\n 'Prof-specialty', 'Other-service', 'Sales', 'Craft-repair',\n 'Transport-moving', 'Farming-fishing', 'Machine-op-inspct',\n 'Tech-support', 'Protective-serv', 'Armed-Forces',\n 'Priv-house-serv'], dtype=object)"
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"domain = adult['Occupation'].dropna().unique()\n",
"domain"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"我们将定义三个函数,这三个函数共同实现了一元编码机制:\n",
"\n",
"1. `encode`(编码),编码应答值\n",
"2. `perturb`(扰动),扰动编码后的应答值\n",
"3. `aggregate`(聚合),根据扰动应答值重构最终结果\n",
"\n",
"该技术的名称来源于所用的编码方法:如果应答域大小为$k$,我们将每个应答值编码为长度为$k$的比特向量。除了应答者的职位所对应的比特值为1以外所有其他位置的编码均为0。机器学习领域称这种表示方法\"独热编码\"One-hot Encoding。\n",
"\n",
"举例来说,'销售'是应答域中的第6个元素因此'销售'职位的编码是第6个比特为1、其余比特值均为0的向量。"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/plain": "[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]"
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def encode(response):\n",
" return [1 if d == response else 0 for d in domain]\n",
"\n",
"encode('Sales')"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"我们接下来要用`perturb`函数翻转应答向量中的各个比特值,从而满足差分隐私。翻转一个比特值的概率由$p$和$q$这两个参数共同决定。这两个参数也决定了隐私参数$\\epsilon$的值(我们稍后将看到具体的计算公式)。\n",
"\n",
"$$ \\mathsf{Pr}[B'[i] = 1] = \\left\\{\n",
"\\begin{array}{ll}\n",
" p\\;\\;\\;\\text{if}\\;B[i] = 1 \\\\\n",
" q\\;\\;\\;\\text{if}\\;B[i] = 0\\\\\n",
"\\end{array} \n",
"\\right. $$"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/plain": "[0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1]"
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def perturb(encoded_response):\n",
" return [perturb_bit(b) for b in encoded_response]\n",
"\n",
"def perturb_bit(bit):\n",
" p = .75\n",
" q = .25\n",
"\n",
" sample = np.random.random()\n",
" if bit == 1:\n",
" if sample <= p:\n",
" return 1\n",
" else:\n",
" return 0\n",
" elif bit == 0:\n",
" if sample <= q:\n",
" return 1\n",
" else: \n",
" return 0\n",
"\n",
"perturb(encode('Sales'))"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"我们可以根据$p$和$q$计算出隐私参数$\\epsilon$。如果$p=.75$$q=.25$,则计算得到的$\\epsilon$略高于2。\n",
"\n",
"\\begin{align}\n",
"\\epsilon = \\log{\\left(\\frac{p (1-q)}{(1-p) q}\\right)}\n",
"\\end{align}"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/plain": "2.1972245773362196"
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def unary_epsilon(p, q):\n",
" return np.log((p*(1-q)) / ((1-p)*q))\n",
"\n",
"unary_epsilon(.75, .25)"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"最后一步是聚合。如果我们没有对应答值进行过任何扰动,我们可以简单地对所有得到的应答向量逐比特相加,得到应答域中每个元素的计数结果:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/plain": "[('Adm-clerical', 3770),\n ('Exec-managerial', 4066),\n ('Handlers-cleaners', 1370),\n ('Prof-specialty', 4140),\n ('Other-service', 3295),\n ('Sales', 3650),\n ('Craft-repair', 4099),\n ('Transport-moving', 1597),\n ('Farming-fishing', 994),\n ('Machine-op-inspct', 2002),\n ('Tech-support', 928),\n ('Protective-serv', 649),\n ('Armed-Forces', 9),\n ('Priv-house-serv', 149)]"
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"counts = np.sum([encode(r) for r in adult['Occupation']], axis=0)\n",
"list(zip(domain, counts))"
]
},
{
"cell_type": "markdown",
"source": [
"但是,正如我们在随机应答中所看到的,翻转比特值产生的\"假\"应答值将使我们得到难以解释的统计结果。如果我们把扰动后的应答向量逐比特相加,得到的所有计数结果都是错误的:"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 208,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/plain": [
"[('Adm-clerical', 10042),\n",
" ('Exec-managerial', 10204),\n",
" ('Handlers-cleaners', 9006),\n",
" ('Prof-specialty', 10238),\n",
" ('Other-service', 9635),\n",
" ('Sales', 9844),\n",
" ('Craft-repair', 10233),\n",
" ('Transport-moving', 8863),\n",
" ('Farming-fishing', 8721),\n",
" ('Machine-op-inspct', 9122),\n",
" ('Tech-support', 8753),\n",
" ('Protective-serv', 8523),\n",
" ('Armed-Forces', 8157),\n",
" ('Priv-house-serv', 8042)]"
]
},
"execution_count": 208,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"counts = np.sum([perturb(encode(r)) for r in adult['Occupation']], axis=0)\n",
"list(zip(domain, counts))"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"一元编码算法的聚合步骤需要考虑每个标签的\"假\"应答数量。此步骤以$p$、$q$,以及应答数量$n$为输入,得到聚合结果:\n",
"\n",
"\\begin{align}\n",
"A[i] = \\frac{\\sum_j B'_j[i] - n q}{p - q}\n",
"\\end{align}"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"def aggregate(responses):\n",
" p = .75\n",
" q = .25\n",
" \n",
" sums = np.sum(responses, axis=0)\n",
" n = len(responses)\n",
" \n",
" return [(v - n*q) / (p-q) for v in sums] "
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/plain": "[('Adm-clerical', 3609.5),\n ('Exec-managerial', 3969.5),\n ('Handlers-cleaners', 1425.5),\n ('Prof-specialty', 4087.5),\n ('Other-service', 3391.5),\n ('Sales', 3895.5),\n ('Craft-repair', 3961.5),\n ('Transport-moving', 1441.5),\n ('Farming-fishing', 1013.5),\n ('Machine-op-inspct', 2099.5),\n ('Tech-support', 797.5),\n ('Protective-serv', 639.5),\n ('Armed-Forces', -14.5),\n ('Priv-house-serv', -60.5)]"
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"responses = [perturb(encode(r)) for r in adult['Occupation']]\n",
"counts = aggregate(responses)\n",
"list(zip(domain, counts))"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"正如我们在随机应答中所看到的,一元编码机制得到的统计结果也比较准确,我们可以得到应答域中各个标签的粗略排序结果(至少可以统计出最受欢迎的职位是什么)。即便如此,一元编码机制的准确率要比中心模型拉普拉斯机制的准确率低几个数量级。\n",
"\n",
"学者们已经提出了其他在本地模型下实现直方图问询的方法。之前链接给出的[论文](https://arxiv.org/abs/1705.04421)具体介绍了这些方法。这些方法可以在一定程度上提高准确率,但这些方法都必须保证本地模型下*每个样本需独立*满足差分隐私。这一基本限制条件使得即便使用最复杂的技术,本地模型机制的准确率也无法达到中心模型机制的准确率。"
]
}
],
"metadata": {
"celltoolbar": "Tags",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.10"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

1108
zh_cn/notebooks/ch14.ipynb Normal file

File diff suppressed because one or more lines are too long

1024
zh_cn/notebooks/ch2.ipynb Normal file

File diff suppressed because one or more lines are too long

316
zh_cn/notebooks/ch3.ipynb Normal file
View File

@ -0,0 +1,316 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"# 差分隐私\n",
"\n",
"\n",
"阅读本章后,您将能够:\n",
"\n",
"- 定义差分隐私\n",
"- 解释$\\epsilon$这一重要的隐私参数\n",
"- 应用拉普拉斯机制实现满足差分隐私的计数问询\n",
"\n",
"\n",
"与$k$-匿名性类似,*差分隐私*Differential Privacy {cite}`dwork2006A,dwork2006B`也是一个用数学语言描述的隐私定义(即可以用数学方法证明发布数据满足此性质)。然而,与$k$-匿名性不同,差分隐私不是*数据*所具有的属性,而是*算法*所具有的属性。也就是说,我们可以证明一个*算法*满足差分隐私。如果想证明一个*数据集*满足差分隐私,我们需要证明的是产生此数据集的算法满足差分隐私。\n",
"\n",
"\n",
"一般将满足差分隐私的函数称为一个*机制*Mechanism。如果对于所有*临近数据集*Neighboring Dataset$x$和$x'$和所有可能的输出$S$,机制$F$均满足\n",
"\n",
"\\begin{equation}\n",
"\\frac{\\mathsf{Pr}[F(x) = S]}{\\mathsf{Pr}[F(x') = S]} \\leq e^\\epsilon\n",
"\\end{equation}\n",
"\n",
"则称机制$F$满足差分隐私。\n",
"\n",
"\n",
"如果两个数据集中只有一个个体的数据项不同,则认为这两个数据集是临近数据集。请注意,$F$一般是一个*随机*函数。也就是说,即使给定相同的输入,$F$一般也包含多个可能的输出。因此,$F$输出的概率分布一般不是一个点分布。\n",
"\n",
"这个定义所蕴含的一个重要含义是,无论输入*是否包含*任意特定个体的数据,$F$的输出总是几乎相同的。换句话说,$F$所引入的随机性应该足够大,使得观察$F$的输出无法判断输入是$x$还是$x'$。假设我的数据在$x$中,但不在$x'$中。如果攻击者无法确定$F$的输入是$x$还是$x'$,则攻击者甚至无法判断输入是否包含我的数据,更无法判断出我的数据是什么了。\n",
"\n",
"一般将差分隐私定义中的参数$\\epsilon$称为*隐私参数*Privacy Parameter或*隐私预算*Privacy Budget。$\\epsilon$提供了一个旋钮,用来调整差分隐私定义所能提供的\"隐私量\"。$\\epsilon$较小时,意味着$F$需要为相似的输入提供*非常*相似的输出,因此提供更高等级的隐私性。较大的$\\epsilon$允许$F$给出不那么相似的输出,因此提供更少的隐私性。\n",
"\n",
"我们在实际中应该如何设置$\\epsilon$,差分隐私才能提供足够的隐私性呢?没人知道这个问题的答案。一般的共识是:将$\\epsilon$设置为约等于1或者更小的值大于10的$\\epsilon$取值意味着大概率无法提供足够的隐私性。但实际上,这个经验法则下的$\\epsilon$取值过于保守了。我们后续将会进一步展开讨论这个问题。"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## 拉普拉斯机制"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"差分隐私一般用于回复特定的问询。我们来考虑一个针对人口普查数据的问询。我们首先不使用差分隐私。"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"pycharm": {
"name": "#%%\n"
},
"tags": [
"remove-cell"
]
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"plt.style.use('seaborn-whitegrid')\n",
"adult = pd.read_csv(\"adult_with_pii.csv\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"\"数据集中有多少个体的年龄大于等于40岁\""
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/plain": [
"14237"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"adult[adult['Age'] >= 40].shape[0]"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"使这个问询满足差分隐私的最简单方法是:在回复结果上增加随机噪声。这里的关键挑战是:既需要增加足够大的噪声,使问询满足差分隐私,但噪声又不能加得太多,否则问询结果就无意义了。为了简化这一过程,差分隐私领域的学者提出了一些基础*机制*。这些基础机制具体描述了应该增加何种类型的噪声,以及噪声量应该有多大。最典型的基础机制是*拉普拉斯机制*Laplace Mechanism {cite}`dwork2006B`。\n",
"\n",
"\n",
"根据拉普拉斯机制,对于可以输出一个数值型结果的函数$f(x)$,按下述方法定义的$F(x)$满足$\\epsilon$-差分隐私:\n",
"\n",
"\\begin{equation}\n",
"F(x) = f(x) + \\textsf{Lap}\\left(\\frac{s}{\\epsilon}\\right)\n",
"\\end{equation}\n",
"\n",
"其中$s$是$f$的*敏感度*Sensitivity$\\textsf{Lap}(S)$表示以均值为0、放缩系数为$S$的拉普拉斯分布采样。\n",
"\n",
"\n",
"函数$f$的*敏感度*是指,当输入由数据集$x$变化为临近数据集$x'$后,$f$的输出变化量。计算函数$f$的敏感度是一个非常复杂的问题,也是设计差分隐私算法时所需面临的核心问题,我们稍后会更进一步展开讨论。我们现在只需要指出,*计数问询*Counting Query的敏感度总为1当问询数据集中满足特定属性的数据量时如果我们只修改数据集中的一个数据项则问询的输出变化量最多为1。\n",
"\n",
"因此,我们可以根据我们所选择的$\\epsilon$在计数问询中使用敏感度等于1的拉普拉斯机制从而使我们的样例问询满足差分隐私性。现在我们取$\\epsilon = 0.1$。我们可以用Numpy的`random.laplace`函数实现拉普拉斯分布采样。"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/plain": [
"14182.011715073944"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sensitivity = 1\n",
"epsilon = 0.1\n",
"\n",
"adult[adult['Age'] >= 40].shape[0] + np.random.laplace(loc=0, scale=sensitivity/epsilon)"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"可以试着多次运行此代码查看噪声对问询结果造成的影响。虽然每次代码的输出结果都会发生变化但在大多数情况下输出的结果都与真实结果14,235很接近输出结果的可用性相对较高。"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## 需要多大的噪声?"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"我们如何知道拉普拉斯机制是否已经增加了足够的噪声,可以阻止攻击者对数据集中的个体实施重标识攻击?我们可以先尝试自己来实施攻击!我们构造一个恶意的计数问询,专门用于确定凯莉·特鲁斯洛夫的收入是否大于\\$50k。"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"karries_row = adult[adult['Name'] == 'Karrie Trusslove']\n",
"karries_row[karries_row['Target'] == '<=50K'].shape[0]"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"此回复结果给出了凯莉所在数据行的收入值,显然侵犯了凯莉的隐私。由于我们知道如何应用拉普拉斯机制使计数问询满足差分隐私,我们可以这样回复问询:"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/plain": [
"2.198682025336349"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sensitivity = 1\n",
"epsilon = 0.1\n",
"\n",
"karries_row = adult[adult['Name'] == 'Karrie Trusslove']\n",
"karries_row[karries_row['Target'] == '<=50K'].shape[0] + \\\n",
" np.random.laplace(loc=0, scale=sensitivity/epsilon)"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"真实结果是0还是1呢因为增加的噪声比较大我们已经无法可靠地判断真实结果是什么了。这就是差分隐私要实现的目的哪怕可以判定出此问询是恶意的我们也不会*拒绝*回复问询。相反,我们会增加足够大的噪声,使恶意问询的回复结果对攻击者来说变得毫无用处。"
]
}
],
"metadata": {
"celltoolbar": "Tags",
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 2
}

688
zh_cn/notebooks/ch4.ipynb Normal file

File diff suppressed because one or more lines are too long

641
zh_cn/notebooks/ch5.ipynb Normal file

File diff suppressed because one or more lines are too long

454
zh_cn/notebooks/ch6.ipynb Normal file

File diff suppressed because one or more lines are too long

717
zh_cn/notebooks/ch7.ipynb Normal file

File diff suppressed because one or more lines are too long

475
zh_cn/notebooks/ch8.ipynb Normal file

File diff suppressed because one or more lines are too long

334
zh_cn/notebooks/ch9.ipynb Normal file
View File

@ -0,0 +1,334 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"tags": [
"remove-cell"
],
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"plt.style.use('seaborn-whitegrid')\n",
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"adult = pd.read_csv(\"adult_with_pii.csv\")\n",
"def laplace_mech(v, sensitivity, epsilon):\n",
" return v + np.random.laplace(loc=0, scale=sensitivity / epsilon)\n",
"def pct_error(orig, priv):\n",
" return np.abs(orig - priv)/orig * 100.0\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"# 指数机制\n",
"\n",
"```{admonition} 学习目标\n",
"阅读本章后,您将能够:\n",
"- 定义、实现并应用指数机制和报告噪声最大值机制\n",
"- 描述实际中应用指数机制所面临的挑战\n",
"- 描述指数机制和报告噪声最大值机制的优势\n",
"```\n",
"\n",
"截至目前我们已学习的基本机制拉普拉斯机制和高斯机制针对的都是数值型回复只需直接在回复的数值结果上增加噪声即可。如果我们想返回一个准确结果即不能直接在结果上增加噪声同时还要保证回复过程满足差分隐私该怎么办呢一种解决方法是使用指数机制Exponential Mechanism{cite}`mcsherry2007`。此机制可以从备选回复集合中选出\"最佳\"回复的同时,保证回复过程满足差分隐私。分析者需要定义一个备选回复集合。同时,分析者需要指定一个*评分函数*Scoring Function此评分函数输出备选回复集合中每个回复的分数。分数最高的回复就是最佳回复。指数机制通过返回分数*近似*最大的回复来实现差分隐私保护。换言之,为了使回复过程满足差分隐私,指数机制返回结果所对应的分数可能*不是*备选回复集合中分数最高的那个结果。\n",
"\n",
"指数机制满足$\\epsilon$-差分隐私:\n",
"\n",
"1. 分析者选择一个备选回复集合$\\mathcal{R}$\n",
"2. 分析者指定一个全局敏感度为$\\Delta u$的评分函数$u : \\mathcal{D} \\times \\mathcal{R} \\rightarrow \\mathbb{R}$\n",
"3. 指数机制输出$r \\in \\mathcal{R}$,各个回复的输出概率与下述表达式成正比:\n",
"\n",
"\\begin{align}\n",
"\\exp \\Big(\\frac{\\epsilon u(x, r)}{2 \\Delta u} \\Big)\n",
"\\end{align}\n",
"\n",
"和我们之前学习过的机制(如拉普拉斯机制)相比,指数机制最大的不同点在于其*总会*输出集合$\\mathcal{R}$中的一个元素。当必须从一个有限集合中选择输出结果,或不能直接在结果上增加噪声时,指数机制就会变得非常有用。例如,假设我们要为一个大型会议敲定一个日期。为此,我们获得了每个参会者的日程表。我们想选择一个与尽可能少的参会者有时间冲突的日期来举办会议,同时想通过差分隐私为所有参会者的日程信息提供隐私保护。在这个场景下,在举办日期上增加噪声没有太大意义,增加噪声可能会使日期从星期五变成星期六,使冲突参会者的数量显著增加。应用指数机制就可以完美解决此类问题:既*不需要在日期上增加噪声*,又可以实现差分隐私。\n",
"\n",
"指数机制的有趣之处在于:\n",
"\n",
"- 无论$\\mathcal{R}$中包含多少个备选输出,指数机制的隐私消耗量仍然为$\\epsilon$。我们后续将详细讨论这一点。\n",
"- 无论$\\mathcal{R}$是有限集合还是无限集合,均可应用指数机制。但如果$\\mathcal{R}$是无限集合,则我们会面临一个非常有挑战的问题:如何构造一个实际可用的实现方法,其可以遵循适当的概率分布从无限集合中采样得到输出结果。\n",
"- 指数机制代表了$\\epsilon$-差分隐私的\"基本机制\":通过选择适当的评分函数$u$,所有其他的$\\epsilon$-差分隐私机制都可以用指数机制定义。"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## 有限集合的指数机制"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/plain": [
"10.683"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"options = adult['Marital Status'].unique()\n",
"\n",
"def score(data, option):\n",
" return data.value_counts()[option]/1000\n",
"\n",
"score(adult['Marital Status'], 'Never-married')"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'Married-civ-spouse'"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def exponential(x, R, u, sensitivity, epsilon):\n",
" # 计算R中每个回复的分数\n",
" scores = [u(x, r) for r in R]\n",
" \n",
" # 根据分数计算每个回复的输出概率\n",
" probabilities = [np.exp(epsilon * score / (2 * sensitivity)) for score in scores]\n",
" \n",
" # 对概率进行归一化处理使概率和等于1\n",
" probabilities = probabilities / np.linalg.norm(probabilities, ord=1)\n",
"\n",
" # 根据概率分布选择回复结果\n",
" return np.random.choice(R, 1, p=probabilities)[0]\n",
"\n",
"exponential(adult['Marital Status'], options, score, 1, 1)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Married-civ-spouse 179\n",
"Never-married 21\n",
"dtype: int64"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"r = [exponential(adult['Marital Status'], options, score, 1, 1) for i in range(200)]\n",
"pd.Series(r).value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## 报告噪声最大值\n",
"\n",
"我们能用拉普拉斯机制实现指数机制吗?当$\\mathcal{R}$为有限集合时,指数机制的基本思想是使从集合中选择元素的过程满足差分隐私。我们可以应用拉普拉斯机制给出此基本思想的一种朴素实现方法:\n",
"\n",
"1. 对于每个$r \\in \\mathcal{R}$,计算*噪声分数*$u(x, r) + \\mathsf{Lap}\\left(\\frac{\\Delta u}{\\epsilon}\\right)$\n",
"2. 输出噪声分数最大的元素$r \\in \\mathcal{R}$\n",
"\n",
"因为评分函数$u$在$x$下的敏感度为$\\Delta u$所以步骤1中的每次\"问询\"都满足$\\epsilon$-差分隐私。因此,如果$\\mathcal{R}$包含$n$个元素,根据串行组合性,上述算法满足$n\\epsilon$-差分隐私。\n",
"\n",
"然而,如果我们使用指数机制,则总隐私消耗量将只有$\\epsilon$!为什么指数机制效果如此之好?原因是指数机制*泄露的信息更少*。\n",
"\n",
"对于上述定义的拉普拉斯机制实现方法我们的隐私消耗量分析过程是非常严苛的。实际上步骤1中计算整个集合噪声分数的过程满足$n\\epsilon$-差分隐私因此我们可以发布得到的所有噪声分数。我们应用后处理性得到步骤2的输出满足$n\\epsilon$-差分隐私。\n",
"\n",
"与之相比,指数机制*仅*发布最大噪声分数所对应的元素,但不发布最大噪声分数本身,也不会发布其他元素的噪声分数。\n",
"\n",
"上述定义的算法通常被称为*报告噪声最大值*Report Noisy Max算法。实际上因为此算法只发布最大噪声分数所对应的回复所以无论集合$\\mathcal{R}$包含多少个备选回复,此算法都满足$\\epsilon$-差分隐私。可以在[Dwork和Roth论文](https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf) {cite}`dwork2014`的断言3.9中找到相应的证明。\n",
"\n",
"输出噪声最大值算法的实现方法非常简单,而且很容易看出,此算法得到的回复结果与之前我们实现的有限集合指数机制非常相似。"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'Married-civ-spouse'"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def report_noisy_max(x, R, u, sensitivity, epsilon):\n",
" # 计算R中每个回复的分数\n",
" scores = [u(x, r) for r in R]\n",
"\n",
" # 为每个分数增加噪声\n",
" noisy_scores = [laplace_mech(score, sensitivity, epsilon) for score in scores]\n",
"\n",
" # 找到最大分数对应的回复索引号\n",
" max_idx = np.argmax(noisy_scores)\n",
" \n",
" # 返回此索引号对应的回复\n",
" return R[max_idx]\n",
"\n",
"report_noisy_max(adult['Marital Status'], options, score, 1, 1)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"scrolled": true,
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Married-civ-spouse 192\n",
"Never-married 8\n",
"dtype: int64"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"r = [report_noisy_max(adult['Marital Status'], options, score, 1, 1) for i in range(200)]\n",
"pd.Series(r).value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"因此,当$\\mathcal{R}$为有限集合时,可以用报告噪声最大值机制代替指数机制。但如果$\\mathcal{R}$为无限集合呢?我们无法简单地为无限集合中每一个元素对应的分数增加拉普拉斯噪声。当$\\mathcal{R}$为无限集合时,我们不得不使用指数机制。\n",
"\n",
"然而,在实际应用中,在无限集合上应用指数机制通常是极具挑战的,甚至是不可能的。尽管可以很容易写出无限集合下指数机制定义的概率密度函数,但一般来说对应的高效采样算法是不存在的。因此,很多理论论文会应用指数机制证明\"存在\"满足某些特定性质的差分隐私算法,但多数算法在实际中都是不可用的。"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## 指数机制是差分隐私的基本机制\n",
"\n",
"我们已经知道,无法使用拉普拉斯机制与串行组合性来实现指数机制,这是因为当使用拉普拉斯机制与串行组合性时,我们可以得到差分隐私保护的所有噪声分数,但我们想实现的差分隐私算法不需要发布这些噪声分数。那么,反过来又如何呢?我们可以应用指数机制实现拉普拉斯机制吗?事实证明,这是可以做到的!\n",
"\n",
"考虑一个敏感度为$\\Delta q$的问询函数$q(x) : \\mathcal{D} \\rightarrow \\mathbb{R}$。我们可以在真实回复值上增加拉普拉斯噪声$F(x) = q(x) + \\mathsf{Lap}(\\Delta q / \\epsilon)$,以得到满足$\\epsilon$-差分隐私的回复结果。差分隐私回复$q$的概率密度函数为:\n",
"\n",
"\\begin{align}\n",
"\\mathsf{Pr}[F(x) = r] =& \\frac{1}{2b} \\exp\\Big(- \\frac{\\lvert r - \\mu \\rvert}{b}\\Big)\\\\\n",
"=& \\frac{\\epsilon}{2 \\Delta q} \\exp\\Big(- \\frac{\\epsilon \\lvert r - q(x) \\rvert}{\\Delta q}\\Big)\n",
"\\end{align}\n",
"\n",
"考虑一下,当我们将指数机制的评分函数设置为$u(x, r) = -2 \\lvert q(x) - r \\rvert$时会发生什么?指数机制的定义告诉我们,每个回复值的采样概率应该与下述表达式成正比:\n",
"\n",
"\\begin{align}\n",
"\\mathsf{Pr}[F(x) = r] =&\\; \\exp \\Big(\\frac{\\epsilon u(x, r)}{2 \\Delta u} \\Big)\\\\\n",
"&= \\exp \\Big(\\frac{\\epsilon (-2 \\lvert q(x) - r \\rvert)}{2 \\Delta q} \\Big)\\\\\n",
"&= \\exp \\Big(- \\frac{\\epsilon \\lvert r - q(x) \\rvert}{\\Delta q} \\Big)\\\\\n",
"\\end{align}\n",
"\n",
"因此,可以应用指数机制实现拉普拉斯机制,并得到相同的概率分布(两个概率分布可能会相差一个常数因子,这是因为指数机制的通用分析结论不一定在所有情况下都是紧致的)。\n",
"\n",
"指数机制非常具有普适性。一般情况下,通过精心选择评分函数$u$,我们可以用指数机制重定义任何$\\epsilon$-差分隐私机制。只要我们可以分析出该评分函数的敏感度,我们就可以轻松证明相应机制满足差分隐私。\n",
"\n",
"另一方面,指数机制之所以具有普适性,是因为其通用分析方法得到的隐私消耗量边界可能会更宽松一些(就像前面给出的拉普拉斯例子那样)。此外,用指数机制定义的差分隐私机制一般都比较难实现。指数机制通常用于证明理论下界(即证明差分隐私算法的*存在性*)。在实际中,一般会使用一些其他的算法来复现指数机制(如前面描述的输出噪声最大值例子)。"
]
}
],
"metadata": {
"celltoolbar": "Tags",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.10"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

9
zh_cn/notebooks/cover.md Normal file
View File

@ -0,0 +1,9 @@
# 动手学差分隐私
![logo](logo_zh_cn.png)
**一本面向程序员的差分隐私书籍**
**Joseph P. Near、Chiké Abuah**
**刘巍然,李双(译)**

View File

@ -0,0 +1,67 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"# 引言\n",
"\n",
"这是一本面向程序员的差分隐私书籍。本书旨在向您介绍数据隐私保护领域所面临的挑战,描述为解决这些挑战而提出的技术,并帮助您理解如何实现其中一部分技术。\n",
"\n",
"本书包含了很多示例,也包含了很多概念的具体实现,这些示例和实现都是用可以实际运行的*程序*撰写的。每一章都由一个独立的Jupyter笔记本Jupyter Notebook文件生成。您可以单击相应章节右上角的“下载”图标并选择“.ipynb”从而下载此章的Jupyter笔记本文件并亲手执行这些示例。章节中的很多示例都是用代码生成的。为了便于阅读我们将这些代码隐藏了起来。您可以通过单击示例单元格下方的\"点击显示\"Click to show按钮显示隐藏在背后的代码。\n",
"\n",
"本书假定您可以使用Python语言编写和运行程序并掌握Pandas和NumPy的一些基本概念。如果您具有离散数学和概率论的相关背景知识那您会更加轻松地理解本书的内容。不必担心本科课程上的离散数学和概率论知识对学习本书来说已经绰绰有余了。\n",
"\n",
"本书的源代码已开源,可以从[这里](https://uvm-plaid.github.io/programming-dp/notebooks/intro.html)在线获取本书的最新(英文)版本。您可以在[GitHub](https://github.com/uvm-plaid/programming-dp)上获取本书英文版的源代码。如果您找到一处笔误、提出一处改进建议或报告一个程序错误请在GitHub上提交问题。\n",
"\n",
"本书描述的技术是从*数据隐私*Data Privacy领域的研究中发展得来的。以本书的撰写目的出发我们将按照下述方式定义数据隐私\n",
"\n",
"```{admonition} 定义\n",
"*数据隐私*技术的目标是,允许数据分析方获取隐私数据中蕴含的*趋势*,但不会泄露特定*个体*的信息。\n",
"```\n",
"\n",
"这是一个宽泛的数据隐私定义,很多不同的技术都是围绕这个定义而提出的。但要特别注意的是,这一定义*不包括*保证*安全性*的技术,如加密技术。加密数据不会泄露*任何*信息,因此加密技术不能满足我们定义的前半部分要求。我们需要特别注意安全与隐私之间的差异:隐私技术涉及到*故意*发布信息,并试图控制从发布信息中*学到什么*。安全技术通常会*阻止*信息的泄露,并控制数据可以*被谁访问*。本书主要涵盖的是隐私技术。只有当安全对隐私有重要影响时,我们才会讨论相应的安全技术。\n",
"\n",
"本书主要聚焦于差分隐私Differential Privacy。我们将在前几章概述本书聚焦差分隐私的部分原因差分隐私及其变体是我们已知的唯一能从数学角度提供可证明隐私保护能力的方法。去标识化、聚合等技术是人们这十几年来常用的隐私技术。这些技术近期已被证明无法抵御复杂的隐私攻击。如$k$-匿名性等更先进的一些隐私技术也无法抵御特定的攻击。因此,差分隐私正迅速成为隐私保护的黄金标准,也是本书重点介绍的隐私技术。"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}

Binary file not shown.

After

Width:  |  Height:  |  Size: 10 KiB

View File

@ -0,0 +1,237 @@
@misc{identifiability
, author = {Sweeney, Latanya}
, title={Simple Demographics Often Identify People Uniquely}
, url={https://dataprivacylab.org/projects/identifiability/}
, journal={Identifiability}}
@article{sweeney2002,
author = {Sweeney, Latanya},
title = {k-ANONYMITY: A MODEL FOR PROTECTING PRIVACY},
journal = {International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems},
volume = {10},
number = {05},
pages = {557-570},
year = {2002},
doi = {10.1142/S0218488502001648},
URL = {
https://doi.org/10.1142/S0218488502001648
},
eprint = {
https://doi.org/10.1142/S0218488502001648
}}
@inproceedings{mcsherry2009,
author = {McSherry, Frank D.},
title = {Privacy Integrated Queries: An Extensible Platform for Privacy-Preserving Data Analysis},
year = {2009},
isbn = {9781605585512},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/1559845.1559850},
doi = {10.1145/1559845.1559850},
abstract = {We report on the design and implementation of the Privacy Integrated Queries (PINQ) platform for privacy-preserving data analysis. PINQ provides analysts with a programming interface to unscrubbed data through a SQL-like language. At the same time, the design of PINQ's analysis language and its careful implementation provide formal guarantees of differential privacy for any and all uses of the platform. PINQ's unconditional structural guarantees require no trust placed in the expertise or diligence of the analysts, substantially broadening the scope for design and deployment of privacy-preserving data analysis, especially by non-experts.},
booktitle = {Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data},
pages = {1930},
numpages = {12},
keywords = {differential privacy, linq, confidentiality, anonymization},
location = {Providence, Rhode Island, USA},
series = {SIGMOD '09}
}
@InProceedings{dwork2006,
author="Dwork, Cynthia
and Kenthapadi, Krishnaram
and McSherry, Frank
and Mironov, Ilya
and Naor, Moni",
editor="Vaudenay, Serge",
title="Our Data, Ourselves: Privacy Via Distributed Noise Generation",
booktitle="Advances in Cryptology - EUROCRYPT 2006",
year="2006",
publisher="Springer Berlin Heidelberg",
address="Berlin, Heidelberg",
pages="486--503"
}
@inproceedings{dwork2006A,
author = {Dwork, Cynthia},
title = {Differential Privacy},
year = {2006},
isbn = {3540359079},
publisher = {Springer-Verlag},
address = {Berlin, Heidelberg},
url = {https://doi.org/10.1007/11787006_1},
doi = {10.1007/11787006_1},
abstract = {In 1977 Dalenius articulated a desideratum for statistical databases: nothing about an individual should be learnable from the database that cannot be learned without access to the database. We give a general impossibility result showing that a formalization of Dalenius' goal along the lines of semantic security cannot be achieved. Contrary to intuition, a variant of the result threatens the privacy even of someone not in the database. This state of affairs suggests a new measure, differential privacy, which, intuitively, captures the increased risk to one's privacy incurred by participating in a database. The techniques developed in a sequence of papers [8, 13, 3], culminating in those described in [12], can achieve any desired level of privacy under this measure. In many cases, extremely accurate information about the database can be provided while simultaneously ensuring very high levels of privacy},
booktitle = {Proceedings of the 33rd International Conference on Automata, Languages and Programming - Volume Part II},
pages = {112},
numpages = {12},
location = {Venice, Italy},
series = {ICALP'06}
}
@inproceedings{dwork2006B,
author = {Dwork, Cynthia and McSherry, Frank and Nissim, Kobbi and Smith, Adam},
title = {Calibrating Noise to Sensitivity in Private Data Analysis},
year = {2006},
isbn = {3540327312},
publisher = {Springer-Verlag},
address = {Berlin, Heidelberg},
url = {https://doi.org/10.1007/11681878_14},
doi = {10.1007/11681878_14},
abstract = {We continue a line of research initiated in [10,11]on privacy-preserving statistical databases. Consider a trusted server that holds a database of sensitive information. Given a query function f mapping databases to reals, the so-called true answer is the result of applying f to the database. To protect privacy, the true answer is perturbed by the addition of random noise generated according to a carefully chosen distribution, and this response, the true answer plus noise, is returned to the user.Previous work focused on the case of noisy sums, in which f = ∑ig(xi), where xi denotes the ith row of the database and g maps database rows to [0,1]. We extend the study to general functions f, proving that privacy can be preserved by calibrating the standard deviation of the noise according to the sensitivity of the function f. Roughly speaking, this is the amount that any single argument to f can change its output. The new analysis shows that for several particular applications substantially less noise is needed than was previously understood to be the case.The first step is a very clean characterization of privacy in terms of indistinguishability of transcripts. Additionally, we obtain separation results showing the increased value of interactive sanitization mechanisms over non-interactive.},
booktitle = {Proceedings of the Third Conference on Theory of Cryptography},
pages = {265284},
numpages = {20},
location = {New York, NY},
series = {TCC'06}
}
@inproceedings{nissim2007,
author = {Nissim, Kobbi and Raskhodnikova, Sofya and Smith, Adam},
title = {Smooth Sensitivity and Sampling in Private Data Analysis},
year = {2007},
isbn = {9781595936318},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/1250790.1250803},
doi = {10.1145/1250790.1250803},
abstract = {We introduce a new, generic framework for private data analysis.The goal of private data analysis is to release aggregate information about a data set while protecting the privacy of the individuals whose information the data set contains.Our framework allows one to release functions f of the data withinstance-based additive noise. That is, the noise magnitude is determined not only by the function we want to release, but also bythe database itself. One of the challenges is to ensure that the noise magnitude does not leak information about the database. To address that, we calibrate the noise magnitude to the smoothsensitivity of f on the database x --- a measure of variabilityof f in the neighborhood of the instance x. The new frameworkgreatly expands the applicability of output perturbation, a technique for protecting individuals' privacy by adding a smallamount of random noise to the released statistics. To our knowledge, this is the first formal analysis of the effect of instance-basednoise in the context of data privacy.Our framework raises many interesting algorithmic questions. Namely,to apply the framework one must compute or approximate the smoothsensitivity of f on x. We show how to do this efficiently for several different functions, including the median and the cost ofthe minimum spanning tree. We also give a generic procedure based on sampling that allows one to release f(x) accurately on manydatabases x. This procedure is applicable even when no efficient algorithm for approximating smooth sensitivity of f is known orwhen f is given as a black box. We illustrate the procedure by applying it to k-SED (k-means) clustering and learning mixtures of Gaussians.},
booktitle = {Proceedings of the Thirty-Ninth Annual ACM Symposium on Theory of Computing},
pages = {7584},
numpages = {10},
keywords = {private data analysis, output perturbation, clustering, sensitivity, privacy preserving data mining},
location = {San Diego, California, USA},
series = {STOC '07}
}
@inproceedings{dwork2009,
author = {Dwork, Cynthia and Lei, Jing},
title = {Differential Privacy and Robust Statistics},
year = {2009},
isbn = {9781605585062},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/1536414.1536466},
doi = {10.1145/1536414.1536466},
abstract = {We show by means of several examples that robust statistical estimators present an excellent starting point for differentially private estimators. Our algorithms use a new paradigm for differentially private mechanisms, which we call Propose-Test-Release (PTR), and for which we give a formal definition and general composition theorems.},
booktitle = {Proceedings of the Forty-First Annual ACM Symposium on Theory of Computing},
pages = {371380},
numpages = {10},
keywords = {propose-test-release paradigm, local sensitivity, differential privacy, robust statistics},
location = {Bethesda, MD, USA},
series = {STOC '09}
}
@article{dwork2014,
title={The algorithmic foundations of differential privacy},
author={Dwork, Cynthia and Roth, Aaron and others},
journal={Foundations and Trends{\textregistered} in Theoretical Computer Science},
volume={9},
number={3--4},
pages={211--407},
year={2014},
publisher={Now Publishers, Inc.}
}
@INPROCEEDINGS{dwork2010,
author={Dwork, Cynthia and Rothblum, Guy N. and Vadhan, Salil},
booktitle={2010 IEEE 51st Annual Symposium on Foundations of Computer Science},
title={Boosting and Differential Privacy},
year={2010}, volume={}, number={}, pages={51-60}, doi={10.1109/FOCS.2010.12}}
@inproceedings{bun2018composable,
title={Composable and versatile privacy via truncated CDP},
author={Bun, Mark and Dwork, Cynthia and Rothblum, Guy N and Steinke, Thomas},
booktitle={Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing},
pages={74--86},
year={2018},
organization={ACM}
}
@inproceedings{mironov2017renyi,
title={Renyi differential privacy},
author={Mironov, Ilya},
booktitle={Computer Security Foundations Symposium (CSF), 2017 IEEE 30th},
pages={263--275},
year={2017},
organization={IEEE}
}
@inproceedings{bun2016concentrated,
title={Concentrated differential privacy: Simplifications, extensions, and lower bounds},
author={Bun, Mark and Steinke, Thomas},
booktitle={Theory of Cryptography Conference},
pages={635--658},
year={2016},
organization={Springer}
}
@INPROCEEDINGS{mcsherry2007,
author={McSherry, Frank and Talwar, Kunal},
booktitle={48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07)},
title={Mechanism Design via Differential Privacy},
year={2007}, volume={}, number={}, pages={94-103}, doi={10.1109/FOCS.2007.66}}
@inproceedings{dwork2009A,
author = {Dwork, Cynthia and Naor, Moni and Reingold, Omer and Rothblum, Guy N. and Vadhan, Salil},
title = {On the Complexity of Differentially Private Data Release: Efficient Algorithms and Hardness Results},
year = {2009},
isbn = {9781605585062},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/1536414.1536467},
doi = {10.1145/1536414.1536467},
abstract = {We consider private data analysis in the setting in which a trusted and trustworthy curator, having obtained a large data set containing private information, releases to the public a "sanitization" of the data set that simultaneously protects the privacy of the individual contributors of data and offers utility to the data analyst. The sanitization may be in the form of an arbitrary data structure, accompanied by a computational procedure for determining approximate answers to queries on the original data set, or it may be a "synthetic data set" consisting of data items drawn from the same universe as items in the original data set; queries are carried out as if the synthetic data set were the actual input. In either case the process is non-interactive; once the sanitization has been released the original data and the curator play no further role.For the task of sanitizing with a synthetic dataset output, we map the boundary between computational feasibility and infeasibility with respect to a variety of utility measures. For the (potentially easier) task of sanitizing with unrestricted output format, we show a tight qualitative and quantitative connection between hardness of sanitizing and the existence of traitor tracing schemes.},
booktitle = {Proceedings of the Forty-First Annual ACM Symposium on Theory of Computing},
pages = {381390},
numpages = {10},
keywords = {cryptography, privacy, differential privacy, traitor tracing, exponential mechanism},
location = {Bethesda, MD, USA},
series = {STOC '09}
}
@inproceedings{rappor,
author = {Erlingsson, \'{U}lfar and Pihur, Vasyl and Korolova, Aleksandra},
title = {RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response},
year = {2014},
isbn = {9781450329576},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/2660267.2660348},
doi = {10.1145/2660267.2660348},
abstract = {Randomized Aggregatable Privacy-Preserving Ordinal Response, or RAPPOR, is a technology for crowdsourcing statistics from end-user client software, anonymously, with strong privacy guarantees. In short, RAPPORs allow the forest of client data to be studied, without permitting the possibility of looking at individual trees. By applying randomized response in a novel manner, RAPPOR provides the mechanisms for such collection as well as for efficient, high-utility analysis of the collected data. In particular, RAPPOR permits statistics to be collected on the population of client-side strings with strong privacy guarantees for each client, and without linkability of their reports. This paper describes and motivates RAPPOR, details its differential-privacy and utility guarantees, discusses its practical deployment and properties in the face of different attack models, and, finally, gives results of its application to both synthetic and real-world data.},
booktitle = {Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security},
pages = {10541067},
numpages = {14},
keywords = {population statistics, crowdsourcing, cloud computing, statistical inference, privacy protection},
location = {Scottsdale, Arizona, USA},
series = {CCS '14}
}
@article{warner1965,
author = { Stanley L. Warner },
title = {Randomized Response: A Survey Technique for Eliminating Evasive Answer Bias},
journal = {Journal of the American Statistical Association},
volume = {60},
number = {309},
pages = {63-69},
year = {1965},
publisher = {Taylor & Francis},
doi = {10.1080/01621459.1965.10480775},
note ={PMID: 12261830},
URL = {https://www.tandfonline.com/doi/abs/10.1080/01621459.1965.10480775}}
@inproceedings {wang2017,
author = {Tianhao Wang and Jeremiah Blocki and Ninghui Li and Somesh Jha},
title = {Locally Differentially Private Protocols for Frequency Estimation},
booktitle = {26th {USENIX} Security Symposium ({USENIX} Security 17)},
year = {2017},
isbn = {978-1-931971-40-9},
address = {Vancouver, BC},
pages = {729--745},
url = {https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/wang-tianhao},
publisher = {{USENIX} Association},
month = aug,
}

BIN
zh_cn/static/book-logo.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 95 KiB

264
zh_cn/static/index.html Normal file
View File

@ -0,0 +1,264 @@
<html prefix="og: https://ogp.me/ns#">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<!-- Bootstrap CSS -->
<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.1.3/css/bootstrap.min.css" integrity="sha384-MCw98/SFnGE8fJT3GXwEOngsV7Zt27NXFoaoApmYm81iuXoPkFOJwJ8ERdknLPMO" crossorigin="anonymous">
<title>动手学差分隐私Programming Differential Privacy</title>
<meta name="description" content="一本面向开发者的差分隐私书籍A book about differential privacy, for programmers."/>
<meta property="og:title" content="动手学差分隐私Programming Differential Privacy" />
<meta property="og:type" content="book" />
<meta property="og:image" content="https://uvm-plaid.github.io/programming-dp/cn/book-logo.png" />
<meta property="og:image:secure_url" content="https://uvm-plaid.github.io/programming-dp/cn/book-logo.png" />
<meta property="og:image:type" content="image/png" />
<meta property="og:image:alt" content="book logo" />
<meta name="twitter:card" content="summary" />
<meta name="twitter:site" content="@josephnear" />
<meta name="twitter:title" content="动手学差分隐私Programming Differential Privacy" />
<meta name="twitter:description" content="一本面向开发者的差分隐私书籍A book about differential privacy, for programmers" />
<meta name="twitter:image" content="https://uvm-plaid.github.io/programming-dp/cn/book-logo.png" />
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "Book",
"name": "动手学差分隐私Programming Differential Privacy",
"about": "一本面向开发者的差分隐私教材A book about differential privacy, for programmers"
"image": "https://uvm-plaid.github.io/programming-dp/book-logo.png",
}
</script>
<PageMap>
<DataObject type="thumbnail">
<Attribute name="src" value="https://uvm-plaid.github.io/programming-dp/cn/book-logo.png"/>
</DataObject>
</PageMap>
<meta name="thumbnail" content="https://uvm-plaid.github.io/programming-dp/cn/book-logo.png" />
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<style type="text/css">
body { background-color: #FFFFFF;
font-family: Optima, Palatino, Arial, sans-serif, Helvetica;
padding:0ex 0ex 0ex 0ex ;
margin: 0ex 0ex 0ex 0ex ;
}
hr {
border: 0;
width: 90%;
height: 2px;
color: #e3e7ef;
background-color: #e3e7ef;
margin:3ex 2ex 0ex 0ex ;
}
h1 {
font-family: Optima,Segoe,Segoe UI,Candara,Calibri,Arial,sans-serif;
color:#333333;
font-size:2.7em;
letter-spacing:-2px;
}
h2 {
font-family: Optima,Segoe,Segoe UI,Candara,Calibri,Arial,sans-serif;
color:#333333;
font-size:1.7em;
padding-bottom: 0.5em;
}
h3 {
font-family: Optima,Segoe,Segoe UI,Candara,Calibri,Arial,sans-serif;
color:#444444;
}
h4 {
font-family: Optima,Segoe,Segoe UI,Candara,Calibri,Arial,sans-serif;
color:#777777;
padding: 0.5em;
}
big {
font-size:1.3em;
}
spam {
color:#ffffff;
}
a {
padding: .2em;
}
html { font-size: 18px !important }
@media (pointer: coarse) {
a {
padding: .4em;
}
}
img {
max-width: 300px;
}
a:link {
color:#444444;
text-decoration: underline;
}
a:visited {
color:#444444;
}
a:hover {
color:#222222;
text-decoration:none;
}
a:active {
color:#000000;
}
p.small {
font-variant: small-caps;
}
.box {
display: flex;
justify-content: center;
text-align: center;
flex-flow: row nowrap;
margin: 10%;
}
@media screen and (max-width:600px) {
.box {
flex-flow: column nowrap;
}
}
</style>
<style type=text/css>
#book-logo{
border: black;
border-width: thin;
border-style: solid;
max-width: 70%;
height: auto;
}
#citation{
font-size: 10px;
background: lightgray;
padding: 10px;
font-family: monospace;
width: 80%;
}
</style>
</head>
<body>
<div class="box">
<div style="flex: 0 0 50%; margin-bottom: 10px;">
<a href="https://programming-dp.com/cn/cover.html">
<img id="book-logo" src="book-logo.png" alt="Book Logo" >
</a>
</div>
<div style="flex: 0 0 50%; padding-bottom: 2em; text-align: left;">
<h1>动手学差分隐私</h1>
<h1>Programming Differential Privacy</h1>
<h2>一本面向开发者的差分隐私书籍</h2>
<h2>A book about differential privacy, for programmers</h2>
<h2><a href="http://uvm.edu/~jnear">Joseph P. Near</a><a href="http://uvm.edu/~cabuah">Chiké Abuah</a>(著)</h2>
<h2><a href="https://github.com/liuweiran900217">刘巍然Weiran Liu</a><a href="https://github.com/little12">李双Shuang Li</a>(译)</h2>
<p><b><i>Programming Differential Privacy</i> uses examples and
Python code to explain the ideas behind differential privacy!</b>
The book is suitable for undergraduate students in computer science,
and no theory background is expected.</p>
<p><b><i>动手学差分隐私</i>应用具体实例和Python代码解释差分隐私的基本原理</b>
本书适合计算机专业的本科生使用,学习本书的内容不需要预先了解任何理论背景知识。</p>
<p><b><i>Programming Differential Privacy</i> is executable!</b>
Each chapter is actually generated from Python code. If you view the
HTML version of the book, you can click on the "Launch Binder" icon
at the top of each page to start an interactive version of that
chapter.</p>
<p><b><i>动手学差分隐私</i>是一本可上手执行代码的书!</b>
每章的内容实际上都是用Python代码生成的。如果您浏览的是本书的HTML版本
可以点击页面上方的"Launch Binder"图标,开启对应章节的交互式版本。</p>
<p>
<ul>
<li><b><a href="https://programming-dp.com/cover.html">点击此处阅读HTML版本
Click here to read the HTML version</a></b>
<li><b><a href="book.pdf">点击此处下载PDF版本Click here to download the PDF version</a></b>
<li><b><a href="https://programming-dp.com/cn/cover.html">点击此处阅读中文HTML版本
Click here to read the Chinese HTML version</a></b>
<li><b><a href="https://programming-dp.com/cn/cn_book.pdf">点击此处下载中文PDF版本Click here to download the Chinese PDF version</a></b>
</ul>
</p>
<p>This book was originally developed at the University of Vermont
as part
of <a href="https://jnear.github.io/cs211-data-privacy/">CS211:
Data Privacy</a>. The material has since been used at the
University of Chicago, Penn State, and Rice University. If you're
using the book in your course, please let us know!</p>
<p>本书最初在佛蒙特大学开发,服务于<a href="https://jnear.github.io/cs211-data-privacy/">CS211课程数据隐私</a>。随后,
本书被芝加哥大学、宾夕法尼亚州立大学、以及莱斯大学所使用。如果您在您的课程中使用了本书,还请告知我们!</p>
<p><b><i>Programming Differential Privacy</i> is a living,
open-source book.</b> We welcome comments, suggestions, and
contributions via issues and pull requests
on <a href="https://github.com/uvm-plaid/programming-dp">the GitHub
repository</a>.
<p><b><i>动手学差分隐私</i>是一本内容可随时更新的开源书籍。</b>
我们欢迎您提交任何意见和建议,
或通过在<a href="https://github.com/uvm-plaid/programming-dp">GitHub仓库</a>上提交问题Issue和推送请求Pull request来为本书作出您的贡献。
<p>Please use the following to cite the book:</p>
<div id="citation">
@book{near_abuah_2021,<br>
&nbsp;&nbsp; title={Programming Differential Privacy},<br>
&nbsp;&nbsp; author={Near, Joseph P. and Abuah, Chiké},<br>
&nbsp;&nbsp; volume={1},<br>
&nbsp;&nbsp; url={https://uvm-plaid.github.io/programming-dp/}, <br>
&nbsp;&nbsp; year={2021}<br>
}<br>
</div>
<p>请使用下述方法引用本书:</p>
<div id="citation">
@book{near_abuah_2021,<br>
&nbsp;&nbsp; title={Programming Differential Privacy},<br>
&nbsp;&nbsp; author={Near, Joseph P. and Abuah, Chiké},<br>
&nbsp;&nbsp; volume={1},<br>
&nbsp;&nbsp; url={https://uvm-plaid.github.io/programming-dp/}, <br>
&nbsp;&nbsp; year={2021}<br>
}<br>
</div>
</div>
</body>
</html>