Extracting Personal Information from Large Language Models Like GPT-2

Researchers have been able to find all sorts of personal information within GPT-2. This information was part of the training data, and can be extracted with the right sorts of queries.

Paper: “Extracting Training Data from Large Language Models.”

Abstract: It has become common to publish large (billion parameter) language models that have been trained on private datasets. This paper demonstrates that in such settings, an adversary can perform a training data extraction attack to recover individual training examples by querying the language model.

We demonstrate our attack on GPT-2, a language model trained on scrapes of the public Internet, and are able to extract hundreds of verbatim text sequences from the model’s training data. These extracted examples include (public) personally identifiable information (names, phone numbers, and email addresses), IRC conversations, code, and 128-bit UUIDs. Our attack is possible even though each of the above sequences are included in just one document in the training data.

We comprehensively evaluate our extraction attack to understand the factors that contribute to its success. For example, we find that larger models are more vulnerable than smaller models. We conclude by drawing lessons and discussing possible safeguards for training large language models.

From a blog post:

We generated a total of 600,000 samples by querying GPT-2 with three different sampling strategies. Each sample contains 256 tokens, or roughly 200 words on average. Among these samples, we selected 1,800 samples with abnormally high likelihood for manual inspection. Out of the 1,800 samples, we found 604 that contain text which is reproduced verbatim from the training set.

The rest of the blog post discusses the types of data they found.

Brexit Deal Mandates Old Insecure Crypto Algorithms

In what is surely an unthinking cut-and-paste issue, page 921 of the Brexit deal mandates the use of SHA-1 and 1024-bit RSA:

The open standard s/MIME as extension to de facto e-mail standard SMTP will be deployed to encrypt messages containing DNA profile information. The protocol s/MIME (V3) allows signed receipts, security labels, and secure mailing lists… The underlying certificate used by s/MIME mechanism has to be in compliance with X.509 standard…. The processing rules for s/MIME encryption operations… are as follows:

  1. the sequence of the operations is: first encryption and then signing,
  2. the encryption algorithm AES (Advanced Encryption Standard) with 256 bit key length and RSA with 1,024 bit key length shall be applied for symmetric and asymmetric encryption respectively,
  3. the hash algorithm SHA-1 shall be applied.
  4. s/MIME functionality is built into the vast majority of modern e-mail software packages including Outlook, Mozilla Mail as well as Netscape Communicator 4.x and inter-operates among all major e-mail software packages.

And s/MIME? Bleah.

Indistinguishability Obfuscation

Indistinguishability Obfuscation

Quanta magazine recently published a breathless article on indistinguishability obfuscation — calling it the “‘crown jewel’ of cryptography” — and saying that it had finally been achieved, based on a recently published paper. I want to add some caveats to the discussion.

Basically, obfuscation makes a computer program “unintelligible” by performing its functionality. Indistinguishability obfuscation is more relaxed. It just means that two different programs that perform the same functionality can’t be distinguished from each other. A good definition is in this paper.

This is a pretty amazing theoretical result, and one to be excited about. We can now do obfuscation, and we can do it using assumptions that make real-world sense. The proofs are kind of ugly, but that’s okay — it’s a start. What it means in theory is that we have a fundamental theoretical result that we can use to derive a whole bunch of other cryptographic primitives.

But — and this is a big one — this result is not even remotely close to being practical. We’re talking multiple days to perform pretty simple calculations, using massively large blocks of computer code. And this is likely to remain true for a very long time. Unless researchers increase performance by many orders of magnitude, nothing in the real world will make use of this work anytime soon.

But but, consider fully homomorphic encryption. It, too, was initially theoretically interesting and completely impractical. And now, after decades of work, it seems to be almost just-barely maybe approaching practically useful. This could very well be on the same trajectory, and perhaps in twenty to thirty years we will be celebrating this early theoretical result as the beginning of a new theory of cryptography.

Seny Kamara on “Crypto for the People”

Seny Kamara on “Crypto for the People”

Seny Kamara gave an excellent keynote talk this year at the (online) CRYPTO Conference. He talked about solving real-world crypto problems for marginalized communities around the world, instead of crypto problems for governments and corporations. Well worth watching and listening to.

Sidebar photo of Bruce Schneier by Joe MacInnis.

New crypto-cracking record reached, with less help than usual from Moore’s Law

New crypto-cracking record reached, with less help than usual from Moore’s Law

Researchers have reached a new milestone in the annals of cryptography with the factoring of the largest RSA key size ever computed and a matching computation of the largest-ever integer discrete logarithm. New records of this type occur regularly as the performance of computer hardware increases over time. The records announced on Monday evening are more significant because they were achieved considerably faster than hardware improvements alone would predict, thanks to enhancements in software used and the algorithms it implemented.

Many public-key encryption algorithms rely on extremely large numbers that are the product of two prime numbers. Other encryption algorithms base their security on the difficulty of solving certain discrete logarithm problems. With sufficiently big enough key sizes, there is no known way to crack the encryption they provide. The factoring of the large numbers and the computing of a discrete logarithm defeat the cryptographic assurances for a given key size and force users to ratchet up the number of bits of entropy it uses.

The new records include the factoring of RSA-240, an RSA key that has 240 decimal digits and a size of 795 bits. The same team of researchers also computed a discrete logarithm of the same size. The previous records were the factoring in 2010 of an RSA-768 (which, despite its digit is a smaller RSA key than the RSA-240, with 232 decimal digits and 768 bits) and the computation of a 768-bit prime discrete logarithm in 2016.

The sum of the computation time for both of the new records is about 4,000 core-years using Intel Xeon Gold 6130 CPUs (running at 2.1GHz) as a reference. Like previous records, these ones were accomplished using a complex algorithm called the Number Field Sieve, which can be used to perform both integer factoring and finite field discrete logarithms. A rough breakdown of the time spent in the sieving and matrixing of both the RSA factoring and the computation of the discrete logarithm problem are:

  • RSA-240 sieving: 800 physical core-years
  • RSA-240 matrix: 100 physical core-years
  • DLP-240 sieving: 2400 physical core-years
  • DLP-240 matrix: 700 physical core-years

Moore’s Law takes a back seat

It’s the first time that records for integer factorization and discrete logarithm have been broken together. It’s also the first time two records have been set using the same hardware and software. These aren’t the only things setting the latest milestones apart from previous ones.

When new records are set, the edicts of Moore’s Law inevitably play a major role. Moore’s Law is the trend that computer chips generally double the number of transistors they have every 18 months or so. The increase in transistors, in turn, boosts the computing power of computers running them, giving computers increased speed and performance over time.

While Moore’s Law was originally an observation made by Intel cofounder Gordon Moore in 1965, it has come to be viewed as almost an inescapable force of nature, like the laws of physics. Given the relentless march of Moore’s Law, it would be unusual if records like the one announced on Monday didn’t occur regularly.

This one, however, is driven less by Moore’s Law than previous milestones and more by improvements made in the software that carries out the Number Field Sieving. To demonstrate the boost in efficiency, the researchers ran their software on hardware that was identical to that used to compute the 768-bit discrete logarithm in 2016. They found that using the old hardware to sieve the record 795-bit size would take 25% less time than it took the same equipment to perform the 768-bit DLP computation.

Another sign of the improved performance: using identical hardware from 2016, the computation on the 795-bit logarithm was 1.33 times faster than it was on the 768-bit one. Estimates that are widely accepted in the field of cryptography predict that the larger size logarithm should be about 2.25 times harder to compute. Taken together, that suggests a three-fold performance increase over what was expected (that is 2.25*1.33=3). Since the hardware was the same for both bit sizes, the performance boost isn’t attributable to the availability of ever-faster computers.

“The acceleration can be attributed to various algorithmic improvements that were implemented for these computations,” the researchers wrote in Monday night’s announcement. Key to those improvements were updates made to the open source software used to implement the Number Field Sieve. Known as CADO-NFS, it comprises 300,000 lines of code written in C and C++.

Emmanuel Thomé, a senior researcher at France’s National Institute for Computer Science and Applied Mathematics and the leader of the team that set the record, said it’s not easy to describe the optimizations in layperson’s terms. The improvements, he wrote in an email, include:

  • we worked on better parallelization and memory usage (but our competitors also did, to be perfectly honest).
  • we took more systematic advantage of asymptotically fast algorithms in some compute-intensive parts of the computation.
  • there’s a big part of this crypto record business that calls to the art of “choosing parameters.” We did that really well. An important part of the picture is the ability to test many different parameter sets and rank them using accurate simulation tools that we developed.

But it’s only slightly less allusive than saying “various improvements.”

Other researchers on the team included Fabrice Boudot, of France’s Ministry of National Education and the University of Limoges, Pierrick Gaudry of the French National Center for Scientific Research, Aurore Guillevic with France’s National Institute for Computer Science and Applied Mathematics, Nadia Heninger with the University of Pennsylvania and the University of California, San Diego, and Paul Zimmermann of France’s National Institute for Computer Science and Applied Mathematics.

With so much attention devoted to the coming advent of quantum computers and their ability to easily break today’s public key encryption, researchers have been busy developing new schemes that can withstand such attacks.

“In the meantime, researchers have been making progress improving classical algorithms to solve factoring and discrete log, which together with Moore’s law results in new key sizes being breakable by computational resources available to academics,” Heninger wrote in an email. “The takeaways for practitioners are basically that we hope they have followed advice to move to at least 2048-bit RSA, Diffie-Hellman, or DSA keys as of several years ago, which would keep them safe from any of these improvements.”