News
Researchers Look at Data Mining Privacy Techniques
As new disclosures mount about government surveillance programs, computer science researchers hope to wade into the fray by enabling data mining that also protects individual privacy.
Largely by employing the head-spinning principles of cryptography, the researchers say they can ensure that law enforcement, intelligence agencies and private companies can sift through huge databases without seeing names and identifying details in the records.
For example, manifests of airplane passengers could be compared with terrorist watch lists _ without airline staff or government agents seeing the actual names on the other side's list. Only if a match were made would a computer alert each side to uncloak the record and probe further.
"If it's possible to anonymize data and produce ... the same results as clear text, why not?" John Bliss, a privacy lawyer in IBM Corp.'s "entity analytics" unit, told a recent workshop on the subject at Harvard University.
The concept of encrypting or hiding identifying details in sensitive databases is not new. Exploration has gone on for years, and researchers say some government agencies already deploy such technologies -- though protecting classified information rather than individual privacy is a main goal.
Even the data-mining project that perhaps drew more scorn than any other in recent years, the Pentagon's Total Information Awareness research program, funded at least two efforts to anonymize database scans. Those anonymizing systems were dropped when Congress shuttered TIA, even while the data-mining aspects of the project lived on in intelligence agencies.
Still, anonymizing technologies have been endorsed repeatedly by panels appointed to examine the implications of data mining. And intriguing progress appears to have been made at designing information-retrieval systems with record anonymization, user audit logs -- which can confirm that no one looked at records beyond the approved scope of an investigation -- and other privacy mechanisms "baked in."
The trick is to do more than simply strip names from records. Latanya Sweeney of Carnegie Mellon University -- a leading privacy technologist who once had a project funded under TIA -- has shown that 87 percent of Americans could be identified by records listing solely their birthdate, gender and ZIP code.
Sweeney had this challenge in mind as she developed a way for the U.S. Department of Housing and Urban Development to anonymously track the homeless.
The system became necessary to meet the conflicting demands of two laws -- one that requires homeless shelters to tally the people they take in, and another that prohibits victims of domestic violence from being identified by agencies that help them.
Sweeney's solution deploys a "hash function," which cryptographically converts information to a random-appearing code of numbers and letters. The function can't be reversed to reveal the original data.
When homeless shelters had to submit their records to regional HUD offices for counting how many people used the facilities, each shelter would send only hashed data.
A key detail here is that each homeless shelter would have its own computational process, known as an algorithm, for hashing data. That way, one person's name wouldn't always translate into the same code -- a method that could be abused by a corrupt insider or savvy stalker who gained access to the records.
However, if the same name generated different codes at different shelters, it would be impossible to tell whether one person had been to two centers and was being double-counted. So Sweeney's system adds a second step: Each shelter's hashed records are sent to all other facilities covered by the HUD regional office, then hashed again and sent back to HUD as a new code.
It might be hard to wrap your mind around this, but it's a fact of the cryptography involved: If one person had been to two different shelters -- and so their anonymized data got hashed twice, once by each of the shelters applying its own formula -- then the codes HUD received in this second phase would indicate as much. That would aid an accurate count.
Even if HUD decides not to adopt the system, Sweeney hopes it finds use in other settings, such as letting private companies and law enforcement anonymously compare whether customer records and watch lists have names in common.
A University of California, Los Angeles professor, Rafail Ostrovsky, said the CIA and the National Security Agency are evaluating a program of his that would let intelligence analysts search huge batches of intercepted communications for keywords and other criteria, while discarding messages that don't apply.
Ostrovsky and co-creator William Skeith believe the system would keep innocent files away from snoops' eyes while also extending their reach: Because the program would encrypt its search terms and the results, it could be placed on machines all over the Internet, not just computers in classified settings.
"Technologically it is possible" to bolster security and privacy, Ostrovsky said. "You can kind of have your cake and eat it too."
That may be the case, but creating such technologies is just part of the battle.
One problem is getting potential users to change how they deal with information.
Rebecca Wright, a Stevens Institute of Technology professor who is part of a five-year National Science Foundation-funded effort to build privacy protections into data-mining systems, illustrates that issue with the following example.
The Computing Research Association annually analyzes the pay earned by university computer faculty. Some schools provide anonymous lists of salaries; more protective ones send just their minimum, maximum and average pay.
Researchers affiliated with Wright's project, known as Portia, offered a way to calculate the figures with better accuracy and privacy. Instead of having universities send their salary figures for the computer association to crunch, Portia's system can perform calculations on data without ever storing it in unencrypted fashion. With such secrecy, the researchers argued, every school could safely send full salary lists.
But the software remains unadopted. One large reason, Wright said, was that universities questioned whether encryption gave them legal standing to provide full salary lists when they previously could not -- even though the new lists never would leave the university in unencrypted form.
Even if data-miners were eager to adopt privacy enhancements, Wright and other researchers worry that the programs' obscure details might be difficult for the public to trust.
Steven Aftergood, who heads the Federation of American Scientists' project on government secrecy, suggested that public confidence could be raised by subjecting government data-mining projects to external privacy reviews.
But that seems somewhat unrealistic, he said, given that intelligence agencies have been slow to share surveillance details with Congress even on a classified basis.
"That part of the problem may be harder to solve than the technical part," Aftergood said. "And in turn, that may mean that the problem may not have a solution."