Archiv der Kategorie: Big Data

How Facebook Undermines Privacy Protections for Its 2 Billion WhatsApp Users

WhatsApp assures users that no one can see their messages — but the company has an extensive monitoring operation and regularly shares personal information with prosecutors.

 

Series: The Social Machine

How Facebook Plays by Its Own set of Rules

Clarification, Sept. 8, 2021: A previous version of this story caused unintended confusion about the extent to which WhatsApp examines its users’ messages and whether it breaks the encryption that keeps the exchanges secret. We’ve altered language in the story to make clear that the company examines only messages from threads that have been reported by users as possibly abusive. It does not break end-to-end encryption.

When Mark Zuckerberg unveiled a new “privacy-focused vision” for Facebook in March 2019, he cited the company’s global messaging service, WhatsApp, as a model. Acknowledging that “we don’t currently have a strong reputation for building privacy protective services,” the Facebook CEO wrote that “I believe the future of communication will increasingly shift to private, encrypted services where people can be confident what they say to each other stays secure and their messages and content won’t stick around forever. This is the future I hope we will help bring about. We plan to build this the way we’ve developed WhatsApp.”

Zuckerberg’s vision centered on WhatsApp’s signature feature, which he said the company was planning to apply to Instagram and Facebook Messenger: end-to-end encryption, which converts all messages into an unreadable format that is only unlocked when they reach their intended destinations. WhatsApp messages are so secure, he said, that nobody else — not even the company — can read a word. As Zuckerberg had put it earlier, in testimony to the U.S. Senate in 2018, “We don’t see any of the content in WhatsApp.”

 

WhatsApp emphasizes this point so consistently that a flag with a similar assurance automatically appears on-screen before users send messages: “No one outside of this chat, not even WhatsApp, can read or listen to them.”

Given those sweeping assurances, you might be surprised to learn that WhatsApp has more than 1,000 contract workers filling floors of office buildings in Austin, Texas, Dublin and Singapore. Seated at computers in pods organized by work assignments, these hourly workers use special Facebook software to sift through millions of private messages, images and videos. They pass judgment on whatever flashes on their screen — claims of everything from fraud or spam to child porn and potential terrorist plotting — typically in less than a minute.

The workers have access to only a subset of WhatsApp messages — those flagged by users and automatically forwarded to the company as possibly abusive. The review is one element in a broader monitoring operation in which the company also reviews material that is not encrypted, including data about the sender and their account.

Policing users while assuring them that their privacy is sacrosanct makes for an awkward mission at WhatsApp. A 49-slide internal company marketing presentation from December, obtained by ProPublica, emphasizes the “fierce” promotion of WhatsApp’s “privacy narrative.” It compares its “brand character” to “the Immigrant Mother” and displays a photo of Malala ​​Yousafzai, who survived a shooting by the Taliban and became a Nobel Peace Prize winner, in a slide titled “Brand tone parameters.” The presentation does not mention the company’s content moderation efforts.

WhatsApp’s director of communications, Carl Woog, acknowledged that teams of contractors in Austin and elsewhere review WhatsApp messages to identify and remove “the worst” abusers. But Woog told ProPublica that the company does not consider this work to be content moderation, saying: “We actually don’t typically use the term for WhatsApp.” The company declined to make executives available for interviews for this article, but responded to questions with written comments. “WhatsApp is a lifeline for millions of people around the world,” the company said. “The decisions we make around how we build our app are focused around the privacy of our users, maintaining a high degree of reliability and preventing abuse.”

WhatsApp’s denial that it moderates content is noticeably different from what Facebook Inc. says about WhatsApp’s corporate siblings, Instagram and Facebook. The company has said that some 15,000 moderators examine content on Facebook and Instagram, neither of which is encrypted. It releases quarterly transparency reports that detail how many accounts Facebook and Instagram have “actioned” for various categories of abusive content. There is no such report for WhatsApp.

Deploying an army of content reviewers is just one of the ways that Facebook Inc. has compromised the privacy of WhatsApp users. Together, the company’s actions have left WhatsApp — the largest messaging app in the world, with two billion users — far less private than its users likely understand or expect. A ProPublica investigation, drawing on data, documents and dozens of interviews with current and former employees and contractors, reveals how, since purchasing WhatsApp in 2014, Facebook has quietly undermined its sweeping security assurances in multiple ways. (Two articles this summer noted the existence of WhatsApp’s moderators but focused on their working conditions and pay rather than their effect on users’ privacy. This article is the first to reveal the details and extent of the company’s ability to scrutinize messages and user data — and to examine what the company does with that information.)

Many of the assertions by content moderators working for WhatsApp are echoed by a confidential whistleblower complaint filed last year with the U.S. Securities and Exchange Commission. The complaint, which ProPublica obtained, details WhatsApp’s extensive use of outside contractors, artificial intelligence systems and account information to examine user messages, images and videos. It alleges that the company’s claims of protecting users’ privacy are false. “We haven’t seen this complaint,” the company spokesperson said. The SEC has taken no public action on it; an agency spokesperson declined to comment.

Facebook Inc. has also downplayed how much data it collects from WhatsApp users, what it does with it and how much it shares with law enforcement authorities. For example, WhatsApp shares metadata, unencrypted records that can reveal a lot about a user’s activity, with law enforcement agencies such as the Department of Justice. Some rivals, such as Signal, intentionally gather much less metadata to avoid incursions on its users’ privacy, and thus share far less with law enforcement. (“WhatsApp responds to valid legal requests,” the company spokesperson said, “including orders that require us to provide on a real-time going forward basis who a specific person is messaging.”)

WhatsApp user data, ProPublica has learned, helped prosecutors build a high-profile case against a Treasury Department employee who leaked confidential documents to BuzzFeed News that exposed how dirty money flows through U.S. banks.

Like other social media and communications platforms, WhatsApp is caught between users who expect privacy and law enforcement entities that effectively demand the opposite: that WhatsApp turn over information that will help combat crime and online abuse. WhatsApp has responded to this dilemma by asserting that it’s no dilemma at all. “I think we absolutely can have security and safety for people through end-to-end encryption and work with law enforcement to solve crimes,” said Will Cathcart, whose title is Head of WhatsApp, in a YouTube interview with an Australian think tank in July.

The tension between privacy and disseminating information to law enforcement is exacerbated by a second pressure: Facebook’s need to make money from WhatsApp. Since paying $22 billion to buy WhatsApp in 2014, Facebook has been trying to figure out how to generate profits from a service that doesn’t charge its users a penny.

That conundrum has periodically led to moves that anger users, regulators or both. The goal of monetizing the app was part of the company’s 2016 decision to start sharing WhatsApp user data with Facebook, something the company had told European Union regulators was technologically impossible. The same impulse spurred a controversial plan, abandoned in late 2019, to sell advertising on WhatsApp. And the profit-seeking mandate was behind another botched initiative in January: the introduction of a new privacy policy for user interactions with businesses on WhatsApp, allowing businesses to use customer data in new ways. That announcement triggered a user exodus to competing apps.

WhatsApp’s increasingly aggressive business plan is focused on charging companies for an array of services — letting users make payments via WhatsApp and managing customer service chats — that offer convenience but fewer privacy protections. The result is a confusing two-tiered privacy system within the same app where the protections of end-to-end encryption are further eroded when WhatsApp users employ the service to communicate with businesses.

The company’s December marketing presentation captures WhatsApp’s diverging imperatives. It states that “privacy will remain important.” But it also conveys what seems to be a more urgent mission: the need to “open the aperture of the brand to encompass our future business objectives.”


 

I. “Content Moderation Associates”

In many ways, the experience of being a content moderator for WhatsApp in Austin is identical to being a moderator for Facebook or Instagram, according to interviews with 29 current and former moderators. Mostly in their 20s and 30s, many with past experience as store clerks, grocery checkers and baristas, the moderators are hired and employed by Accenture, a huge corporate contractor that works for Facebook and other Fortune 500 behemoths.

The job listings advertise “Content Review” positions and make no mention of Facebook or WhatsApp. Employment documents list the workers’ initial title as “content moderation associate.” Pay starts around $16.50 an hour. Moderators are instructed to tell anyone who asks that they work for Accenture, and are required to sign sweeping non-disclosure agreements. Citing the NDAs, almost all the current and former moderators interviewed by ProPublica insisted on anonymity. (An Accenture spokesperson declined comment, referring all questions about content moderation to WhatsApp.)

When the WhatsApp team was assembled in Austin in 2019, Facebook moderators already occupied the fourth floor of an office tower on Sixth Street, adjacent to the city’s famous bar-and-music scene. The WhatsApp team was installed on the floor above, with new glass-enclosed work pods and nicer bathrooms that sparked a tinge of envy in a few members of the Facebook team. Most of the WhatsApp team scattered to work from home during the pandemic. Whether in the office or at home, they spend their days in front of screens, using a Facebook software tool to examine a stream of “tickets,” organized by subject into “reactive” and “proactive” queues.

Collectively, the workers scrutinize millions of pieces of WhatsApp content each week. Each reviewer handles upwards of 600 tickets a day, which gives them less than a minute per ticket. WhatsApp declined to reveal how many contract workers are employed for content review, but a partial staffing list reviewed by ProPublica suggests that, at Accenture alone, it’s more than 1,000. WhatsApp moderators, like their Facebook and Instagram counterparts, are expected to meet performance metrics for speed and accuracy, which are audited by Accenture.

Their jobs differ in other ways. Because WhatsApp’s content is encrypted, artificial intelligence systems can’t automatically scan all chats, images and videos, as they do on Facebook and Instagram. Instead, WhatsApp reviewers gain access to private content when users hit the “report” button on the app, identifying a message as allegedly violating the platform’s terms of service. This forwards five messages — the allegedly offending one along with the four previous ones in the exchange, including any images or videos — to WhatsApp in unscrambled form, according to former WhatsApp engineers and moderators. Automated systems then feed these tickets into “reactive” queues for contract workers to assess.

Artificial intelligence initiates a second set of queues — so-called proactive ones — by scanning unencrypted data that WhatsApp collects about its users and comparing it against suspicious account information and messaging patterns (a new account rapidly sending out a high volume of chats is evidence of spam), as well as terms and images that have previously been deemed abusive. The unencrypted data available for scrutiny is extensive. It includes the names and profile images of a user’s WhatsApp groups as well as their phone number, profile photo, status message, phone battery level, language and time zone, unique mobile phone ID and IP address, wireless signal strength and phone operating system, as a list of their electronic devices, any related Facebook and Instagram accounts, the last time they used the app and any previous history of violations.

The WhatsApp reviewers have three choices when presented with a ticket for either type of queue: Do nothing, place the user on “watch” for further scrutiny, or ban the account. (Facebook and Instagram content moderators have more options, including removing individual postings. It’s that distinction — the fact that WhatsApp reviewers can’t delete individual items — that the company cites as its basis for asserting that WhatsApp reviewers are not “content moderators.”)

WhatsApp moderators must make subjective, sensitive and subtle judgments, interviews and documents examined by ProPublica show. They examine a wide range of categories, including “Spam Report,” “Civic Bad Actor” (political hate speech and disinformation), “Terrorism Global Credible Threat,” “CEI” (child exploitative imagery) and “CP” (child pornography). Another set of categories addresses the messaging and conduct of millions of small and large businesses that use WhatsApp to chat with customers and sell their wares. These queues have such titles as “business impersonation prevalence,” “commerce policy probable violators” and “business verification.”

Moderators say the guidance they get from WhatsApp and Accenture relies on standards that can be simultaneously arcane and disturbingly graphic. Decisions about abusive sexual imagery, for example, can rest on an assessment of whether a naked child in an image appears adolescent or prepubescent, based on comparison of hip bones and pubic hair to a medical index chart. One reviewer recalled a grainy video in a political-speech queue that depicted a machete-wielding man holding up what appeared to be a severed head: “We had to watch and say, ‘Is this a real dead body or a fake dead body?’”

In late 2020, moderators were informed of a new queue for alleged “sextortion.” It was defined in an explanatory memo as “a form of sexual exploitation where people are blackmailed with a nude image of themselves which have been shared by them or someone else on the Internet.” The memo said workers would review messages reported by users that “include predefined keywords typically used in sextortion/blackmail messages.”

WhatsApp’s review system is hampered by impediments, including buggy language translation. The service has users in 180 countries, with the vast majority located outside the U.S. Even though Accenture hires workers who speak a variety of languages, for messages in some languages there’s often no native speaker on site to assess abuse complaints. That means using Facebook’s language-translation tool, which reviewers said could be so inaccurate that it sometimes labeled messages in Arabic as being in Spanish. The tool also offered little guidance on local slang, political context or sexual innuendo. “In the three years I’ve been there,” one moderator said, “it’s always been horrible.”

The process can be rife with errors and misunderstandings. Companies have been flagged for offering weapons for sale when they’re selling straight shaving razors. Bras can be sold, but if the marketing language registers as “adult,” the seller can be labeled a forbidden “sexually oriented business.” And a flawed translation tool set off an alarm when it detected kids for sale and slaughter, which, upon closer scrutiny, turned out to involve young goats intended to be cooked and eaten in halal meals.

The system is also undercut by the human failings of the people who instigate reports. Complaints are frequently filed to punish, harass or prank someone, according to moderators. In messages from Brazil and Mexico, one moderator explained, “we had a couple of months where AI was banning groups left and right because people were messing with their friends by changing their group names” and then reporting them. “At the worst of it, we were probably getting tens of thousands of those. They figured out some words the algorithm did not like.”

Other reports fail to meet WhatsApp standards for an account ban. “Most of it is not violating,” one of the moderators said. “It’s content that is already on the internet, and it’s just people trying to mess with users.” Still, each case can reveal up to five unencrypted messages, which are then examined by moderators.

The judgment of WhatsApp’s AI is less than perfect, moderators say. “There were a lot of innocent photos on there that were not allowed to be on there,” said Carlos Sauceda, who left Accenture last year after nine months. “It might have been a photo of a child taking a bath, and there was nothing wrong with it.” As another WhatsApp moderator put it, “A lot of the time, the artificial intelligence is not that intelligent.”

Facebook’s written guidance to WhatsApp moderators acknowledges many problems, noting “we have made mistakes and our policies have been weaponized by bad actors to get good actors banned. When users write inquiries pertaining to abusive matters like these, it is up to WhatsApp to respond and act (if necessary) accordingly in a timely and pleasant manner.” Of course, if a user appeals a ban that was prompted by a user report, according to one moderator, it entails having a second moderator examine the user’s content.


 

*£%#£$&@+*&+@@@£#+@&§_$£&£@_§##*$#$§+&+@&&%_$$@@

In public statements and on the company’s websites, Facebook Inc. is noticeably vague about WhatsApp’s monitoring process. The company does not provide a regular accounting of how WhatsApp polices the platform. WhatsApp’s FAQ page and online complaint form note that it will receive “the most recent messages” from a user who has been flagged. They do not, however, disclose how many unencrypted messages are revealed when a report is filed, or that those messages are examined by outside contractors. (WhatsApp told ProPublica it limits that disclosure to keep violators from “gaming” the system.)

By contrast, both Facebook and Instagram post lengthy “Community Standards” documents detailing the criteria its moderators use to police content, along with articles and videos about “the unrecognized heroes who keep Facebook safe” and announcements on new content-review sites. Facebook’s transparency reports detail how many pieces of content are “actioned” for each type of violation. WhatsApp is not included in this report.

When dealing with legislators, Facebook Inc. officials also offer few details — but are eager to assure them that they don’t let encryption stand in the way of protecting users from images of child sexual abuse and exploitation. For example, when members of the Senate Judiciary Committee grilled Facebook about the impact of encrypting its platforms, the company, in written follow-up questions in Jan. 2020, cited WhatsApp in boasting that it would remain responsive to law enforcement. “Even within an encrypted system,” one response noted, “we will still be able to respond to lawful requests for metadata, including potentially critical location or account information… We already have an encrypted messaging service, WhatsApp, that — in contrast to some other encrypted services — provides a simple way for people to report abuse or safety concerns.”

Sure enough, WhatsApp reported 400,000 instances of possible child-exploitation imagery to the National Center for Missing and Exploited Children in 2020, according to its head, Cathcart. That was ten times as many as in 2019. “We are by far the industry leaders in finding and detecting that behavior in an end-to-end encrypted service,” he said.

During his YouTube interview with the Australian think tank, Cathcart also described WhatsApp’s reliance on user reporting and its AI systems’ ability to examine account information that isn’t subject to encryption. Asked how many staffers WhatsApp employed to investigate abuse complaints from an app with more than two billion users, Cathcart didn’t mention content moderators or their access to encrypted content. “There’s a lot of people across Facebook who help with WhatsApp,” he explained. “If you look at people who work full time on WhatsApp, it’s above a thousand. I won’t get into the full breakdown of customer service, user reports, engineering, etc. But it’s a lot of that.”

In written responses for this article, the company spokesperson said: “We build WhatsApp in a manner that limits the data we collect while providing us tools to prevent spam, investigate threats, and ban those engaged in abuse, including based on user reports we receive. This work takes extraordinary effort from security experts and a valued trust and safety team that works tirelessly to help provide the world with private communication.” The spokesperson noted that WhatsApp has released new privacy features, including “more controls about how people’s messages can disappear” or be viewed only once. He added, “Based on the feedback we’ve received from users, we’re confident people understand when they make reports to WhatsApp we receive the content they send us.”


 

III. “Deceiving Users” About Personal Privacy

Since the moment Facebook announced plans to buy WhatsApp in 2014, observers wondered how the service, known for its fervent commitment to privacy, would fare inside a corporation known for the opposite. Zuckerberg had become one of the wealthiest people on the planet by using a “surveillance capitalism” approach: collecting and exploiting reams of user data to sell targeted digital ads. Facebook’s relentless pursuit of growth and profits has generated a series of privacy scandals in which it was accused of deceiving customers and regulators.

By contrast, WhatsApp knew little about its users apart from their phone numbers and shared none of that information with third parties. WhatsApp ran no ads, and its co-founders, Jan Koum and Brian Acton, both former Yahoo engineers, were hostile to them. “At every company that sells ads,” they wrote in 2012, “a significant portion of their engineering team spends their day tuning data mining, writing better code to collect all your personal data, upgrading the servers that hold all the data and making sure it’s all being logged and collated and sliced and packed and shipped out,” adding: “Remember, when advertising is involved you the user are the product.” At WhatsApp, they noted, “your data isn’t even in the picture. We are simply not interested in any of it.”

Zuckerberg publicly vowed in a 2014 keynote speech that he would keep WhatsApp “exactly the same.” He declared, “We are absolutely not going to change plans around WhatsApp and the way it uses user data. WhatsApp is going to operate completely autonomously.”

In April 2016, WhatsApp completed its long-planned adoption of end-to-end encryption, which helped establish the app as a prized communications platform in 180 countries, including many where text messages and phone calls are cost-prohibitive. International dissidents, whistleblowers and journalists also turned to WhatsApp to escape government eavesdropping.

Four months later, however, WhatsApp disclosed it would begin sharing user data with Facebook — precisely what Zuckerberg had said would not happen — a move that cleared the way for an array of future revenue-generating plans. The new WhatsApp terms of service said the app would share information such as users’ phone numbers, profile photos, status messages and IP addresses for the purposes of ad targeting, fighting spam and abuse and gathering metrics. “By connecting your phone number with Facebook’s systems,” WhatsApp explained, “Facebook can offer better friend suggestions and show you more relevant ads if you have an account with them.”

Such actions were increasingly bringing Facebook into the crosshairs of regulators. In May 2017, European Union antitrust regulators fined the company 110 million euros (about $122 million) for falsely claiming three years earlier that it would be impossible to link the user information between WhatsApp and the Facebook family of apps. The EU concluded that Facebook had “intentionally or negligently” deceived regulators. Facebook insisted its false statements in 2014 were not intentional, but didn’t contest the fine.

By the spring of 2018, the WhatsApp co-founders, now both billionaires, were gone. Acton, in what he later described as an act of “penance” for the “crime” of selling WhatsApp to Facebook, gave $50 million to a foundation backing Signal, a free encrypted messaging app that would emerge as a WhatsApp rival. (Acton’s donor-advised fund has also given money to ProPublica.)

Meanwhile, Facebook was under fire for its security and privacy failures as never before. The pressure culminated in a landmark $5 billion fine by the Federal Trade Commission in July 2019 for violating a previous agreement to protect user privacy. The fine was almost 20 times greater than any previous privacy-related penalty, according to the FTC, and Facebook’s transgressions included “deceiving users about their ability to control the privacy of their personal information.”

The FTC announced that it was ordering Facebook to take steps to protect privacy going forward, including for WhatsApp users: “As part of Facebook’s order-mandated privacy program, which covers WhatsApp and Instagram, Facebook must conduct a privacy review of every new or modified product, service, or practice before it is implemented, and document its decisions about user privacy.” Compliance officers would be required to generate a “quarterly privacy review report” and share it with the company and, upon request, the FTC.

Facebook agreed to the FTC’s fine and order. Indeed, the negotiations for that agreement were the backdrop, just four months before that, for Zuckerberg’s announcement of his new commitment to privacy.

By that point, WhatsApp had begun using Accenture and other outside contractors to hire hundreds of content reviewers. But the company was eager not to step on its larger privacy message — or spook its global user base. It said nothing publicly about its hiring of contractors to review content.


 

IV$ “W+ Kill P_op%§ Base@%On$Met§data”

Even as Zuckerberg was touting Facebook Inc.’s new commitment to privacy in 2019, he didn’t mention that his company was apparently sharing more of its WhatsApp users’ metadata than ever with the parent company — and with law enforcement.

To the lay ear, the term “metadata” can sound abstract, a word that evokes the intersection of literary criticism and statistics. To use an old, pre-digital analogy, metadata is the equivalent of what’s written on the outside of an envelope — the names and addresses of the sender and recipient and the postmark reflecting where and when it was mailed — while the “content” is what’s written on the letter sealed inside the envelope. So it is with WhatsApp messages: The content is protected, but the envelope reveals a multitude of telling details (as noted: time stamps, phone numbers and much more).

Those in the information and intelligence fields understand how crucial this information can be. It was metadata, after all, that the National Security Agency was gathering about millions of Americans not suspected of a crime, prompting a global outcry when it was exposed in 2013 by former NSA contractor Edward Snowden. “Metadata absolutely tells you everything about somebody’s life,” former NSA general counsel Stewart Baker once said. “If you have enough metadata, you don’t really need content.” In a symposium at Johns Hopkins University in 2014, Gen. Michael Hayden, former director of both the CIA and NSA, went even further: “We kill people based on metadata.”

U.S. law enforcement has used WhatsApp metadata to help put people in jail. ProPublica found more than a dozen instances in which the Justice Department sought court orders for the platform’s metadata since 2017. These represent a fraction of overall requests, known as pen register orders (a phrase borrowed from the technology used to track numbers dialed by landline telephones), as many more are kept from public view by court order. U.S. government requests for data on outgoing and incoming messages from all Facebook platforms increased by 276% from the first half of 2017 to the second half of 2020, according to Facebook Inc. statistics (which don’t break out the numbers by platform). The company’s rate of handing over at least some data in response to such requests has risen from 84% to 95% during that period.

It’s not clear exactly what government investigators have been able to gather from WhatsApp, as the results of those orders, too, are often kept from public view. Internally, WhatsApp calls such requests for information about users “prospective message pairs,” or PMPs. These provide data on a user’s messaging patterns in response to requests from U.S. law enforcement agencies, as well as those in at least three other countries — the United Kingdom, Brazil and India — according to a person familiar with the matter who shared this information on condition of anonymity. Law enforcement requests from other countries might only receive basic subscriber profile information.

WhatsApp metadata was pivotal in the arrest and conviction of Natalie “May” Edwards, a former Treasury Department official with the Financial Crimes Enforcement Network, for leaking confidential banking reports about suspicious transactions to BuzzFeed News. The FBI’s criminal complaint detailed hundreds of messages between Edwards and a BuzzFeed reporter using an “encrypted application,” which interviews and court records confirmed was WhatsApp. “On or about August 1, 2018, within approximately six hours of the Edwards pen becoming operative — and the day after the July 2018 Buzzfeed article was published — the Edwards cellphone exchanged approximately 70 messages via the encrypted application with the Reporter-1 cellphone during an approximately 20-minute time span between 12:33 a.m. and 12:54 a.m.,” FBI Special Agent Emily Eckstut wrote in her October 2018 complaint. Edwards and the reporter used WhatsApp because Edwards believed the platform to be secure, according to a person familiar with the matter.

Edwards was sentenced on June 3 to six months in prison after pleading guilty to a conspiracy charge and reported to prison last week. Edwards’ attorney declined to comment, as did representatives from the FBI and the Justice Department.

WhatsApp has for years downplayed how much unencrypted information it shares with law enforcement, largely limiting mentions of the practice to boilerplate language buried deep in its terms of service. It does not routinely keep permanent logs of who users are communicating with and how often, but company officials confirmed they do turn on such tracking at their own discretion — even for internal Facebook leak investigations — or in response to law enforcement requests. The company declined to tell ProPublica how frequently it does so.

The privacy page for WhatsApp assures users that they have total control over their own metadata. It says users can “decide if only contacts, everyone, or nobody can see your profile photo” or when they last opened their status updates or when they last opened the app. Regardless of the settings a user chooses, WhatsApp collects and analyzes all of that data — a fact not mentioned anywhere on the page.


 

V. “Opening the Aperture to Encompass Business Objectives”

The conflict between privacy and security on encrypted platforms seems to be only intensifying. Law enforcement and child safety advocates have urged Zuckerberg to abandon his plan to encrypt all of Facebook’s messaging platforms. In June 2020, three Republican senators introduced the “Lawful Access to Encrypted Data Act,” which would require tech companies to assist in providing access to even encrypted content in response to law enforcement warrants. For its part, WhatsApp recently sued the Indian government to block its requirement that encrypted apps provide “traceability” — a method to identify the sender of any message deemed relevant to law enforcement. WhatsApp has fought similar demands in other countries.

Other encrypted platforms take a vastly different approach to monitoring their users than WhatsApp. Signal employs no content moderators, collects far less user and group data, allows no cloud backups and generally rejects the notion that it should be policing user activities. It submits no child exploitation reports to NCMEC.

Apple has touted its commitment to privacy as a selling point. Its iMessage system displays a “report” button only to alert the company to suspected spam, and the company has made just a few hundred annual reports to NCMEC, all of them originating from scanning outgoing email, which is unencrypted.

But Apple recently took a new tack, and appeared to stumble along the way. Amid intensifying pressure from Congress, in August the company announced a complex new system for identifying child-exploitative imagery on users’ iCloud backups. Apple insisted the new system poses no threat to private content, but privacy advocates accused the company of creating a backdoor that potentially allows authoritarian governments to demand broader content searches, which could result in the targeting of dissidents, journalists or other critics of the state. On Sept. 3, Apple announced it would delay implementation of the new system.

Still, it’s Facebook that seems to face the most constant skepticism among major tech platforms. It is using encryption to market itself as privacy-friendly, while saying little about the other ways it collects data, according to Lloyd Richardson, the director of IT at the Canadian Centre for Child Protection. “This whole idea that they’re doing it for personal protection of people is completely ludicrous,” Richardson said. “You’re trusting an app owned and written by Facebook to do exactly what they’re saying. Do you trust that entity to do that?” (On Sept. 2, Irish authorities announced that they are fining WhatsApp 225 million euros, about $267 million, for failing to properly disclose how the company shares user information with other Facebook platforms. WhatsApp is contesting the finding.)

Facebook’s emphasis on promoting WhatsApp as a paragon of privacy is evident in the December marketing document obtained by ProPublica. The “Brand Foundations” presentation says it was the product of a 21-member global team across all of Facebook, involving a half-dozen workshops, quantitative research, “stakeholder interviews” and “endless brainstorms.” Its aim: to offer “an emotional articulation” of WhatsApp’s benefits, “an inspirational toolkit that helps us tell our story,” and a “brand purpose to champion the deep human connection that leads to progress.” The marketing deck identifies a feeling of “closeness” as WhatsApp’s “ownable emotional territory,” saying the app delivers “the closest thing to an in-person conversation.”

WhatsApp should portray itself as “courageous,” according to another slide, because it’s “taking a strong, public stance that is not financially motivated on things we care about,” such as defending encryption and fighting misinformation. But the presentation also speaks of the need to “open the aperture of the brand to encompass our future business objectives. While privacy will remain important, we must accommodate for future innovations.”

WhatsApp is now in the midst of a major drive to make money. It has experienced a rocky start, in part because of broad suspicions of how WhatsApp will balance privacy and profits. An announced plan to begin running ads inside the app didn’t help; it was abandoned in late 2019, just days before it was set to launch. Early this January, WhatsApp unveiled a change in its privacy policy — accompanied by a one-month deadline to accept the policy or get cut off from the app. The move sparked a revolt, impelling tens of millions of users to flee to rivals such as Signal and Telegram.

The policy change focused on how messages and data would be handled when users communicate with a business in the ever-expanding array of WhatsApp Business offerings. Companies now could store their chats with users and use information about users for marketing purposes, including targeting them with ads on Facebook or Instagram.

Elon Musk tweeted “Use Signal,” and WhatsApp users rebelled. Facebook delayed for three months the requirement for users to approve the policy update. In the meantime, it struggled to convince users that the change would have no effect on the privacy protections for their personal communications, with a slightly modified version of its usual assurance: “WhatsApp cannot see your personal messages or hear your calls and neither can Facebook.” Just as when the company first bought WhatsApp years before, the message was the same: Trust us.

Correction

Sept. 10, 2021: This story originally stated incorrectly that Apple’s iMessage system has no “report” button. The iMessage system does have a report button, but only for suspected spam (not for suspected abusive content).

https://www.propublica.org/article/how-facebook-undermines-privacy-protections-for-its-2-billion-whatsapp-users

John Deere turned tractors into computers — what’s next?

One of our themes on Decoder is that basically everything is a computer now, and farming equipment like tractors and combines are no different. My guest this week is Jahmy Hindman, chief technology officer at John Deere, the world’s biggest manufacturer of farming machinery. And I think our conversation will surprise you.

Jahmy told me that John Deere employs more software engineers than mechanical engineers now, which completely surprised me. But the entire business of farming is moving toward something called precision agriculture, which means farmers are closely tracking where seeds are planted, how well they’re growing, what those plants need, and how much they yield.

The idea, Jahmy says, is to have each plant on a massive commercial farm tended with individual care — a process which requires collecting and analyzing a massive amount of data. If you get it right, precision agriculture means farmers can be way more efficient — they can get better crop yields with less work and lower costs.

The idea, Jahmy says, is to have each plant on a massive commercial farm tended with individual care — a process which requires collecting and analyzing a massive amount of data. If you get it right, precision agriculture means farmers can be way more efficient — they can get better crop yields with less work and lower costs.

But as Decoder listeners know by now, turning everything into computers means everything has computer problems now. Like all that farming data: who owns it? Where is it processed? How do you get it off the tractors without reliable broadband networks? What format is it in? If you want to use your John Deere tractor with another farming analysis vendor, how easy is that? Is it easy enough?

And then there are the tractors themselves — unlike phones, or laptops, or even cars, tractors get used for decades. How should they get upgraded? How can they be kept secure? And most importantly, who gets to fix them when they break?

John Deere is one of the companies at the center of a nationwide reckoning over the right to repair. Right now, tech companies like Samsung and Apple and John Deere all get to determine who can repair their products and what official parts are available.

And because these things are all computers, these manufacturers can also control the software to lock out parts from other suppliers. But it’s a huge deal in the context of farming equipment, which is still extremely mechanical, often located far away from service providers and not so easy to move, and which farmers have been repairing themselves for decades. In fact, right now the prices of older, pre-computerized tractors are skyrocketing because they’re easier to repair.

Half of the states in the country are now considering right to repair laws that would require manufacturers to disable software locks and provide parts to repair shops, and a lot of it is being driven — in a bipartisan way — by the needs of farmers.

John Deere is famously a tractor company. You make a lot of equipment for farmers, for construction sites, that sort of thing. Give me the short version of what the chief technology officer at John Deere does.

[As] chief technology officer, my role is really to try to set the strategic direction from a technology perspective for the company, across both our agricultural products as well as our construction, forestry, and road-building products. It’s a cool job. I get to look out five, 10, 15, 20 years into the future and try to make sure that we’re putting into place the pieces that we need in order to have the technology solutions that are going to be important for our customers in the future.

One of the reasons I am very excited to have you on Decoder is there are a lot of computer solutions in your products. There’s hardware, software, services that I think of as sort of traditional computer company problems. Do you also oversee the portfolio of technologies that [also] make combines more efficient and tractor wheels move faster?

We’ve got a centrally-organized technology stack organization. We call it the intelligent solutions group, and its job is really to do exactly that. It’s to make sure that we’re developing technologies that can scale across the complete organization, across those combines you referenced, and the tractors and the sprayers, and the construction products, and deploy that technology as quickly as possible.

One of the things The Verge wrestles with almost every day is the question of, “What is a computer?” We wrestle with it in very small and obvious ways — we argue about whether the iPad or an Xbox is a computer. Then you can zoom all the way out: we had Jim Farley, who’s the CEO of Ford, on Decoder a couple of weeks ago, and he and I talked about how Ford’s cars are effectively rolling computers now.

Is that how you see a tractor or a combine or construction equipment — that these are gigantic computers that have big mechanical functions as well?

They absolutely are. That’s what they’ve become over time. I would call them mobile sensor suites that have computational capability, not only on-board, but to your point, off-board as well. They are continuously streaming data from whatever it is — let’s say the tractor and the planter — to the cloud. We’re doing computational work on that data in the cloud, and then serving that information, those insights, up to farmers, either on their desktop computer or on a mobile handheld device or something like that.

As much as they are doing productive work in the field, planting as an example, they are also data acquisition and computational devices.

How much of that is in-house at John Deere? How big is the team that is building your mobile apps? Is that something you outsource? Is that something you develop internally? How have you structured the company to enable this kind of work?

We do a significant amount of that work internally. It might surprise you, we have more software development engineers today within Deere than we have mechanical design engineers. That’s kind of mind-blowing for a company that’s 184 years old and has been steeped in mechanical product development, but that’s the case. We do nearly all of our own internal app development inside the four walls of Deere.

That said, our data application for customers in the ag space, for example, is the Operations Center. We do utilize third parties. There’s roughly 184 companies that have been connected to Operations Center through encrypted APIs, that are writing applications against that data for the benefit of the customers, the farmers that want to use those applications within their business.

One of the reasons we’re always debating what a computer is and isn’t is that once you describe something as a computer, you inherit a bunch of expectations about how computers work. You inherit a bunch of problems about how computers work and don’t work. You inherit a bunch of control; API access is a way of exercising control over an ecosystem or an economy.

Have you shifted the way that John Deere thinks about its products? As new abilities are created because you have computerized so much of a tractor, you also increase your responsibility, because you have a bunch more control.

There’s no doubt. We’re having to think about things like security of data, as an example, that previously, 30 years ago, was not necessarily a topic of conversation. We didn’t have competency in it. We’ve had to become competent in areas like that because of exactly the point you’re making, that the product has become more computer-like than conventional tractor-like over time.

That leads to huge questions. You mentioned security. Looking at some of your recent numbers, you have a very big business in China. Thirty years ago, you would export a tractor to China and that’s the end of that conversation. Now, there’s a huge conversation about cybersecurity, data sharing with companies in China, down the line, a set of very complicated issues for a tractor company that 30 years ago wouldn’t have any of those problems. How do you balance all those out?

It’s a different set of problems for sure, and more complicated for geopolitical reasons in the case of China, as you mentioned. Let’s take security as an example. We have gone through the change that many technology companies have had to go through in the space of security, where it’s no longer bolted on at the end, it’s built in from the ground up. So it’s the security-by-design approach. We’ve got folks embedded in development organizations across the company that do nothing every day, other than get up and think about how to make the product more secure, make the datasets more secure, make sure that the data is being used for its intended purposes and only those.

That’s a new skill. That’s a skill that we didn’t have in the organization 20 years ago that we’ve had to create and hire the necessary talent in order to develop that skill set within the company at the scale that we need to develop it at.

Go through a very basic farming season with a John Deere combine and tractor. The farmer wakes up, they say, “Okay, I’ve got a field. I’ve got to plant some seeds. We’ve got to tend to them. Eventually, we’ve got to harvest some plants.” What are the points at which data is collected, what are the points at which it’s useful, and where does the feedback loop come in?

I’m going to spin it a little bit and not start with planting.

I’m going to tell you that the next season for a farmer actually starts at harvest of the previous season, and that’s where the data thread for the next season actually starts. It starts when that combine is in the field harvesting whatever it is, corn, soybeans, cotton, whatever. And the farmer is creating, while they’re running the combine through the field, a dataset that we call a yield map. It is geospatially referenced. These combines are running through the field on satellite guidance. We know where they’re at at any point in time, latitude, longitude, and we know how much they’re harvesting at that point in time.

So we create this three-dimensional map that is the yield across whatever field they happen to be in. That data is the inception for a winter’s worth of work, in the Northern hemisphere, that a farmer goes through to assess their yield and understand what changes they should make in the next season that might optimize that yield even further.

They might have areas within the field that they go into and know they need to change seeding density, or they need to change crop type, or they need to change how much nutrients they provide in the next season. And all of those decisions are going through their head because they [have] to seed in December, they have to order their nutrients in late winter. They’re making those plans based upon that initial dataset of harvest information.

And then they get into the field in the spring, to your point, with a tractor and a planter, and that tractor and planter are taking the prescription that the farmer developed with the yield data that they took from the previous harvest. They’re using that prescription to apply changes to that field in real time as they’re going through the field, with the existing data from the yield map and the data in real time that they’re collecting with the tractor to modify things like seeding rate, and fertilizer rate and all of those things in order to make sure that they’re minimizing the inputs to the operation while at the same time working to maximize the output.

That data is then going into the cloud, and they’re referencing it. For example, that track the tractor and the planter took through the field is being used to inform the sprayer. When the sprayer goes into the field after emergence, when the crops come out of the ground, it’s being used to inform that sprayer what the optimal path is to drive through the field in order to spray only what needs to be sprayed and no more, to damage the crop the least amount possible, all in an effort to optimize that productivity at the end of the year, to make that yield map that is [a] report card at the end of the year for the farmer, to make that turn out to have a better grade.

That’s a lot of data. Who collects it? Is John Deere collecting it? Can I hire a third-party SaaS software company to manage that data for me? How does that part work?

A significant amount of that data is collected on the fly while the machines are in the field, and it’s collected, in the case of Deere machines, by Deere equipment running through the field. There are other companies that create the data, and they can be imported into things like the Deere Operations Center so that you have the data from whatever source that you wanted to collect it from. I think the important thing there is historically, it’s been more difficult to get the data off the machine, because of connectivity limitations, into a database that you can actually do something with it.

Today, the disproportionate number of machines in large agriculture are connected. They’re connected through terrestrial cell networks. They’re streaming data bi-directionally to the cloud and back from the cloud. So that data connectivity infrastructure that’s been built out over the last decade has really enabled two-way communication, and it’s taken the friction out of getting the data off of a mobile piece of equipment. So it’s happening seamlessly for that operator. And that’s a benefit, because they can act on it then in more near real time, as opposed to having to wait for somebody to upload data at some point in the future.

Whose data is this? Is it the farmer’s data? Is it John Deere’s data? Is there a terms of service agreement for a combine? How does that work?

Certainly [there is] a terms of service agreement. Our position is pretty simple. It’s the farmer’s data. They control it. So if they want to share it through an API with somebody that is a trusted adviser from their perspective, they have the right to do that. If they don’t want to share it, they don’t have to do that. It is their data to control.

Is it portable? When I say there are “computer problems” here, can my tractor deliver me, for example, an Excel file?

They certainly can export the data in form factors that are convenient for them, and they do. Spreadsheet math is still routinely done on the farm, and then [they can] utilize the spreadsheet to do some basic data analytics if they want. I would tell you, though, that what’s happening is that the amount of data that is being collected and curated and made available to them to draw insights from is so massive that while you can still use spreadsheets to manipulate some of it, it’s just not tractable in all cases. So that’s why we’re building functionality into things like the Operations Center to help do data analytics and serve up insights to growers.

It’s their data. They can choose to look at the insights or not, but we can serve those insights up to them, because the data analysis part of this problem is becoming significantly larger because the datasets are so complex and large, not to mention the fact that you’ve got more data coming in all the time. Different sensors are being applied. We can measure different things. There [are] unique pieces of information that are coming in and routinely building to overall ecosystems of data that they have at their disposal.

We’ve talked a lot about the feedback loop of data with the machinery in particular. There’s one really important component to this, which is the seeds. There are a lot of seed manufacturers out in the world. They want this data. They have GMO seeds, they can adjust the seeds to different locations. Where do they come into the mix?

The data, from our perspective, is the farmer’s data. They’re the ones who are controlling the access to it. So if they want to share their data with someone, they have that ability to do it. And they do today. They’ll share their yield map with whoever their local seed salesman is and try to optimize the seed variety for the next planting season in the spring.

So that data exists. It’s not ours, so we’re not at liberty to share it with seed companies, and we don’t. It has to come through the grower because it’s their productivity data. They’re the ones that have the opportunity to share it. We don’t.

You do have a lot of data. Maybe you can’t share it widely, but you can aggregate it. You must have a very unique view of climate change. You must see where the foodways are moving, where different kinds of crops are succeeding and failing. What is your view of climate change, given the amount of data that you’re taking in?

The reality is for us that we’re hindered in answering that question by the recency of the data. So, broad-scale data acquisition from production agriculture is really only a five- to 10-year-old phenomenon. So the datasets are getting richer. They’re getting better.

We have the opportunity to see trends in that data across the datasets that exist today, but I think it’s too early. I don’t think the data is mature enough yet for us to be able to draw any conclusions from a climate change perspective with respect to the data that we have.

The other thing that I’ll add is that the data intensity is not universal across the globe. So if you think of climate change on a global perspective, we’ve got a lot of data for North America, a fair amount of data that gets taken by growers in Europe, a little bit in South America, but it’s not rich enough across the global agricultural footprint for us to be able to make any sort of statements about how climate change is impacting it right now.

Is that something you’re interested in doing?

Yes. I couldn’t predict when, but I think that the data will eventually be rich enough for insights to be drawn from it. It’s just not there yet.

Do you think about doing a fully electric tractor? Is that in your technology roadmap, that you’ve got to get rid of these diesel engines?

You’ve got to be interested in EVs right now. And the answer is yes. Whether it’s a tractor or whether it’s some other product in our product line, alternative forms of propulsion, alternative forms of power are definitely something that we’re thinking about. We’ve done it in the past with, I would say, hybrid solutions like a diesel engine driving an electric generator, and then the rest of the machine being electrified from a propulsion perspective.

But we’re just getting to the point now where battery technology, lithium-ion technology, is power-dense enough for us to see it starting to creep into our portfolio. Probably from the bottom up. Lower power density applications first, before it gets into some of the very large production ag equipment that we’ve talked about today.

What’s the timeline to a fully EV combine, do you think?

I think it’ll be a long time for a combine.

I picked the biggest thing I could, basically.

It has got to run 14, 15, 16 hours per day. It’s got a very short window to run in. You can’t take all day to charge it. Those sorts of problems, they’re not insurmountable. They’re just not solved by anything that’s on the roadmap today, from a lithium-ion perspective, anyway.

You and I are talking two days after Apple had its developers’ conference. Apple famously sells hardware, software, services, as an integrated solution. Do you think of John Deere’s equipment as integrated suites of hardware, software, and services, or is it a piece of hardware that spits off data, and then maybe you can buy our services, or maybe buy somebody else’s services?

I think it’s most efficient when we think of it collectively as a system. It doesn’t have to be that way, and one of the differences I would say to an Apple comparison would be the life of the product, the iron product in our case, the tractor or the combine, is measured in decades. It may be in service for a very long time, and so we have to take that into account as we think about the technology [and] apps that we put on top of it, which have a much shorter shelf life. They’re two, three, four, five years, and then they’re obsolete, and the next best thing has come along.

We have to think about the discontinuity that occurs between product buy cycles as a consequence of that. I do think it’s most efficient to think of it all together. It isn’t always necessarily that way. There are lots of farmers that run multi-colored fleets. It’s not Deere only. So we have to be able to provide an opportunity for them to get data off of whatever their product is into the environment that best enables them to make good decisions from it.

Is that how you characterize the competition, multi-colored fleets?

Absolutely, for sure. I would love the world to be completely [John Deere] green, but it’s not quite that way.

On my way to school every day in Wisconsin growing up, I drove by a Case plant. They’re red. John Deere is famously green, Case is red, International Harvester is yellow.

Yep. Case is red, Deere is green, and then there’s a rainbow of colors outside of those two for sure.

Who are your biggest competitors? And are they adopting the same business model as you? Is this an iOS versus Android situation, or is it widely different?

Our traditional competitors in the ag space, no surprise, you mentioned one of them. Case New Holland is a great example. AGCO would be another. I think everybody’s headed down the path of precision agriculture. [It’s] the term that is ubiquitous for where the industry’s headed.

I’m going to paint a picture for you: It’s this idea of enabling each individual plant in production agriculture to be tended to by a master gardener. The master gardener is in this case probably some AI that is enabling a farmer to know exactly what that particular plant needs, when it needs it, and then our equipment provides them the capability of executing on that plan that master gardener has created for that plant on an extremely large scale.

You’re talking about, in the case of corn, for example, 50,000 plants per acre, so a master gardener taking care of 50,000 plants for every acre of corn. That’s where this is headed, and you can picture the data intensity of that. Two hundred million acres of corn ground, times 50,000 plants per acre; each one of those plants is creating data, and that’s the enormity of the scale of production agriculture when you start to get to this plant-by-plant management basis.

Let’s talk about the enormity of the data and the amount of computation — that’s in tension with how long the equipment lasts. Are you upgrading the computers and the tractors every year, or are you just trying to pull the data into your cloud where you can do the intense computation you want to do?

It’s a combination of both, I would tell you. There are components within the vehicles that do get upgraded from time to time. The displays and the servers that operate in the vehicles do go through upgrade cycles within the existing fleet.

There’s enough appetite, Nilay, for technology in agriculture that we’re also seeing older equipment get updated with new technology. So it’s not uncommon today for a customer who’s purchased a John Deere planter that might be 10 years old to want the latest technology on that planter. And instead of buying a new planter, they might buy the upgrade kit for that planter that allows them to have the latest technology on the existing planter that they own. That sort of stuff is happening all the time across the industry.

I would tell you, though, that what is maybe different now versus 10 years ago is the amount of computation that happens in the cloud, to serve up this enormity of data in bite-sized forms and in digestible pieces that actually can be acted upon for the grower. Very little of that is done on-board machines today. Most of that is done off-board.

We cover rural broadband very heavily. There’s some real-time data collection happening here, but what you’re really talking about is that at the end of a session you’ve got a big asynchronous dataset. You want to send it off somewhere, have some computation done to it, and brought back to you so you can react to it.

What is your relationship to the connectivity providers, or to the Biden administration, that is trying to roll out a broadband plan? Are you pushing to get better networks for the next generation of your products, or are you kind of happy with where things are now?

We’re pro-rural broadband, and in particular newer technologies, 5G as an example. And it’s not just for agricultural purposes, let’s just be frank. There’s a ton of benefits that accrue to a society that’s connected with a sufficient network to do things like online schooling, in particular, coming through the pandemic that we’re in the midst of, and hopefully on the tail end of here. I think that’s just highlighted the use cases for connectivity in rural locations.

Agriculture is but one of those, but there’s some really cool feature unlocks that better connectivity, both in terms of coverage and in terms of bandwidth and latency, provide in agriculture. I’ll give you an example. You think of 5G and the ability to get to incredibly low latency numbers. It allows us to do some things from a computational perspective on the edge of the network that today we don’t have the capability to do. We either do it on-board the machine, or we don’t do it at all. So things like serving up the real-time location of where a farmer’s combine is, instead of having to route that data all the way to the cloud and then back to a handheld device that the farmer might have, wouldn’t it be great if we could do that math on the edge and just ping tower to tower and serve it back down and do it really, really quickly. Those are the sorts of use cases that open up when you get to talking about not just connectivity rurally, but 5G specifically, that are pretty exciting.

Are the networks in place to do all the things you want to do?

Globally, the answer is no. Within the US and Canadian markets, coverage improves every day. There are towers that are going up every day and we are working with our terrestrial cell coverage partners across the globe to expand coverage, and they’re responding. They see, generally, the need, in particular with respect to agriculture, for rural connectivity. They understand the power that it can provide [and] the efficiency that it can derive into food production globally. So they are incentivized to do that. And they’ve been good partners in this space. That said, they recognize that there are still gaps and there’s still a lot of ground to cover, literally in some cases, with connectivity solutions in rural locations.

You mentioned your partners. The parallels to a smartphone here are strong. Do you have different chipsets for AT&T and Verizon? Can you activate your AT&T plan right from the screen in the tractor? How does that work?

AT&T is our dominant partner in North America. That is our go-to, primarily from a coverage perspective. They’re the partner that we’ve chosen that I think serves our customers the best in the most locations.

Do you get free HBO Max if you sign up?

[laughs] Unfortunately, no.

They’re putting it everywhere. You have no idea.

For sure.

I look at the broadband gap everywhere. You mentioned schooling. We cover these very deep consumer needs. On the flip side, you need to run a lot of fiber to make 5G work, especially with the low latency that you’re talking about. You can’t have too many nodes in the way. Do you support millimeter wave 5G on a farm?

Yeah, it is something we’ve looked at. It’s intriguing. How you scale it is the question. I think if we could crack that nut, it would be really interesting.

Just for listeners, an example of millimeter wave if you’re unfamiliar — you’re standing on just the right street corner in New York City, you could get gigabit speeds to a phone. You cross the street, and it goes away. That does not seem tenable on a farm.

That’s right. Not all data needs to be transmitted at the same rate. Not to cover the broad acreage, but you can envision a case where potentially, when you come into range of millimeter wave, you dump a bunch of data all at once. And then when you’re out of range, you’re still collecting data and transmitting it slower perhaps. But having the ability to have millimeter wave type of bandwidth is pretty intriguing for being able to take opportunistic advantage of it when it’s available.

What’s something you want to do that the network isn’t there for you to do yet?

I think that the biggest piece is just a coverage answer from my perspective. We intentionally buffer data on the vehicle in places where we don’t have great coverage in order to wait until that machine has coverage, in order to send the data. But the reality is that means that a grower is waiting in some cases 30 minutes or an hour until the data is synced up in the cloud and something actionable has been done with it and it’s back down to them. And by that point in time, the decision has already been made. It’s not useful because it’s time sensitive. I think that’s probably the biggest gap that we have today. It’s not universal. It happens in pockets and in geographies, but where it happens, the need is real. And those growers don’t benefit as much as growers that do have areas of good coverage.

Is that improvement going as fast as you’d like? Is that a place where you’re saying to the Biden administration, whoever it might be, “Hey, we’re missing out on opportunities because there aren’t the networks we need to go faster.”

It is not going as fast as we would like, full stop. We should be moving faster in that space. Just to tease the thought out a little bit, maybe it’s not just terrestrial cell. Maybe it’s Starlink, maybe it’s a satellite-based type of infrastructure that provides that coverage for us in the future. But it’s certainly not moving at a pace that’s rapid enough for us, given the appetite for data that growers have and what they’ve seen as an ability for that data to significantly optimize their operations.

Have you talked to the Starlink folks?

We have. It’s super interesting. It’s an intriguing idea. The question for us is a mobile one. All of our devices are mobile. Tractors are driving around a field, combines are driving around a field. You get into questions around, what does the receiver need to look like in order to make that work? It’s an interesting idea at this point. I’m ever the optimist, glass-half-full sort of person. I think it’s conceivable that in the not too distant future, that could be a very viable option for some of these locations that are underserved with terrestrial connectivity today.

Walk me through the pricing model of a tractor. These things are very expensive. They’re hundreds of thousands of dollars. What is the recurring cost for an AT&T plan necessary to run that tractor? What is the recurring cost for your data services that you provide? How does that all break down?

Our data services are free today, interestingly enough. Free in the sense [of] the hosting of the data in the cloud and the serving up of that data through Operations Center. If you buy a piece of connected Deere equipment, that service is part of your purchase. I’ll just put it that way.

The recurring expense on the consumer side of things for the connectivity is not unlike what you would experience for a cell phone plan. It’s pretty similar. The difference is for large growers, it’s not just a single cell phone.

They might have 10, 15, 20 devices that are all connected. So we do what we can to make sure that the overhead associated with all of those different connected devices is minimized, but it’s not unlike what you’d experience with an iPhone or an Android device.

Do you have large growers in pockets where the connectivity is just so bad, they’ve had to resort to other means?

We have a multitude of ways of getting data off of mobile equipment. Cell is but one. We’re also able to take it off with Wi-Fi, if you can find a hotspot that you can connect to. Growers also routinely use a USB stick, when all else fails, that works regardless. So we make it possible no matter what their connectivity situation is to get the data off.

But to the point we already talked about, the less friction you’ve got in that system to get the data off, the more data you end up pushing. The more data you push, the more insights you can generate. The more insights you generate, the more optimal your operation is. So to the extent that you don’t have cell connectivity, we do see the intensity of the data usage, it tracks with connectivity.

So if your cloud services are free with the purchase of a connected tractor, is that built into the price or the lease agreement of the tractor for you on your P&L? You’re just saying, “We’re giving this away for free, but baking it into the price.”

Yep.

Can you buy a tractor without that stuff for cheaper?

You can buy products that aren’t connected that do not have a telematics gateway or the cell connection, absolutely. It is uncommon, especially in large ag. I would hesitate to throw a number at you at what the take rate is, but it’s standard equipment in all of our large agricultural products. That said, you can still get it without that if you need to.

How long until these products just don’t have steering wheels and seats and Sirius radios in them? How long until you have a fully autonomous farm?

I love that question. [With] a fully autonomous farm, you’ve got to draw some boundaries around it in order to make it digestible. I think we could have fully autonomous tractors in low single digit years. I’ll leave it a little bit gray just to let the mind wander a little bit.

Taking the cab completely off the tractor, I think, is a ways away, only because the tractor gets used for lots of things that it may not be programmed for, from an autonomous perspective, to do. It’s sort of a Swiss Army knife in a farm environment. But that operatorless operation in, say, fall tillage or spring planting, we’re right on the doorstep of that. We’re knocking on the door of being able to do it.

It’s due to some really interesting technology that’s come together all in one place at one time. It’s the confluence of high capability-compute onboard machines. So we’re putting GPUs on machines today to do vision processing that would blow your mind. Nvidia GPUs are not just for the gaming community or the autonomous car community. They’re happening on tractors and sprayers and things too. So that’s one stream of technology that’s coming together with advanced algorithms. Machine learning, reinforcement learning, convolutional neural networks, all of that going into being able to mimic the human sight capability from a mechanical and computational perspective. That’s come together to give us the ability to start seriously considering taking an operator out of the cab of the tractor.

One of the things that is different, though, for agriculture versus maybe the on-highway autonomous cars, is that tractors don’t just go from point A to point B. Their mission in life is not just to transport. It’s to do productive work. They’re pulling a tillage tool behind them or pulling a planter behind them planting seed. So we not only have to be able to automate the driving of the tractor, but we have to automate the function that it’s doing as well, and make sure that it’s doing a great job of doing the tillage operation that normally the farmer would be observing in the cab of the tractor. Now we have to do that and be able to ascertain whether or not that job quality that’s happening as a consequence of the tractor going through the field is meeting the requirements or not.

What’s the challenge there?

I think it’s the variety of jobs. In this case, let’s take the tractor example again — it’s not only is it doing the tillage right with this particular tillage tool, but a farmer might use three or four different tillage tools in their operation. They all have different use cases. They all require different artificial intelligence models to be trained and to be validated. So scaling out across all of those different conceivable operations, I think is the biggest challenge.

You mentioned GPUs. GPUs are hard to get right now.

Everything’s hard to get right now.

How is the chip shortage affecting you?

It’s impacting us. Weekly, I’m in conversations with semiconductor manufacturers trying to get the parts that we need. It is an ongoing battle. We had thought probably six or seven months ago, like everybody else, that it would be relatively short-term. But I think we’re into this for the next 12 to 18 months. I think we’ll come out of it as capacity comes online, but it’s going to take a little while before that happens.

I’ve talked to a few people about the chip shortage now. The best consensus I’ve gotten is that the problem isn’t at the state of the art. The problem is with older process nodes — five or 10-year-old technology. Is that where the problem is for you as well or are you thinking about moving beyond that?

It’s most acute with older tech. So we’ve got 16-bit chipsets that we’re still working with on legacy controllers that are a pain point. But that said, we’ve also got some really recent, modern stuff that is also a pain point. I was where your head is at three months ago. And then in the three months since, we’ve felt the pain everywhere.

When you say 18 months from now, is that you think there’s going to be more supply or you think the demand is going to tail off?

Supply is certainly coming online. [The] semiconductor industry is doing the right thing. They’re trying to bring capacity online to meet the demand. I would argue it’s just a classic bullwhip effect that’s happened in the marketplace. So I think that will happen. I think there’s certainly some behavior in the industry at the moment around what the demand side is. That’s made it hard for semiconductor manufacturers to understand what real demand is because there’s a panic situation in some respects in the marketplace at the moment.

That said, I think it’s clear there’s only one direction that semiconductor volume is going, and it’s going up. Everything is going to demand it moving forward and demand more of it. So I think once we work through the next 12 to 18 months and work through this sort of immediate and near-term issue, the semiconductor industry is going to have a better handle on things, but capacity has to go up in order to meet the demand. There’s no doubt about it. A lot of that demand is real.

Are you thinking, “Man, I have these 16-bit systems. We should rearchitect things to be more modular, to be more modern, and faster,” or are you saying, “Supply will catch up”?

No, very much the former. I would say two things. One, more prevalent in supply for sure. And then the second one is, easier to change when we need to change. There’s some tech debt that we’re continuing to battle against and pay off over time. And it’s times like these when it rises to the surface and you wish you’d made decisions a little bit differently 10 years ago or five years ago.

My father-in-law, my wife’s cousins, are all farmers up and down. A lot of John Deere hats in my family. I texted them all and asked what they wanted to know. All of them came back and said “right to repair” down the line. Every single one of them. That’s what they asked me to ask you about.

I set up this whole conversation to talk about these things as computers. We understand the problems of computers. It is notable to me that John Deere and Apple had the same effective position on right to repair, which is, we would prefer if you didn’t do it and you let us do it. But there’s a lot of pushback. There are right-to-repair bills in an ever-growing number of states. How do you see that playing out right now? People want to repair their tractors. It is getting harder and harder to do it because they’re computers and you control the parts.

It’s a complex topic, first and foremost. I think the first thing I would tell you is that we have and remain committed to enabling customers to repair the products that they buy. The reality is that 98 percent of the repairs that customers want to do on John Deere products today, they can do. There’s nothing that prohibits them from doing them. Their wrenches are the same size as our wrenches. That all works. If somebody wants to go repair a diesel engine in a tractor, they can tear it down and fix it. We make the service manuals available. We make the parts available, we make the how-to available for them to tear it down to the ground and build it back up again.

That is not really what I’ve heard. I hear that a sensor goes off, the tractor goes into what people call “limp mode.” They have to bring it into a service center. They need a John Deere-certified laptop to pull the codes and actually do that work.

The diagnostic trouble codes are pushed out onto the display. The customer can see what those diagnostic trouble codes are. They may not understand or be able to connect what that sensor issue is with a root cause. There may be an underlying root cause that’s not immediately obvious to the customer based upon the fault code, but the fault code information is there. There is expertise that exists within the John Deere dealer environment, because they’ve seen those issues over time that allows them to understand what the probable cause is for that particular issue. That said, anybody can go buy the sensor. Anybody can go replace it. That’s just a reality.

There is, though, this 2 percent-ish of the repairs that occur on equipment today [that] involve software. And to your point, they’re computer environments that are driving around on wheels. So there is a software component to them. Where we differ with the right-to-repair folks is that software, in many cases, it’s regulated. So let’s take the diesel engine example. We are required, because it’s a regulated emissions environment, to make sure that diesel engine performs at a certain emission output, nitrous oxide, particulate matter, etc., and so on. Modifying software changes that. It changes the output characteristics of the emissions of the engine and that’s a regulated device. So we’re pretty sensitive to changes that would impact that. And disproportionately, those are software changes. Like going in and changing governor gain scheduling, for example, on a diesel engine would have a negative consequence on the emissions that [an] engine produces.

The same argument would apply in brake-by-wire and steer-by-wire. Do you really want a tractor going down the road with software on it that has been modified for steering or modified for braking in some way that might have a consequence that nobody thought of? We know the rigorous nature of testing that we go through in order to push software out into a production landscape. We want to make sure that that product is as safe and reliable and performs to the intended expectations of the regulatory environment that we operate in.

But people are doing it anyway. That’s the real issue here. Again, these are computer problems. This is what I hear from Apple about repairing your own iPhone. Here’s the device with all your data on it that’s on the network. Do you really want to run unsupported software on it? The valence of the debate feels the same to me.

At the same time though, is it their tractor or is it your tractor? Shouldn’t I be allowed to run whatever software I want on my computer?

I think the difference with the Apple argument is that the iPhone isn’t driving down the road at 20 miles an hour with oncoming traffic coming at it. There’s a seriousness of the change that you could make to a product. These things are large. They cost a lot of money. It’s a 40,000-pound tractor going down the road at 20 miles an hour. Do you really want to expose untested, unplanned, unknown introductions of software into a product like that that’s out in the public landscape?

But they were doing it mechanically before. Making it computerized allows you to control that behavior in a way that you cannot on a purely mechanical tractor. I know there are a lot of farmers who did dumb stuff with their mechanical tractors and that was just part of the ecosystem.

Sure. I grew up on one of those. I think the difference there is that the system is so much more complicated today, in part because of software, that it’s not always evident immediately if I make a change here, what it’s going to produce over there. When it was all mechanical, I knew, if I changed the size of the tires or the steering linkage geometry, what was going to happen. I could physically see it and the system was self-contained because it was a mechanical-only system.

I think when we’re talking about a modern piece of equipment and the complexity of the system, it’s a ripple effect. You don’t know what a change that you make over here is going to impact over there any longer. It’s not intuitively obvious to somebody who would make a change in isolation to software, for example, over here. It is a tremendously complex problem. It’s one that we’ve got a tremendously large organization that’s responsible for understanding that complete system and making sure that when the product is produced, that it is reliable and it is safe and it does meet emissions and all of those things.

I look at some of the coverage and there are farmers who are downloading software of unknown provenance that can hack around some of the restrictions. Some of that software appears to be coming from groups in the Ukraine. They’re now using other software to get around the restrictions that, in some cases, could make it even worse, and lead to other unintended consequences, whereas providing the opportunities or making that more official might actually solve some of those problems in a more straightforward way.

I think we’ve taken steps to try to help. One of those is customer service. Service Advisor is the John Deere software that a dealership would use in order to diagnose and troubleshoot equipment. We’ve made available the customer version of Service Advisor as well in order to provide some of the ability for them to have insights — to your point about fault codes before — insights into what are those issues, and what can I learn about them as a customer? How might I go about fixing them? There have been efforts underway in order to try to bridge some of that gap to the extent possible.

We are, though, not in a position where we would ever condone or support a third-party software being put on products of ours, because we just don’t know what the consequences of that are going to be. It’s not something that we’ve tested. We don’t know what it might make the equipment do or not do. And we don’t know what the long-term impacts of that are.

I feel like a lot of people listening to the show own a car. I’ve got a pickup truck. I can go buy a device that will upload a new tune for my Ford pickup truck’s engine. Is that something you can do to a John Deere tractor?

There are third-party outfits that will do exactly that to a John Deere engine. Yep.

But can you do that yourself?

I suspect if you had the right technical knowledge, you could probably figure out a way to do it yourself. If a third-party company figured it out, there is a way for a consumer to do it too.

Where’s the line? Where do you think your control of the system ends and the consumer’s begins? I ask that because I think that might be the most important question in computing right now, just broadly across every kind of computer in our lives. At some point, the manufacturer is like, “I’m still right here with you and I’m putting a line in front of you.” Where’s your line?

We talked about the corner cases, the use cases I think that for us are the lines. They’re around the regulated environment from an emissions perspective. We’ve got a responsibility when we sell a piece of equipment to make sure that it’s meeting the regulatory environment that we sold it into. And then I think the other one is in and around safety, critical systems, things that they can impact others in the environment that, again, in a regulated fashion, we have a responsibility to produce a product that meets the requirements that the regulatory environment requires.

Not only that, but I think there’s a societal responsibility, frankly, that we make sure that the product is as safe as it can be for as long as it can be in operation. And those are where I think we spend a lot of time talking about what amounts to a very small part of the repair of a product. The statistics are real: 98 percent of the repairs that happen on a product can be done by a customer today. So we’re talking about a very small number of them, but they tend to be around those sort of sensitive use cases, regulatory and safety.

Right to Repair legislation is very bipartisan. You’re talking about big commercial operations in a lot of states. It’s America. It’s apple pie and corn farmers. They have a lot of political weight and they’re able to make a very bipartisan push, which is pretty rare in this country right now. Is that a signal you see as, “Oh man, if we don’t get this right, the government is coming for our products?”

I think the government’s certainly one voice in this, and it’s stemming from feedback from some customers. Obviously you’ve done your own bit of work across the farmers in your family. So it is a topic that is being discussed for sure. And we’re all in favor of that discussion, by the way. I think that what we want to make sure of is that it’s an objective discussion. There are ramifications across all dimensions of this. We want to make sure that those are well understood, because it’s such an important topic and has significant enough consequences, so we want to make sure we get it right. The unintended consequences of this are not small. They will impact the industry, some of them in a negative way. And so we just want to make sure that the discussion is objective.

The other signal I’d ask you about is that prices of pre-computer tractors are skyrocketing. Maybe you see that a different way, but I’m looking at some coverage that says old tractors, pre-1990 tractors, are selling for double what they were a year or two ago. There are incredible price hikes on these old tractors. And that the demand is there because people don’t want computers in their tractors. Is that a market signal to you, that you should change the way your products work? Or are you saying, “Well, eventually those tractors will die and you won’t have a choice except to buy one of the new products”?

I think the benefits that accrue from technology are significant enough for consumers. We see this happening with the consumer vote by dollar, by what they purchase. Consumers are continuing to purchase higher levels of technology as we go on. So while yes, the demand for older tractors has gone up, in part it’s because the demand for tractors has gone up completely. Our own technology solutions, we’ve seen upticks in take rates year over year over year over year. So if people were averse to technology, I don’t think you’d see that. At some point we have to recognize that the benefits that technology brings outweigh the downsides of the technology. I think that’s just this part of the technology adoption curve that we’re all on.

That’s the same conversation around smartphones. I get it with smartphones. Everyone has them in their pocket. They collect all this personal data. You may want a gatekeeper there because you don’t have a sophisticated user base.

Your customers are very self-interested, commercial customers.

Yep.

Do you think you have a different kind of responsibility than, I don’t know, the Xbox Live team has to the Xbox Live community? In terms of data, in terms of control, in terms of relinquishing control of the product once it’s sold.

It certainly is a different market. It’s a different customer base. It’s a different clientele. To your point, they are dependent upon the product for their livelihood. So we do everything we can to make sure that product is reliable. It produces when it needs to produce in order to make sure that their businesses are productive and sustainable. I do think the biggest difference from the consumer market that you referenced to our market is the technology life cycle that we’re on.

You brought up tractors that are 20 years old that don’t have a ton of computers on-board versus what we have today. But what we have today is significantly more efficient than what we had 20 years ago. The tractors that you referenced are still in the market. People are still using them. They’re still putting them to work, productive work. In fact, on my family farm, they’re still being used for productive work. And I think that’s what’s different between the consumer market and the ag market. We don’t have a disposable product. You don’t just pick it up and throw it away. We have to be able to plan for that technology use across decades as opposed to maybe single-digit years.

In terms of the benefits of technology and selling that through, one of the other questions I got from the folks in my family was about the next thing that technology can enable. It seems like the equipment can’t physically get much bigger. The next thing to tackle is speed — making things faster for increased productivity.

Is that how you think about selling the benefits of technology — now the combine is as big as it can be, and it’s efficient at this massive scale. Is the next step to make it more efficient in terms of speed?

You’ve seen the industry trend that way. You look at planting as a great example. Ten years ago, we planted at three miles an hour. Today, we plant at 10 miles an hour. And what enabled that was technology. It was electric motors on row units that can react really, really quickly, that are highly controllable and can place seed really, really accurately, right? I think that’s the trend. Wisconsin’s a great place to talk about it. Whether it’s a row crop farm, there’s a small window in the spring, a couple of weeks, where it’s optimal to get those crops in the ground. And so it’s an insurance policy to be able to go faster because the weather may not be great for both of those weeks that you’ve got that are optimal planning weeks. And so you may only have three days or four days in that 10-day window in order to plant all your crops.

And speed is one way to make sure that that happens. Size and the width of the machine is the other. I would agree that we’ve gotten to the point where there’s very little opportunity left in going bigger, and so going faster and, I would argue, going more intelligently, is the way that you improve productivity in the future.

So we’ve talked about a huge set of responsibilities, everything from the physical mechanical design of the machinery to building cloud services, to geopolitics. What is your decision-making process? What’s your framework for how you make decisions?

I think at the root of it, we try to drive everything back to a customer and what we can do to make that customer more productive and more sustainable. And that helps us triage. Of all the great ideas that are out there, all the things that we could work on, what are the things that can move the needle for a customer in their operation as much as possible? And I think that grounding in the customer and the customer’s business is important because, fundamentally, our business is dependent upon the farmer’s business. If the farmer does well, we do well. If the farmer doesn’t do well, we don’t do well. We’re intertwined. There’s a connection there that you can’t and shouldn’t separate.

So driving our decision-making process towards having an intimate knowledge of the customer’s business and what we can do to make their business better frames everything we do.

What’s next for John Deere? What is the short term future for precision farming? Give me a five-year prediction.

I’m super excited about what we’re calling “sense and act.” “See and spray” is the first down payment on that. It’s the ability to create, in software and through electronic and mechanical devices, the human sense of sight, and then act on it. So we’re separating, in this case, weeds from useful crop, and we’re only spraying the weeds. That reduces herbicide use within a field. It reduces the cost for the farmer, input cost into their operation. It’s a win-win-win. And it is step one in the sense-and-act trajectory or sense-and-act runway that we’re on.

There’s a lot more opportunity for us in agriculture to do more sensing and acting, and doing that in an optimal way so that we’re not painting the same picture across a complete field, but doing it more prescriptively and acting more prescriptively in areas of a field that demand different things. I think that sense-and-act type of vision is the roadmap that we’re on. There’s a ton of opportunity in there. It is technology-intensive because you’re talking sensors, you’re talking computers, and you’re talking acting with precision. All of those things require fundamental shifts in technology from where we’re at today.

Source: https://www.theverge.com/22533735/john-deere-cto-hindman-decoder-interview-right-to-repair-tractors

It’s time to ditch Chrome

It’s time to ditch Chrome

As well as collecting your data, Chrome also gives Google a huge amount of control over how the web works
Its time to ditch Chrome
Kheat / GOOGLE / WIRED
 

Despite a poor reputation for privacy, Google’s Chrome browser continues to dominate. The web browser has around 65 per cent market share and two billion people are regularly using it. Its closest competitor, Apple’s Safari, lags far behind with under 20 per cent market share. That’s a lot of power, even before you consider Chrome’s data collection practices. 

Is Google too big and powerful, and do you need to ditch Chrome for good? Privacy experts say yes. Chrome is tightly integrated with Google’s data gathering infrastructure, including services such as Google search and Gmail – and its market dominance gives it the power to help set new standards across the web. Chrome is one of Google’s most powerful data-gathering tools.

Google is currently under fire from privacy campaigners including rival browser makers and regulators for changes in Chrome that will spell the end of third-party cookies, the trackers that follow you as you browse. Although there are no solid plans for Europe yet, Google is planning to replace cookies with its own ‘privacy preserving’ tracking tech called FLoC, which critics say will give the firm even more power at the expense of its competitors due to the sheer scale of Chrome’s user base.

Chrome’s hefty data collection practices are another reason to ditch the browser. According to Apple’s iOS privacy labels, Google’s Chrome app can collect data including your location, search and browsing history, user identifiers and product interaction data for “personalisation” purposes. Google says this gives you the ability to enable features such as the option to save your bookmarks and passwords to your Google Account. But unlike rivals Safari, Microsoft’s Edge and Firefox, Chrome links this data to devices and individuals.

Although Chrome legitimately needs to handle browsing data, it can siphon off a large amount of information about your activities and transmit it to Google, says Rowenna Fielding, founder and director of privacy consultancy Miss IG Geek. “If you’re using Chrome to browse the internet, even in private mode, Google is watching everything you do online, all the time. This allows Google to build up a detailed and sophisticated picture about your personality, interests, vulnerabilities and triggers.”

When you sync your Google accounts to Chrome, the data slurping doesn’t stop there. Information from other Google-owned products including its email service Gmail and Google search can be combined to form a scarily accurate picture. Chrome data can be added to your geolocation history from Google Maps, the metadata from your Gmail usage, your social graph – who you interact with, both on and offline – the apps you use on your Android phone, and the products you buy with Google Pay. “That creates a very clear picture of who you are and how you live your life,” Fielding says.

As well as gathering information about your online and offline purchases, data from Google Pay can be used “in the same way as data from other Google services,” says Fielding. “This is not just what you buy, but also your location, device contacts and information, and the links those details provide so you can be identified and profiled across multiple datasets.”

Google’s power goes even further than its own browser market share. Competitor browsers such as Microsoft’s Edge are based on the same engine, Chromium. “So under the hood they are still a form of Chrome”, says Sean Wright, an independent security researcher.

Google’s massive market share has allowed the internet giant to develop web standards such as AMP in Google mobile search, which publishers must use in order to appear at the top of search results. And more recently, Chrome’s FLoC effectively gives Google control over the ad tracking tech that will replace third-party cookies – although this is being developed in the open and with feedback from other developers.

Google’s power allows it to set the direction of the industry, says Wright. “Some of those changes are good, including the move to make HTTPS encryption a default, but others are more self-serving, such as the FLoC proposal.”

Google says its Ads products do not access synced Chrome browsing history, other than for preventing spam and fraud. The firm outlines that the iOS privacy labels represent the maximum categories of data that can be gathered, and what is actually collected depends on the features you use in the app, and how you configure your settings. It also claims its open-source FLoC API is privacy-focused and will not give Google Ads products special privileges or access.

Google says privacy and security “have always been core benefits of the Chrome browser”. A Google spokesperson highlighted the Safe Browsing features that protect against threats such as phishing and malware, as well as additional controls to help you manage your information in Chrome. In recent years the company has introduced more ways you can control your data. “Chrome offers helpful options to keep your data in sync across devices, and you control what activity gets saved to your Google Account if you choose to sign in,” the spokesperson says.

But that doesn’t change the level of data collection possible, or the fact that Google has so much sway, simply through its market dominance and joined up ad-driven ecosystem. “When you are a company that has the majority share of browsers and internet search, you suddenly have a huge amount of power,” says Matthew Gribben, a former GCHQ cybersecurity consultant. “When every web developer and SEO expert in the world needs to pander to these whims, the focus becomes on making sites work well for Google at the expense of everything else.”

And as long as people use Chrome and other services – many of which are, admittedly, more user friendly than those of rivals – then Google’s power shows no signs of diminishing. Chrome provides Google with “enormous amounts of behavioural and demographic data, control over people’s browsing experience, a platform for shaping the web to Google’s own advantage, and brand ‘capture’”, Fielding says. “When people’s favourite tools, games and sites only work with Chrome, they are reluctant to switch to an alternative.”

In theory, competition and data protection laws should provide the tools to keep Google from getting out of control, says Fielding. But in practice, “that doesn’t seem to be working for various reasons – including disparities of wealth and power between Google and national regulators”. Fielding adds that Google is also useful to many governments and economies and it is tricky to enforce national laws against a global corporation.

There are steps you can take to lock down your account, such as preventing your browsing data being collected by not syncing Chrome, and turning off third-party cookie tracking. But note that the more features you use in Chrome, the more data Google needs to ensure they can function properly. And as Google’s power and dominance continues to surge, the other option is to ditch Chrome altogether.

If you do decide to ditch Chrome, there are plenty of other feature-rich privacy browser options to consider, including Firefox, Brave and DuckDuckGo, which don’t involve giving Google any of your data.

source: https://www.wired.co.uk/article/google-chrome-browser-data

How Apple and Google Are Enabling Covid-19 Contact-Tracing

Source: https://www.wired.com/story/apple-google-bluetooth-contact-tracing-covid-19/

The tech giants have teamed up to use a Bluetooth-based framework to keep track of the spread of infections without compromising location privacy.
a man walking in the street in boston
The companies chose to skirt privacy pitfalls and implement a system that collects no location data.Photograph: Craig F. Walker/Boston Globe/Getty Images

Since Covid-19 began its spread across the world, technologists have proposed using so-called contact-tracing apps to track infections via smartphones. Now, Google and Apple are teaming up to give contact-tracers the ingredients to make that system possible—while in theory still preserving the privacy of those who use it.

On Friday, the two companies announced a rare joint project to create the groundwork for Bluetooth-based contact-tracing apps that can work across both iOS and Android phones. In mid-May, they plan to release an application programming interface that apps from public health organizations can tap into. The API will let those apps use a phone’s Bluetooth radios—which have a range of about 30 feet—to keep track of whether a smartphone’s owner has come into contact with someone who later turns out to have been infected with Covid-19. Once alerted, that user can then self-isolate or get tested themselves.

Crucially, Google and Apple say the system won’t involve tracking user locations or even collecting any identifying data that would be stored on a server. „This is a very unprecedented situation for the world,“ said one of the joint project’s spokespeople in a phone call with WIRED. „As platform companies we’ve both been thinking hard about what we can do to help get people back to normal life and back to work effectively. We think in bringing the two platforms together we can solve digital contact tracing at scale in partnership with public health authorities and do it in a privacy-preserving way.“

Unlike Apple, which has complete control over its software and hardware and can push system-wide changes with relative ease, Google faces a fragmented Android ecosystem. The company will still make the framework available to all devices running Android 6.0 or higher by delivering the update through Google Play Services, which does not require hardware partners to sign off.

Several projects, including ones led by developers at MIT, Stanford, and the governments of Singapore and Germany, have already proposed, and in some cases implemented, similar Bluetooth-based contact-tracing systems. Google and Apple declined to say which specific groups or government agencies they’ve been working with. But they argue that by building operating-level functions those applications can tap into, the apps will be far more effective and energy efficient. Most importantly, they’ll be interoperable between the two dominant smartphone platforms.

In the version of the system set to roll out next month, the operating-system-level Bluetooth tracing would allow users to opt in to a Bluetooth-based proximity-detection scheme when they download a contact-tracing app. Their phone would then constantly ping out Bluetooth signals to others nearby while also listening for communications from nearby phones.

If two phones spend more than a few minutes within range of one another, they would each record contact with the other phone, exchanging unique, rotating identifier “beacon” numbers that are based on keys stored on each device. Public heath app developers would be able to „tune“ both the proximity and the amount of time necessary to qualify as a contact based on current information about how Covid-19 spreads.

If a user is later diagnosed with Covid-19, they would alert their app with a tap. The app would then upload their last two weeks of keys to a server, which would then generate their recent “beacon” numbers and send them out to other phones in the system. If someone else’s phone finds that one of these beacon numbers matches one stored on their phone, they would be notified that they’ve been in contact with a potentially infected person and given information about how to help prevent further spread.

graph with illustrations of phones and humans
Courtesy of Google
graph with illustrations of phones and humans
Courtesy of Google

The advantage of that system, in terms of privacy, is that it doesn’t depend on collecting location data. „People’s identities aren’t tied to any contact events,“ said Cristina White, a Stanford computer scientist who described a very similar Bluetooth-based contact tracing project known as Covid-Watch to WIRED last week. „What the app uploads instead of any identifying information is just this random number that the two phones would be able to track down later but that nobody else would, because it’s stored locally on their phones.“

Until now, however, Bluetooth-based schemes like the one White described suffered from how Apple limits access to Bluetooth when apps run in the background of iOS, a privacy and power-saving safeguard. It will lift that restriction specifically for contact-tracing apps. And Apple and Google say that the protocol they’re releasing will be designed to use minimal power to save phones‘ battery lives. „This thing has to run 24-7, so it has to really only sip the battery life,“ said one of the project’s spokespeople.

In a second iteration of the system rolling out in June, Apple and Google say they’ll allow users to enable Bluetooth-based contact-tracing even without an app installed, building the system into the operating systems themselves. This would be opt-in as well. But while the phones would exchange „beacon“ numbers via Bluetooth, users would still need to download a contact-tracing app to either declare themselves as Covid-19 positive or to learn if someone they’ve come into contact with was diagnosed.

Google and Apple’s Bluetooth-based system has some significant privacy advantages over GPS-based location-tracking systems that have been proposed by other researchers including at MIT, the University of Toronto, McGill, and Harvard. Since those systems collect location data, they would require complex cryptographic systems to avoid collecting information about users‘ movements that could potentially expose highly personal information, from political dissent to extramarital affairs.

With Google and Apple’s announcement, it’s clear that the companies chose to skirt those privacy pitfalls and implement a system that collects no location data. „It looks like we won,“ says Stanford’s White, whose Covid-Watch project, part of a consortium of projects using a Bluetooth-based system, had advocated for the Bluetooth-only approach. „It’s clear from the API that it was influenced by our work. It’s following the exact suggestions from our engineers about how implement it.“

Sticking to Bluetooth alone doesn’t guarantee the system won’t violate users’ privacy, White notes. Although Google and Apple say they’ll only upload anonymous identifiers from users’ phones, a server could nonetheless identify Covid-19 users in other ways, such as based on their IP address. The organization running a given app still needs to act responsibly. “Exactly what they’re proposing for the backend still isn’t clear, and that’s really important,” White says. “We need to keep advocating to make sure this is done properly and the server isn’t collecting information it shouldn’t.”

Even with Bluetooth tracing, the app still faces some practical challenges. First, it would need significant adoption and broad willingness to share Covid-19 infection information to work. And it will also require a safeguard that only allows users to declare themselves Covid-19 positive after a healthcare provider has officially diagnosed them, so that the system isn’t overrun with false positives. Covid-Watch, for instance, would require the user to get a confirmation code from a health care provider.

Bluetooth-based systems, in contrast with location-based systems, also have some problems of their own. If someone leaves behind traces of the novel coronavirus on a surface, for instance, someone can be infected by it without their phones ever being in proximity.

A spokesperson for the Google and Apple project didn’t deny that possibility, but argued that those cases of „environmental transmission“ are relatively rare compared to direct transmission from people in proximity of each other. „This won’t cut every chain of every transmission,“ the spokesperson said. „But if you cut enough of them, you modulate the transmission enough to flatten the curve.“

 

Why outbreaks like coronavirus spread exponentially, and how to “flatten the curve”

FREE-FOR-ALL VS. ATTEMPTED QUARANTINE

MODERATE SOCIAL DISTANCING vs. EXTENSIVE SOCIAL DISTANCING

This so-called exponential curve has experts worried. If the number of cases were to continue to double every three days, there would be about a hundred million cases in the United States by May.

That is math, not prophecy. The spread can be slowed, public health professionals say, if people practice “social distancing” by avoiding public spaces and generally limiting their movement.

Still, without any measures to slow it down, covid-19 will continue to spread exponentially for months. To understand why, it is instructive to simulate the spread of a fake disease through a population.

We will call our fake disease simulitis. It spreads even more easily than covid-19: whenever a healthy person comes into contact with a sick person, the healthy person becomes sick, too.

In a population of just five people, it did not take long for everyone to catch simulitis.

In real life, of course, people eventually recover. A recovered person can neither transmit simulitis to a healthy person nor become sick again after coming in contact with a sick person.

Let’s see what happens when simulitis spreads in a town of 200 people. We will start everyone in town at a random position, moving at a random angle, and we will make one person sick.

Notice how the slope of the red curve, which represents the number of sick people, rises rapidly as the disease spreads and then tapers off as people recover.

Our simulation town is small — about the size of Whittier, Alaska — so simulitis was able to spread quickly across the entire population. In a country like the United States, with its 330 million people, the curve could steepen for a long time before it started to slow.

[Mapping the spread of the coronavirus in the U.S. and worldwide]

When it comes to the real covid-19, we would prefer to slow the spread of the virus before it infects a large portion of the U.S. population. To slow simulitis, let’s try to create a forced quarantine, such as the one the Chinese government imposed on Hubei province, covid-19’s ground zero.

Whoops! As health experts would expect, it proved impossible to completely seal off the sick population from the healthy.

Leana Wen, the former health commissioner for the city of Baltimore, explained the impracticalities of forced quarantines to The Washington Post in January. “Many people work in the city and live in neighboring counties, and vice versa,“ Wen said. “Would people be separated from their families? How would every road be blocked? How would supplies reach residents?”

As Lawrence O. Gostin, a professor of global health law at Georgetown University, put it: “The truth is those kinds of lockdowns are very rare and never effective.”

Fortunately, there are other ways to slow an outbreak. Above all, health officials have encouraged people to avoid public gatherings, to stay home more often and to keep their distance from others. If people are less mobile and interact with each other less, the virus has fewer opportunities to spread.

Some people will still go out. Maybe they cannot stay home because of their work or other obligations, or maybe they simply refuse to heed public health warnings. Those people are not only more likely to get sick themselves, they are more likely to spread simulitis, too.

Let’s see what happens when a quarter of our population continues to move around while the other three quarters adopt a strategy of what health experts call “social distancing.”

More social distancing keeps even more people healthy, and people can be nudged away from public places by removing their allure.

“We control the desire to be in public spaces by closing down public spaces. Italy is closing all of its restaurants. China is closing everything, and we are closing things now, too,” said Drew Harris, a population health researcher and assistant professor at The Thomas Jefferson University College of Public Health. “Reducing the opportunities for gathering helps folks social distance.”

To simulate more social distancing, instead of allowing a quarter of the population to move, we will see what happens when we let just one of every eight people move.

The four simulations you just watched — a free-for-all, an attempted quarantine, moderate social distancing and extensive social distancing — were random. That means the results of each one were unique to your reading of this article; if you scroll up and rerun the simulations, or if you revisit this page later, your results will change.

Even with different results, moderate social distancing will usually outperform the attempted quarantine, and extensive social distancing usually works best of all. Below is a comparison of your results.

Finishing simulations…

Simulitis is not covid-19, and these simulations vastly oversimplify the complexity of real life. Yet just as simulitis spread through the networks of bouncing balls on your screen, covid-19 is spreading through our human networks — through our countries, our towns, our workplaces, our families. And, like a ball bouncing across the screen, a single person’s behavior can cause ripple effects that touch faraway people.

[What you need to know about coronavirus]

In one crucial respect, though, these simulations are nothing like reality: Unlike simulitis, covid-19 can kill. Though the fatality rate is not precisely known, it is clear that the elderly members of our community are most at risk of dying from covid-19.

“If you want this to be more realistic,” Harris said after seeing a preview of this story, “some of the dots should disappear.”

A Deep Dive Into the Technology of Corporate Surveillance

December 2, 2019

By Bennett Cyphers and Gennie Gebhart

Introduction

Trackers are hiding in nearly every corner of today’s Internet, which is to say nearly every corner of modern life. The average web page shares data with dozens of third-parties. The average mobile app does the same, and many apps collect highly sensitive information like location and call records even when they’re not in use. Tracking also reaches into the physical world. Shopping centers use automatic license-plate readers to track traffic through their parking lots, then share that data with law enforcement. Businesses, concert organizers, and political campaigns use Bluetooth and WiFi beacons to perform passive monitoring of people in their area. Retail stores use face recognition to identify customers, screen for theft, and deliver targeted ads.

The tech companies, data brokers, and advertisers behind this surveillance, and the technology that drives it, are largely invisible to the average user. Corporations have built a hall of one-way mirrors: from the inside, you can see only apps, web pages, ads, and yourself reflected by social media. But in the shadows behind the glass, trackers quietly take notes on nearly everything you do. These trackers are not omniscient, but they are widespread and indiscriminate. The data they collect and derive is not perfect, but it is nevertheless extremely sensitive.

This paper will focus on corporate “third-party” tracking: the collection of personal information by companies that users don’t intend to interact with. It will shed light on the technical methods and business practices behind third-party tracking. For journalists, policy makers, and concerned consumers, we hope this paper will demystify the fundamentals of third-party tracking, explain the scope of the problem, and suggest ways for users and legislation to fight back against the status quo.

Part 1 breaks down “identifiers,” or the pieces of information that trackers use to keep track of who is who on the web, on mobile devices, and in the physical world. Identifiers let trackers link behavioral data to real people.

Part 2 describes the techniques that companies use to collect those identifiers and other information. It also explores how the biggest trackers convince other businesses to help them build surveillance networks.

Part 3 goes into more detail about how and why disparate actors share information with each other. Not every tracker engages in every kind of tracking. Instead, a fragmented web of companies collect data in different contexts, then share or sell it in order to achieve specific goals.

Finally, Part 4 lays out actions consumers and policy makers can take to fight back. To start, consumers can change their tools and behaviors to block tracking on their devices. Policy makers must adopt comprehensive privacy laws to rein in third-party tracking.

Contents

Introduction
First-party vs. third-party tracking
What do they know?
Part 1: Whose Data is it Anyway: How Do Trackers Tie Data to People?
Identifiers on the Web
Identifiers on mobile devices
Real-world identifiers
Linking identifiers over time
Part 2: From bits to Big Data: What do tracking networks look like?
Tracking in software: Websites and Apps
Passive, real-world tracking
Tracking and corporate power
Part 3: Data sharing: Targeting, brokers, and real-time bidding
Real-time bidding
Group targeting and look-alike audiences
Data brokers
Data consumers
Part 4: Fighting back
On the web
On mobile phones
IRL
In the legislature

First-party vs. third-party tracking

The biggest companies on the Internet collect vast amounts of data when people use their services. Facebook knows who your friends are, what you “Like,” and what kinds of content you read on your newsfeed. Google knows what you search for and where you go when you’re navigating with Google Maps. Amazon knows what you shop for and what you buy.

The data that these companies collect through their own products and services is called “first-party data.” This information can be extremely sensitive, and companies have a long track record of mishandling it. First-party data is sometimes collected as part of an implicit or explicit contract: choose to use our service, and you agree to let us use the data we collect while you do. More users are coming to understand that for many free services, they are the product, even if they don’t like it.

However, companies collect just as much personal information, if not more, about people who aren’t using their services. For example, Facebook collects information about users of other websites and apps with its invisible “conversion pixels.” Likewise, Google uses location data to track user visits to brick and mortar stores. And thousand of other data brokers, advertisers, and other trackers lurk in the background of our day-to-day web browsing and device use. This is known as “third-party tracking.” Third-party tracking is much harder to identify without a trained eye, and it’s nearly impossible to avoid completely.

What do they know?

Many consumers are familiar with the most blatant privacy-invasive potential of their devices. Every smartphone is a pocket-sized GPS tracker, constantly broadcasting its location to parties unknown via the Internet. Internet-connected devices with cameras and microphones carry the inherent risk of conversion into silent wiretaps. And the risks are real: location data has been badly abused in the past. Amazon and Google have both allowed employees to listen to audio recorded by their in-home listening devices, Alexa and Home. And front-facing laptop cameras have been used by schools to spy on students in their homes.

But these better known surveillance channels are not the most common, or even necessarily the most threatening to our privacy. Even though we spend many of our waking hours in view of our devices’ Internet-connected cameras, it’s exceedingly rare for them to record anything without a user’s express intent. And to avoid violating federal and state wiretapping laws, tech companies typically refrain from secretly listening in on users’ conversations. As the rest of this paper will show, trackers learn more than enough from thousands of less dramatic sources of data. The unsettling truth is that although Facebook doesn’t listen to you through your phone, that’s just because it doesn’t need to.

The most prevalent threat to our privacy is the slow, steady, relentless accumulation of relatively mundane data points about how we live our lives. This includes things like browsing history, app usage, purchases, and geolocation data. These humble parts can be combined into an exceptionally revealing whole. Trackers assemble data about our clicks, impressions, taps, and movement into sprawling behavioral profiles, which can reveal political affiliation, religious belief, sexual identity and activity, race and ethnicity, education level, income bracket, purchasing habits, and physical and mental health.

Despite the abundance of personal information they collect, tracking companies frequently use this data to derive conclusions that are inaccurate or wrong. Behavioral advertising is the practice of using data about a user’s behavior to predict what they like, how they think, and what they are likely to buy, and it drives much of the third-party tracking industry. While behavioral advertisers sometimes have access to precise information, they often deal in sweeping generalizations and “better than nothing” statistical guesses. Users see the results when both uncannily accurate and laughably off-target advertisements follow them around the web. Across the marketing industry, trackers use petabytes of personal data to power digital tea reading. Whether trackers’ inferences are correct or not, the data they collect represents a disproportionate invasion of privacy, and the decisions they make based on that data can cause concrete harm.

Part 1: Whose Data is it Anyway: How Do Trackers Tie Data to People?

Most third-party tracking is designed to build profiles of real people. That means every time a tracker collects a piece of information, it needs an identifier—something it can use to tie that information to a particular person. Sometimes a tracker does so indirectly: by correlating collected data with a particular device or browser, which might in turn later be correlated to one person or perhaps a small group of people like a household.

To keep track of who is who, trackers need identifiers that are unique, persistent, and available. In other words, a tracker is looking for information (1) that points only to you or your device, (2) that won’t change, and (3) that it has easy access to. Some potential identifiers fit all three of these requirements, but trackers can still make use of an identifier that checks only two of these three boxes. And trackers can combine multiple weak identifiers to create a single, strong one.

An identifier that checks all three boxes might be a name, an email, or a phone number. It might also be a “name” that the tracker itself gives you, like “af64a09c2” or “921972136.1561665654”. What matters most to the tracker is that the identifier points to you and only you. Over time, it can build a rich enough profile about the person known as “af64a09c2”—where they live, what they read, what they buy—that a conventional name is not necessary. Trackers can use artificial identifiers, like cookies and mobile ad IDs, to reach users with targeted messaging. And data that isn’t tied to a real name is no less sensitive: “anonymous” profiles of personal information can nearly always be linked back to real people.

Some types of identifiers, like cookies, are features built into the tech that we use. Others, like browser fingerprints, emerge from the way those technologies work. This section will break down how trackers on the web and in mobile apps are able to identify and attribute data points.

This section will describe a representative sample of identifiers that third-party trackers can use. It is not meant to be exhaustive; there are more ways for trackers to identify users than we can hope to cover, and new identifiers will emerge as technology evolves. The tables below give a brief overview of how unique, persistent, and available each type of identifier is.

Web Identifiers Unique Persistent Available
Cookies Yes Until user deletes In some browsers without tracking protection
IP address Yes On the same network, may persist for weeks or months Always
TLS state Yes For up to one week In most browsers
Local storage super cookie Yes Until user deletes Only in third-party IFrames; can be blocked by tracker blockers
Browser fingerprint Only on certain browsers Yes Almost always; usually requires JavaScript access, sometimes blocked by tracker blockers

 

Phone Identifiers Unique Persistent Available
Phone number Yes Until user changes Readily available from data brokers; only visible to apps with special permissions
IMSI and IMEI number Yes Yes Only visible to apps with special permissions
Advertising ID Yes Until user resets Yes, to all apps
MAC address Yes Yes To apps: only with special permissionsTo passive trackers: visible unless OS performs randomization or device is in airplane mode

 

Other Identifiers Unique Persistent Available
License plate Yes Yes Yes
Face print Yes Yes Yes
Credit card number Yes Yes, for months or years To any companies involved in payment processing

Identifiers on the Web

Browsers are the primary way most people interact with the Web. Each time you visit a website, code on that site may cause your browser to make dozens or even hundreds of requests to hidden third parties. Each request contains several pieces of information that can be used to track you.

Anatomy of a Request

Almost every piece of data transmitted between your browser and the servers of the websites you interact with occurs in the form of an HTTP request. Basically, your browser asks a web server for content by sending it a particular URL. The web server can respond with content, like text or an image, or with a simple acknowledgement that it received your request. It can also respond with a cookie, which can contain a unique identifier for tracking purposes.

Each website you visit kicks off dozens or hundreds of different requests. The URL you see in the address bar of your browser is the address for the first request, but hundreds of other requests are made in the background. These requests can be used for loading images, code, and styles, or simply for sharing data.

A diagram depicting the various parts of a URL

Parts of a URL. The domain tells your computer where to send the request, while the path and parameters carry information that may be interpreted by the receiving server however it wants.

The URL itself contains a few different pieces of information. First is the domain, like “nytimes.com”. This tells your browser which server to connect to. Next is the path, a string at the end of the domain like “/section/world.html”. The server at nytimes.com chooses how to interpret the path, but it usually specifies a piece of content to serve—in this case, the world news section. Finally, some URLs have parameters at the end in the form of “?key1=value1&key2=value2”. The parameters usually carry extra information about the request, including queries made by the user, context about the page, and tracking identifiers.

A computer sending a single request to a website at "eff.org."

The path of a request. After it leaves your machine, the request is redirected by your router to your ISP, which sends it through a series of intermediary routing stations in “the Internet.” Finally, it arrives at the server specified by the domain, which can decide how (or if) to respond.

The URL isn’t all that gets sent to the server. There are also HTTP headers, which contain extra information about the request like your device’s language and security settings, the “referring” URL, and cookies. For example, the User-Agent header identifies your browser type, version, and operating system. There’s also lower-level information about the connection, including IP address and shared encryption state. Some requests contain even more configurable information in the form of POST data. POST requests are a way for websites to share chunks of data that are too large or unwieldy to fit in a URL. They can contain just about anything.

Some of this information, like the URL and POST data, is specifically tailored for each individual request; other parts, like your IP address and any cookies, are sent automatically by your machine. Almost all of it can be used for tracking.

A URL bar and the data that’s sent along with a website request.

Data included with a background request. In the image, although the user has navigated to fafsa.gov, the page triggers a third-party request to facebook.com in the background. The URL isn’t the only information that gets sent to the receiving server; HTTP Headers contain information like your User Agent string and cookies, and POST data can contain anything that the server wants.

The animation immediately above contains data we collected directly from a normal version of Firefox. If you want to check it out for yourself, you can. All major browsers have an “inspector” or “developer” mode which allows users to see what’s going on behind the scenes, including all requests coming from a particular tab. In Chrome and Firefox, you can access this interface with Crtl+Shift+I (or ⌘+Shift+I on Mac). The “Network” tab has a log of all the requests made by a particular page, and you can click on each one to see where it’s going and what information it contains.

Identifiers shared automatically

Some identifiable information is shared automatically along with each request. This is either by necessity—as with IP addresses, which are required by the underlying protocols that power the Internet—or by design—as with cookies. Trackers don’t need to do anything more than trigger a request, any request, in order to collect the information described here.

//website.com. This is shown as a HTTP request, processed by a first-party server, and delivering the requested content. A separate red line shows that the HTTP request is also forwarded to a third-party server, given an assigned ID, and a tracking cookie that is included in the requested content.

Each time you visit a website by typing in a URL or clicking on a link, your computer makes a request to that website’s server (the “first party”). It may also make dozens or hundreds of requests to other servers, many of which may be able to track you.

Cookies

The most common tool for third-party tracking is the HTTP cookie. A cookie is a small piece of text that is stored in your browser, associated with a particular domain. Cookies were invented to help website owners determine whether a user had visited their site before, which makes them ideal for behavioral tracking. Here’s how they work.

The first time your browser makes a request to a domain (like www.facebook.com), the server can attach a Set-Cookie header to its reply. This will tell your browser to store whatever value the website wants—for example, `c_user:“100026095248544″` (an actual Facebook cookie taken from the author’s browser). Then, every time your browser makes a request to www.facebook.com in the future, it sends along the cookie that was set earlier. That way, every time Facebook gets a request, it knows which individual user or device it’s coming from.

//website.com. The server responds with website content and a cookie.

The first time a browser makes a request to a new server, the server can reply with a “Set-Cookie” header that stores a tracking cookie in the browser.

Not every cookie is a tracker. Cookies are also the reason that you don’t have to log in every single time you visit a website, as well as the reason your cart doesn’t empty if you leave a website in the middle of shopping. Cookies are just a means of sharing information from your browser to the website you are visiting. However, they are designed to be able to carry tracking information, and third-party tracking is their most notorious use.

Luckily, users can exercise a good deal of control over how their browsers handle cookies. Every major browser has an optional setting to disable third-party cookies (though it is usually turned off by default.) In addition, Safari and Firefox have recently started restricting access to third-party cookies for domains they deem to be trackers. As a result of this “cat and mouse game” between trackers and methods to block them, third-party trackers are beginning to shift away from relying solely on cookies to identify users, and are evolving to rely on other identifiers.

Cookies are always unique, and they normally persist until a user manually clears them. Cookies are always available to trackers in unmodified versions of Chrome, but third-party cookies are no longer available to many trackers in Safari and Firefox. Users can always block cookies themselves with browser extensions.

IP Address

Each request you make over the Internet contains your IP address, a temporary identifier that’s unique to your device. Although it is unique, it is not necessarily persistent: your IP address changes every time you move to a new network (e.g., from home to work to a coffee shop). Thanks to the way IP addresses work, it may change even if you stay connected to the same network.

There are two types of IP addresses in widespread use, known as IPv4 and IPv6. IPv4 is a technology that predates the Web by a decade. It was designed for an Internet used by just a few hundred institutions, and there are only around 4 billion IPV4 addresses in the world to serve over 22 billion connected devices today. Even so, over 70% of Internet traffic still uses IPv4.

As a result, IPv4 addresses used by consumer devices are constantly being reassigned. When a device connects to the Internet, its internet service provider (ISP) gives it a “lease” on an IPv4 address. This lets the device use a single address for a few hours or a few days. When the lease is up, the ISP can decide to extend the lease or grant it a new IP. If a device remains on the same network for extended periods of time, its IP may change every few hours — or it may not change for months.

IPv6 addresses don’t have the same scarcity problem. They do not need to change, but thanks to a privacy-preserving extension to the technical standard, most devices generate a new, random IPv6 address every few hours or days. This means that IPv6 addresses may be used for short-term tracking or to link other identifiers, but cannot be used as standalone long-term identifiers.

IP addresses are not perfect identifiers on their own, but with enough data, trackers can use them to create long-term profiles of users, including mapping relationships between devices. You can hide your IP address from third-party trackers by using a trusted VPN or the Tor browser.

IP addresses are always unique, and always available to trackers unless a user connects through a VPN or Tor. Neither IPv4 nor IPv6 addresses are guaranteed to persist for longer than a few days, although IPv4 addresses may persist for several months

TLS State

Today, most traffic on the web is encrypted using Transport Layer Security, or TLS. Any time you connect to a URL that starts with “https://” you’re connecting using TLS. This is a very good thing. The encrypted connection that TLS and HTTPS provide prevents ISPs, hackers, and governments from spying on web traffic, and it ensures that data isn’t being intercepted or modified on the way to its destination.

However, it also opens up new ways for trackers to identify users. TLS session IDs and session tickets are cryptographic identifiers that help speed up encrypted connections. When you connect to a server over HTTPS, your browser starts a new TLS session with the server.

The session setup involves some expensive cryptographic legwork, so servers don’t like to do it more often than they have to. Instead of performing a full cryptographic “handshake” between the server and your browser every time you reconnect, the server can send your browser a session ticket that encodes some of the shared encryption state. The next time you connect to the same server, your browser sends the session ticket, allowing both parties to skip the handshake. The only problem with this is that the session ticket can be exploited by trackers as a unique identifier.

TLS session tracking was only brought to the public’s attention recently in an academic paper, and it’s not clear how widespread its use is in the wild.

Like IP addresses, session tickets are always unique. They are available unless the user’s browser is configured to reject them, as Tor is. Server operators can usually configure session tickets to persist for up to a week, but browsers do reset them after a while.

Identifiers created by trackers

Sometimes, web-based trackers want to use identifiers beyond just IP addresses (which are unreliable and not persistent), cookies (which a user can clear or block), or TLS state (which expires within hours or days). To do so, trackers need to put in a little more effort. They can use JavaScript to save and load data in local storage or perform browser fingerprinting.

Local storage “cookies” and IFrames

Local storage is a way for websites to store data in a browser for long periods of time. Local storage can help a web-based text editor save your settings, or allow an online game to save your progress. Like cookies, local storage allows third-party trackers to create and save unique identifiers in your browser.

Also like cookies, data in local storage is associated with a specific domain. This means if example.com sets a value in your browser, only example.com web pages and example.com’s IFrames can access it. An IFrame is like a small web page within a web page. Inside an IFrame, a third-party domain can do almost everything a first-party domain can do. For example, embedded YouTube videos are built using IFrames; every time you see a YouTube video on a site other than YouTube, it’s running inside a small page-within-a-page. For the most part, your browser treats the YouTube IFrame like a full-fledged web page, giving it permission to read and write to YouTube’s local storage. Sure enough, YouTube uses that storage to save a unique “device identifier” and track users on any page with an embedded video.

Local storage “cookies” are unique, and they persist until a user manually clears their browser storage. They are only available to trackers which are able to run JavaScript code inside a third-party IFrame. Not all cookie-blocking measures take local storage cookies into account, so local storage cookies may sometimes be available to trackers for which normal cookie access is blocked.

Fingerprinting

Browser fingerprinting is one of the most complex and insidious forms of web-based tracking. A browser fingerprint consists of one or more attributes that, on their own or when combined, uniquely identify an individual browser on an individual device. Usually, the data that go into a fingerprint are things that the browser can’t help exposing, because they’re just part of the way it interacts with the web. These include information sent along with the request made every time the browser visits a site, along with attributes that can be discovered by running JavaScript on the page. Examples include the resolution of your screen, the specific version of software you have installed, and your time zone. Any information that your browser exposes to the websites you visit can be used to help assemble a browser fingerprint. You can get a sense of your own browser’s fingerprint with EFF’s Panopticlick project.

The reliability of fingerprinting is a topic of active research, and must be measured against the backdrop of ever-evolving web technologies. However, it is clear that new techniques increase the likelihood of unique identification, and the number of sites that use fingerprinting is increasing as well. A recent report found that at least a third of the top 500 sites visited by Americans employ some form of browser fingerprinting. The prevalence of fingerprinting on sites also varies considerably with the category of website.

Researchers have found canvas fingerprinting techniques to be particularly effective for browser identification. The HTML Canvas is a feature of HTML5 that allows websites to render complex graphics inside of a web page. It’s used for games, art projects, and some of the most beautiful sites on the Web. Because it’s so complex and performance-intensive, it works a little bit differently on each different device. Canvas fingerprinting takes advantage of this.

Subtle differences in the way shapes and text are rendered on the two computers lead to very different fingerprints.

Canvas fingerprinting. A tracker renders shapes, graphics, and text in different fonts, then computes a “hash” of the pixels that get drawn. The hash will be different on devices with even slight differences in hardware, firmware, or software.

A tracker can create a “canvas” element that’s invisible to the user, render a complicated shape or string of text using JavaScript, then extract data about exactly how each pixel on the canvas is rendered. The operating system, browser version, graphics card, firmware version, graphics driver version, and fonts installed on your computer all affect the final result.

For the purposes of fingerprinting, individual characteristics are hardly ever measured in isolation. Trackers are most effective in identifying a browser when they combine multiple characteristics together, stitching the bits of information left behind into a cohesive whole. Even if one characteristic, like a canvas fingerprint, is itself not enough to uniquely identify your browser, it can usually be combined with others — your language, time zone, or browser settings — in order to identify you. And using a combination of simple bits of information is much more effective than you might guess.

Fingerprints are often, but not always, unique. Some browsers, like Tor and Safari, are specifically designed so that their users are more likely to look the same, which removes or limits the effectiveness of browser fingerprinting. Browser fingerprints tend to persist as long as a user has the same hardware and software: there’s no setting you can fiddle with to “reset” your fingerprint. And fingerprints are usually available to any third parties who can run JavaScript in your browser.

Identifiers on mobile devices

Smartphones, tablets, and ebook readers usually have web browsers that work the same way desktop browsers do. That means that these types of connected devices are susceptible to all of the kinds of tracking described in the section above.

However, mobile devices are different in two big ways. First, users typically need to sign in with an Apple, Google, or Amazon account to take full advantage of the devices’ features. This links device identifiers to an account identity, and makes it easier for those powerful corporate actors to profile user behavior. For example, in order to save your home and work address in Google Maps, you need to turn on Google’s “Web and App Activity,” which allows it to use your location, search history, and app activity to target ads.

Second, and just as importantly, most people spend most of their time on their mobile device in apps outside of the browser. Trackers in apps can’t access cookies the same way web-based trackers can. But by taking advantage of the way mobile operating systems work, app trackers can still access unique identifiers that let them tie activity back to your device. In addition, mobile phones—particularly those running the Android and iOS operating systems—have access to a unique set of identifiers that can be used for tracking.

In the mobile ecosystem, most tracking happens by way of third-party software development kits, or SDKs. An SDK is a library of code that app developers can choose to include in their apps. For the most part, SDKs work just like the Web resources that third parties exploit, as discussed above: they allow a third party to learn about your behavior, device, and other characteristics. An app developer who wants to use a third-party analytics service or serve third-party ads downloads a piece of code from, for example, Google or Facebook. The developer then includes that code in the published version of their app. The third-party code thus has access to all the data that the app does, including data protected behind any permissions that the app has been granted, such as location or camera access.

On the web, browsers enforce a distinction between “first party” and “third party” resources. That allows them to put extra restrictions on third-party content, like blocking their access to browser storage. In mobile apps, this distinction doesn’t exist. You can’t grant a privilege to an app without granting the same privilege to all the third party code running inside it.

Phone numbers

The phone number is one of the oldest unique numeric identifiers, and one of the easiest to understand. Each number is unique to a particular device, and numbers don’t change often. Users are encouraged to share their phone numbers for a wide variety of reasons (e.g., account verification, electronic receipts, and loyalty programs in brick-and-mortar stores). Thus, data brokers frequently collect and sell phone numbers. But phone numbers aren’t easy to access from inside an app. On Android, phone numbers are only available to third-party trackers in apps that have been granted certain permissions. iOS prevents apps from accessing a user’s phone number at all.

Phone numbers are unique and persistent, but usually not available to third-party trackers in most apps.

Hardware identifiers: IMSI and IMEI

Every device that can connect to a mobile network is assigned a unique identifier called an International Mobile Subscriber Identity (IMSI) number. IMSI numbers are assigned to users by their mobile carriers and stored on SIM cards, and normal users can’t change their IMSI without changing their SIM. This makes them ideal identifiers for tracking purposes.

Similarly, every mobile device has an International Mobile Equipment Identity (IMEI) number “baked” into the hardware. You can change your SIM card and your phone number, but you can’t change your IMEI without buying a new device.

IMSI numbers are shared with your cell provider every time you connect to a cell tower—which is all the time. As you move around the world, your phone sends out pings to nearby towers to request information about the state of the network. Your phone carrier can use this information to track your location (to varying degrees of accuracy). This is not quite third-party tracking, since it is perpetrated by a phone company that you have a relationship with, but regardless many users may not realize that it’s happening.

Software and apps running on a mobile phone can also access IMSI and IMEI numbers, though not as easily. Mobile operating systems lock access to hardware identifiers behind permissions that users must approve and can later revoke. For example, starting with Android Q, apps need to request the “READ_PRIVILEGED_PHONE_STATE” permission in order to read non-resettable IDs. On iOS, it’s not possible for apps to access these identifiers at all. This makes other identifiers more attractive options for most app-based third-party trackers. Like phone numbers, IMSI and IMEI numbers are unique and persistent, but not readily available, as most trackers have a hard time accessing them.

Advertising IDs

An advertising ID is a long, random string of letters and numbers that uniquely identifies a mobile device. Advertising IDs aren’t part of any technical protocols, but are built in to the iOS and Android operating systems.

Ad IDs on mobile phones are analogous to cookies on the Web. Instead of being stored by your browser and shared with trackers on different websites like cookies, ad IDs are stored by your phone and shared with trackers in different apps. Ad IDs exist for the sole purpose of helping behavioral advertisers link user activity across apps on a device.

Unlike IMSI or IMEI numbers, ad IDs can be changed and, on iOS, turned off completely. Ad IDs are enabled by default on both iOS and Android, and are available to all apps without any special permissions. On both platforms, the ad ID does not reset unless the user does so manually.

Both Google and Apple encourage developers to use ad IDs for behavioral profiling in lieu of other identifiers like IMEI or phone number. Ostensibly, this gives users more control over how they are tracked, since users can reset their identifiers by hand if they choose. However, in practice, even if a user goes to the trouble to reset their ad ID, it’s very easy for trackers to identify them across resets by using other identifiers, like IP address or in-app storage. Android’s developer policy instructs trackers not to engage in such behavior, but the platform has no technical safeguards to stop it. In February 2019, a study found that over 18,000 apps on the Play store were violating Google’s policy.

Ad IDs are unique, and available to all apps by default. They persist until users manually reset them. That makes them very attractive identifiers for surreptitious trackers.

MAC addresses

Every device that can connect to the Internet has a hardware identifier called a Media Access Control (MAC) address. MAC addresses are used to set up the initial connection between two wireless-capable devices over WiFi or Bluetooth.

MAC addresses are used by all kinds of devices, but the privacy risks associated with them are heightened on mobile devices. Websites and other servers you interact with over the Internet can’t actually see your MAC address, but any networking devices in your area can. In fact, you don’t even have to connect to a network for it to see your MAC address; being nearby is enough.

Here’s how it works. In order to find nearby Bluetooth devices and WiFi networks, your device is constantly sending out short radio signals called probe requests. Each probe request contains your device’s unique MAC address. If there is a WiFi hotspot in the area, it will hear the probe and send back its own “probe response,” addressed with your device’s MAC, with information about how you can connect to it.

But other devices in the area can see and intercept the probe requests, too. This means that companies can set up wireless “beacons” that silently listen for MAC addresses in their vicinity, then use that data to track the movement of specific devices over time. Beacons are often set up in businesses, at public events, and even in political campaign yard signs. With enough beacons in enough places, companies can track users’ movement around stores or around a city. They can also identify when two people are in the same location and use that information to build a social graph.

A smartphone emits probe request to scan for available WiFi and Bluetooth connections. Several wireless beacons listen passively to the requests.

In order to find nearby Bluetooth devices and WiFi networks, your device is constantly sending out short radio signals called probe requests. Each probe request contains your device’s unique MAC address. Companies can set up wireless “beacons” that silently listen for MAC addresses in their vicinity, then use that data to track the movement of specific devices over time.

This style of tracking can be thwarted with MAC address randomization. Instead of sharing its true, globally unique MAC address in probe requests, your device can make up a new, random, “spoofed” MAC address to broadcast each time. This makes it impossible for passive trackers to link one probe request to another, or to link them to a particular device. Luckily, the latest versions of iOS and Android both include MAC address randomization by default.

MAC address tracking remains a risk for laptops, older phones, and other devices, but the industry is trending towards more privacy-protective norms.

Hardware MAC addresses are globally unique. They are also persistent, not changing for the lifetime of a device. They are not readily available to trackers in apps, but are available to passive trackers using wireless beacons. However, since many devices now obfuscate MAC addresses by default, they are becoming a less reliable identifier for passive tracking.

Real-world identifiers

Many electronic device identifiers can be reset, obfuscated, or turned off by the user. But real-world identifiers are a different story: it’s illegal to cover your car’s license plate while driving (and often while parked), and just about impossible to change biometric identifiers like your face and fingerprints.

License plates

Every car in the United States is legally required to have a license plate that is tied to their real-world identity. As far as tracking identifiers go, license plate numbers are about as good as it gets. They are easy to spot and illegal to obfuscate. They can’t be changed easily, and they follow most people wherever they go.

Automatic license plate readers, or ALPRs, are special-purpose cameras that can automatically identify and record license plate numbers on passing cars. ALPRs can be installed at fixed points, like busy intersections or mall parking lots, or on other vehicles like police cars. Private companies operate ALPRs, use them to amass vast quantities of traveler location data, and sell this data to other businesses (as well as to police).

Unfortunately, tracking by ALPRs is essentially unavoidable for people who drive. It’s not legal to hide or change your license plate, and since most ALPRs operate in public spaces, it’s extremely difficult to avoid the devices themselves.

License plates are unique, available to anyone who can see the vehicle, and extremely persistent. They are ideal identifiers for gathering data about vehicles and their drivers, both for law enforcement and for third-party trackers.

Face biometrics

Faces are another class of unique identifier that are extremely attractive to third-party trackers. Faces are unique and highly inconvenient to change. Luckily, it’s not illegal to hide your face from the general public, but it is impractical for most people to do so.

Everyone’s face is unique, available, and persistent. However, current face recognition software will sometimes confuse one face for another. Furthermore, research has shown that algorithms are much more prone to making these kinds of errors when identifying people of color, women, and older individuals.

Facial recognition has already seen widespread deployment, but we are likely just beginning to feel the extent of its impact. In the future, facial recognition cameras may be in stores, on street corners, and docked on computer-aided glasses. Without strong privacy regulations, average people will have virtually no way to fight back against pervasive tracking and profiling via facial recognition.

Credit/debit cards

Credit card numbers are another excellent long-term identifier. While they can be cycled out, most people don’t change their credit card numbers nearly as often as they clear their cookies. Additionally, credit card numbers are tied directly to real names, and anyone who receives your credit card number as part of a transaction also receives your legal name.

What most people may not understand is the amount of hidden third parties involved with each credit card transaction. If you buy a widget at a local store, the store probably contracts with a payment processor who provides card-handling services. The transaction also must be verified by your bank as well as the bank of the card provider. The payment processor in turn may employ other companies to validate its transactions, and all of these companies may receive information about the purchase. Banks and other financial institutions are regulated by the Gramm-Leach-Bliley Act, which mandates data security standards, requires them to disclose how user data is shared, and gives users the right to opt out of sharing. However, other financial technology companies, like payment processors and data aggregators, are significantly less regulated.

Linking identifiers over time

Often, a tracker can’t rely on a single identifier to act as a stable link to a user. IP addresses change, people clear cookies, ad IDs can be reset, and more savvy users might have “burner” phone numbers and email addresses that they use to try to separate parts of their identity. When this happens, trackers don’t give up and start a new user profile from scratch. Instead, they typically combine several identifiers to create a unified profile. This way, they are less likely to lose track of the user when one identifier or another changes, and they can link old identifiers to new ones over time.

Trackers have an advantage here because there are so many different ways to identify a user. If a user clears their cookies but their IP address doesn’t change, linking the old cookie to the new one is trivial. If they move from one network to another but use the same browser, a browser fingerprint can link their old session to their new one. If they block third-party cookies and use a hard-to-fingerprint browser like Safari, trackers can use first-party cookie sharing in combination with TLS session data to build a long-term profile of user behavior. In this cat-and-mouse game, trackers have technological advantages over individual users.

Part 2: From bits to Big Data: What do tracking networks look like?

In order to track you, most tracking companies need to convince website or app developers to include custom tracking code in their products. That’s no small thing: tracking code can have a number of undesirable effects for publishers. It can slow down software, annoy users, and trigger regulation under laws like GDPR. Yet the largest tracking networks cover vast swaths of the Web and the app stores, collecting data from millions of different sources all the time. In the physical world, trackers can be found in billboards, retail stores, and mall parking lots. So how and why are trackers so widespread? In this section, we’ll talk about what tracking networks look like in the wild.

A bar graph showing market share of different web tracking companies. Google is the most prevalent, monitoring over 80% of traffic on the web.

Top trackers on the Web, ranked by the proportion of web traffic that they collect data from. Google collects data about over 80% of measured web traffic. Source: WhoTracks.me, by Cliqz GBMH.

Tracking in software: Websites and Apps

Ad networks

A graphic of a web page, with three ads separated and outlined. Each ad is served by a different ad server.

Each ad your browser loads may come from a different advertising server, and each server can build its own profile of you based on your activity. Each time you connect to that server, it can use a cookie to link that activity to your profile.

The dominant market force behind third-party tracking is the advertising industry, as discussed below in Part 3. So it’s no surprise that online ads are one of the primary vectors for data collection. In the simplest model, a single third-party ad network serves ads on a number of websites. Each publisher that works with the ad network must include a small snippet of code on their website that will load an ad from the ad server. This triggers a request to the ad server each time a user visits one of the cooperating sites, which lets the ad server set third-party cookies into users’ browsers and track their activity across the network. Similarly, an ad server might provide an ad-hosting software development kit (SDK) for mobile app developers to use. Whenever a user opens an app that uses the SDK, the app makes a request to the ad server. This request can contain the advertising ID for the user’s device, thus allowing the ad server to profile the user’s activity across apps.

In reality, the online ad ecosystem is even more complicated. Ad exchanges host “real time auctions” for individual ad impressions on web pages. In the process, they may load code from several other third-party advertising providers, and may share data about each impression with many potential advertisers participating in the auction. Each ad you see might be responsible for sharing data with dozens of trackers. We’ll go into more depth about Real Time Bidding and other data-sharing activities in Part 3.

Analytics and tracking pixels

Tracking code often isn’t associated with anything visible to users, like a third-party ad. On the web, a significant portion of tracking happens via invisible, 1-pixel-by-1-pixel “images” that exist only to trigger requests to the trackers. These “tracking pixels” are used by many of the most prolific data collectors on the web, including Google Analytics, Facebook, Amazon, and DoubleVerify.

When website owners install a third party’s tracking pixels, they usually do so in exchange for access to some of the data the third party collects. For example, Google Analytics and Chartbeat use pixels to collect information, and offer website owners and publishers insights about what kinds of people are visiting their sites. Going another level deeper, advertising platforms like Facebook also offer “conversion pixels,” which allow publishers to keep track of how many click-throughs their own third-party ads are getting.

The biggest players in web-based analytics offer similar services to mobile apps. Google Analytics and Facebook are two of the most popular SDKs on both Android and iOS. Like their counterparts on the Web, these services silently collect information about users of mobile apps and then share some of that information with the app developers themselves.

Mobile third-party trackers convince app developers to install their SDKs by providing useful features like analytics or single sign-on. SDKs are just big blobs of code that app developers add to their projects. When they compile and distribute an app, the third-party code ships with it. Unlike Web-based tools, analytics services in mobile apps don’t need to use “pixels” or other tricks to trigger third-party requests.

Another class of trackers work on behalf of advertisers rather than first-party sites or apps. These companies work with advertisers to monitor where, how, and to whom their ads are being served. They often don’t work with first-party publishers at all; in fact, their goal is to gather data about publishers as well as users.

DoubleVerify is one of the largest such services. Third-party advertisers inject DoubleVerify code alongside their ads, and DoubleVerify estimates whether each impression is coming from a real human (as opposed to a bot), whether the human is who the advertiser meant to target, and whether the page around the ad is “brand safe.” According to its privacy policy, the company measures “how long the advertisement was displayed in the consumer’s browser” and “the display characteristics of the ad on the consumer’s browser.” In order to do all that, DoubleVerify gathers detailed data about users’ browsers; it is by far the largest source of third-party browser fingerprinting on the web. It collects location data, including data from other third-party sources, to try to determine whether a user is viewing an ad in the geographic area that the advertiser targeted.

Other companies in the space include Adobe, Oracle, and Comscore.

Embedded media players

Sometimes, third-party trackers serve content that users actually want to see. On the web, embedding third-party content is extremely common for blogs and other media sites. Some examples include video players for services like YouTube, Vimeo, Streamable, and Twitter, and audio widgets for Soundcloud, Spotify, and podcast-streaming services. These media players nearly always run inside IFrames, and therefore have access to local storage and the ability to run arbitrary JavaScript. This makes them well-suited to tracking users as well.

Social media widgets

Social media companies provide a variety of services to websites, such as Facebook Like buttons and Twitter Share buttons. These are often pitched as ways for publishers to improve traffic numbers on their own platforms as well as their presence on social media. Like and Share buttons can be used for tracking in the same way that pixels can: the “button” is really an embedded image which triggers a request to the social media company’s server.

More sophisticated widgets, like comment sections, work more like embedded media players. They usually come inside of IFrames and enjoy more access to users’ browsers than simple pixels or images. Like media players, these widgets are able to access local storage and run JavaScript in order to compute browser fingerprints.

Finally, the biggest companies (Facebook and Google in particular) offer account management services to smaller companies, like “Log in with Google.” These services, known as “single sign-on,” are attractive to publishers for several reasons. Independent websites and apps can offload the work of managing user accounts to the big companies. Users have fewer username/password pairs to remember, and less frequently go through annoying sign up/log-in flows. But for users, there is a price: account management services allow log-in providers to act as a third party and track their users’ activity on all of the services they log into. Log-in services are more reliable trackers than pixels or other simple widgets because they force users to confirm their identity.

CAPTCHAs

CAPTCHAs are a technology that attempts to separate users from robots. Publishers install CAPTCHAs on pages where they want to be particularly careful about blocking automated traffic, like sign-up forms and pages that serve particularly large files.

Google’s ReCAPTCHA is the most popular CAPTCHA technology on the web. Every time you connect to a site that uses recaptcha, your browser connects to a *.google.com domain in order to load the CAPTCHA resources and shares all associated cookies with Google. This means that its CAPTCHA network is another source of data that Google can use to profile users.

While older CAPTCHAs asked users to read garbled text or click on pictures of bikes, the new ReCAPTCHA v3 records “interactions with the website” and silently guesses whether a user is human. ReCAPTCHA scripts don’t send raw interaction data back to Google. Rather, they generate something akin to a behavioral fingerprint, which summarizes the way a user has interacted with a page. Google feeds this into a machine-learning model to estimate how likely the user is to be human, then returns that score to the first-party website. In addition to making things more convenient for users, this newer system benefits Google in two ways. First, it makes CAPTCHAS invisible to most users, which may make them less aware that Google (or anyone) is collecting data about them. Second, it leverages Google’s huge set of behavioral data to cement its dominance in the CAPTCHA market, and ensures that any future competitors will need their own tranches of interaction data in order to build tools that work in a similar way.

Session replay services

Session replay services are tools that website or app owners can install in order to actually record how users interact with their services. These services operate both on websites and in apps. They log keystrokes, mouse movements, taps, swipes, and changes to the page, then allow first-party sites to “re-play” individual users’ experiences after the fact. Often, users are given no indication that their actions are being recorded and shared with third parties.

These creepy tools create a massive risk that sensitive data, like medical information, credit card numbers, or passwords, will be recorded and leaked. The providers of session replay services usually leave it up to their clients to designate certain data as off-limits. But for clients, the process of filtering out sensitive information is subtle, painstaking, and time-consuming, and it clashes with replay services’ promises to get set up “in a matter of seconds.” As a result, independent auditing has found that sensitive data ends up in the recordings, and that session replay service providers often fail to secure that data appropriately.

Passive, real-world tracking

WiFi hotspots and wireless beacons

Many consumer devices emit wireless “probe” signals, and many companies install commercial beacons that intercept these probes all over the physical world. Some devices randomize the unique MAC address device identifiers they share in probes, protecting themselves from passive tracking, but not all do. And connecting to an open WiFi network or giving an app Bluetooth permissions always opens a device up to tracking.

As we discussed above, WiFi hotspots, wireless beacons, and other radio devices can be used to “listen” for nearby devices. Companies like Comcast (which provides XFinity WiFi) and Google (which provides free WiFi in Starbucks and many other businesses) have WiFi hotspots installed all over the world; Comcast alone boasts over 18 million XFinity WiFi installations. Dozens of other companies that you likely haven’t heard of provide free WiFi to coffee shops, restaurants, events, and hotels.

Companies also pay to install wireless beacons in real-world businesses and public spaces. Bluetooth-enabled beacons have been installed around retail stores, at political rallies, in campaign lawn signs, and on streetlight poles.

Wireless beacons are capable of tracking on two levels. First, and most concerning, wireless beacons can passively monitor the “probes” that devices send out all the time. If a device is broadcasting its hardware MAC address, companies can use the probes they collect to track its user’s movement over time.

A laptop emits probe requests containing its a MAC address. Wireless Bbeacons listen for the probes and tie the requests to a profile of the user.

WiFi hotspots and bluetooth beacons can listen for probes that wireless devices send out automatically. Trackers can use each device’s MAC address to create a profile of it based on where they’ve seen that device.

Second, when a user connects to a WiFi hotspot or to a Bluetooth beacon, the controller of the hotspot or beacon can connect the device’s MAC address to additional identifiers like IP address, cookies, and ad ID. Many WiFi hotspot operators also use a sign-in page to collect information about users’ real names or email addresses. Then, when users browse the web from that hotspot, the operator can collect data on all the traffic coming from the user’s device, much like an ISP. Bluetooth beacons are used slightly differently. Mobile phones allow apps to access the Bluetooth interface with certain permissions. Third-party trackers in apps with Bluetooth permissions can automatically connect to Bluetooth beacons in the real world, and they can use those connections to gather fine-grained location data.

Thankfully, both iOS and Android devices now send obfuscated MAC addresses with probes by default. This prevents the first, passive style of tracking described above.

But phones aren’t the only devices with wireless capability. Laptops, e-readers, wireless headphones, and even cars are often outfitted with Bluetooth capability. Some of these devices don’t have the MAC randomization features that recent models of smartphones do, making them vulnerable to passive location tracking.

Furthermore, even devices with MAC randomization usually share static MAC addresses when they actually connect to a wireless hotspot or Bluetooth device. This heightens the risks of the second style of tracking described above, which occurs when the devices connect to public WiFi networks or local Bluetooth beacons.

Vehicle tracking and ALPRs

Automated license plate readers, or ALPRs, are cameras outfitted with the ability to detect and read license plates. They can also use other characteristics of cars, like make, model, color, and wear, in order to help identify them. ALPRs are often used by law enforcement, but many ALPR devices are owned by private companies. These companies collect vehicle data indiscriminately, and once they have it, they can re-sell it to whomever they want: local police, federal immigration enforcement agencies, private data aggregators, insurance companies, lenders, or bounty hunters.

Different companies gather license plate data from different sources, and sell it to different audiences. Digital Recognition Network, or DRN, sources its data from thousands of repossession agencies around the country, and sells data to insurance agencies, private investigators, and “asset recovery” companies. According to an investigation by Motherboard, the vast majority of individuals about whom DRN collects data are not suspected of a crime or behind on car payments. The start-up Flock Safety offers ALPR-powered “neighborhood watch” services. Concerned homeowners can install ALPRs on their property in order to record and share information about cars that drive through their neighborhood.

DRN is owned by VaaS International Holdings, a Fort Worth-based company that brands itself as “the preeminent provider of license plate recognition (‘LPR’) and facial recognition products and data solutions.” It also owns Vigilant Solutions, another private purveyor of ALPR technology. Vigilant’s clients include law enforcement agencies and private shopping centers. Vigilant pools data from thousands of sources around the country into a single database, which it calls “PlateSearch.” Scores of law enforcement agencies pay for access to PlateSearch. According to EFF’s research, approximately 99.5% of the license plates recorded by Vigilant are not connected to a public safety interest at the time they are scanned.

Cameras and machine vision aren’t the only technologies enabling vehicle tracking. Passive MAC address tracking can also be used to track vehicle movement. Phones inside of vehicles, and sometimes the vehicles themselves, broadcast probe requests including their MAC addresses. Wireless beacons placed strategically around roads can listen for those signals. One company, Libelium, sells a wireless beacon that is meant to be installed on streetlights in order to track nearby traffic.

Face recognition cameras

Face recognition has been deployed widely by law enforcement in some countries, including China and the UK. This has frightening implications: it allows mass logging of innocent people’s activities. In China, it has been used to monitor and control members of the Uighur minority community.

We’ve covered the civil liberties harms associated with law enforcement use of face recognition extensively in the past. But face recognition also has been deployed in a number of private industries. Airlines use face recognition to authenticate passengers before boarding. Concert venues and ticket sellers have used it to screen concert-goers. Retailers use face recognition to identify people who supposedly are greater risks for shoplifting, which is especially concerning considering that the underlying mugshot databases are riddled with unfair racial disparities, and the technology is more likely to misidentify people of color. Private security companies sell robots equipped with face recognition to monitor public spaces and help employers keep tabs on employees. And schools and even summer camps use it to keep tabs on kids.

Big tech companies have begun investing in facial recognition for payment processing, which would give them another way to link real-world activity to users’ online personas. Facebook has filed a patent on a system that would link faces to social media profiles in order to process payments. Also, Amazon’s brick-and-mortar “Go” stores rely on biometrics to track who enters and what they take in order to charge them accordingly.

In addition, many see facial recognition as a logical way to bring targeted advertising to the physical world. Face recognition cameras can be installed in stores, on billboards, and in malls to profile people’s behavior, build dossiers on their habits, and target messages at them. In January 2019, Walgreens began a pilot program using face recognition cameras installed on LED-screen fridge doors. The idea is that, instead of looking through a plate of glass to see the contents of a fridge, consumers can look at a screen which will display graphics indicating what’s inside. The camera can perform facial recognition on whoever is standing in front of the fridge, and the graphics can be dynamically changed to serve ads targeted to that person. Whether or not Walgreens ends up deploying this technology at a larger scale, this appears to be one direction retailers are heading.

Payment processors and financial technology

Financial technology, or “fintech,” is a blanket term for the burgeoning industry of finance-adjacent technology companies. Thousands of relatively new tech companies act as the technological glue between old-guard financial institutions and newer technologies, including tracking and surveillance. When they are regulated, fintech companies are often subject to less government oversight than traditional institutions like banks.

Payment processors are companies that accept payments on behalf of other businesses. As a result, they are privy to huge amounts of information about what businesses sell and what people buy. Since most financial transactions involve credit card numbers and names, it is easy for payment processors to tie the data they collect to real identities. Some of these companies are pure service providers, and don’t use data for any purposes other than moving money from one place to another. Others build profiles of consumers or businesses and then monetize that data. For example, Square is a company that makes credit card readers for small businesses. It also uses the information it collects to serve targeted ads from third parties and to underwrite loans through its Square Capital program.

Some fintech companies offer financial services directly to users, like Intuit, the company behind TurboTax and Mint. Others provide services to banks or businesses. In the fintech world, “data aggregators” act as intermediaries between banks and other services, like money management apps. In the process, data aggregators gain access to all the data that passes through their pipes, including account balances, outstanding debts, and credit card transactions for millions of people. In addition, aggregators often collect consumers’ usernames and passwords in order to extract data from their banks. Yodlee, one of the largest companies in the space, sells transaction data to hedge funds, which mine the information to inform stock market moves. Many users are unaware that their data is used for anything other than operating the apps they have signed up for.

Tracking and corporate power

Many of the companies that benefit most from data tracking have compelling ways to entice web developers, app creators, and store managers to install their tracking technology. Companies with monopolies or near-monopolies can use their market power to build tracking networks, monitor and inhibit smaller competitors, and exploit consumer privacy for their own economic advantage. Corporate power and corporate surveillance reinforce one another in several ways.

First, dominant companies like Google and Facebook can pressure publishers into installing their tracking code. Publishers rely on the world’s biggest social network and the world’s biggest search engine to drive traffic to their own sites. As a result, most publishers need to advertise on those platforms. And in order to track how effective their ads are, they have no choice but to install Google and Facebook’s conversion measurement code on their sites and apps. Google, Facebook, and Amazon also act as third-party ad networks, together controlling over two-thirds of the market. That means publishers who want to monetize their content have a hard time avoiding the big platforms’ ad tracking code.

Second, vertically integrated tech companies can gain control of both sides of the tracking market. Google administers the largest behavioral advertising system in the world, which it powers by collecting data from its Android phones and Chrome browser—the most popular mobile operating system and most popular web browser in the world. Compared to its peer operating systems and browsers, Google’s user software makes it easier for its trackers to collect data.

When the designers of the Web first described browsers, they called them “user agents:” pieces of software that would act on their users’ behalf on the Internet. But when a browser maker is also a company whose main source of revenue is behavioral advertising, the company’s interest in user privacy and control is pitted against the company’s interest in tracking. The company’s bottom line usually comes out on top.

Third, data can be used to profile not just people, but also competitor companies. The biggest data collectors don’t just know how we act, they also know more about the market—and their competitors—than anyone else. Google’s tracking tools monitor over 80% of traffic on the Web, which means it often knows as much about it’s competitors’ traffic as its competitors do (or more). Facebook (via third-party ads, analytics, conversion pixels, social widgets, and formerly its VPN app Onavo) also monitors the use and growth of websites, apps, and publishers large and small. Amazon already hosts a massive portion of the Internet in its Amazon Web Services computing cloud, and it is starting to build its own formidable third-party ad network. These giants use this information to identify nascent competitors, and then buy them out or clone their products before they become significant threats. According to confidential internal documents, Facebook used data about users’ app habits from Onavo, its VPN, to inform its acquisition of WhatsApp.

Fourth, as tech giants concentrate tracking power into their own hands, they can use access to data as an anticompetitive cudgel. Facebook was well aware that access to its APIs (and the detailed private data that entailed) were invaluable to other social companies. It has a documented history of granting or withholding access to user data in order to undermine its competition.

Furthermore, Google and Facebook have both begun adopting policies that restrict competitors’ access to their data without limiting what they collect themselves. For example, most of the large platforms now limit the third-party trackers on their own sites. In its own version of RTB, Google has recently begun restricting access to ad identifiers and other information that would allow competing ad networks to build user profiles. And following the Cambridge Analytica incident, Facebook started locking down access to third-party APIs, without meaningfully changing anything about the data that Facebook itself collects on users. On the one hand, restricting third-party access can have privacy benefits. On the other, kicking third-party developers and outside actors off Facebook’s and Google’s platform services can make competition problems worse, give incumbent giants sole power over the user data they have collected, and cement their privacy-harmful business practices. Instead of seeing competition and privacy as isolated concerns, empowering users requires addressing both to reduce large companies’ control over users’ data and attention.

Finally, big companies can acquire troves of data from other companies in mergers and acquisitions. Google Analytics began its life as the independent company Urchin, which Google purchased in 2005. In 2007, Google supercharged its third-party advertising networks by purchasing Doubleclick, then as now a leader in the behaviorally targeted ad market. In late 2019, it purchased the health data company Fitbit, merging years of step counts and exercise logs into its own vast database of users’ physical activity.

In its brief existence, Facebook has acquired 67 other companies. Amazon has acquired 91, and Google, 214—an average of over 10 per year. Many of the smaller firms that Facebook, Amazon, or Google have acquired had access to tremendous amounts of data and millions of active users. With each acquisition, those data sources are folded into the already-massive silos controlled by the tech giants. And thanks to network effects, the data becomes more valuable when it’s all under one roof. On its own, Doubleclick could assemble pseudonymous profiles of users’ browsing history. But as a part of Google, it can merge that data with real names, locations, cross-device activity, search histories, and social graphs.

Multi-billion dollar tech giants are not the only companies tracking us, nor are they the most irresponsible actors in the space. But the bigger they are, the more they know. And the more kinds of data a company has access to, the more powerful its profiles of users and competitors will be. In the new economy of personal information, the rich are only getting richer.

Part 3: Data sharing: Targeting, brokers, and real-time bidding

Where does the data go when it’s collected? Most trackers don’t collect every piece of information by themselves. Instead, companies work together, collecting data for themselves and sharing it with each other. Sometimes, companies with information about the same individual will combine it only briefly to determine which advertiser will serve which ad to that person. In other cases, companies base their entire business model on collecting and selling data about individuals they never interact with. In all cases, the type of data they collect and share can impact their target’s experience, whether by affecting the ads they’re exposed to or by determining which government databases they end up cataloged in. Moreover, the more a user’s data is spread around, the greater the risk that they will be affected by a harmful data breach. This section will explore how personal information gets shared and where it goes.

Real-time bidding

Real-time bidding is the system that publishers and advertisers use to serve targeted ads. The unit of sale in the Internet advertising world is the “impression.” Every time a person visits a web page with an ad, that person views an ad impression. Behind the scenes, an advertiser pays an ad network for the right to show you an ad, and the ad network pays the publisher of the web page where you saw the ad. But before that can happen, the publisher and the ad network have to decide which ad to show. To do so, they conduct a milliseconds-long auction, in which the auctioneer offers up a user’s personal information, and then software on dozens of corporate servers bid on the rights to that user’s attention. Data flows in one direction, and money flows in the other.

Such “real-time bidding” is quite complex, and the topic could use a whitepaper on its own. Luckily, there are tremendous, in-depth resources on the topic already. Dr. Johnny Ryan and Brave have written a series on the privacy impact of RTB. There is also a doctoral thesis on the privacy implications of the protocol. This section will give a brief overview of what the process looks like, much of which is based on Ryan’s work.

//website.com” also shares information, including a cookie and other request headers, with other third-party servers. This information is sent to a Supply-Side Platform (SSP), which is the server that begins the real-time bidding auction . This SSP matches the cookie to user 552EFF, which is Ava’s device. The SSP then fills out a “bid request”, which includes information like year of birth, gender (“f?”), keywords (“coffee, goth”), and geo (“USA”), and sends it to DSP servers.

Supply-side platforms use cookies to identify a user, then distribute “bid requests” with information about the user to potential advertisers.

First, data flows from your browser to the ad networks, also known as “supply-side platforms” (SSPs). In this economy, your data and your attention are the “supply” that ad networks and SSPs are selling. Each SSP receives your identifying information, usually in the form of a cookie, and generates a “bid request” based on what it knows about your past behavior. Next, the SSP sends this bid request to each of the dozens of advertisers who have expressed interest in showing ads.

A screenshot of a table describing the information content of the User object from the AdCOM 1.0 specification.

The `user` object in an OpenRTB bid request contains the information a particular supply-side platform knows about the subject of an impression, including one or more unique IDs, age, gender, location, and interests. Source: https://github.com/InteractiveAdvertisingBureau/AdCOM/blob/master/AdCOM%20v1.0%20FINAL.md#object–user-

The bid request contains information about your location, your interests, and your device, and includes your unique ID. The screenshot above shows the information included in an OpenRTB bid request.

A demand-side platform server winning the bid.

After the auction is complete, winning bidders pay supply-side platforms, SSPs pay the publisher, and the publisher shows the user an ad. At this point, the winning advertiser can collect even more information from the user’s browser.

Finally, it’s the bidders’ turn. Using automated systems, the advertisers look at your info, decide whether they’d like to advertise to you and which ad they want to show, then respond to the SSP with a bid. The SSP determines who won the auction and displays the winner’s ad on the publisher’s web page.

All the information in the bid request is shared before any money changes hands. Advertisers who don’t win the auction still receive the user’s personal information. This enables “shadow bidding.” Certain companies may pretend to be interested in buying impressions, but intentionally bid to lose in each auction with the goal of collecting as much data as possible as cheaply as possible.

Furthermore, there are several layers of companies that participate in RTB between the SSP and the advertisers, and each layer of companies also vacuums up user information. SSPs interface with “ad exchanges,” which share data with “demand side platforms” (DSPs), which also share and purchase data from data brokers. Publishers work with SSPs to sell their ad space, advertisers work with DSPs to buy it, and ad exchanges connect buyers and sellers. You can read a breakdown of the difference between SSPs and DSPs, written for advertisers, here. Everyone involved in the process gets to collect behavioral data about the person who triggered the request.

During the bidding process, advertisers and the DSPs they work with can use third-party data brokers to augment their profiles of individual users. These data brokers, which refer to themselves innocuously as “data management platforms” (DMPs), sell data about individuals based on the identifiers and demographics included in a bid request. In other words, an advertiser can share a user ID with a data broker and receive that user’s behavioral profile in return.

Source: Zhang, W., Yuan, S., Wang, J., and Shen, X. (2014b). Real-time bidding benchmarking with ipinyou dataset. arXiv preprint arXiv:1407.7073.

The diagram above gives another look at the flow of information and money in a single RTB auction.

In summary: (1) a user’s visit to a page triggers an ad request from the page’s publisher to an ad exchange. This is our real-time bidding “auctioneer.” The ad exchange (2) requests bids from advertisers and the DSPs they work with, sending them information about the user in the process. The DSP then (3) augments the bid request data with more information from data brokers, or DMPs. Advertisers (4) respond with a bid for the ad space. After (5) a millisecond-long auction, the ad exchange (6) picks and notifiers the winning advertiser. The ad exchange (7) serves that ad to the user, complete with the tracking technology described above. The advertiser will (8) receive information about how the user interacted with the ad, e.g. how long they looked at it, what they clicked, if they purchased anything, etc. That data will feed back into the DSP’s information about that user and other users who share their characteristics, informing future RTB bids.

From the perspective of the user who visited the page, RTB causes two discrete sets of privacy invasions. First, before they visited the page, an array of companies tracked their personal information, both online and offline, and merged it all into a sophisticated profile about them. Then, during the RTB process, a different set of companies used that profile to decide how much to bid for the ad impression. Second, as a result of the user’s visit to the page, the RTB participants harvest additional information from the visiting user. That information is injected into the user’s old profile, to be used during subsequent RTBs triggered by their next page visits. Thus, RTB is both a cause of tracking and a means of tracking.

RTB on the web: cookie syncing

Cookie syncing is a method that web trackers use to link cookies with one another and combine the data one company has about a user with data that other companies might have.

Mechanically, it’s very simple. One tracking domain triggers a request to another tracker. In the request, the first tracker sends a copy of its own tracking cookie. The second tracker gets both its own cookie and the cookie from the first tracker. This allows it to “compare notes” with the other tracker while building up its profile of the user.

Cookie sharing is commonly used as a part of RTB. In a bid request, the SSP shares its own cookie ID with all of the potential bidders. Without syncing, the demand side platforms might have their own profiles about users linked to their own cookie IDs. A DSP might not know that the user “abc” from Doubleclick (Google’s ad network) is the same as its own user “xyz”. Cookie syncing lets them be sure. As part of the bidding process, SSPs commonly trigger cookie-sync requests to many DSPs at a time. That way, the next time that SSP sends out a bid request, the DSPs who will be bidding can use their own behavioral profiles about the user to decide how to bid.

A laptop makes a request for a hidden element on the page, which kicks off the "cookie sync" process described below.

Cookie syncing. An invisible ‘pixel’ element on the page triggers a request to an ad exchange or SSP, which redirects the user to a DSP. The redirect URL contains information about the SSP’s cookie that lets the DSP link it to its own identifier. A single SSP may trigger cookie syncs to many different DSPs at a time.

RTB in mobile apps

RTB was created for the Web, but it works just as well for ads in mobile apps. Instead of cookies, trackers use ad IDs. The ad IDs baked into iOS and Android make trackers’ jobs easier. On the web, each advertiser has its own cookie ID, and demand-side platforms need to sync data with DMPs and with each other in order to tie their data to a specific user.

But on mobile devices, each user has a single, universal ad ID that is accessible from every app. That means that the syncing procedures described above on the web are not necessary on mobile; advertisers can use ad IDs to confirm identity, share data, and build more detailed profiles upon which to base bids.

Group targeting and look-alike audiences

Sometimes, large platforms do not disclose their data; rather, they lease out temporary access to their data-powered tools. Facebook, Google, and Twitter all allow advertisers to target categories of people with ads. For example, Facebook lets advertisers target users with certain “interests” or “affinities.”

The companies do not show advertisers the actual identities of individuals their campaigns target. If you start a Facebook campaign targeting “people interested in Roller Derby in San Diego,” you can’t see a list of names right away. However, this kind of targeting does allow advertisers to reach out directly to roller derby-going San Diegans and direct them to an outside website or app. When targeted users click on an ad, they are directed off of Facebook and to the advertiser’s domain. At this point, the advertiser knows they came from Facebook and that they are part of the targeted demographic. Once users have landed on the third-party site, the advertiser can use data exchange services to match them with behavioral profiles or even real-world identities.

In addition, Facebook allows advertisers to build “look-alike audiences” based on other groups of people. For example, suppose you’re a payday loan company with a website. You can install an invisible Facebook pixel on a page that your debtors visit, make a list of people who visit that page, and then ask Facebook to create a “look-alike” audience of people who Facebook thinks are “similar” to the ones on your list. You can then target those people with ads on Facebook, directing them back to your website, where you can use cookies and data exchanges to identify who they are.

These “look-alike” features are black boxes. Without the ability to audit or study them, it’s impossible to know what kinds of data they use and what kinds of information about users they might expose. We urge advertisers to disclose more information about them and to allow independent testing.

Data brokers

Data brokers are companies that collect, aggregate, process, and sell data. They operate out of sight from regular users, but in the center of the data-sharing economy. Often, data brokers have no direct relationships with users at all, and the people about whom they sell data may not be aware they exist. Data brokers purchase information from a variety of smaller companies, including retailers, financial technology companies, medical research companies, online advertisers, cellular providers, Internet of Things device manufacturers, and local governments. They then sell data or data-powered services to advertisers, real estate agents, market research companies, colleges, governments, private bounty hunters, and other data brokers.

This is another topic that is far too broad to cover here, and others have written in depth about the data-selling ecosystem. Cracked Labs’ report on corporate surveillance is both accessible and in-depth. Pam Dixon of the World Privacy Forum has also done excellent research into data brokers, including a report from 2014 and testimony before the Senate in 2015 and 2019.

The term “data broker” is broad. It includes “mom and pop” marketing firms that assemble and sell curated lists of phone numbers or emails, and behemoths like Oracle that ingest data from thousands of different streams and offer data-based services to other businesses.

Some brokers sell raw streams of information. This includes data about retail purchase behavior, data from Internet of Things devices, and data from connected cars. Others act as clearinghouses between buyers and sellers of all kinds of data. For example, Narrative promises to help sellers “unlock the value of [their] data” and help buyers “access the data [they] need.” Dawex describes itself as “a global data marketplace where you can meet, sell and buy data directly.”

Another class of companies act as middlemen or “aggregators,” licensing raw data from several different sources, processing it, and repackaging it as a specific service for other businesses. For example, major phone carriers sold access to location data to aggregators called Zumigo and Microbilt, which in turn sold access to a broad array of other companies, with the resulting market ultimately reaching down to bail bondsmen and bounty hunters (and an undercover reporter). EFF is now suing AT&T for selling this data without users’ consent and for misleading the public about its privacy practices.

Many of the largest data brokers don’t sell the raw data they collect. Instead, they collect and consume data from thousands of different sources, then use it to assemble their own profiles and draw inferences about individuals. Oracle, one of the world’s largest data brokers, owns Bluekai, one of the largest third-party trackers on the web. Credit reporting agencies, including Equifax and Experian, are also particularly active here. While the U.S. Fair Credit Reporting Act governs how credit raters can share specific types of data, it doesn’t prevent credit agencies from selling most of the information that trackers collect today, including transaction information and browsing history. Many of these companies advertise their ability to derive psychographics, which are “innate” characteristics that describe user behavior. For example, Experian classifies people into financial categories like “Credit Hungry Card Switcher,” “Disciplined, Passive Borrower,” and “Insecure Debt Dependent,” and claims to cover 95% of the U.S. population. Cambridge Analytica infamously used data about Facebook likes to derive “OCEAN scores”—ratings for openness, conscientiousness, extraversion, agreeableness, and neuroticism—about millions of voters, then sold that data to political campaigns.

Finally, many brokers use their internal profiles to offer “identity resolution” or “enrichment” services to others. If a business has one identifier, like a cookie or email address, it can pay a data broker to “enrich” that data and learn other information about the person. It can also link data tied to one identifier (like a cookie) to data from another (like a mobile ad ID). In the real-time bidding world, these services are known as “data management platforms.” Real-time bidders can use these kinds of services to learn who a particular user is and what their interests are, based only on the ID included with the bid request.

For years, data brokers have operated out of sight and out of mind of the general public. But we may be approaching a turning point. In 2018, Vermont passed the nation’s first law requiring companies that buy and sell third-party data to register with the secretary of state. As a result, we now have access to a list of over 120 data brokers and information about their business models. Furthermore, when the California Consumer Privacy Act goes into effect in 2020, consumers will have the right to access the personal information that brokers have about them for free, and to opt out of having their data sold.

Data consumers

So far, this paper has discussed how data is collected, shared, and sold. But where does it end up? Who are the consumers of personal data, and what do they do with it?

Targeted advertising

By far the biggest, most visible, and most ubiquitous data consumers are targeted advertisers. Targeted advertising allows advertisers to reach users based on demographics, psychographics, and other traits. Behavioral advertising is a subset of targeted advertising that leverages data about users’ past behavior in order to personalized ads.

The biggest data collectors are also the biggest targeted advertisers. Together, Google and Facebook control almost 60% of the digital ad market in the U.S., and they use their respective troves of data in order to do so. Google, Facebook, Amazon, and Twitter offer end-to-end targeting services where advertisers can target high-level categories of users, and the advertisers don’t need to have access to any data themselves. Facebook lets advertisers target users based on location; demographics like age, gender, education, and income; and interests like hobbies, music genres, celebrities, and political leaning. Some of the “interests” Facebook uses are based on what users have “liked” or commented on, and others are derived based on Facebook’s third-party tracking. While Facebook uses its data to match advertisers to target audiences, Facebook does not share its data with those advertisers.

Real-time bidding (RTB) involves more data sharing, and there are a vast array of smaller companies involved in different levels of the process. The big tech companies offer services in this space as well: Google’s Doubleclick Bid Manager and Amazon DSP are both RTB demand-side platforms. In RTB, identifiers are shared so that the advertisers themselves (or their agents) can decide whether they want to reach each individual and what ad they want to show. In the RTB ecosystem, advertisers collect their own data about how users behave, and they may use in-house machine learning models in order to predict which users are most likely to engage with their ads or buy their products.

Some advertisers want to reach users on Facebook or Google, but don’t want to use the big companies’ proprietary targeting techniques. Instead, they can buy lists of contact information from data brokers, then upload those lists directly to Facebook or Google, who will reach those users across all of their platforms. This system undermines big companies’ efforts to rein in discriminatory or otherwise malicious targeting. Targeting platforms like Google and Facebook do not allow advertisers to target users of particular ethnicities with ads for jobs, housing, or credit. However, advertisers can buy demographic information about individuals from data brokers, upload a list of names who happen to be from the same racial group, and have the platform target those people directly. Both Google and Facebook forbid the use of “sensitive information” to target people with contact lists, but it’s unclear how they enforce these policies.

Political campaigns and interest groups

Companies aren’t the only entities that try to benefit from data collection and targeted advertising. Cambridge Analytica used ill-gotten personal data to estimate “psychographics” for millions of potential voters, then used that data to help political campaigns. In 2018, the group CatholicVote used cell-phone location data to determine who had been inside a Catholic church, then targeted them with “get out the vote” ads. Anti-abortion groups used similar geo-fencing technology to target ads to women while they were at abortion clinics..

And those incidents are not isolated. Some non-profits that rely on donations buy data to help narrow in on potential donors. Many politicians around the country have used open voter registration data to target voters. The Democratic National Committee is reportedly investing heavily in its “data warehouse” ahead of the 2020 election. And Deep Root Analytics, a consulting firm for the Republican party, was the source of the largest breach of US voter data in history; it had been collecting names, registration details, and “modeled” ethnicity and religion data about nearly 200 million Americans.

Debt collectors, bounty hunters, and fraud investigators

Debt collectors, bounty hunters, and repossession agencies all purchase and use location data from a number of sources. EFF is suing AT&T for its role in selling location data to aggregators, which enabled a secondary market that allowed access by bounty hunters. However, phone carriers aren’t the only source of that data. The bail bond company Captira sold location data gathered from cell phones and ALPRs to bounty hunters for as little as $7.50. And thousands of apps collect “consensual” location data using GPS permissions, then sell that data to downstream aggregators. This data can be used to locate fugitives, debtors, and those who have not kept up with car payments. And as investigations have shown, it can also be purchased—and abused—by nearly anyone.

Cities, law enforcement, intelligence agencies

The public sector also purchases data from the private sector for all manner of applications. For example, U.S. Immigration and Customs Enforcement bought ALPR data from Vigilant to help locate people the agency intends to deport. Government agencies contract with data brokers for myriad tasks, from determining eligibility for human services to tax collection, according to the League of California Cities, in a letter seeking an exception from that state’s consumer data privacy law for contracts between government agencies and data brokers. Advocates have long decried these arrangements between government agencies and private data brokers as a threat to consumer data privacy, as well as an end-run around legal limits on governments’ own databases. And of course, national security surveillance often rests on the data mining of private companies’ reservoirs of consumer data. For example, as part of the PRISM program revealed by Edward Snowden, the NSA collected personal data directly from Google, YouTube, Facebook, and Yahoo.

Part 4: Fighting back

You might want to resist tracking to avoid being targeted by invasive or manipulative ads. You might be unhappy that your private information is being bartered and sold behind your back. You might be concerned that someone who wishes you harm can access your location through a third-party data broker. Perhaps you fear that data collected by corporations will end up in the hands of police and intelligence agencies. Or third-party tracking might just be a persistent nuisance that gives you a vague sense of unease.

But the unfortunate reality is that tracking is hard to avoid. With thousands of independent actors using hundreds of different techniques, corporate surveillance is widespread and well-funded. While there’s no switch to flip that can prevent every method of tracking, there’s still a lot that you can do to take back your privacy. This section will go over some of the ways that privacy-conscious users can avoid and disrupt third-party tracking.

Each person should decide for themselves how much effort they’re willing to put into protecting their privacy. Small changes can seriously cut back on the amount of data that trackers can collect and share, like installing EFF’s tracker-blocker extension Privacy Badger in your browser and changing settings on a phone. Bigger changes, like uninstalling third-party apps and using Tor, can offer stronger privacy guarantees at the cost of time, convenience, and sometimes money. Stronger measures may be worth it for users who have serious concerns.

Finally, keep in mind that none of this is your fault. Privacy shouldn’t be a matter of personal responsibility. It’s not your job to obsess over the latest technologies that can secretly monitor you, and you shouldn’t have to read through a quarter million words of privacy-policy legalese to understand how your phone shares data. Privacy should be a right, not a privilege for the well-educated and those flush with spare time. Everyone deserves to live in a world—online and offline—that respects their privacy.

In a better world, the companies that we choose to share our data with would earn our trust, and everyone else would mind their own business. That’s why EFF files lawsuits to compel companies to respect consumers’ data privacy, and why we support legislation that would make privacy the law of the land. With the help of our members and supporters, we are making progress, but changing corporate surveillance policies is a long and winding path. So for now, let’s talk about how you can fight back.

On the web

There are several ways to limit your exposure to tracking on the Web. First, your choice of browser matters. Certain browser developers take more seriously their software’s role as a “user agent” acting on your behalf. Apple’s Safari takes active measures against the most common forms of tracking, including third-party cookies, first-to-third party cookie sharing, and fingerprinting. Mozilla’s Firefox blocks third-party cookies from known trackers by default, and Firefox’s Private Browsing mode will block requests to trackers altogether.

Browser extensions like EFF’s Privacy Badger and uBlock Origin offer another layer of protection. In particular, Privacy Badger learns to block trackers using heuristics, which means it might catch new or uncommon trackers that static, list-based blockers miss. This makes Privacy Badger a good supplement to the built-in protections offered by Firefox, which rely on the Disconnect list. And while Google Chrome does not block any tracking behavior by default, installing Privacy Badger or another tracker-blocking extension in Chrome will allow you to use it with relatively little exposure to tracking. (However, planned changes in Chrome will likely affect the security and privacy tools that many use to block tracking.)

The browser extension, Privacy Badger, blocks a third-party tracker

Browser extensions like EFF’s Privacy Badger offer a layer of protection against third-party tracking on the web. Privacy Badger learns to block trackers using heuristics, which means it might catch new or uncommon trackers that static, list-based blockers miss.

No tracker blocker is perfect. All tracker blockers must make exceptions for companies that serve legitimate content. Privacy Badger, for example, maintains a list of domains which are known to perform tracking behaviors as well as serving content that is necessary for many sites to function, such as content delivery networks and video hosts. Privacy Badger restricts those domains’ ability to track by blocking cookies and access to local storage, but dedicated trackers can still access IP addresses, TLS state, and some kinds of fingerprintable data.

If you’d like to go the extra mile and are comfortable with tinkering, you can install a network-level filter in your home. Pi-hole filters all traffic on a local network at the DNS level. It acts as a personal DNS server, rejecting requests to domains which are known to host trackers. Pi-hole blocks tracking requests coming from devices which are otherwise difficult to configure, like smart TVs, game consoles, and Internet of Things products.

For people who want to reduce their exposure as much as possible, Tor Browser is the gold standard for privacy. Tor uses an onion routing service to totally mask its users’ IP addresses. It takes aggressive steps to reduce fingerprinting, like blocking access to the HTML canvas by default. It completely rejects TLS session tickets and clears cookies at the end of each session.

Unfortunately, browsing the web with Tor in 2019 is not for everyone. It significantly slows down traffic, so pages take much longer to load, and streaming video or other real-time content is very difficult. Worse, much of the modern web relies on invisible CAPTCHAs that block or throttle traffic from sources deemed “suspicious.” Traffic from Tor is frequently classified as high-risk, so doing something as simple as a Google search with Tor can trigger CAPTCHA tests. And since Tor is a public network which attackers also use, some websites will block Tor visitors altogether.

On mobile phones

Blocking trackers on mobile devices is more complicated. There isn’t one solution, like a browser or an extension, that can cover many bases. And unfortunately, it’s simply not possible to control certain kinds of tracking on certain devices.

The first line of defense against tracking is your device’s settings.

App permissions page. “ width=“1081″ height=“1849″>

Both iOS and Android let users view and control the permissions that each app has access to. You should check the permissions that your apps have, and remove the permissions that aren’t needed. While you are at it, you might simply remove the apps you are not using. In addition to per-app settings, you can change global settings that affect how your device collects and shares particularly sensitive information, like location. You can also control how apps are allowed to access the Internet when they are not in use, which can prevent passive tracking.

Both operating systems also have options to reset your device’s ad ID in different ways. On iOS, you can remove the ad ID entirely by setting it to a string of zeros. (Here are some other ways to block ad tracking on iOS.) On Android, you can manually reset it. This is equivalent to clearing your cookies, but not blocking new ones: it won’t disable tracking entirely, but will make it more difficult for trackers to build a unified profile about you.

Android also has a setting to “opt out of interest-based ads.” This sends a signal to apps that the user does not want to have their data used for targeted ads, but it doesn’t actually stop the apps from doing so by means of the ad ID. Indeed, recent research found that tens of thousands of apps simply ignore the signal.

On iOS, there are a handful of apps that can filter tracking activity from other apps. On Android, it’s not so easy. Google bans ad- and tracker-blockers from its app store, the Play Store, so it has no officially vetted apps of this kind. It’s possible to “side-load” blockers from outside of the Play Store, but this can be very risky. Make sure you only install apps from publishers you trust, preferably with open source code.

You should also think about the networks your devices are communicating with. It is best to avoid connecting to unfamiliar public WiFi networks. If you do, the “free” WiFi probably comes at the cost of your data.

Wireless beacons are also trying to collect information from your device. They can only collect identifying information if your devices are broadcasting their hardware MAC addresses. Both iOS and Android now randomize these MAC addresses by default, but other kinds of devices may not. Your e-reader, smart watch, or car may be broadcasting probe requests that trackers can use to derive location data. To prevent this, you can usually turn off WiFi and Bluetooth or set your device to “airplane mode.” (This is also a good way to save battery!)

Finally, if you really need to be anonymous, using a “burner phone” can help you control tracking associated with inherent hardware identifiers.

IRL

In the real world, opting out isn’t so simple.

As we’ve described, there are many ways to modify the way your devices work to prevent them from working against you. But it’s almost impossible to avoid tracking by face recognition cameras and automatic license plate readers. Sure, you can paint your face to disrupt face recognition algorithms, you can choose not to own a car to stay out of ALPR companies’ databases, and you can use cash or virtual credit cards to stop payment processors from profiling you. But these options aren’t realistic for most people most of the time, and it’s not feasible for anyone to avoid all the tracking that they’re exposed to.

Knowledge is, however, half the battle. For now, face recognition cameras are most likely to identify you in specific locations, like airports, during international travel. ALPR cameras are much more pervasive and harder to avoid, but if absolutely necessary, it is possible to use public transit or other transportation methods to limit how often your vehicle is tracked.

In the legislature

Some jurisdictions have laws to protect users from tracking. The General Data Protection Regulation (GDPR) in the European Union gives those it covers the right to access and delete information that’s been collected about them. It also requires companies to have a legitimate reason to use data, which could come from a “legitimate interest” or opt-in consent. The GDPR is far from perfect, and its effectiveness will depend on how regulators and courts implement it in the years to come. But it gives meaningful rights to users and prescribes real consequences for companies who violate them.

In the U.S., a smattering of state and federal laws offer specific protections to some. Vermont’s data privacy law brings transparency to data brokers. The Illinois Biometric Information Protection Act (BIPA) requires companies to get consent from users before collecting or sharing biometric identifiers. In 2020, the California Consumer Privacy Act (CCPA) will take effect, giving users there the right to access their personal information, delete it, and opt out of its sale. Some communities have passed legislation to limit government use of face recognition, and more plan to pass it soon.

At the federal level, some information in some circumstances is protected by laws like HIPAA, FERPA, COPPA, the Video Privacy Protection Act, and a handful of financial data privacy laws. However, these sector-specific federal statutes apply only to specific types information about specific types of people when held by specific businesses. They have many gaps, which are exploited by trackers, advertisers, and data brokers.

To make a long story very short, most third-party data collection in the U.S. is unregulated. That’s why EFF advocates for new laws to protect user privacy. People should have the right to know what personal information is collected about them and what is done with it. We should be free from corporate processing of our data unless we give our informed opt-in consent. Companies shouldn’t be able to charge extra or degrade service when users choose to exercise their privacy rights. They should be held accountable when they misuse or mishandle our data. And people should have the right to take companies to court when their privacy is violated.

The first step is to break the one-way mirror. We need to shed light on the tangled network of trackers that lurk in the shadows behind the glass. In the sunlight, these systems of commercial surveillance are exposed for what they are: Orwellian, but not omniscient; entrenched, but not inevitable. Once we, the users, understand what we’re up against, we can fight back.

Source: https://www.eff.org/wp/behind-the-one-way-mirror

the combination of repressive regimes with IT monopolies endows those regimes with a built-in advantage over open societies

Source: https://www.wired.com/story/mortal-danger-chinas-push-into-ai/

Governments and companies worldwide are investing heavily in artificial intelligence in hopes of new profits, smarter gadgets, and better health care. Financier and philanthropist George Soros told the World Economic Forum in Davos Thursday that the technology may also undermine free societies and create a new era of authoritarianism.

“I want to call attention to the mortal danger facing open societies from the instruments of control that machine learning and artificial intelligence can put in the hands of repressive regimes,” Soros said. He made an example of China, repeatedly calling out the country’s president, Xi Jinping.

China’s government issued a broad AI strategy in 2017, asserting that it would surpass US prowess in the technology by 2030. As in the US, much of the leading work on AI in China takes place inside a handful of large tech companies, such as search engine Baidu and retailer and payments company Alibaba.

Soros argued that AI-centric tech companies like those can become enablers of authoritarianism. He pointed to China’s developing “social credit” system, aimed at tracking citizens’ reputations by logging financial activity, online interactions, and even energy use, among other things. The system is still taking shape, but depends on data and cooperation from companies like payments firm Ant Financial, a spinout of Alibaba. “The social credit system, if it became operational, would give Xi Jinping total control over the people,” Soros said.

Soros argued that synergy like that between corporate and government AI projects creates a more potent threat than was posed by Cold War–era autocrats, many of whom spurned corporate innovation. “The combination of repressive regimes with IT monopolies endows those regimes with a built-in advantage over open societies,” Soros said. “They pose a mortal threat to open societies.”

Soros is far from the first to raise an alarm about the dangers of AI technology. It’s a favorite topic of Elon Musk, and last year Henry Kissinger called for a US government commission to examine the technology’s risks. Google cofounder Sergey Brin warned in Alphabet’s most recent annual shareholder letter that AI technology had downsides, including the potential to manipulate people. Canada and France plan to establish an intergovernmental group to study how AI changes societies.

The financier attempted to draft Donald Trump into his AI vigilance campaign. He advised the president to be tougher on Chinese telecoms manufacturers ZTE and Huawei, to prevent them from dominating the high-bandwidth 5G mobile networks being built around the world. Both companies are already reeling from sanctions by the US and other governments.

Soros also urged the well-heeled attendees of Davos to help forge international mechanisms to prevent AI-enhanced authoritarianism—and that could both include and contain China. He asked them to imagine a technologically oriented version of the treaty signed after World War II that underpins the United Nations, binding countries into common standards for human rights and freedoms.

Here is the text of Soros’s speech:

I want to use my time tonight to warn the world about an unprecedented danger that’s threatening the very survival of open societies.

Last year when I stood before you I spent most of my time analyzing the nefarious role of the IT monopolies. This is what I said: “An alliance is emerging between authoritarian states and the large data rich IT monopolies that bring together nascent systems of corporate surveillance with an already developing system of state sponsored surveillance. This may well result in a web of totalitarian control the likes of which not even George Orwell could have imagined.”

Tonight I want to call attention to the mortal danger facing open societies from the instruments of control that machine learning and artificial intelligence can put in the hands of repressive regimes. I’ll focus on China, where Xi Jinping wants a one-party state to reign supreme.

A lot of things have happened since last year and I’ve learned a lot about the shape that totalitarian control is going to take in China.

All the rapidly expanding information available about a person is going to be consolidated in a centralized database to create a “social credit system.” Based on that data, people will be evaluated by algorithms that will determine whether they pose a threat to the one-party state. People will then be treated accordingly.

The social credit system is not yet fully operational, but it’s clear where it’s heading. It will subordinate the fate of the individual to the interests of the one-party state in ways unprecedented in history.

I find the social credit system frightening and abhorrent. Unfortunately, some Chinese find it rather attractive because it provides information and services that aren’t currently available and can also protect law-abiding citizens against enemies of the state.

China isn’t the only authoritarian regime in the world, but it’s undoubtedly the wealthiest, strongest and most developed in machine learning and artificial intelligence. This makes Xi Jinping the most dangerous opponent of those who believe in the concept of open society. But Xi isn’t alone. Authoritarian regimes are proliferating all over the world and if they succeed, they will become totalitarian.

As the founder of the Open Society Foundations, I’ve devoted my life to fighting totalizing, extremist ideologies, which falsely claim that the ends justify the means. I believe that the desire of people for freedom can’t be repressed forever. But I also recognize that open societies are profoundly endangered at present.

What I find particularly disturbing is that the instruments of control developed by artificial intelligence give an inherent advantage to authoritarian regimes over open societies. For them, instruments of control provide a useful tool; for open societies, they pose a mortal threat.

I use “open society” as shorthand for a society in which the rule of law prevails as opposed to rule by a single individual and where the role of the state is to protect human rights and individual freedom. In my personal view, an open society should pay special attention to those who suffer from discrimination or social exclusion and those who can’t defend themselves.

By contrast, authoritarian regimes use whatever instruments of control they possess to maintain themselves in power at the expense of those whom they exploit and suppress.

How can open societies be protected if these new technologies give authoritarian regimes a built-in advantage? That’s the question that preoccupies me. And it should also preoccupy all those who prefer to live in an open society.

Open societies need to regulate companies that produce instruments of control, while authoritarian regimes can declare them “national champions.” That’s what has enabled some Chinese state-owned companies to catch up with and even surpass the multinational giants.

This, of course, isn’t the only problem that should concern us today. For instance, man-made climate change threatens the very survival of our civilization. But the structural disadvantage that confronts open societies is a problem which has preoccupied me and I’d like to share with you my ideas on how to deal with it.

My deep concern for this issue arises out of my personal history. I was born in Hungary in 1930 and I’m Jewish. I was 13 years old when the Nazis occupied Hungary and started deporting Jews to extermination camps.

I was very fortunate because my father understood the nature of the Nazi regime and arranged false identity papers and hiding places for all members of his family, and for a number of other Jews as well. Most of us survived.

The year 1944 was the formative experience of my life. I learned at an early age how important it is what kind of political regime prevails. When the Nazi regime was replaced by Soviet occupation I left Hungary as soon as I could and found refuge in England.

At the London School of Economics I developed my conceptual framework under the influence of my mentor, Karl Popper. That framework proved to be unexpectedly useful when I found myself a job in the financial markets. The framework had nothing to do with finance, but it is based on critical thinking. This allowed me to analyze the deficiencies of the prevailing theories guiding institutional investors. I became a successful hedge fund manager and I prided myself on being the best paid critic in the world.

Running a hedge fund was very stressful. When I had made more money than I needed for myself or my family, I underwent a kind of midlife crisis. Why should I kill myself to make more money? I reflected long and hard on what I really cared about and in 1979 I set up the Open Society Fund. I defined its objectives as helping to open up closed societies, reducing the deficiencies of open societies and promoting critical thinking.

My first efforts were directed at undermining the apartheid system in South Africa. Then I turned my attention to opening up the Soviet system. I set up a joint venture with the Hungarian Academy of Science, which was under Communist control, but its representatives secretly sympathized with my efforts. This arrangement succeeded beyond my wildest dreams. I got hooked on what I like to call “political philanthropy.” That was in 1984.

In the years that followed, I tried to replicate my success in Hungary and in other Communist countries. I did rather well in the Soviet empire, including the Soviet Union itself, but in China it was a different story.

My first effort in China looked rather promising. It involved an exchange of visits between Hungarian economists who were greatly admired in the Communist world, and a team from a newly established Chinese think tank which was eager to learn from the Hungarians.

Based on that initial success, I proposed to Chen Yizi, the leader of the think tank, to replicate the Hungarian model in China. Chen obtained the support of Premier Zhao Ziyang and his reform-minded policy secretary Bao Tong.

A joint venture called the China Fund was inaugurated in October 1986. It was an institution unlike any other in China. On paper, it had complete autonomy.

Bao Tong was its champion. But the opponents of radical reforms, who were numerous, banded together to attack him. They claimed that I was a CIA agent and asked the internal security agency to investigate. To protect himself, Zhao Ziyang replaced Chen Yizi with a high-ranking official in the external security police. The two organizations were co-equal and they couldn’t interfere in each other’s affairs.

I approved this change because I was annoyed with Chen Yizi for awarding too many grants to members of his own institute and I was unaware of the political infighting behind the scenes. But applicants to the China Fund soon noticed that the organization had come under the control of the political police and started to stay away. Nobody had the courage to explain to me the reason for it.

Eventually, a Chinese grantee visited me in New York and told me, at considerable risk to himself. Soon thereafter, Zhao Ziyang was removed from power and I used that excuse to close the foundation. This happened just before the Tiananmen Square massacre in 1989 and it left a “black spot” on the record of the people associated with the foundation. They went to great length to clear their names and eventually they succeeded.

In retrospect, it’s clear that I made a mistake in trying to establish a foundation which operated in ways that were alien to people in China. At that time, giving a grant created a sense of mutual obligation between the donor and recipient and obliged both of them to remain loyal to each other forever.

So much for history. Let me now turn to the events that occurred in the last year, some of which surprised me.

When I first started going to China, I met many people in positions of power who were fervent believers in the principles of open society. In their youth they had been deported to the countryside to be re-educated, often suffering hardships far greater than mine in Hungary. But they survived and we had much in common. We had all been on the receiving end of a dictatorship.

They were eager to learn about Karl Popper’s thoughts on the open society. While they found the concept very appealing, their interpretation remained somewhat different from mine. They were familiar with Confucian tradition, but there was no tradition of voting in China. Their thinking remained hierarchical and carried a built-in respect for high office. I, on the other hand I was more egalitarian and wanted everyone to have a vote.

So, I wasn’t surprised when Xi Jinping ran into serious opposition at home; but I was surprised by the form it took. At last summer’s leadership convocation at the seaside resort of Beidaihe, Xi Jinping was apparently taken down a peg or two. Although there was no official communique, rumor had it that the convocation disapproved of the abolition of term limits and the cult of personality that Xi had built around himself.

It’s important to realize that such criticisms were only a warning to Xi about his excesses, but did not reverse the lifting of the two-term limit. Moreover, “The Thought of Xi Jinping,” which he promoted as his distillation of Communist theory was elevated to the same level as the “Thought of Chairman Mao.” So Xi remains the supreme leader, possibly for lifetime. The ultimate outcome of the current political infighting remains unresolved.

I’ve been concentrating on China, but open societies have many more enemies, Putin’s Russia foremost among them. And the most dangerous scenario is when these enemies conspire with, and learn from, each other on how to better oppress their people.

The question poses itself, what can we do to stop them?

The first step is to recognize the danger. That’s why I’m speaking out tonight. But now comes the difficult part. Those of us who want to preserve the open society must work together and form an effective alliance. We have a task that can’t be left to governments.

History has shown that even governments that want to protect individual freedom have many other interests and they also give precedence to the freedom of their own citizens over the freedom of the individual as a general principle.

My Open Society Foundations are dedicated to protecting human rights, especially for those who don’t have a government defending them. When we started four decades ago there were many governments which supported our efforts but their ranks have thinned out. The US and Europe were our strongest allies, but now they’re preoccupied with their own problems.

Therefore, I want to focus on what I consider the most important question for open societies: what will happen in China?

The question can be answered only by the Chinese people. All we can do is to draw a sharp distinction between them and Xi Jinping. Since Xi has declared his hostility to open society, the Chinese people remain our main source of hope.

And there are, in fact, grounds for hope. As some China experts have explained to me, there is a Confucian tradition, according to which advisors of the emperor are expected to speak out when they strongly disagree with one of his actions or decrees, even that may result in exile or execution.

This came as a great relief to me when I had been on the verge of despair. The committed defenders of open society in China, who are around my age, have mostly retired and their places have been taken by younger people who are dependent on Xi Jinping for promotion. But a new political elite has emerged that is willing to uphold the Confucian tradition. This means that Xi will continue to have a political opposition at home.

Xi presents China as a role model for other countries to emulate, but he’s facing criticism not only at home but also abroad. His Belt and Road Initiative has been in operation long enough to reveal its deficiencies.

It was designed to promote the interests of China, not the interests of the recipient countries; its ambitious infrastructure projects were mainly financed by loans, not by grants, and foreign officials were often bribed to accept them. Many of these projects proved to be uneconomic.

The iconic case is in Sri Lanka. China built a port that serves its strategic interests. It failed to attract sufficient commercial traffic to service the debt and enabled China to take possession of the port. There are several similar cases elsewhere and they’re causing widespread resentment.

Malaysia is leading the pushback. The previous government headed by Najib Razak sold out to China but in May 2018 Razak was voted out of office by a coalition led by Mahathir Mohamed. Mahathir immediately stopped several big infrastructure projects and is currently negotiating with China how much compensation Malaysia will still have to pay.

The situation is not as clear-cut in Pakistan, which has been the largest recipient of Chinese investments. The Pakistani army is fully beholden to China but the position of Imran Khan who became prime minister last August is more ambivalent. At the beginning of 2018, China and Pakistan announced grandiose plans in military cooperation. By the end of the year, Pakistan was in a deep financial crisis. But one thing became evident: China intends to use the Belt and Road Initiative for military purposes as well.

All these setbacks have forced Xi Jinping to modify his attitude toward the Belt and Road Initiative. In September, he announced that “vanity projects” will be shunned in favor of more carefully conceived initiatives and in October, the People’s Daily warned that projects should serve the interests of the recipient countries.

Customers are now forewarned and several of them, ranging from Sierra Leone to Ecuador, are questioning or renegotiating projects.

Most importantly, the US government has now identified China as a “strategic rival.” President Trump is notoriously unpredictable, but this decision was the result of a carefully prepared plan. Since then, the idiosyncratic behavior of Trump has been largely superseded by a China policy adopted by the agencies of the administration and overseen by Asian affairs advisor of the National Security Council Matt Pottinger and others. The policy was outlined in a seminal speech by Vice President Mike Pence on October 4th.

Even so, declaring China a strategic rival is too simplistic. China is an important global actor. An effective policy towards China can’t be reduced to a slogan.

It needs to be far more sophisticated, detailed and practical; and it must include an American economic response to the Belt and Road Initiative. The Pottinger plan doesn’t answer the question whether its ultimate goal is to level the playing field or to disengage from China altogether.

Xi Jinping fully understood the threat that the new US policy posed for his leadership. He gambled on a personal meeting with President Trump at the G20 meeting in Buenos Aires. In the meantime, the danger of global trade war escalated and the stock market embarked on a serious sell-off in December. This created problems for Trump who had concentrated all his efforts on the 2018 midterm elections. When Trump and Xi met, both sides were eager for a deal. No wonder that they reached one, but it’s very inconclusive: a ninety-day truce.

In the meantime, there are clear indications that a broad based economic decline is in the making in China, which is affecting the rest of the world. A global slowdown is the last thing the market wants to see.

The unspoken social contract in China is built on steadily rising living standards. If the decline in the Chinese economy and stock market is severe enough, this social contract may be undermined and even the business community may turn against Xi Jinping. Such a downturn could also sound the death knell of the Belt and Road Initiative, because Xi may run out of resources to continue financing so many lossmaking investments.

On the question of global internet governance, there’s an undeclared struggle between the West and China. China wants to dictate rules and procedures that govern the digital economy by dominating the developing world with its new platforms and technologies. This is a threat to the freedom of the Internet and indirectly open society itself.

Last year I still believed that China ought to be more deeply embedded in the institutions of global governance, but since then Xi Jinping’s behavior has changed my opinion. My present view is that instead of waging a trade war with practically the whole world, the US should focus on China. Instead of letting ZTE and Huawei off lightly, it needs to crack down on them. If these companies came to dominate the 5G market, they would present an unacceptable security risk for the rest of the world.

Regrettably, President Trump seems to be following a different course: make concessions to China and declare victory while renewing his attacks on US allies. This is liable to undermine the US policy objective of curbing China’s abuses and excesses.

To conclude, let me summarize the message I’m delivering tonight. My key point is that the combination of repressive regimes with IT monopolies endows those regimes with a built-in advantage over open societies. The instruments of control are useful tools in the hands of authoritarian regimes, but they pose a mortal threat to open societies.

China is not the only authoritarian regime in the world but it is the wealthiest, strongest and technologically most advanced. This makes Xi Jinping the most dangerous opponent of open societies. That’s why it’s so important to distinguish Xi Jinping’s policies from the aspirations of the Chinese people. The social credit system, if it became operational, would give Xi total control over the people. Since Xi is the most dangerous enemy of the open society, we must pin our hopes on the Chinese people, and especially on the business community and a political elite willing to uphold the Confucian tradition.

This doesn’t mean that those of us who believe in the open society should remain passive. The reality is that we are in a Cold War that threatens to turn into a hot one. On the other hand, if Xi and Trump were no longer in power, an opportunity would present itself to develop greater cooperation between the two cyber-superpowers.

It is possible to dream of something similar to the United Nations Treaty that arose out of the Second World War. This would be the appropriate ending to the current cycle of conflict between the US and China. It would reestablish international cooperation and allow open societies to flourish. That sums up my message.

Steve Rymell Head of Technology, Airbus CyberSecurity answers What Should Frighten us about AI-Based Malware?

Of all the cybersecurity industry’s problems, one of the most striking is the way attackers are often able to stay one step ahead of defenders without working terribly hard. It’s an issue whose root causes are mostly technical: the prime example are software vulnerabilities which cyber-criminals have a habit of finding out about before vendors and their customers, leading to the almost undefendable zero-day phenomenon which has propelled many famous cyber-attacks.

A second is that organizations struggling with the complexity of unfamiliar and new technologies make mistakes, inadvertently leaving vulnerable ports and services exposed. Starkest of all, perhaps, is the way techniques, tools, and infrastructure set up to help organizations defend themselves (Shodan, for example but also numerous pen-test tools) are now just as likely to be turned against businesses by attackers who tear into networks with the aggression of red teams gone rogue.

Add to this the polymorphic nature of modern malware, and attackers can appear so conceptually unstoppable that it’s no wonder security vendors increasingly emphasize the need not to block attacks but instead respond to them as quickly as possible.

The AI fightback
Some years back, a list of mostly US-based start-ups started a bit of a counter-attack against the doom and gloom with a brave new idea – AI machine learning (ML) security powered by algorithms. In an age of big data, this makes complete sense and the idea has since been taken up by all manner of systems used to for anti-spam, malware detection, threat analysis and intelligence, and Security Operations Centre (SoC) automation where it has been proposed to help patch skills shortages.

I’d rate these as useful advances, but there’s no getting away from the controversial nature of the theory, which has been branded by some as the ultimate example of technology as a ‘black box’ nobody really understands. How do we know that machine learning is able to detect new and unknown types of attack that conventional systems fail to spot? In some cases, it could be because the product brochure says so.

Then the even bigger gotcha hits you – what’s stopping attackers from outfoxing defensive ML with even better ML of their own? If this were possible, even some of the time, the industry would find itself back at square one.

This is pure speculation, of course, because to date nobody has detected AI being used in a cyber-attack, which is why our understanding of how it might work remains largely based around academic research such as IBM’s proof-of-concept DeepLocker malware project.

What might malicious ML look like?
It would be unwise to ignore the potential for trouble. One of the biggest hurdles faced by attackers is quickly understanding what works, for example when sending spam, phishing and, increasingly, political disinformation.

It’s not hard to imagine that big data techniques allied to ML could hugely improve the efficiency of these threats by analyzing how targets react to and share them in real time. This implies the possibility that such campaigns might one day evolve in a matter of hours or minutes; a timescale defender would struggle to counter using today’s technologies.

A second scenario is one that defenders would even see: that cyber-criminals might simulate the defenses of a target using their own ML to gauge the success of different attacks (a technique already routinely used to evade anti-virus). Once again, this exploits the advantage that attackers always have sight of the target, while defenders must rely on good guesses.

Or perhaps ML could simply be used to crank out vast quantities of new and unique malware than is possible today. Whichever of these approaches is taken – and this is only a sample of the possibilities – it jumps out at you how awkward it would be to defend against even relatively simple ML-based attacks. About the only consolation is that if ML-based AI really is a black box that nobody understands then, logically, the attackers won’t understand it either and will waste time experimenting.

Unintended consequences
If we should fear anything it’s precisely this black box effect. There are two parts to this, the biggest of which is the potential for ML-based malware to cause something unintended to happen, especially when targeting critical infrastructure.

This phenomenon has already come to pass with non-AI malware – Stuxnet in 2010 and NotPetya in 2017 are the obvious examples – both of which infected thousands of organizations not on their original target list after unexpectedly ‘escaping’ into the wild.

When it comes to powerful malware exploiting multiple zero days there’s no such thing as a reliably contained attack. Once released, this kind of malware remains pathogenically dangerous until every system it can infect is patched or taken offline, which might be years or decades down the line.

Another anxiety is that because the expertise to understand ML is still thin on the ground, there’s a danger that engineers could come to rely on it without fully understanding its limitations, both for defense and by over-estimating its usefulness in attack. The mistake, then, might be that too many over-invest in it based on marketing promises that end up consuming resources better deployed elsewhere.  Once a more realistic assessment takes hold, ML could end up as just another tool that is good at solving certain very specific problems.

Conclusion
My contradictory-sounding conclusion is that perhaps ML and AI makes no fundamental difference at all. It’s just another stop on a journey computer security has been making since the beginning of digital time. The problem is overcoming our preconceptions about what it is and what it means. Chiefly, we must overcome the tendency to think of ML and AI as mysteriously ‘other’ because we don’t understand it and therefore find it difficult to process the concept of machines making complex decisions.

It’s not as if attackers aren’t breaching networks already with today’s pre-ML technology or that well-prepared defenders aren’t regularly stopping them using the same technology. What AI reminds us is that the real difference is how organizations are defended, not whether they or their attackers use ML and AI or not. That has always been what separates survivors from victims. Cybersecurity remains a working demonstration of how the devil takes the hindmost.

Source: https://www.infosecurity-magazine.com/opinions/frighten-ai-malware-1/

45 Techniques Used by Data Scientists

These techniques cover most of what data scientists and related practitioners are using in their daily activities, whether they use solutions offered by a vendor, or whether they design proprietary tools. When you click on any of the 45 links below, you will find a selection of articles related to the entry in question. Most of these articles are hard to find with a Google search, so in some ways this gives you access to the hidden literature on data science, machine learning, and statistical science. Many of these articles are fundamental to understanding the technique in question, and come with further references and source code.

Starred techniques (marked with a *) belong to what I call deep data science, a branch of data science that has little if any overlap with closely related fields such as machine learning, computer science, operations research, mathematics, or statistics. Even classical machine learning and statistical techniques such as clustering, density estimation,  or tests of hypotheses, have model-free, data-driven, robust versions designed for automated processing (as in machine-to-machine communications), and thus also belong to deep data science. However, these techniques are not starred here, as the standard versions of these techniques are more well known (and unfortunately more used) than the deep data science equivalent.

To learn more about deep data science,  click here. Note that unlike deep learning, deep data science is not the intersection of data science and artificial intelligence; however, the analogy between deep data science and deep learning is not completely meaningless, in the sense that both deal with automation.

Also, to discover in which contexts and applications the 40 techniques below are used, I invite you to read the following articles:

Finally, when using a technique, you need to test its performance. Read this article about 11 Important Model Evaluation Techniques Everyone Should Know.

The 40 data science techniques

  1. Linear Regression
  2. Logistic Regression
  3. Jackknife Regression *
  4. Density Estimation
  5. Confidence Interval
  6. Test of Hypotheses
  7. Pattern Recognition
  8. Clustering – (aka Unsupervised Learning)
  9. Supervised Learning
  10. Time Series
  11. Decision Trees
  12. Random Numbers
  13. Monte-Carlo Simulation
  14. Bayesian Statistics
  15. Naive Bayes
  16. Principal Component Analysis – (PCA)
  17. Ensembles
  18. Neural Networks
  19. Support Vector Machine – (SVM)
  20. Nearest Neighbors – (k-NN)
  21. Feature Selection – (aka Variable Reduction)
  22. Indexation / Cataloguing *
  23. (Geo-) Spatial Modeling
  24. Recommendation Engine *
  25. Search Engine *
  26. Attribution Modeling *
  27. Collaborative Filtering *
  28. Rule System
  29. Linkage Analysis
  30. Association Rules
  31. Scoring Engine
  32. Segmentation
  33. Predictive Modeling
  34. Graphs
  35. Deep Learning
  36. Game Theory
  37. Imputation
  38. Survival Analysis
  39. Arbitrage
  40. Lift Modeling
  41. Yield Optimization
  42. Cross-Validation
  43. Model Fitting
  44. Relevancy Algorithm *
  45. Experimental Design

Source: https://www.datasciencecentral.com/profiles/blogs/40-techniques-used-by-data-scientists

Alexa do you work for the NSA ;-)

Tens of millions of people use smart speakers and their voice software to play games, find music or trawl for trivia. Millions more are reluctant to invite the devices and their powerful microphones into their homes out of concern that someone might be listening.

Sometimes, someone is.

Amazon.com Inc. employs thousands of people around the world to help improve the Alexa digital assistant powering its line of Echo speakers. The team listens to voice recordings captured in Echo owners’ homes and offices. The recordings are transcribed, annotated and then fed back into the software as part of an effort to eliminate gaps in Alexa’s understanding of human speech and help it better respond to commands.

The Alexa voice review process, described by seven people who have worked on the program, highlights the often-overlooked human role in training software algorithms. In marketing materials Amazon says Alexa “lives in the cloud and is always getting smarter.” But like many software tools built to learn from experience, humans are doing some of the teaching.

The team comprises a mix of contractors and full-time Amazon employees who work in outposts from Boston to Costa Rica, India and Romania, according to the people, who signed nondisclosure agreements barring them from speaking publicly about the program. They work nine hours a day, with each reviewer parsing as many as 1,000 audio clips per shift, according to two workers based at Amazon’s Bucharest office, which takes up the top three floors of the Globalworth building in the Romanian capital’s up-and-coming Pipera district. The modern facility stands out amid the crumbling infrastructure and bears no exterior sign advertising Amazon’s presence.

The work is mostly mundane. One worker in Boston said he mined accumulated voice data for specific utterances such as “Taylor Swift” and annotated them to indicate the searcher meant the musical artist. Occasionally the listeners pick up things Echo owners likely would rather stay private: a woman singing badly off key in the shower, say, or a child screaming for help. The teams use internal chat rooms to share files when they need help parsing a muddled word—or come across an amusing recording.

 Amazon in Bucharest
Amazon has offices in this Bucharest building.
Photographer: Irina Vilcu/Bloomberg

Sometimes they hear recordings they find upsetting, or possibly criminal. Two of the workers said they picked up what they believe was a sexual assault. When something like that happens, they may share the experience in the internal chat room as a way of relieving stress. Amazon says it has procedures in place for workers to follow when they hear something distressing, but two Romania-based employees said that, after requesting guidance for such cases, they were told it wasn’t Amazon’s job to interfere.

“We take the security and privacy of our customers’ personal information seriously,” an Amazon spokesman said in an emailed statement. “We only annotate an extremely small sample of Alexa voice recordings in order [to] improve the customer experience. For example, this information helps us train our speech recognition and natural language understanding systems, so Alexa can better understand your requests, and ensure the service works well for everyone.

“We have strict technical and operational safeguards, and have a zero tolerance policy for the abuse of our system. Employees do not have direct access to information that can identify the person or account as part of this workflow. All information is treated with high confidentiality and we use multi-factor authentication to restrict access, service encryption and audits of our control environment to protect it.”

Amazon, in its marketing and privacy policy materials, doesn’t explicitly say humans are listening to recordings of some conversations picked up by Alexa. “We use your requests to Alexa to train our speech recognition and natural language understanding systems,” the company says in a list of frequently asked questions.

In Alexa’s privacy settings, Amazon gives users the option of disabling the use of their voice recordings for the development of new features. The company says people who opt out of that program might still have their recordings analyzed by hand over the regular course of the review process. A screenshot reviewed by Bloomberg shows that the recordings sent to the Alexa reviewers don’t provide a user’s full name and address but are associated with an account number, as well as the user’s first name and the device’s serial number.

The Intercept reported earlier this year that employees of Amazon-owned Ring manually identify vehicles and people in videos captured by the company’s doorbell cameras, an effort to better train the software to do that work itself.

“You don’t necessarily think of another human listening to what you’re telling your smart speaker in the intimacy of your home,” said Florian Schaub, a professor at the University of Michigan who has researched privacy issues related to smart speakers. “I think we’ve been conditioned to the [assumption] that these machines are just doing magic machine learning. But the fact is there is still manual processing involved.”

“Whether that’s a privacy concern or not depends on how cautious Amazon and other companies are in what type of information they have manually annotated, and how they present that information to someone,” he added.

When the Echo debuted in 2014, Amazon’s cylindrical smart speaker quickly popularized the use of voice software in the home. Before long, Alphabet Inc. launched its own version, called Google Home, followed by Apple Inc.’s HomePod. Various companies also sell their own devices in China. Globally, consumers bought 78 million smart speakers last year, according to researcher Canalys. Millions more use voice software to interact with digital assistants on their smartphones.

Alexa software is designed to continuously record snatches of audio, listening for a wake word. That’s “Alexa” by default, but people can change it to “Echo” or “computer.” When the wake word is detected, the light ring at the top of the Echo turns blue, indicating the device is recording and beaming a command to Amazon servers.

 Inside An Amazon 4-Star Store
An Echo smart speaker inside an Amazon 4-star store in Berkeley, California.
Photographer: Cayce Clifford/Bloomberg

Most modern speech-recognition systems rely on neural networks patterned on the human brain. The software learns as it goes, by spotting patterns amid vast amounts of data. The algorithms powering the Echo and other smart speakers use models of probability to make educated guesses. If someone asks Alexa if there’s a Greek place nearby, the algorithms know the user is probably looking for a restaurant, not a church or community center.

But sometimes Alexa gets it wrong—especially when grappling with new slang, regional colloquialisms or languages other than English. In French, avec sa, “with his” or “with her,” can confuse the software into thinking someone is using the Alexa wake word. Hecho, Spanish for a fact or deed, is sometimes misinterpreted as Echo. And so on. That’s why Amazon recruited human helpers to fill in the gaps missed by the algorithms.

Apple’s Siri also has human helpers, who work to gauge whether the digital assistant’s interpretation of requests lines up with what the person said. The recordings they review lack personally identifiable information and are stored for six months tied to a random identifier, according to an Apple security white paper. After that, the data is stripped of its random identification information but may be stored for longer periods to improve Siri’s voice recognition.

At Google, some reviewers can access some audio snippets from its Assistant to help train and improve the product, but it’s not associated with any personally identifiable information and the audio is distorted, the company says.

A recent Amazon job posting, seeking a quality assurance manager for Alexa Data Services in Bucharest, describes the role humans play: “Every day she [Alexa] listens to thousands of people talking to her about different topics and different languages, and she needs our help to make sense of it all.” The want ad continues: “This is big data handling like you’ve never seen it. We’re creating, labeling, curating and analyzing vast quantities of speech on a daily basis.”

Amazon’s review process for speech data begins when Alexa pulls a random, small sampling of customer voice recordings and sends the audio files to the far-flung employees and contractors, according to a person familiar with the program’s design.

 Amazon.com Inc. Holds Product Reveal Launch
The Echo Spot
Photographer: Daniel Berman/Bloomberg

Some Alexa reviewers are tasked with transcribing users’ commands, comparing the recordings to Alexa’s automated transcript, say, or annotating the interaction between user and machine. What did the person ask? Did Alexa provide an effective response?

Others note everything the speaker picks up, including background conversations—even when children are speaking. Sometimes listeners hear users discussing private details such as names or bank details; in such cases, they’re supposed to tick a dialog box denoting “critical data.” They then move on to the next audio file.

According to Amazon’s website, no audio is stored unless Echo detects the wake word or is activated by pressing a button. But sometimes Alexa appears to begin recording without any prompt at all, and the audio files start with a blaring television or unintelligible noise. Whether or not the activation is mistaken, the reviewers are required to transcribe it. One of the people said the auditors each transcribe as many as 100 recordings a day when Alexa receives no wake command or is triggered by accident.

In homes around the world, Echo owners frequently speculate about who might be listening, according to two of the reviewers. “Do you work for the NSA?” they ask. “Alexa, is someone else listening to us?”

— With assistance by Gerrit De Vynck, Mark Gurman, and Irina Vilcu

Source: https://www.bloomberg.com/news/articles/2019-04-10/is-anyone-listening-to-you-on-alexa-a-global-team-reviews-audio