3 min read

Taming the Wild West of Big Data

In the era of Big Data — where public and private data is abundant, and growing — everyday residents might not realize how they generate volumes of data in ways that are both innocent and serious; visibly apparent and covert.

Published on

July 15, 2022

In the era of Big Data — where public and private data is abundant, and keeps growing — everyday residents might not realize how they generate volumes of data throughout their days in ways that are both innocent and serious; visibly apparent and covert. Their data reveals both mundane and intimate details about their habits, movements, and lifestyles.

Every time a person uses an app to order a meal, log their 10,000 steps, look up driving directions, buy a coffee, or report a public works issue, they’re generating data. As consumers, we appreciate the conveniences this connected world provides. We accept the trade-off that we’re surrendering a bit of personal data in exchange for a simple convenience. However, consumers accept this trade-off with the belief that these data holders and new users of this data will act in a principled way. Some may, some may not. Therein lies the rub.

The relatively new practice of processing all of this data goes beyond long-standing disciplines like statistics and survey analysis. Technologists now use machine learning and artificial intelligence to process these large volumes of information to try to make them actionable and insightful.

Indeed, the combination of huge amounts of data plus powerful computing leads to an ethical fork in the road. Cue the literary references: We’ve opened Pandora’s Box. We’ve created a Frankenstein of data. With great power comes great responsibility.

But in real-life practice, we cannot shrug off our ethics to a metaphor.

Privacy is a real issue with real risks. This is an issue worthy of thoughtful, nuanced debate to balance the public’s interest against such risks. The world of passive data has largely operated in an unregulated, Wild West fashion for too long. A common misperception among users of data is that privacy and quality insights are somehow at odds with each other.

At Replica, we believe that in order to gather useful insights — such as timely and informed decisions on how to keep residents and essential workers moving and safe during Covid — we shouldn’t need to sacrifice individual privacy.

Absent regulation or standards, the scales of data insights versus privacy will continue to operate in a gray area where the holders and consumers of this data are left to self-police. Without clear rules of engagement, the next era of data usage requires ethical leadership and the courage to do the right thing.

At Replica, as technologists, data scientists and former public officials, this self-policing is why we cannot — and will not — compromise on our ethical leadership when it comes to privacy protection.

We use Replica’s technology to build models from data sources independently so that we abstract out potentially identifying details of any individual before combining these models into our aggregate outputs. We never attempt to re-identify individuals from our source data, and forbid our users from doing so as well in our contracts.

We also have a goal at Replica to level the playing field between the public and private sectors. We’ve seen how the public sector frequently finds itself at the negotiation table with the large, data-rich companies that are behind popular consumer services. These companies have so much data that they own more knowledge about what’s happening in the city than the city itself. And because that app data is first-party, self-collected information, the companies can claim to protect their own user’s privacy while still leveraging that data to shape public policy and public opinion. We believe this private versus public information asymmetry is fundamentally unfair.

This paradox has put public agencies at a significant disadvantage. They are unequipped, from a data and tooling perspective, to negotiate or regulate in effective, equitable ways — potentially undermining democratic norms and the public institutions that uphold them.

Replica has developed a way to offer the same powerful tools and data to public agencies; the difference is that we do it in a way that doesn’t compromise or put the public agency at risk in regard to privacy. In simplified terms, we build “synthetic data,” computer-generated data that contains properties of the original information without disclosing the actual original, raw data itself.

Given the relative newness of this issue and the highly technical and nuanced elements of privacy, many public agencies act as if the need for more data should supersede the need for protecting privacy. The reality is, Replica has shown you don’t need to make this trade-off.

With that, here is the set of privacy principles we hold ourselves to in all of the work that we do. We encourage public agencies to raise the bar and hold the companies from which they source information to the same standard:

Always use de-identified data, and apply additional internal de-identification measures
Use synthetic populations so that behavior is matched in aggregate, but never copy the specific behavior of a real person from original, ingested data
Build models from data sources independently so that potentially identifying details of any individual are abstracted out before combining these models into our aggregate outputs
Never join data sources on keys containing sensitive data
Never attempt to re-identify individuals from our source data, and forbid users from doing so as well in all contracts

We are in the process of creating an open-source template for privacy-protecting vendor agreements that will codify these principles. We hope this template will serve as a guide to all those who have the same commitment to privacy as we do. If others are interested in this process, we encourage you to send an email to privacy@replicahq.com.

We call on other trailblazers in our field, those who are mining piles of data for nuggets of insight, to join us as we raise the bar for the ethical use of data and define a high standard from the outset. When tempted to compromise for the sake of a juicy insight or even a laudable public policy goal, absent clear regulations or standards, we must stand firm to our privacy principles and remember that we can still uncover useful information that is ethically sourced and privacy-protected.

Share this post

Replica Editorial

Learn More?

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Featured blog posts

How Replica Processes Ground-Truth Transit Data at Scale

March 24, 2026

•

10 minutes

How Replica Processes Ground-Truth Transit Data at Scale

Transit data is one of the most important inputs to understanding how people move through a region. This post explains how Replica processes and validates ridership data from hundreds of agencies to ensure its models reflect real-world transportation patterns.

Highlights from Our 2026 California User Summits

At Replica’s 2026 California User Summits, transportation leaders from state, regional, and local agencies shared how they are applying Replica to real planning challenges. Use cases included corridor planning, evacuation modeling, construction scenario analysis, and safety prioritization. The events highlighted how agencies are increasingly using Replica as a platform for decision-making, not just a dataset.

From Policy to Proof: Evaluating NYC’s Congestion Pricing with Replica Data

One year after New York City implemented congestion pricing, Replica’s mobility data helped researchers quantify the program’s impacts across the metro area. Findings show fewer vehicles in the congestion zone, faster travel speeds, and increased transit ridership.

Learn more

View all

see your city better