blog

Making Simple Fake People

What makes a person a person is quite the question. Some would say the soul, some the personality, some the flesh and blood. Fortunately, I don’t have to care for exercise. All I’m trying to make is data that can sit in your tbl_person equivalent. This is Part 2 of the {And Then There Was Light} series, an indefinite ongoing feature.

The Easy Part – Basic Demographic Info

Name

Data Source: US Census data [Link to it]

The US Government posts given and surnames ranked by popularity. I randomly select from the list of names with a uniform distribution because I want a good mix without much repetition. It looks better to have a Jack, a Jeff, and a Zhou, than 2 Jacks and a Jeff in my opinion. It’s obligatory to mention Patrick Mackenzie’s piece on names(Link) which basically can be summed up as “Shit’s fucking hard. You will fail.” But that’s fine, I’m just worried about being able to create a lot of average people. So I randomly select two gender appropriate first names from the list of first names to be a first and middle name, and then one from the surnames to create a very common name format. It’s also possible here to have ADVANCED NAME GENERATION TECHNOLOGIES to try to test possible failure cases. So you could create people with only one name, with hyphenated names, etc. and this may be explored in another post, but for now, we’re good with a simple common FirstName MiddleName LastName format.

Age/Date of Birth

Data Source: [Java random standard distribution sample]

(Random/Doesn’t matter) Age/Date of Birth is unlikely to matter. You might be interested in featuring something specific like a focus on geriatric patients or families with elementary students.

SSN

Data Source: [Java function to make it]

(Random/Doesn’t matter) Once again, could have information embedded in the number, but unimportant. There are similar data types for Non-US countries. I’ll probably do a few common ones in the future. There are also other unique(ish) ID types and formats that may be relevant for your domain. This is just a starting point that should cover most people’s bases.

Height/Weight

Data Source: [Java random code with a good mean and deviation value selected]

Random in a range. I keep it in a standard deviation of the mean because you just want this information to sit in the background. It lends credibility without creating any real signal. Appropriate for some things(Police databases), very inappropriate for others(Cheese King’s newsletters).

Race/Ethnicity

Data Source:

These are tricky and not essential for what a lot of people do. If you don’t need them, you probably shouldn’t touch them. But a lot of people need or “need” them. Advertisers and law enforcement are two of the biggest ones I can think of. If you do need them, first you have to come up with your lists for both. I’m taking the short cut of

You could have this impact your name selection if it’s relevant to believability for your currrent problem. That would require a lot of manual coding of ethnicity to name or changing to a different data source where the data is already linked.

User Accounts(Phone number, email, Facebook name etc.)

Facebook/Email offer a couple interesting opportunities for personality. You can have your 45 year olds with cowboy4852@AOL.com, the 13 year olds with spidy99@gmail.com, the yuppie with the elias.marsden@gmail.com. This is the first example of one piece of data possibly depending on another for its value.

So what do we have at the end of all this? A bunch of people. They have names and decent unique IDs, some basic properties that are intrinsic to them and unlikely to change dramatically over time. Sure, some people nuke their facebooks and start over. Others legally change their names. Others lose a lot of weight, but it’s a relatively small portion of people and we’re explicitly not worried about dealing with that.