What is the Science Hall of Fame?

Who would be on a list of the most famous scientists of the past two centuries, and how would they rank against each other? The constraint: To determine membership, you have to rely on the consensus of millions of people presently alive, and for the ranking, you must rigorously survey millions of people who lived over the past 200 years. Impossible?

The Science Hall of Fame is an experiment in culturomics. There are many halls of fame in the world. All of them rely on the subjective judgment of a small group of experts to determine their membership and ranking. The Science Hall of Fame is the first to use instead an objective and literal measure of fame: the appearance of people’s names in millions of books published over the centuries. To determine who is a scientist, it relies on the consensus of the millions of editors of Wikipedia.

This analysis became possible only recently with this science paper by Jean-Baptiste Michel, Erez Lieberman Aiden, and colleagues. They converted millions of books into a database of “n-grams” that can be analyzed without violating copyright. (You can explore those data yourself with the Google ngram viewer. Warning: It is addictive.)

The Science Hall of Fame is a project hosted by the journal Science, created and curated by Adrian Veres (a coauthor of the Michel et al. paper) and Science correspondent John Bohannon. The data comes ultimately from Google Books and DBpedia. The entire ngram dataset and DBPedia are publicly available online.

MOST IMPORTANTLY: This list does not rank scientists by their importance as scientists, nor even by the impact of their scientific work. It simply ranks scientists by their personal fame, based on an objective and literal measure: the appearance of their names in books over the centuries.

How are the members of the Hall of Fame chosen?

The names of millions of people appear in books, but how many of them are scientists? We rely on the judgment of the masses: Wikipedia. There are about 750,000 biographical entries in Wikipedia. To generate a list of candidate scientists, we used computer algorithms that search for birth and death years, Wikipedia categories related to scientific disciplines, and keywords within the text such as “biologist” and “physicist”. You can read the full recipe for the algorithms in the release notes.

But that only generates the candidates. The next step is to measure the mD fame of each person. The recipe for that: First, find out which year a person was 30 years old. Calculate the frequency of occurrence of that person’s full name among all words contained within the set of all English language books published that year. (The ngram data for books between 1800 and 2000, representing 4% of all books ever published, are available for download here.) Now calculate the frequency within all books published the next year, and then the next. Continue up to year 2000. Now take the mean of those annual frequencies. Now normalize to Darwin by dividing that number by 0.000465298724959. Multiply the result by 1000. Now you have the person’s fame in milliDarwins.

The last step is to clean the data. For example, an early version of the Science Hall of Fame included Margaret Thatcher. It is true that Britain’s former prime minister and “Iron Lady” did an undergraduate chemistry degree, but she was never a scientist. Anomalies like these are rare and concentrated at the top of the fame spectrum, due to the very high level of biographical detail in their Wikipedia entries. Those people were excluded manually.

Why are some scientists missing?

There are many ways to be excluded. First, because mD fame is calculated from the age of 30 onwards, anyone born too recently to have their name appear in books published in the year 2000 is off the radar. Second, you can’t be a “John Smith”. James Watson is an example of a famous scientist excluded for this reason. While "Francis Crick" gives a good clean signal in the books data--and note that noisy spike around 1900 due to misdated books--there are just too many famous people with the name James Watson.

You can also be missed because your Wikipedia entry does not have enough detail (such as birth year and scientific field), or is not sufficiently structured (so that DBpedia has no useful metadata). Not having a birth year in Wikipedia is one of the most common reasons for exclusion. Finally, there are quirky reasons.

Consider the case of Nobel prize-winning physicist George Smoot. He is a legitimately famous scientist (5 mD), but take a look at his fame trajectory. Smoot was born in 1945, so what is that big peak before 1940? That is all coming from a single book: "The Smoots of Maryland and Virginia" published in 1936. Bad luck for Smoot! Our quality-control algorithms excluded him because his name appears to be more famous before he was born. Our aim is to use more sophisticated algorithms for future versions of the Science Hall of Fame that will include George Smoot without pulling in more noise.

Why is the data noisy?

There are several different sources of noise. One is the Google Books corpus itself. The optical character recognition (OCR) used to scan all those books is impressive, but not perfect. For example, it has difficulty with archaic fonts, such as those containing the medial S, which looks like an F. This along with physical blemishes make books printed before 1800 far less reliable. (That is why the Science Hall of Fame is currently limited to people who were born after 1799.)

Noise also leaks in due to misdating of books. This is a known and very complex issue, described in the Supplementary Online Material of the Michel et al. paper. You will often see a tiny bump of fame around 1900, for example, which is usually coming from misdated books.

There is also noise coming from the ambiguity generated by people’s names (see the George Smoot example above). And finally, the information in Wikipedia is not 100% accurate.

What fields are represented in the Hall of Fame?

In the present version, we classify people as being biologists, physicists, chemists, or mathematicians. (About 5% of people in the Science Hall of Fame are members of more than one of these.) We also include a field that reveals the winners of the Nobel prizes in Chemistry, Physics, and Biology (a.k.a. the Nobel prize in Physiology or Medicine.) For future versions of the Science Hall of Fame, we may include winners of the Nobel economics prize and also the Fields medal for mathematics.

Why the full name?

We seek to measure the influence of scientists as individual people. For example, Charles Darwin’s name appears in shorter forms in a huge number of books, from “Social Darwinism” to “Darwinian selection.” But those instances do not identify Charles Darwin the man. Even a biography of Charles Darwin will contain relatively few instances of “Charles Darwin” in full.

For the importance of using full names, consider the case of Henry Bence Jones, the 19th Century chemist. His name is immortalized by the Bence Jones protein, but the man himself only has 3 mD of fame.

Why measure fame from age 30 onwards?

Famous scientists almost never appear in books before the year in which they are 30 years old. Using that age as the starting point for measuring people's fame cuts down on noise in the data. Age 30 is an arbitrary choice, but it does the job. We're reproducing the method used in the Michel et al. paper.

How can I improve the Science Hall of Fame?

We support a model of indirect curatorial editing. The best way to improve the Hall of Fame is to improve Wikipedia. In particular, scientists need better relational information in their Wikipedia entries, so that they can be easily parsed and accessed through the DBPedia project. The first step is tagging categories in scientists' Wikipedia entries, especially birth year and occupation. The second step is adding infoboxes to the articles. Future editions of the SHoF will include infobox properties along with categories to populate its entries. With your help, this window on scientific fame will grow more inclusive and accurate, while at the same time improving the world's largest, free information resource about science.