Data Science Investment Counter —Funding raised by UK data science companies in 2018.
£ 5.640 Million

Artificial Intelligence model can fix errors in citizen science data

Heat maps showing bird observations in New York state. The map on the left shows the original samples submitted to eBird and the map on the right shows the distribution after it was adjusted using a model developed by Cornell researchers to reduce location bias in citizen science projects. Credit: Cornell Lab of Ornithology.

Scientists from the Institute for Computational Sustainability (ICS) at Cornell University have developed a deep learning model able to correct location biases in citizen science.

The model has been trained on data from the Cornell Lab of Ornithology’s eBird, which collects more than 100 million bird sightings submitted annually by birdwatchers worldwide.

Citizen science is very relevant for researchers, since it provides data about everything from animal species to observations about distant galaxies. However, it can also be quite inconsistent. Many reports, for example, come from densely populated areas and fewer from places that are naturally harder to reach.

“There is a huge bias in the data set because the data is collected by volunteers,” said Di Chen, a doctoral student in computer science and first author of “Bias Reduction via End to End Shift Learning: Application to Citizen Science.”

“Since this is highly motivated by their personal interest,” Chen added, “the distribution of this kind of data is not what scientists want.

“All the data is actually distributed along main roads and in urban areas because most people don’t want to drive 200 miles to help us explore birds in a desert”, he said.

Researchers have been aware of these issues for a long time, and have tried various methods to address them, including other types of statistical models. Some projects, for instance, offered volunteers incentives to travel to remote spots or search for less-popular species. However, initiatives such as this one can be expensive and hard to conduct on a large scale.

The deep learning model by ICS takes data created with the eBird system, which would normally make inaccurate predictions, then adjusts for population differences in different areas by comparing their ratios of density.

“Right now the data we get is essentially biased because the birds don’t just stay around cities, so we need to factor that in and correct that,” Carla Gomes, professor of computer science and director of ICS, said. “We need to make sure the training data is going to match what you would have in the real world.”

Chen and Gomes tested they algorithm with different models and claimed its success rates were higher than other statistical or machine learning models at predicting where bird species might be found.

Though they worked with eBird, their findings could be used in any type of citizen science project, Gomes said.

“There are many, many applications that rely on citizen science, and this problem is prevalent, so you really need to correct for it, whether people are classifying birds, galaxies or other situations where data biases can skew the learned model,” she said.


Co-working space and blog dedicated to all things data science.

Subscribe to our newsletter