The increase of urbanisation rates, generally defined as the increase of the proportion of people living in urban areas or the proportion of buildings belonging to urban agglomerations [1], is a trend that has happened in waves throughout human history, with a dramatic acceleration in the last 300 years [2]. In 2015, 56% of China’s population lived in cities, a figure that has more than doubled compared to 26% of 1990. The Ministry of Housing and Urban-Rural Development estimates that by 2025 300M Chinese now living in rural areas will move into cities. State spending is planned on new houses, roads, hospitals, schools, which could cost up to 600 billion US dollars a year. A great rate of urbanisation is also expected in Sub-Saharan African countries. As a result, by 2030 it is estimated that the world’s population will have increased by over 1 billion people most of whom will dwell in the rapidly growing cities of Asia and Africa [3]. Recent studies show that, on average, urban land is expanding at twice the urban population growth rate, resulting in a decrease of urban population density with time [4].
A quantitative understanding of the mechanisms that drive urbanisation is important for helping governments and decision makers to plan investments in order to achieve sustainable urban planning and growth. These decisions will have a huge impact on the lives of millions of people, the economy and the environment. Urbanisation can happen in two ways: diffusion (or sprawl) and aggregation. Diffusion corresponds to existing cities growing and increasing in size because of either net migration from rural areas or a greater rate of natural increase (i.e. birth rate minus death rate) in urban areas. Aggregation corresponds to new villages and towns being created in rural areas that were previously considered non-urbanised. In order to properly characterise urbanisation patterns we should consider both aspects: the distribution of city sizes, describing the size and growth of existing cities, and the overall number of cities, describing the abundance and formation of new urban areas.
The distribution of city sizes is a broad and heterogeneous distribution. Ranking cities by population, it has been observed [5,6,7] that the population of the i-th largest city of a country is approximately equal to the population of the largest city divided by i, i.e. a city’s rank is inversely proportional to its population. In other words, the fraction of cities with population larger than x follows Zipf’s law, \(P(>x) \sim x^{-\alpha }\), with \(\alpha \simeq 1\). Previous studies have shown how Zipf’s law can originate from various models based on cluster growth and aggregation [8,9,10,11], the interplay between multiplication and diffusion processes [12], preferential migration to large aggregates [13], pairwise interactions between individuals [14] and proportionate random growth [15,16,17], or Gibrat’s law [18, 19].
Compared to the great efforts made to characterise the distribution of city sizes both empirically and theoretically, much less work has been done to answer the other fundamental question about the urbanisation process: What determines the number of cities in a country? In this paper we empirically investigate the relationships between the number of cities in a region and some of the region’s properties, such as the region’s total population and built-up area. In particular, we consider how the total population (or the total built-up area) of a region affects the number of cities. This is analogous to Heaps’ Law in linguistics [20, 21], which describes the empirical scaling relationship between the number of distinct words, W, in a document and the total number of words in the document (or text length), N: \(W \sim N^{\gamma }\), where \(\gamma \le 1\) is the Heaps exponent.
Previous research has shown that Zipf’s law and Heaps’ law often appear together, suggesting that the presence of Zipf’s law implies Heaps’ law. Considering the probability density function (PDF) corresponding to Zipf’s Law, \(P(x) \sim x^{-1-\alpha }\), it can be shown [22] that Heaps’ exponent γ is related to Zipf’s exponent α: \(\gamma = \alpha \) if \(\alpha < 1\), and \(\gamma = 1\) otherwise. However, this relationship does not necessarily hold for spatially extended systems, such as cities, because evidence of Zipf’s law at the country (global) scale does not necessarily imply the presence of Zipf’s law and Heaps’ law at the regional (local) scale. In fact, even if Zipf’s law for the distribution of city sizes holds globally at the level of countries, it might not hold locally at smaller spatial scales if correlations in the spatial distribution of urban clusters are present. This would be the case, for example, if urban clusters were spatially aggregated by size, so that it is more common to find clusters of similar sizes close to each other compared to the case in which clusters are randomly distributed among the regions, irrespective of their size. The overall (global) distribution of cluster sizes would not change and still be a power-law, but the size distributions in the regions would not follow Zipf’s law anymore and as a consequence Heaps’ law would not hold. Indeed, this is what happens in ecological systems, where macro-ecological statistical patterns of species distribution and abundance display a strong dependence on the spatial scale considered [23]. One of the most relevant statistics used to characterise the degree of biodiversity of ecosystems is the species-area relationship (SAR), which measures the number of different species expected to be found in areas of increasing size. Since the density of individuals per unit area is constant, the SAR is the equivalent of Heaps’ law for ecosystems, as it measures the relationship between a region’s total population and the expected number of different groups of individuals in the region, where here groups correspond to species instead of cities. Empirical measurements of the SAR show a different functional behaviour as the region’s area increases, and this is due to the fact that the shape of the distribution of species sizes, called the relative species abundance, depends on the spatial scale considered. While there are various studies on Heaps’ law in linguistics and SAR in ecology, to the best of our knowledge there is no thorough empirical analysis of Heaps’ law in urban systems. The aim of this paper is to precisely fill this gap and to investigate the validity of Heaps’ laws for cities.
There is another reason to investigate the relationship between Zipf’s and Heaps’ laws for cities. Zipf’s law for the distribution of city sizes usually holds only for the tail of the distribution, however the fact that in a region the distribution of city sizes has a power-law tail does not give any information regarding the relationship between the number of cities in the region and its total population. In other words, when Zipf’s law holds only for large cities, there is no guarantee that Heaps’ law holds as well. To understand this, consider a region in which city sizes follow Zipf’s law. If the population of each city is doubled and hence the total population of the region is also doubled, yet no new cities are created, Zipf’s law will still be present, albeit with a larger scale parameter (i.e. the minimum city size is doubled). However, Heaps’ law will not hold in this case, because the total population, N, is doubled, but the number of cities, C, has not changed.
In this paper, we use a dataset on the population and location of cities globally to assess if Heaps’ law holds for all countries in all continents (except Australia and Antartica), and to test the predicted relationship between Heaps’ and Zipf’s exponents. Cities can be defined in many different ways and various relevant properties of urban agglomerations, including the scaling relationships between population size and urban indicators such as area of roads and number of patents, depend on the method used to define cities [24, 25]. In particular, the relationship between the number of cities in a region and the region’s total population, i.e. Heaps’ law, can also depend on the definition of city considered. To understand how Heaps’ law depend on the definition of city, we use a second dataset of the spatial distribution of population in the United States that allows us to consider various definitions of urban clusters and provide additional support to our results.