4.1 Datasets
4.1.1 Food purchases
At all our retailer’s 411 shops in Greater London, 1.6M customers used their loyalty cards and bought 1.6B food products in the entire year of 2015. Given the use of loyalty cards, purchase data is stored in the following anonymized form: customer postcode area/region, store postcode, productID, and timestamp. Each productID is associated with the product’s net weight, total energy, fats, saturated fats, carbohydrates, sugars, proteins, and fibres. The last six elements are expressed as grams of substances contained in the product. Using standard guidelines [42], we map grams into corresponding calories by simply multiplying them by fixed factors: 9 Kcals per gram for fats, 4 Kcals for proteins and carbohydrates, and 2 Kcals for fibres.Footnote 3
4.1.2 Chronic diseases
At all the 1174 general practices (GPs) in Greater London, 1.1B medicine items were prescribed in the entire 2016. Such prescription data has been recently made publicly availableFootnote 4 in the following form: GP identifier, medicineID, and timestamp. Each medicineID is associated with the medicine’s active ingredients, from which the corresponding diseases can be inferred. To do so, we map the prescriptions to their medicines’ active ingredients and, in turn, to the chronic diseases they are supposed to treat. The mapping of an active ingredient to the most likely disease is done based on the OpenPrescribing taxonomy [43]. As a result, for each GP, we know the number of prescriptions that are meant to treat a given disease.
We are interested not in all prescriptions but only in those related to three main factors that are generally grouped under the heading of “metabolic syndrome”: high blood pressure (hypertension), an excess of cholesterol in the blood, and high blood-sugar levels (diabetes).
Hypertension. Hypertension is a long-term medical condition in which the blood pressure in the arteries is persistently elevated. It has been identified as the most important risk factor for death in the Western world [44]. To capture the prevalence of hypertension, we consider the prescriptions in three categories of the OpenPrescribing taxonomy: antihypertensive drugs (e.g., Hydralazine Hydrochloride), alpha-adrenoceptor blocking drugs (e.g., Doxazosin Mesilate), and renin-angiotensin system drugs (e.g., Lisinopril).
Cholesterol. One important risk factor for coronary heart diseases is cholesterol. If cholesterol level is low, an obese and diabetic still does not develop atherosclerosis [45, 46]. Cholesterol also seems to help some cancers migrate and invade more tissue [47]. To capture the prevalence of high cholesterol, we consider one category in the OpenPrescribing taxonomy: lipid-regulating drugs (e.g., Statins).
Diabetes. Diabetes is characterized by chronically elevated levels of sugar in the blood [48]. Insulin is the hormone that keeps the blood sugar in check. The disease is caused by either the pancreas gland not making enough insulin (type 1 diabetes) or by the body becoming resistant to insulin’s effects (type 2 diabetes, which accounts for 90-95 percent of diabetes cases [49]). Type 2 diabetes is a consequence of dietary choices (of “high-fat and high-calorie diets”) and, as such, is preventable and often treatable. Diabetics are more likely to suffer from strokes and heart failure [50]. To capture the prevalence of diabetes, we consider the prescriptions in four categories of the OpenPrescribing taxonomy: insulin, antidiabetic drugs (e.g., gliclazide), active ingredients for the treatment of hypoglycaemia (e.g., glucagon), and agents for diabetic diagnostic and monitoring (e.g., glucose blood testing reagents).Footnote 5
4.1.3 Mapping food purchases and chronic diseases
To map all our data, we use the postcode area/region as our initial unit of geographic aggregation. We map our food purchases using the customers’ area of residence, which are available in our grocery dataset. We then map our prescriptions based on what the anonymized prescription dataset offers—that is, based on the fraction of each GP’s patients living across areas. More specifically, for each GP, we perform two steps. First, we consider the fraction of the GP’s patients who live in each area since patient counts are publicly availableFootnote 6 (e.g., 50% of the GP’s patients live in area X). Second, we assign the GP’s prescriptions to an area, and the assignment is proportional to the fraction of the GP’s patients who live in the area (e.g., half of the GP’s prescriptions are assigned to area X). We repeat these two steps for all GPs in the city. As a result, we obtain the number of prescriptions containing each medicine in each area, and normalize that number by the population, determining the per-capita prevalence of that medicine in that area.
As we shall see shortly, to support our analyzes, we need to match our data with census data, which is not available at postcode area/region level though. Census data is typically defined at four different spatial resolutions:Footnote 7 Lower Super Output Area (LSOA), Medium Super Output Area (MSOA), Ward, and Local Authority (LA, or, more informally, Borough). Among these four aggregations, we opt for MSOA because, at this resolution, our aggregate metrics for food consumption and disease prevalence start to be significant, not least because they concern a sufficient number of residents: our data covers 937 MSOAs in London, which have an average of 8250 residents. As such, from now on, we refer to MSOAs as areas or neighborhoods.
Given our three diseases, we consider prescriptions related to them and express each of their prevalence at area level in terms of the number of prescriptions for each disease per capita. For example, in the case of diabetes, for any area a we have:
$$ \mathrm{prevalence}\text{-}\mathrm{diabetes}@a = \frac{ \text{\#prescriptions for diabetes@a}}{\text{\#residents@a}}. $$
(1)
In a similar way, we compute \(\mathrm{prevalence}\text{-}\mathrm {hypertension}@a\) and \(\mathrm{prevalence}\text{-} \mathrm{cholesterol}@a\).
4.1.4 Socio-economic indicators
The prevalence of chronic diseases is not only impacted by food consumption but also mediated by socio-economic conditions. Higher-income and well-educated people may have better access to doctors, gyms, parks and healthy food. There is an inverse relationship between education levels and the likelihood of getting fat in Australia, Canada and England [51]. The same applies in USA: “obesity rates in children with college-educated parents are less than half the rates of children whose parents lack a high-school degree” [52]. In developed countries, there is a difference between cities and suburban areas: the more affluent urbanites are usually fitter than rural residents [53].
To control for these factors in our study, we collected data on socio-economic conditions from the 2015 UK census and from the Index of Multiple Deprivation (IMD) 2015 that is based on a basket of measures of deprivation for small areas across England.Footnote 8 We focus on the socio-economic factors that have been found to be associated with specific food consumption patterns and, ultimately, with chronic diseases. These factors—available at MSOA level—are average income, education level, gender distribution (%female), and average age.
4.2 Estimating eating habits of an area
To estimate the eating habits of people living in an area, we pool together all the food items purchased by its residents and look at the nutritional properties of the average item. We do so by defining six metrics below.
To capture calorie consumption, we compute the average amount of calories contained in the food items purchased in an area:
$$ \mathrm{calorie}\text{-}\mathrm{consumption}@a = \frac{\sum_{p \in P_{a}}\mathrm{Kcal}(p)}{|P _{a}|}, $$
(2)
where \(P_{a}\) is the set of all food products purchased by residents of area a, p is one of such products, and \(\mathrm{Kcal}(p)\) is the value of kilocalories in p.
To capture calorie concentration rather than simple calorie counts, we compute:
$$ \mathrm{calorie}\text{-}\mathrm{concentration}@a = \frac{\sum_{p \in P_{a}}\mathrm{Kcal}(p)}{ \sum_{p \in P_{a}} \mathrm{weight}(p)}, $$
(3)
which reflects the concentration of calories in the area’s “average” product.
For each area, we also compute the average number of calories given by individual nutrients in a product, on average:
$$ \mathrm{nutrient}\text{-}\mathrm{calories}@a = \frac{\sum_{p \in P_{a}}\mathrm{Kcal}(\mathrm{nutrient},p)}{|P _{a}|}, $$
(4)
where \(P_{a}\) is the set of all food products purchased at area a; p is one of such products; \(\mathrm{Kcal}(\mathrm{nutrient},p)\) is the energy intake given by nutrient in p. The nutrients we consider are: fats (\(\mathrm{fats} {-}\mathrm{calories}@a\)), saturated fats (\(\mathrm{saturated}{-}\mathrm {calories}@a\)), carbohydrates (\(\mathrm{carbs}{-}\mathrm{calories}@a\)), sugars (\(\mathrm {sugar}{-}\mathrm{calories}@a\)), proteins (\(\mathrm{proteins}{-}\mathrm{calories}@a\)), and fibres (\(\mathrm{fibres}{-}\mathrm{calories}@a\)).
We also capture the diversity of nutrients consumed in the area. This is computed as the Shannon entropy of the distribution of the calories given by all the nutrients:
$$\begin{aligned} H(\mathrm{nutrient})@a = -\sum_{\mathrm{nutrient}} \mathrm{prob}(\mathrm{nutrient},a) \cdot\log\mathrm{prob}(\mathrm{nutrient},a), \end{aligned}$$
(5)
where \(\mathrm{prob}(\mathrm{nutrient},a)\) can be thought as the fraction of area a’s total calories coming from nutrient, which, in turn, can be written as:
$$\begin{aligned} f_{\mathrm{nutrient}\text{-}\mathrm{calories}}@a = \frac{\mathrm{nutrient}\text{-}\mathrm{calories}@a}{\mathrm{calorie} \text{-}\mathrm{consumption}@a}. \end{aligned}$$
(6)
For example, \(f_{\mathrm{fat}\text{-}\mathrm{calories}}@a\) is the fraction of a’s total calories coming from fat.
Finally, we also compute the average item weight:
$$ \mathrm{item}\text{-}\mathrm{weight}@a = \frac{\sum_{p \in P_{a}} \mathrm{weight}(p)}{|P_{a}|}. $$
(7)
For the sake of reproducibility of our analysis, we publicly share our data aggregated at MSOA level.