 Regular article
 Open Access
Home is where the ad is: online interest proxies housing demand
 Marco Pangallo^{1, 2}Email authorView ORCID ID profile and
 Michele Loberto^{3}
 Received: 26 June 2018
 Accepted: 31 October 2018
 Published: 9 November 2018
Abstract
Online activity leaves digital traces of human behavior. In this paper we investigate if online interest can be used as a proxy of housing demand, a key yet so far mostly unobserved feature of housing markets. We analyze data from an Italian website of housing sales advertisements (ads). For each ad, we know the timings at which website users clicked on the ad or used the corresponding contact form. We show that low online interest—a small number of clicks/contacts on the ad relative to other ads in the same neighborhood—predicts longer time on market and higher chance of downward price revisions, and that aggregate online interest is a leading indicator of housing market liquidity and prices. As online interest affects time on market, liquidity and prices in the same way as actual demand, we deduce that it is a good proxy. We then turn to a standard econometric problem: what difference in demand is caused by a difference in price? We use machine learning to identify pairs of duplicate ads, i.e. ads that refer to the same housing unit. Under some caveats, differences in demand between the two ads can only be caused by differences in price. We find that a 1% higher price causes a 0.66% lower number of clicks.
Keywords
 Online data
 Housing market
 Econometrics
 Machine learning
 Causality
1 Introduction
Online activity makes it possible to quantify aspects of human behavior that were not previously measurable at a comparable scale. Examples include stock market sentiment [1, 2], ideological conflict [3, 4], social networks [5, 6], mobility [7] and epidemic spreading [8]. In this paper we quantify housing demand, as viewed through the lenses of online activity on an Italian website of housing sales advertisements (ads). We first establish that online interest is a good proxy of actual demand, and then, on a more technical level, we combine econometric and machine learning ideas to investigate the causal link from prices to demand.
The interaction between housing demand and supply determines price trends and the social composition of neighborhoods. Higher demand—if supply does not increase—is associated with increasing prices and consequently worsening residential income segregation. This insight can easily be formalized in various types of models of the housing market: spatial equilibrium models [9, 10], search and matching models [11–13] and agentbased models [14–16].
However, empirically testing the effect of demand is much harder, because demand is hard to measure. For example, Genesove and Han [17] write “however as buyers are not listed in North American housing markets, the stock of them is impossible to construct for empirical work.” Genesove and Han use changes in income and population at the city level as proxies of demand [17]. Carrillo et al. [18] use seller bargaining power and sale probability. Merlo and OrtaloMagné [19] analyze what is arguably the most complete dataset in terms of demand information. Their data include the number and timing of viewings to listed properties and the sequence of offers by potential buyers. However, data are handcollected by the agencies, limiting their sample size to 780 units.
The advent of the internet has made it possible to quantify demand on a larger scale. Potential home buyers start gathering information about dwellings by browsing the internet, and may subsequently contact an agency to obtain more detailed information or organize a viewing. Wu and Brynjolfsson [20] are the first to use internet data to quantify demand, showing that the number of Google housingrelated searches is predictive of future price appreciations and higher volume of transactions at the city level. (See also Ref. [21] for Google searches.) Van Dijk and Francke [22] come to the same conclusion, but their measure of demand is the aggregate number of clicks on housing ads on a Dutch website, where aggregation is performed again at the city level.
Here we go one step further and analyze measures of online interest at the level of individual ads. We have access to the full temporal sequence of the number of clicks on each ad, from the time the ad was posted to the time it was removed from the website. We also know the timings in which potential buyers used the contact form on the website to contact sellers. We show that our measures of online interest are predictive of the time on market and of the probability and magnitude of both downward and upward price revisions. We also aggregate the number of clicks and contacts at the neighborhood and city level, and confirm the results in Refs. [20, 22] in terms of liquidity and price trends. As time on market, liquidity and prices are linked to actual demand in the same way as our measure of online interest, we deduce that clicks and contacts are a good proxy of actual demand.
The main problem with our dataset is the large fraction of duplicate ads, namely multiple ads that refer to the same dwelling. For example, an agency might post a new ad for the same dwelling to make the new ad appear at the top of “most recent” listings, without deleting the old ad. It is clear that if one does not deal with the existence of duplicates, results on time on market and price revisions are likely to be biased. To address this issue, we devise a machine learning algorithm that identifies duplicates. We use a classification tree with boosting to assign to pairs of ads probabilities to be duplicates, and consider pairs with probability larger than 0.5 as duplicates.
The identification of duplicate ads is also very useful to estimate the price elasticity of demand. This is the relative difference in demand—in this case, the number of clicks—that is caused by a relative difference in price. (Clarification: a price revision for an ad and the price elasticity of demand are used as different concepts here. A price revision means that the ad was already online when the price change occurred. The price elasticity of demand is more of a thought experiment: had the ad been posted with a different price, what would the relative difference in clicks be?) The elasticity of demand is an extremely important concept for both businesses and policy. Many companies need to know how changing prices would affect the demand for their goods, and many institutions need causal understanding of the link between prices and demand to implement some policies. For example, a city council may want to start a program of housing subsidies. By subsidizing poor households this policy effectively decreases the price of houses for those households, and its success depends on the effect on demand.
A regression in which the dependent variable is the number of clicks and the independent variable is the price yields an incorrect estimate of the elasticity, mainly because both the price and the number of clicks are correlated with other variables, such as the intrinsic quality of the dwelling or of the neighborhood. But if the dwelling is the same and only the price of the corresponding duplicate ad is different—for example, because the agency posted a new ad with a different price—the elasticity can be estimated consistently from pairs of ads. There are some caveats. For example, users must not be able to identify duplicates before clicking on them (given the way the search engine of our website works, we think that this is reasonable in most cases).
With this approach we relate to the literature on demand identification [23–25], whose goal is to understand the causal structure of demand using exogenous demand and supply shocks. A recent work in this literature uses the Uber pricing system to identify the full demand curve [26]. Given data constraints, here we only identify the demand elasticity, and we use an imperfect proxy of demand such as the number of clicks rather than considering realized demand. A rather different literature is concerned with demand forecasting [27]:^{1} given a set of house and neighborhood characteristics, what is the most accurate prediction of the house demand? As demand forecasting is purely a prediction task, machine learning algorithms are likely to perform best (see e.g. the Zestimate Competition at https://www.kaggle.com/c/zillowprize1). Demand identification requires instead causal reasoning that is usually formalized in terms of econometric techniques. Crossfertilization between machine learning and econometrics has been advocated recently [28], and has already delivered a substantial amount of research [29, 30]. Our contribution in this direction combines ideas from classification in supervised machine learning and the potential outcomes framework [31, 32] in statistics and econometrics. Our method can generally be applied to any marketplace website and not necessarily to housing demand.
The rest of this paper is organized as follows. In Sect. 2 we describe the data; in Sect. 3 we provide descriptive statistics on the temporal and spatial aspects of clicks and contacts, both at the level of individual ads and aggregating over neighborhoods and cities. In Sect. 4 we provide evidence that online interest is indeed a proxy of housing demand, and in Sect. 5 we introduce the methodology to estimate the price elasticity of demand. Section 6 concludes.
2 Data
Our data consist of multiple snapshots of the Immobiliare.it database, from January 2015 to June 2017. By snapshot we mean all information on ads that are visible on a specific day. For 2015 we only have quarterly snapshots, while from 2016 on we mostly have weekly snapshots. In practice, most ads remain unchanged between two weekly snapshots, with about 5% of the ads being removed and 5% being newly uploaded. We retain timevarying information for the variables we are mostly interested in—(asking) price, number of clicks on the ad, and number of contacts that occurred through the website. (There is a counter of clicks and contacts that increases over successive snapshots.)
For other variables we instead rely on the latest available information, because we assume that the sellers correct the mistakes they might have made when posting the ad. These variables are the physical characteristics of the dwelling—floor area, number of rooms, maintenance status, etc. (see Fig. 1B)—and its geographical coordinates. We are also given a brief description of the dwelling. This description tends to contain the same information that is stored in the other variables, but also provides more details about the neighborhood and the agency that sells the property. Finally, we know the dates in which the ad was uploaded and in which the ad was removed (if it was).
In this paper we only focus on residential units for sale in the 110 province capital cities, which include all major cities and comprise about 18 million inhabitants in total. In cities the majority of transactions is brokered by real estate agents—who are more likely to upload an ad on Immobiliare.it than private citizens—, whereas in small towns and in rural areas representativeness is potentially a problem. The set of ads we will work on encompasses 1,037,095 units.
However, not all ads refer to a distinct dwelling. Indeed, there is a substantial fraction of duplicate ads, that is two or more ads that refer to the same dwelling. The existence of duplicates is due to several reasons. First, in Italy there is no legal obligation for owners to entrust at most one real estate agent for the sale of their property. This means that two or more real estate agents may be selling the same dwelling at the same time. Second, the same agency may remove an ad and upload an identical one, so that the new ad is more recent. (In Sect. 3.1 we show that most clicks on any ad occur within the first few days the ad is posted.) Third, the mandate of an agency may cease and the seller could decide to entrust another agency, which would then upload a new ad. In previous work [33] we showed that the existence of duplicates is not random and that keeping duplicate ads can lead to a serious misrepresentation of the supply of dwellings for sale, especially when looking at small geographical aggregates. We identify duplicate ads using a machine learning methodology, described in Sect. 5.1 and in Ref. [33] in much more detail. According to our procedure, the total number of dwellings is 653,499 units, about 63% of the total number of posted ads.
Finally, we use the geographical coordinates of the ads to match our data with two administrative datasets. The first comes from Osservatorio del Mercato Immobiliare (OMI), the realestate market observatory of the Italian Tax Agency. From this dataset we extract the perimeters of the socalled OMI microzones, homogeneous areas in terms of socioeconomic and geographic characteristics that roughly correspond to neighborhoods.^{2} We then perform spatial matching and assign to each ad its corresponding OMI microzone. The second dataset is the Italian 2011 Census, providing information on socioeconomic characteristics. As census tracts do not correspond to OMI microzones, we impute data to OMI microzones depending on the percentage of overlap between each census tract and OMI microzone (see Ref. [33] for more details).
3 Descriptive statistics on clicks and contacts
We quantify online interest by the number of clicks and contacts. For each ad and every snapshot of the database, we record new clicks/contacts, so as to follow the full evolution of these variables over the lifetime of all ads. We perform our analysis both at the level of individual ads, and after aggregation at the OMI microzone/city level. In this section we provide some descriptive statistics on clicks and contacts for all spatial aggregation levels.
3.1 Individual statistics
We only focus on ads uploaded from 2016 (since we only have quarterly snapshots in 2015) and subsequently removed from the dataset, to make sure that we follow all the lifetime of the ads. This corresponds to 329,915 ads.
In Fig. 2B we show a histogram with the total number of contacts (note the logarithmic scale on the vertical axis) that occurred through an ad. Each bin in the histogram corresponds to a unit, so that the first bin is the number of ads that received 0 contacts, etc. The median is 0 (201,934 ads received no contacts), suggesting that the contact form on the webpage is only used in a minority of cases. In case it is used, for most ads the contact form is used once (52,788), twice (25,569) or three times (14,206). However, the number of contacts decays slowly. The distribution is very heavytailed, and some ads received a large number of contacts. We do not attempt to identify the shape of this distribution as it varies over too few orders of magnitude (while the figure has a cutoff at 50 contacts, only 161 ads received between 50 and 100 contacts, and only 20 received more than 100 contacts. We suspect many of these could be outliers for which particular conditions apply).
3.2 Aggregate statistics
We consider the temporal and spatial distribution of clicks and contacts, as aggregated either at the level of cities or OMI microzones.
So both in Rome and Milan prices are highest in the center and decrease towards the peripheries, but online interest is maximal in an intermediate area between the center and the peripheries. We conjecture that this pattern may be related to income and wealth inequality. Only few people in the top of the income/wealth distribution can afford—and so look for—apartments in the center of Rome and Milan, where the prices are easily above 6000 euros per m2. This results in a lower number of clicks and contacts with respect to neighborhoods that are still attractive but less expensive. It is also interesting that the neighborhoods of Sapienza University in Rome and of Polytechnic and Statale University (Città Studi) in Milan are in the top sixtile of daily contacts/clicks per ad, suggesting a high demand from students.
4 Evidence that online interest proxies demand
Is online interest a good proxy for actual demand? In this section we provide evidence that supports this hypothesis, showing that online interest has the same effect of demand on time on market, liquidity and prices. We run our analysis both at the level of individual ads and aggregating data over OMI microzones/cities. We mostly follow the microeconometrics literature [24, 34], in that we assess whether the effect of clicks/contacts is statistically significant by running hypothesis tests on the coefficients of prespecified statistical models. The alternative would be, as in machine learning, to perform model selection and estimation jointly, but in this way the estimated parameters may not indicate any structure due to the correlation among predictors [29, 35]. We assume linear relations among variables, which is certainly a restriction but makes it possible to give a simple interpretation to the coefficients of these linear models. In addition, we can control in a transparent way for other characteristics that clicks and contacts may also be a proxy for (e.g. intrinsic quality of a dwelling/neighborhood).
4.1 Evidence at the individual level
We test whether high online interest for a dwelling is correlated with shorter time on market and with price revisions. We only focus on ads that have been posted since the beginning of 2016 and that have no duplicates (which would bias the analysis).^{4}
We construct two variables, RELCLICKS and RELCONTACTS, to quantify the relative interest in a particular dwelling with respect to all other dwellings in the same OMI microzone. In the case of time on market, RELCLICKS and RELCONTACTS are defined as the total number of clicks/contacts on an ad, divided by the average number of clicks/contacts in the corresponding OMI microzone during the same period in which the ad has been online. In the case of price revisions, this definition would not work. Indeed, price revisions trigger a change of online interest, as can be seen in Fig. 3D. This would lead to a dubious interpretation of the results. To solve this problem, when analyzing price revisions we define RELCLICKS and RELCONTACTS as the ratio of clicks/contacts in the first 14 days since the ad was posted to the average of the OMI microzone in the same period. (We also discard ads that had a price revision within 15 days.) This choice is justified by the peak of clicks/contacts in the first few days after ads are posted (Fig. 3A–B).
Effect of online interest on time on market and chance of price revisions
Dependent variable:  

LOGTIMEONMARKET  PRDECREASE  PRINCREASE  
OLS  logistic  logistic  
(1)  (2)  (3)  (4)  
RELCLICKS  −0.520^{∗∗∗}  −0.095^{∗∗∗}  0.156^{∗∗∗}  
(0.004)  (0.013)  (0.042)  
RELCONTACTS  −0.481^{∗∗∗}  
(0.004)  
RELPRICEM2  −0.060^{∗∗∗}  −0.022  0.222^{∗∗∗}  −0.375^{∗∗∗} 
(0.010)  (0.014)  (0.017)  (0.023)  
FLOORAREA  0.0002^{∗∗∗}  −0.0003^{∗∗∗}  0.0001  −0.0002 
(0.0001)  (0.0001)  (0.0001)  (0.0003)  
STATUS  −0.033^{∗∗∗}  0.007  −0.203^{∗∗∗}  0.539^{∗∗∗} 
(0.004)  (0.005)  (0.008)  (0.028)  
ROOMS  0.024^{∗∗∗}  0.012^{∗∗}  −0.028^{∗∗∗}  0.016 
(0.004)  (0.006)  (0.007)  (0.025)  
Constant  3.853^{∗∗∗}  4.332^{∗∗∗}  1719.883^{∗∗∗}  1559.833 
(0.364)  (0.111)  (639.811)  (7900.740)  
Observations  71,221  26,536  128,829  128,829 
Adjusted Rsquared  0.327  0.457  /  / 
Residual deviance  /  /  141,916  18,944 
AIC  174,083  56,598  145,972  21,848 
The coefficient on RELCLICKS is highly statistically significant and can be interpreted in the following way: a 1% higher number of clicks is on average associated with a 0.52% shorter time on market, holding all control variables constant. Here we cannot interpret this coefficient causally because e.g. the time on market influences the relative number of clicks, as the temporal profile of clicks is not uniform (Fig. 3), and so there is reverse causality [34]. The elasticity for the variable RELCONTACTS is similar.^{5} Looking at the control variables, it appears that dwellings with higher relative price stay shorter on the market, although in this case statistical significance is less clear.
To quantify variable importance in the determination of the time on market we use a regression tree. In particular, we use the R package rpart and select the hyperparameters for visualization purposes. We use the same variables RELCLICKS, RELCONTACTS, RELPRICEM2, FLOORAREA, STATUS, ROOMS as in Table 1. Instead of controlling for location using the finegrained OMI microzone dummies, we use distance from the center and city dummies, and only consider ads in the four largest cities (Rome, Milan, Naples and Turin). This again is dictated by the necessity to produce a discernible regression tree.
After testing that online interest is predictive of time on market, we now test whether it is predictive of price revisions. In our dataset, about 25% of the dwellings had a price change; out of these, about 6% had an increase in price, and the complementary 94% had a price decrease. These figures are consistent with data from the Italian Housing Market Survey (jointly run by Banca d’Italia), showing that the share of transactions in which the actual transaction price was equal or higher than the asking price was about 3.0% in 2015, 5.1% in 2016 and 5.6% in 2017. However, there are two caveats that we should make here. First, these price revisions do not necessarily reflect the transaction price (other revisions may occur during offline bargaining). Second, we cannot know why price revisions occurred. In particular, in the case of price increases, this may reflect an auction, but also the fact that the agency corrected a wrong posted price. Yet, our imperfect measures for price revisions carry information, and so it is useful to see what the effect of online interest is on them.
In Table 1, column (3), we show the results from running a logistic regression on the binary variables PRDECREASE, taking value 1 if the price of the dwelling was revised downward, and 0 if it was not revised or if it was revised upward. In column (4) we consider the variable PRINCREASE, defined as PRDECREASE but equal to 1 if the price was revised upward, and 0 otherwise. With logistic regressions, the interpretation of the coefficients is less straightforward than with OLS.^{6} It is first necessary to take exponentials. Doing this, looking at the coefficient on RELCLICKS in Table 1 we get \(\exp (0.095)= 0.91\) in the case of PRDECREASE, and \(\exp (0.156)=1.17\) in the case of PRINCREASE. These numbers can then be interpreted as changes in the odds ratio, that is the ratio of the probability that the event happens to the complementary probability that it does not happen. Given the logarithmic transformation of RELCLICKS, the interpretation is that a 1% increase in the relative number of clicks is associated with a 0.09% reduction in the odds of a downward price revision, and with a 0.17% increase in the odds of an upward price revision.
The results for RELCONTACTS are similar, except that we cannot make a logarithmic transformation of this variable because the condition \(\mathit {RELCONTACTS}>0\) is satisfied by too few ads with at least one price revision. To deal with this, we run a regression akin to Eq. (1), but in which the dependent variable RELCONTACTS enters linearly. The coefficients are −0.028 (0.002)^{∗∗∗} for PRDECREASE, and \(0.012\ (0.003)^{***}\) for PRINCREASE. Applying the same method as above, the interpretation for these coefficients is that a unit increase in RELCONTACTS is associated with a 0.03% reduction in the odds of a downward price revision, and with a 0.01% increase in the odds of an upward price revision.
Effect of online interest on the magnitude of price changes
Dependent variable:  

PRICEVAR  PRICEVAR−  PRICEVAR+  
(1)  (2)  (3)  
RELCLICKS  0.009^{∗∗∗}  −0.007^{∗∗∗}  0.008^{∗∗} 
(0.001)  (0.001)  (0.004)  
RELPRICEM2  0.001  −0.006^{∗∗∗}  −0.040^{∗∗∗} 
(0.001)  (0.001)  (0.006)  
FLOORAREA  −0.00004^{∗∗∗}  0.00003^{∗∗∗}  −0.00003 
(0.00001)  (0.00000)  (0.00004)  
STATUS  0.011^{∗∗∗}  −0.007^{∗∗∗}  −0.001 
(0.0004)  (0.0003)  (0.002)  
ROOMS  0.005^{∗∗∗}  −0.005^{∗∗∗}  −0.005^{∗∗} 
(0.0004)  (0.0003)  (0.002)  
Constant  −4.725^{∗∗∗}  6.717^{∗∗∗}  2.777 
(1.680)  (1.382)  (10.226)  
Observations  36,344  34,552  1792 
Adjusted Rsquared  0.075  0.089  0.127 
AIC  −100,790  −111,020  −4460 
4.2 Evidence at the aggregate level
We aggregate data over OMI microzones and cities and test whether aggregate online attention is a leading indicator of liquidity and prices. We mostly follow the approach of van Dijk and Francke [22], who analyze a dataset of online housing ads in the Netherlands and show that the average number of clicks Grangercauses liquidity and prices. We confirm their findings, and extend their analysis by considering smaller geographical aggregates—the spatial unit in their analysis [22] is municipalities, here we also consider OMI microzones—and contacts in addition to clicks.^{8}
Our underlying hypothesis is that a tight market—that is, a market with relatively high demand as compared to the supply—at time t predicts an increase in price and liquidity at time \(t+1\), where t is an arbitrary temporal unit. (In this paper t corresponds to quarters, see below.) This can be justified theoretically in various ways. Carrillo et al. [18] use a search model in which a demand shock occurs. Sellers’ expectations on the number of buyers adjust slowly, and therefore it takes time to reach a different equilibrium. We empirically test the hypothesis that an increase in clicks leads to a lagged increase in liquidity and prices, using OMI microzones or cities as spatial units, and quarters as temporal units.

LOGLIQUIDITY is the logarithm of the ratio between the number of dwellings removed from the dataset and the number of dwellings for sale. In Ref. [33] we show that the number of dwellings removed from the dataset is highly correlated to the number of actual sales, as measured from OMI. (A dwelling is a cluster of duplicate ads. If we did not deal with duplicates, measures of liquidity would be biased [33].)

LOGPRICEM2 is the logarithm of the average price per m2. In Ref. [33] we show that the price per m2 calculated from this dataset is highly correlated to the price per m2 calculated by OMI using actual transactions.

LOGCLICKS and LOGCONTACTS are the logarithms of the average number of clicks/contacts per ad.
Lagged effect of aggregate online interest on liquidity
Dependent variable:  

\(\mathit {LOGLIQUIDITY}_{t}\)  
OMI microzone  City  
(1)  (2)  (3)  (4)  
\(\mathit {LOGLIQUIDITY}_{t1}\)  0.230^{∗∗∗}  0.230^{∗∗∗}  0.488^{∗∗∗}  0.384^{∗∗∗} 
(0.019)  (0.019)  (0.043)  (0.047)  
\(\mathit {LOGCLICKS}_{t1}\)  0.060  −0.088  
(0.037)  (0.076)  
\(\mathit {LOGCLICKS}_{t2}\)  0.146^{∗∗∗}  0.194^{∗∗}  
(0.037)  (0.078)  
\(\mathit {LOGCONTACTS}_{t1}\)  0.075^{∗∗∗}  0.039  
(0.019)  (0.059)  
\(\mathit {LOGCONTACTS}_{t2}\)  0.037^{∗}  0.184^{∗∗∗}  
(0.019)  (0.062)  
DEGREE  −0.574^{∗∗∗}  −0.483^{∗∗∗}  0.332  −0.374 
(0.114)  (0.111)  (0.533)  (0.531)  
UNEMPLOYED  0.752  0.861  1.969^{∗}  1.547 
(0.647)  (0.647)  (1.151)  (1.109)  
OWNEDHOUSES  −0.116  −0.093  −0.778^{∗}  −0.392 
(0.095)  (0.095)  (0.437)  (0.423)  
FOREIGN  −0.214  −0.314^{∗∗}  0.964  −0.188 
(0.133)  (0.133)  (0.688)  (0.652)  
Constant  −2.400^{∗∗∗}  −1.013^{∗∗∗}  −1.526^{∗∗∗}  −0.859^{∗∗} 
(0.208)  (0.135)  (0.576)  (0.350)  
Fixed effects  City + quarter  City + quarter  Quarter  Quarter 
Observations  2977  2977  423  420 
Adjusted Rsquared  0.372  0.370  0.350  0.359 
AIC  1243  1249  336  301 
Lagged effect of aggregate online interest on price
Dependent variable:  

\(\mathit {LOGPRICEM}2_{t}\)  
OMI microzone  City  
(1)  (2)  (3)  (4)  
\(\mathit {LOGPRICEM}2_{t1}\)  0.511^{∗∗∗}  0.511^{∗∗∗}  0.639^{∗∗∗}  0.619^{∗∗∗} 
(0.014)  (0.014)  (0.039)  (0.041)  
\(\mathit {LOGCLICKS}_{t1}\)  0.035  −0.050  
(0.025)  (0.058)  
\(\mathit {LOGCLICKS}_{t2}\)  0.051^{∗∗}  −0.040  
(0.026)  (0.060)  
\(\mathit {LOGCONTACTS}_{t1}\)  0.019  −0.026  
(0.013)  (0.046)  
\(\mathit {LOGCONTACTS}_{t2}\)  0.030^{∗∗}  0.100^{∗∗}  
(0.013)  (0.050)  
DEGREE  0.836^{∗∗∗}  0.866^{∗∗∗}  0.938^{∗∗}  0.513 
(0.083)  (0.081)  (0.414)  (0.424)  
UNEMPLOYED  −2.736^{∗∗∗}  −2.694^{∗∗∗}  −0.298  −0.897 
(0.456)  (0.455)  (0.878)  (0.897)  
OWNEDHOUSES  −0.503^{∗∗∗}  −0.493^{∗∗∗}  −0.392  −0.057 
(0.068)  (0.067)  (0.333)  (0.336)  
FOREIGN  −0.444^{∗∗∗}  −0.483^{∗∗∗}  0.109  0.107 
(0.094)  (0.093)  (0.524)  (0.519)  
Constant  3.366^{∗∗∗}  3.945^{∗∗∗}  3.390^{∗∗∗}  2.921^{∗∗∗} 
(0.171)  (0.148)  (0.556)  (0.417)  
Fixed effects  City + quarter  City + quarter  Quarter  Quarter 
Observations  2977  2977  423  420 
Adjusted Rsquared  0.835  0.835  0.540  0.541 
AIC  −919  −920  110  109 
5 Duplicates, demand and prices
The main problem of this dataset is the substantial fraction of duplicate ads. We devise a machine learning algorithm to identify duplicates and to cluster ads so that each cluster corresponds to a unique dwelling. This deduplication procedure was necessary to clean the data for the analysis in the previous sections. In this section we show that duplicates can also be exploited to shed some light on a classical problem in econometrics: what difference in demand is caused by a difference in price?
5.1 Description of the deduplication algorithm
We adapt standard methodologies for the deduplication of datasets [36, 37] to our specific case. Here we only give an overview of the working of the algorithm. For a more detailed description and the pseudocodes, see Ref. [33].
Model. We perform a pairwise comparison, meaning that we compare each ad with all other ads that are close enough—both in terms of geographical coordinates and price—to potentially be duplicates. We use a C5.0 classification tree. For each pair of ads the classification tree outputs a probability that they are duplicates. If this probability is larger than 0.5, we consider the two ads as duplicates. We implement two different C5.0 models, depending on whether the ads are posted by the same agency or not.
Predictors. Among the predictors we consider the geographical distance, the difference in price, the temporal distance between the upload dates, and the difference between the physical characteristics of the dwellings. As some physical characteristics are categorical variables, we consider different degrees of similarity, taking advantage of the natural order of the classes. For example, two ads with reported maintenance status “new” and “good” respectively are more likely to be duplicates than two ads with “new” and “to renovate”. A final important predictor is the distance between the textual description of the two ads. For this variable we consider two different measures, depending on whether the ads are posted by the same agency or not. In the first case we use the Levenshtein distance, as only a few words may have changed.
In the case of different agencies, we instead compute the cosine similarities between the vectors produced using the doc2vec algorithm [38], as implemented in gensim [39]. Doc2vec is an unsupervised algorithm that learns vector representations of documents, so that two documents that are close in “context” are also close in vector space. We use the Distributed Memory version of doc2vec. This is a twolayer neural network in which the output neuron is a word w and the input neurons are a set of words surrounding w and an identifier for each document. Learning occurs by minimizing the distance between the predicted and actual w, over all w and all documents. We choose the training settings (number of training epochs, use of stopwords, minimum frequency of words, etc.) via crossvalidation. In particular, we check how often the outofsample predicted vector for a document is closest to the insample learned vector for that document. In the best performing case, this is achieved 85% of the times.
Training. We manually construct a training sample by verifying the photos of the ads on the website. The training sample for the ads of different agencies is made up of 9997 pairs of ads; among them 3483 are duplicates (true positive, TP). The training sample for the ads of the same agency is made up of 8688 observations and 1473 are duplicates.
In order to assess the performance of the two models we randomly split the training sample in two different subsamples: the first one (90% of the observations) is used to estimate the models using boosting, the second one (10% of the observations) is used for the outofsample assessment of the classification performance. We repeat the operation 1000 times (drawing different subsamples) and we evaluate the performance based on average results. Since the number of true negatives (ads that are not duplicates and are identified as such) is much larger than the number of true positives, using the classic accuracy rate can be misleading. For this reason we consider measures of classification performance that do not rely on the number of true negatives, namely: precision, recall and Fmeasure.
Assessment of C5.0 models
Observations  Duplicates  Precision  Recall  Fmeasure  

Different agency  9997  3483  0.923  0.892  0.907 
Same agency  8688  1473  0.952  0.963  0.957 
Once we have created the clusters of ads identifying different dwellings, we collapse the information contained in multiple ads related to the same dwelling. As a general rule, for each characteristic we take the one with highest absolute frequency.
Real time implementation. To make the methodology computationally feasible, we apply an iterative approach. In particular, we process the ads progressively as soon as they are published on the website. In this way we are able to reduce the number of pairwise comparisons between ads.
5.2 Estimating the price elasticity of demand
It is a typical problem for businesses to forecast demand. But often companies need to understand how demand would be different if the price of their product was different. This is a causal question, and it is a much trickier task. Demand and supply are simultaneously determined, and to identify demand or supply curves it is often necessary to look for exogenous “shifters” [24]. Here we propose a method to estimate the price elasticity of demand—the relative difference in demand caused by a relative difference in price—by exploiting duplicate ads. We stress that here we use a proxy of demand (the number of clicks), differently from most of the literature that considers realized demand (for example, Cohen et al. [26] use actual Uber rides purchased by users of the Uber app).
Our method is inspired by the potential outcomes framework [31], and in particular by propensity score matching [32]. A typical way to assess causality is to assign some treatment to a set of randomly chosen units, and then compare the effects on units that received treatment vs. units that did not receive it. This does not work in observational studies in which units decide if they want to undergo treatment. The basic idea of propensity score matching is to compare units that are very similar except for their choice to receive treatment. Our method compares units that are identical, except for the “treatment” variable.
Indeed, our strategy to compute the elasticity of demand is to consider pairs of duplicate ads posted with a different price. This difference in price may simply reflect the decision of the seller to revise the price jointly with the decision of the agency to post a new ad for the same dwelling, or it could be that different agencies suggest different prices. Our key assumption is that all differences in the number of clicks between the two ads can only be imputed to the differences in price. Indeed, the two ads should have identical or very close characteristics, so that the only difference in the number of clicks should come from the user preference for cheaper dwellings—or from one of the two prices exceeding her maximum willingness to pay.
The full dataset contains 113,365 pairs of duplicate ads. We restrict our sample to ads published after the beginning of 2016, and to pairs for which the two ads have been posted within 60 days from each other. We finally remove ads whose price changed in the period of observation, because a price change makes one of the two ads different from the other. This selection leaves us with 16,824 ads, or equivalently 8412 unique dwellings.
Running the regression (3) on the 8412 pairs of ads we estimate \(\beta =0.657\ (0.048)^{***}\) and \(\alpha =0.012\ (0.007)^{*}\) (\({}^{*}p<0.1\); \({}^{**}p<0.05\); \({}^{***}p<0.01\)). As expected, the price elasticity of demand is negative and highly significant, while the intercept is only marginally significant (at the 10% level). Here β has the causal interpretation of an elasticity: a 1% higher price causes on average a 0.66% lower number of clicks relative to the average in the OMI microzone.
As a robustness test, we also check that our results are robust to a different measure of \(c_{1i}\) and \(c_{2i}\). In particular we consider the number of clicks in the first 7 or 10 days, to deal with the potential issue that both duplicate ads may be online at the same time. With 7 days we estimate \(\beta =0.608\ (0.044)^{***}\) and \(\alpha =0.004\ (0.006)\); with 10 days we find \(\beta =0.633\ (0.044)^{***}\) and \(\alpha =0.004\ (0.006)\). This confirms the results, although the elasticities are slightly smaller than in the 14 days case. In addition, here the intercept is not statistically significant even at the 10% level.
Our identification strategy (in the econometrics jargon, it means technique to assess causality) comes with a series of caveats. First, users of the website should not be able to identify duplicates before clicking on them. We think this is reasonable in most cases. Indeed, if users search by list, duplicate ads may be listed far from each other and potentially have different “front pictures”. And if users search by map, it is quite common that multiple dwellings in the same block of flats are sold at the same time, so users should not be able to disambiguate between duplicate ads and multiple dwellings in the same block. Our choice of focusing on the first 14 days makes it unlikely that users decide not to reclick on ads if they realized they were duplicates.
The second caveat is that agencies should be assigned randomly to the ads within the pair. Indeed, agencies can pay to upload “premium ads”, which are shown high up in the list and so receive a higher number of clicks. If agencies are systematically more likely to upload premium ads for more (or less) expensive ads, our estimates can be inconsistent. Third, small differences in characteristics should be assigned randomly to the ads within the pair (i.e., differences are just due to reporting errors). Fourth, the deduplication algorithm has a low rate of false positives, that is pairs of ads that are identified as duplicates but are not so. (However, in this case one could argue that if the machine learning algorithm identifies the ads as duplicates, probably they are so similar that our identification strategy should work as well.) Although at least some of these effects are probably present to some extent, we think that they could alter the value of the estimated elasticity by at most some decimal points.
6 Conclusion
In the last few years a growing amount of research has used data coming from online sources to analyse the housing market (further to the references listed so far, see also Refs. [40–44]). The large number of housing ads websites—including Zillow.com and Trulia.com in the U.S., Zoopla.co.uk in the U.K., Immobilienscout24.de in Germany, Funda.nl in the Netherlands, etc.—will probably further increase the interest of researchers in this type of data. To the best of our knowledge, our work is the first to characterize online interest for individual ads.
We describe the distribution and temporal profile of two measures of online interest, clicks on ads and uses of the contact form on the page of each ad. We show that both the distributions of the total number of clicks and of the total number of contacts are heavy tailed, and that a peak of clicks/contacts occurs in the first few days since an ad was posted. We then use inferential statistics to provide evidence that online interest indeed proxies demand. Ads that receive a high number of clicks/contacts relative to other ads in the same neighborhood stay shorter online, it is less likely that the price is revised downward, and more likely that it is revised upward. We also aggregate data at the level of neighborhoods and cities, replicating existing results in the literature that document a lagged increase in prices and volume of transactions that follows a spike in demand. As time on market, price revisions and liquidity respond to online interest in the same way as to actual demand, we deduce that clicks and contacts are a good proxy.
Our second key contribution is to show how these data can be used to estimate the price elasticity of demand, the relative change in demand in response to a relative change in price. This should be intended in the sense of a thought experiment—had the price been different by x%, demand would be different by y%—and not in the sense of revising the price when the ad was already online. We exploit the substantial fraction of duplicate ads identified with a machine learning algorithm. Under some caveats, differences in demand between two ads that advertise the same dwelling can only be caused by differences in price (controlling for the different times at which the two ads may be posted). Quantitatively, we show that a 1% higher price causes a 0.66% lower number of clicks.
Econometrics is mostly used to understand causality, while machine learning is mostly used for prediction. It has recently been argued that the strengths of the two approaches should be combined [28–30]. For example, Belloni et al. [45, 46] suggest to use LASSO to select among a large number of instrumental variables. The method we introduce here is less general as it relies on the specific existence of duplicate ads with different price, but it combines ideas from supervised machine learning and from the potential outcomes framework [31, 32]. As housing and nonhousing marketplace websites are attracting increasing interest from researchers, we think that it can be applied in other circumstances. For example, it would be interesting to apply the method on “classifieds” (classified advertisement) websites such as Craiglist.org, Gumtree.com, Ganji.com, Leboncoin.fr, Subito.it, etc.
We have intentionally avoided to use the term demand estimation as it is sometimes used to indicate causal identification and sometimes to refer to demand forecasting.
Using OMI microzones as spatial units is not necessarily an optimal choice, and it is known that some results may in general depend on how spatial units are constructed [47]. Yet, these microzones are constructed to estimate the market value of properties for tax purposes, so it is reasonable to assume that boundaries are drawn meaningfully. An interesting extension of this work would be to use clustering methods to construct datadriven spatial units.
Due to representativeness concerns, we remove data corresponding to the bottom 5% of the distribution of the number of visible ads. In a few small cities and for specific quarters only a few ads are visible, and these are often outliers.
For the analysis of the time on market, we only consider ads that have been removed from the dataset as this ensures that we follow the ad throughout its lifespan.
In the regression in column (2) we have only considered the ads that had at least one contact, otherwise \(\mathit {RELCONTACTS}=0\) and the logarithmic specification would not be possible.
See Ref. [24] for a textbook treatment, or the clear explanation at https://stats.idre.ucla.edu/other/multpkg/faq/general/faqhowdoiinterpretoddsratiosinlogisticregression/.
There is no need to take a logarithmic transformation of PRICEVAR, as it is already defined as a relative change.
From a technical point of view, our analysis differs from van Dijk and Francke [22] in that we consider the levels of clicks, contacts, liquidity and prices, whereas they consider first differences. Our temporal span of 1.5 years (2015 data cannot be used here) does not make it possible to remove seasonality, impairing an analysis based on first differences.
Declarations
Acknowledgements
For their comments, we thank the participants of the “Harnessing Big Data and Machine Learning Technologies for Central Banks” workshop at Banca d’Italia, in particular our discussant Stefano Nardelli. For invaluable technical support with the creation of the database, we thank Andrea Luciani. We also thank Roberta Zizza for suggesting the first part of the title, and Adrián Carro, Penny Mealy and three anonymous reviewers for providing comments on the manuscript. This work was funded in part by INET and by EPSRC award number 1657725. We are extremely grateful to Immobiliare.it for providing the data and for their assistance. All mistakes are our own.
Availability of data and materials
Data are proprietary and cannot be shared. On request, we are happy to share the codes we used for the analysis.
Funding
This work was funded in part by INET and EPSRC award number 1657725.
Authors’ contributions
MP and ML designed research, analyzed data and wrote the paper. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests. The views expressed in this paper are those of the authors and do not reflect the views of Banca d’Italia.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 Bollen J, Mao H, Zeng X (2011) Twitter mood predicts the stock market. J Comput Sci 2(1):1–8 Google Scholar
 Preis T, Moat HS, Stanley HE (2013) Quantifying trading behavior in financial markets using Google trends. Sci Rep 3:01684 View ArticleGoogle Scholar
 Adamic LA, Glance N (2005) The political blogosphere and the 2004 US election: divided they blog. In: Proceedings of the 3rd international workshop on link discovery. ACM, New York, pp 36–43 View ArticleGoogle Scholar
 Yasseri T, Sumi R, Rung A, Kornai A, Kertész J (2012) Dynamics of conflicts in Wikipedia. PLoS ONE 7(6):38869 View ArticleGoogle Scholar
 Szell M, Lambiotte R, Thurner S (2010) Multirelational organization of largescale social networks in an online world. Proc Natl Acad Sci 107(31):13636–13641 View ArticleGoogle Scholar
 Altenburger KM, Ugander J (2018) Monophily in social networks introduces similarity among friendsoffriends. Nat Hum Behav 2(4):284 View ArticleGoogle Scholar
 Beiró MG, Panisson A, Tizzoni M, Cattuto C (2016) Predicting human mobility through the assimilation of social media traces into mobility models. EPJ Data Sci 5(1):30 View ArticleGoogle Scholar
 Ginsberg J, Mohebbi MH, Patel RS, Brammer L, Smolinski MS, Brilliant L (2009) Detecting influenza epidemics using search engine query data. Nature 457(7232):1012 View ArticleGoogle Scholar
 Hartwick J, Schweizer U, Varaiya P (1976) Comparative statics of a residential economy with several classes. J Econ Theory 13(3):396–413 MathSciNetView ArticleGoogle Scholar
 Fujita M (1989) Urban economic theory: land use and city size. Cambridge University Press, Cambridge View ArticleGoogle Scholar
 Courant PN (1978) Racial prejudice in a search model of the urban housing market. J Urban Econ 5(3):329–345 MATHGoogle Scholar
 Wheaton WC (1990) Vacancy, search, and prices in a housing market matching model. J Polit Econ 98(6):1270–1292 View ArticleGoogle Scholar
 Han L, Strange WC (2015) The microstructure of housing markets. In: Handbook of regional and urban economics, vol 5, pp 813–886 Google Scholar
 Feitosa FF, Reyes J, Zesk W (2008) Spatial patterns of residential segregation: a generative model. In: Proceedings of the Brazilian symposium on GeoInformatics, pp 157–162 Google Scholar
 Filatova T, Parker D, Van der Veen A (2009) Agentbased urban land markets: agent’s pricing behavior, land prices and urban land use change. J Artif Soc Soc Simul 12(1):3 Google Scholar
 Pangallo M, Nadal JP, Vignes A (2017) Residential income segregation: a behavioral model of the housing market. Available at https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3084090
 Genesove D, Han L (2012) Search and matching in the housing market. J Urban Econ 72(1):31–45 Google Scholar
 Carrillo PE, Wit ER, Larson W (2015) Can tightness in the housing market help predict subsequent home price appreciation? Evidence from the United States and the Netherlands. Real Estate Econ 43(3):609–651 View ArticleGoogle Scholar
 Merlo A, OrtaloMagne F (2004) Bargaining over residential real estate: evidence from England. J Urban Econ 56(2):192–216 Google Scholar
 Wu L, Brynjolfsson E (2015) The future of prediction: how Google searches foreshadow housing prices and sales. In: Economic analysis of the digital economy. NBER chapters. National Bureau of Economic Research, Cambridge, pp 89–118 View ArticleGoogle Scholar
 Askitas N (2016) Trendspotting in the housing market. Cityscape J Policy Dev Res 18(2):165–178 Google Scholar
 van Dijk DW, Francke MK (2017) Internet search behavior, liquidity and prices in the housing market. Real Estate Econ 46(2):1–36 Google Scholar
 Deaton A (1986) Demand analysis. In: Handbook of econometrics, vol 3, pp 1767–1839 Google Scholar
 Wooldridge JM (2010) Econometric analysis of cross section and panel data. MIT Press, Boston MATHGoogle Scholar
 Berry ST, Haile PA (2014) Identification in differentiated products markets using market level data. Econometrica 82(5):1749–1797 MathSciNetView ArticleGoogle Scholar
 Cohen P, Hahn R, Hall J, Levitt S, Metcalfe R (2016) Using big data to estimate consumer surplus: the case of Uber. NBER Working Paper 22627 Google Scholar
 Bajari P, Nekipelov D, Ryan SP, Yang M (2015) Machine learning methods for demand estimation. Am Econ Rev 105(5):481–485 View ArticleGoogle Scholar
 Varian HR (2014) Big data: new tricks for econometrics. J Econ Perspect 28(2):3–27 View ArticleGoogle Scholar
 Mullainathan S, Spiess J (2017) Machine learning: an applied econometric approach. J Econ Perspect 31(2):87–106 View ArticleGoogle Scholar
 Athey S (2017) The impact of machine learning on economics. In: Economics of artificial intelligence. University of Chicago Press, Chicago Google Scholar
 Rubin DB (1974) Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol 66(5):688 Google Scholar
 Rosenbaum PR, Rubin DB (1983) The central role of the propensity score in observational studies for causal effects. Biometrika 70(1):41–55 MathSciNetView ArticleGoogle Scholar
 Loberto M, Luciani A, Pangallo M (2018) The potential of big housing data: an application to the Italian realestate market. Bank of Italy Working Paper N. 1171 Google Scholar
 Verbeek M (2008) A guide to modern econometrics. Wiley, Hoboken MATHGoogle Scholar
 Zhao P, Yu B (2006) On model selection consistency of lasso. J Mach Learn Res 7(Nov):2541–2563 MathSciNetMATHGoogle Scholar
 Naumann F, Herschel M (2010) An introduction to duplicate detection. Morgan and Claypool Publishers, San Rafael MATHGoogle Scholar
 Christen P (2012) Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer, Berlin View ArticleGoogle Scholar
 Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of the 31st international conference on machine learning (ICML14), pp 1188–1196 Google Scholar
 Řehůřek R, Sojka P (2010) Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks. ELRA, Valletta, pp 45–50. http://is.muni.cz/publication/884893/en Google Scholar
 Piazzesi M, Schneider M, Stroebel J (2015) Segmented housing search. NBER Working Paper 20823 Google Scholar
 Wu J, Deng Y (2015) Intercity information diffusion and price discovery in housing markets: evidence from Google searches. J Real Estate Finance Econ 50(3):289–306 View ArticleGoogle Scholar
 Lee KO, Mori M (2016) Do conspicuous consumers pay higher housing premiums? Spatial and temporal variation in the United States. Real Estate Econ 44(3):726–763 View ArticleGoogle Scholar
 Anenberg E, Laufer S (2017) A more timely house price index. Rev Econ Stat 99(4):722–734 View ArticleGoogle Scholar
 Glaeser EL, Hyunjin K, Michael L (2018) Nowcasting gentrification: using yelp data to quantify neighborhood change. Am Econ Assoc Pap Proc 108(1):77–82 Google Scholar
 Belloni A, Chen D, Chernozhukov V, Hansen C (2012) Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica 80(6):2369–2429 MathSciNetView ArticleGoogle Scholar
 Belloni A, Chernozhukov V, Hansen C (2014) Highdimensional methods and inference on structural and treatment effects. J Econ Perspect 28(2):29–50 View ArticleGoogle Scholar
 Openshaw S (1984) The modifiable areal unit problem. GeoBooks, Norwich Google Scholar