Technological novelty profile and invention's future impact

We consider inventions as novel combinations of existing technological capabilities. Patent data allow us to explicitly identify such combinatorial processes in invention activities. Unconsidered in the previous research, not every new combination is novel to the same extent. Some combinations are naturally anticipated based on patent activities in the past or mere random choices, and some appear to deviate exceptionally from existing invention pathways. We calculate a relative likelihood that each pair of classification codes is put together at random, and a deviation from the empirical observation so as to assess the overall novelty (or conventionality) that the patent brings forth at each year. An invention is considered as unconventional if a pair of codes therein is unlikely to be used together given the statistics in the past. Temporal evolution of the distribution indicates that the patenting activities become more conventional with occasional cross-over combinations. Our analyses show that patents introducing novelty on top of the conventional units would receive higher citations, and hence have higher impact.

With an increasing volume of electronic corpora available online, research on the systems of science and technology, once considered to be a domain of humanities, social sciences and economics, has expanded its realm to be a subject of data science [5]. The growing empirical literature in this respect is to identify the process of publication [18], to utilise Google ngram to characterise scientific evolution [11], to delineate the boundary of science [19,20], and to predict the future impact of scientific papers [21,22] and authors [23,24].
In the case of inventive activities, Youn and co-workers availed themselves of technology codes, classified by United States Patent and Trademark Office (USPTO), as countable units to identify the underlying dynamics of inventions as combinatorial processes in a comprehensive and explicit way [1]. In this way, an invention yields either a new unit of technological capability, a new way of combining the already existing units making an innovative function, or a refinement of existing combinations. The rate at which the new combinations are introduced has been invariant over two centuries, implying that a combinatorial process is the nature of invention.
Building on these previous findings, we delve into the temporal evolution of this combinatorial process using U.S. patent data from 1836 to 2014. We first describe the data structure, and elaborate our method to quantify technological novelty scores. We then show how the z-scores of the pairings within an invention would affect its future impact. Finally we will discuss the implications of our findings.

DATA
The U.S. patent records began at July 31, 1790 with Samuel Hopkins' patent on pot ash [25]. Since then, there has been almost ten million inventions granted over two hundred years [26]. We only consider utility patents, which are those that pertain to new and useful inventions, omitting design and plant patents for instance [27]. There are 8,884,909 utility patents up to December 2014, taking up almost 90% of the documents. Among them, we analyse patents that have two or more codes (80.3%).
In order for examiners to efficiently search for relevant prior arts, the U.S. patent office encode salient technological capabilities into six-position alphanumeric codes. Every patent is then tagged with a combination of codes that represent the technologies involved in the invention [28]. The classification codes are created in a nested structure: 473 classes at the highest level and 168,743 codes at the lowest (most detailed) level. These low-level codes ('codes' from herein) can lie on different levels of the hierarchy tree; some classes have deeper branching than others.
We used patent citation data provided by National Bureau of Economic Research (NBER) and considered patents' citations as a measure of their impact [29,30]. The citation data span only 31 years, from 1976 to 2006 unlike the co-occurrence data covering almost two hundred years. In order to cover as large data as possible, we use the co-occurrence data, spanning over one hundred years, for the section 4.1, and use NBER data when citation information is needed in the section 4.2. In order to control the temporal effect on citation volume, we use the citation number only up to the first five years after publication. It is also known that only recent citation works well in prediction of future impact [31]. This leaves us data spanning 36 years (from 1976 to 2001) [8,32].

METHODS
We aim to assess the novelty of technological constituents in each patent, and then compare aspects of this novelty to the patent's impact. We measure how technology codes are combined in the empirical data and compare the observed combination to what would be expected if the combinations were randomly configured. In this way, we can discern recurring themes within invention space and also those combinations that are unconventional or novel. These features, measured by well established standard scores, are related to patent future impact.

STANDARD SCORES (Z-SCORES) OF CODE PAIRS
The patent data P can be represented by a collection of sets of classification codes, where each set corresponds to an individual patent and contains its classification codes. The z-score for a pair of codes, α and β is expressed as: where o αβ is the observed number of times the code α appears together with β within a patent (a set) within the actual data. µ αβ and σ αβ are the expected co-occurrences of the codes and its standard deviation, derived from a null model of the data which randomises code arrangement while preserving code usage and number of patents within the data (the section 3.2 provides the detail).
The observed co-occurrences o αβ in the patent record is compared with µ αβ . If the two codes appear together more often than expected, then Eq. (1) results in a positive value, or if they are rarely paired within a patent relative to their expected occurrences then their z-score is negative. The degree to which the deviation is significant is derived by normalising the value by the expected standard deviation σ [8,9,33,34]. We can thus associate high z-scores with very typical code pairing, and conversely, a negative z-score is indicative of an atypical or novel pairing of codes.

EXPECTED CO-OCCURRENCES
The null model acts as the baseline by which we deem an aspect of the data to have statistical significance, beyond what would occur by random, or with no underlying pattern or law. The aspect in consideration is the arrangement of the codes between the patents.
The premise of the null model is that each of these arrangements of codes is equally likely.
From these possible arrangements the expected pairing counts can be computed.
Consider codes α and β, with the number of occurrences n α and n β within the set of patents P . Noting that patents cannot be classified with the same code twice, the number of possible configurations of the α and β into the |P | possible patents is |P | nα |P | n β . Now consider those arrangements which contain exactly x co-occurrences, within a patent, of α and β. There are |P | nα nα x |P |−nα n β −x possible configurations; first distribution the α into |P |, then x of the β into those n α patents already assigned an α, finally distribute the remaining β into the patents without an α. Thus giving a hypergeometric probability distribution for the number of co-occurrences: thus the expected number of patents that have both α and β is: and the variance of µ αβ is:

INCORPORATING TEMPORAL EVOLUTION
As new technologies become successful, so they may subsequently become established areas of inventive activity. Following z-scores in time allows us to observe the case where an invention may have been exceptionally novel in its time of creation, but its novelty would 'wash-out' with many similar inventions subsequently follow it over time.
To capture this time variance we consider z-scores specific to time-ordered subsets of the entire data. We choose cumulatively increasing subsets in yearly steps, letting P (t) be the sub-collection of patents up to the year t in P . So P (2000) contains all patents issue up to the year 2000, and the z-scores calculated using this set are specific to this year. Thus for a given year, the newly added patents' z-scores are discerned based on all the patents that precede them, and the older patents' z-scores continue to evolve and change based on subsequently issued inventions.

A SCHEMATIC FOR THREE CASES: ATYPICAL, TYPICAL AND NEU-TRAL
We provide a schematic to aid in our understanding of how atypicality embedded in code combinations is captured and expressed by z-score measure.
Suppose 40 inventions P at time t, indexed by its entering order i. Each invention is expressed as a combination of codes: Figure 1 illustrates the collection of patents at t, P (t), represented as a network structure where pairwise combinations are represented as weighted links (solid lines). We then consider three cases where a new patent P (t + 1) arrives with two codes, that is, (i) In the case (i), the link a bridges the two most frequently used codes A and B that are yet combined together until time t. We therefore find the appearance of link a atypical given the current statistics, and naturally expect a negative z-score. Indeed, calculated z a exhibits a negative value, that is, and structure", has the shortest depth up to 2 at the end of which "Chain" and "Coil".
As shown in the above examples, the level of differentiation for two codes in a class can be qualitatively different according to the depth and classes. To create a consistent level of detail in the analysis, and gain broader and more intelligible insights, we coarse-grain over the codes to look at pairings at the highest and the second highest level [35].
We consider each patent as a combination of classes -the highest level. We also consider them as a combination of subclasses, i.e at one level below the class level, to gain insight in a slightly more fine-grained technology space, as well as for comparison with the class level results. At the code level a patent is never assigned the same code twice, and so no self-pairing is possible. This is not the case for class and subclasses, as multiple codes from the same class are often assigned to the same patent.
Rather than coarse-grain over the data before the analysis, we first preserve its detailed structure at the code level, only after which we coarse-grain over the resultant statistics to where the bracketed term accounts for double counting when the classes considered are the same. Similarly for the expected co-occurrences: and the variance: from which using these the class pair z-score can be calculated.
In addition, unless we use the above coarse-graining, the code usages/frequencies mostly can be far smaller than the number of patents up to t, |P (t)|, because these usage values follow a power-law. That means an actual co-occurrence value o αβ between two different codes, α and β, can be much larger than expected co-occurrence, µ αβ , between α and β though o αβ = 1. Thus, using fine-grained codes mostly gives us positive z-scores and we cannot obtain the novel, but the conventional.

RESULTS
Technological constituents of a patent are translated into a set of pairwise z-scores that characterise its novelty or typicality. In this way, we are able to capture how inventors combine technological units by analysing the summary statistics.
In the following sections, we will look at the distribution of z-scores derived from the entire set of patents, and compare it with that of newly created combinations, in each year.
Then we will relate the observed compositional features of an invention at the time of its creation to its future impact. All analysis is carried out at both the class and subclass level (one level down from the class) to ensure that our findings and insights are persistent across different levels of detail.

DECOMPOSITION OF NEW COMBINATIONS
It has been shown that the rate at which inventors create new combinations is invariant, and that they do so more often than not [1]. This result alludes to a ceaseless introduction of new ways of combining technological units, and thereby a constant reshaping of technology space. By just considering the number of new combinations occurring, the dynamics of novelty creation looks temporally independent, which conforms with the possibility that new inventions occur at random.
On the contrary, a new combination is not simply concocted by randomly choosing technological units, but, however novel it may be, it is either built on the existing body of knowledge accumulated, or discovered by the expansion of the adjacent possible in the technology space [36][37][38][39]. The new combination may not be composed of an entirely novel membership [10]. It may contain a set of codes that have been frequently combined, such that they can be considered as an established unit, or building block [40]. Therefore, binary classification-a combination can be either absolutely novel if it was previously unseen, or otherwise not novel-misses the subtleties of an invention's novelty by lacking the complexity to capture it in any detail. For instance, a combination with small novel addition to the conventional subset as unique.
We decompose combinations into a novelty profile in terms of pairwise z-scores, and assess the extent to which the multiple aspects of a combination are novel or conventional in more detail [8,9]. In this way, a new combination can both reinforce the current technological conventions, and introduce new ways of combining codes. As elaborated in the Method section, the z-scores are a means to compare the observed occurrences to the random counterpart.
Although there are no limits to how many codes may be assigned to a patent, Figure 2 shows the number of codes in a patent hovers around three to four in average with a tendency that the size increases in time, indicating parsimonious code usages. Every pair within a combination is then assigned a z-score (see, the Method section), and the composition of an invention can be captured by three statistics: its median z med and minimum z min , and the difference between the two ∆z ≡ z med − z min [9]. The z med indicates the degree to which the main body of a patent conforms to technological conventionality, while z min indicates the extent to which the invention contains an element which is novel when combined with the patents other parts. The difference between the two ∆z captures aspects of both in a single measure; whether the patent has a conventional core and a novel addition. Figure 3 (a) shows z-score of new class pairings, in which it is seen that their average zscore remain zero, that is, indistinguishable from the random incidence on average, and then gradually become negative after 1980. This implies, new class pairings neither strengthen nor join any modular structures of the class network when they firstly appear, and then new atypical class pairings after 1980 gradually join two different technological domains. In addition, new class pairs gradually being atypical may dispute a claim that 1880s was more innovative than now [41].
When a new pair, or combination was introduced, it is normally the case that its z-score is negative, or neutral as is shown in Fig. 1. Occasionally however, there is also a case where codes involved in the combination have rarely been used, that the expected co-occurrences, µ, and the standard deviation, σ, is relatively low.
We then capture the compositional features of how newly created combination by these two summary statistics of each year. Figure 3 (c) and (e) show z med , z min , and ∆z. It shows that these z-scores steadily grow over time in a more or less margin of increase. The new combinations mostly contain the conventional pairings. Additionally, the gap between z med and z min increases ( Fig. 3 (g) and (h)). In other words, a new combination becomes both more conventionality and non-conventionality.
We now look at how these compositional features of new pair and new combinations reshape the landscape. We characterising this phenomena by analysing the distribution of z-scores for entire pairs accumulated up to the year. Figure 4 shows the cumulative distribution of z-scores for every year from 1836 to 2014, respectively denoted by a colour scale (from blue to red). Broadening of this distribution across time indicates that the network is becoming more ingrained, with increasingly highly connected subsets of codes, hence higher z-scores, while pairs that span between these conventional units are thus increasingly perceived as atypical. Thus if two codes are used together more often than random expectation, it is probable that they are used together again. This broadening is also explicitly shown in the Fig. 5 where the standard deviation (grey shade) widens over time and the minimum z-scores of the year cohort becomes increasingly negative, especially around the recent decades.

COMPOSITIONAL FEATURES PREDICTING FUTURE IMPACT
Predicting which new inventions will have a high impact is an obviously wanted goal, both for attempting to predict profitability, but also as a signifier of future societal changes cause by new technologies [12,31]. Further to just assessing new inventions, an understanding of the qualities related to invention impact enables one to optimise their inventive strategy to maximise such qualities. We show that a patent's success is predictable using its novelty profile.
As we discussed in the previous sections, an invention is interpretable as pairwise zscores, quantifying statistically significance of code pairings in inventing activities. Built on the previous research, suggesting that the compositional feature is key to a patent having a high impact, we delve into the temporal dynamics of this relationship given our patent records [8,9].
We define high impact inventions as those patents in the upper 5th percentile[42], within each year, of citations gained within 5 years from publication [31,32]. We categorise the patents accordingly: whether (i) z med of a patent belongs to either the top quartile z med of a year, middle half, or bottom quartile, (ii) similarly z min in the top quartile, middle half, or bottom, and (iii) ∆z in the top, middle, or the bottom. We abbreviated top quartile to high, middle to mid, and bottom to low.
Panels (a) and (d) of Fig. 6 show that high z med has a small but positive influence on future impact, and vice versa for low z med , indicating that inventions that are primarily based on established prior work do marginally better in the future. Meanwhile, Fig. 6 (b) and (e) shows that a high z min , signifying more typical, has a noticeable negative effect on a patents future. Thus if all the pairings of an invention become conventional, it is less likely to be influential. On the other hand, it is evident that when measuring against core conventionality and a novel element together, the results are both more consistent and more significant, as seen in Fig. 6 (c) and (f). The high ∆z has a clear positive influence, whereas the mid ∆z has no influence and the low ∆z has negative influence, capturing that is it neither of the two aspects on there own that have the most influence but the combination of the two.
We can further elaborate on this through a differing classification of the patent set: whether (i) z med of a patent is above or below the quartile z-score within the entire period and (ii) z min of a patent is above or below quartile z-score within the entire period. These classifications directly capture (i) whether the patent has a conventional core and (ii) whether it includes a novel aspect. We also redefine high impact inventions as being in the top 5th across all the patent records . These criteria split patents into one of four categories: (high z min , high z med ) being those patents with a conventional core but without a novel addition, namely marginal improving; (high z min , low z med ) as neither having a conventional core nor a relatively novel addition; (low z min , high z med ) as those patents with the success signifier of a conventional core and a novel twist; and lastly (low z min , and low z med ) as those patents which are entirely novel, or oddball. cores and a novel addition do notable better then the background rate, and do best of all four categories. Also, it is again shown that entirely novel inventions fair the worst. These analyses also suggest that conventionality does not collide over novelty, but conventionality illuminated by novelty can help an invention's influence. This finding, that is there is an optimal balance between conventionality and novelty for influential inventions, indicates that influence is associated with knowledge transfer between technological domains [8-10, 43].

CONCLUSION
In this paper, we quantitatively study the distribution of technology pairings' novelty and a connection between a novelty profile in a combination and its future impact by using technology code pairings in the U.S. patent spanning 179 years (1836-2014) [26]. We show inventions assemble technological units in a way to reinforce the already conventional pairs, thereby some components become increasingly entrenched within the inventive repertoire with increasing z-scores, such that they become a further building block for future combina- tions. Yet still combinations will occasionally bridge between these code-cliques, exhibited as increasingly negative z-scores in time. This result implies that the technology space forms units of tightly co-occurring codes with occasional inter-unit combinations to change that structure, and that inventors always require components which are familiar to them, or available in the industry [37,38,[44][45][46].
We also show how technological composition can effect the future impact of an invention, by associating the patents' citation count as a measure of that impact [29]. Through analysis of citation relationships across the U.S. patents , our analysis shows the statistically significant technology pairings are correlated with future influence of an invention. In line with the previous research, our findings demonstrate that conventional combinations, enlightened by proper novelty, are more likely to be influential in future, alluding to that there is an optimal balance between conventionality and novelty for influential inventions, and that influence is associated with knowledge transfer between technological domains [8][9][10]43].
Yet there still remains much research to be resolved to more rigorously quantify statistical significance of code pairings. The proposed null model for calculating z-scores of code pairings is limited: it does not account for inventions having a single code and for when a code is created. In addition, excluding citations to outside the data or academic papers may miss the important role of scientific research in guiding inventors to search the technological space more efficiently, hence resulting highly novel content [47][48][49]. Future studies need to correctly complement the above limitations. Nonetheless, our study may provide valuable insights into how technology combinations give rise to boundary-spanning breakthroughs in technology as well as science and how innovative technology combinations become influential.