Expecting to be HIP: Hawkes Intensity Processes for Social Media Popularity

Modeling and predicting the popularity of online content is a significant problem for the practice of information dissemination, advertising, and consumption. Recent work analyzing massive datasets advances our understanding of popularity, but one major gap remains: To precisely quantify the relationship between the popularity of an online item and the external promotions it receives. This work supplies the missing link between exogenous inputs from public social media platforms, such as Twitter, and endogenous responses within the content platform, such as YouTube. We develop a novel mathematical model, the Hawkes intensity process, which can explain the complex popularity history of each video according to its type of content, network of diffusion, and sensitivity to promotion. Our model supplies a prototypical description of videos, called an endo-exo map. This map explains popularity as the result of an extrinsic factor -- the amount of promotions from the outside world that the video receives, acting upon two intrinsic factors -- sensitivity to promotion, and inherent virality. We use this model to forecast future popularity given promotions on a large 5-months feed of the most-tweeted videos, and found it to lower the average error by 28.6% from approaches based on popularity history. Finally, we can identify videos that have a high potential to become viral, as well as those for which promotions will have hardly any effect.

Popularity for an online cultural item is the total amount of attention it receives. Being an important quantity for measuring collective behavior, popularity is critical for understanding online dissemination for content producers and managing information overload for content consumers. Understanding and predicting popularity has been of a topic of interest for both researchers and practitioners recently, but a number of fundamental questions still remain open, such as: What describes the most viral digital items? What roles do intrinsic virality and external promotion play in determining popularity? Can we promote an item to increase its popularity, and how much promotion is needed?
Two main lines of thought shape our current understanding of online popularity. The first is by Crane and Sornette [3], who views popularity evolution as power-law curves corresponding to either exogenous shock or endogenous relaxation. The second is that future popularity can be best predicted by past popularity [11] and features on the content, timing and network of an online cascade [1]. The points of departure of this work are in two recent observations. We tracked the popularity history of hundreds of thousands of online videos over several years [12], and found that more than 80% do not follow the prototypical endo-or exo-dynamic, but have multiple rising or falling phases in different shapes and scales, such as the videos in Fig 2 and 4. We also observe that volumes of external promotions, such as the of sharing and tweeting about a video, are indicative of sudden popularity changes that are otherwise hard to explain, such as in Fig 2(c) and 4(a). More generally, both feature-driven predictions [1,8,10,11] and stochastic event models [13] face fundamental difficulties in explaining the complex evolutions of popularity in the presence of external promotions.
In this work, we view popularity dynamics as the continuous interplay between endogenous and exogenous factors. This led to a novel mathematical model that fills in the missing link between external promotions and internal responses. This model explicitly capture several key factors for each video -the sensitivity to external promotion, content virality, and temporal decay of network response -which are all estimated from data. Furthermore, we explain the popularity history using aggregated counts rather than individual viewing events, which is unobservable due to privacy and data volume concerns. We design a model specifically for viewcounts by computing the expectations over all stochastic event history, and dub it the Hawkes Intensity Process. Figure 11 illustrates the conceptual schema of this model. On the left is the volume of exogenous excitations over time, in the middle is an endogenous self-exciting process that reacts to external input; the output on the right is the popularity series, modeled as the ongoing interaction between the exogenous and endogenous processes. The popularity series modeled through the Hawkes intensity process matches closely the observed viewcount series, even for videos with complex popularity lifecycles, such as the ones show in Figure 2.
We present a novel prototype description of each video based on the Hawkes intensity model. This description consist of two quantities -the endogenous response and exogenous sensitivity -readily computed from the history of each video. We use these two quantities to visualize collections of videos on a two-dimensional map, dubbed the endo-exo map. Applying the endo-exo map, one can identify videos that have high viral potential, i.e. those with exogenous promotion endogenous response event intensity s(t) ξ(t) λ(t) t1 t2 t3 m1 m2 m3 Figure 1: Modeling popularity using a Hawkes intensity process. The input are exogenous sources of promotion or discussions s(t) (in red), that engenders endogenous reactions from online social networks (middle box, in green). The result of this interaction is observed popularity, or event intensity ξ(t), over time (in blue). The endogenous reactions are self-exciting point processes, whose event rate λ(t) is a sum of memory kernels triggered by each prior event. In this example, there are three events at time t 1 , t 2 , and t 3 , with respective magnitudes m 1 , m 2 and m 3 (bottom axis). Their contributions to the event rate is shown as gray slices.
high sensitivity to external promotions and high endogenous response, but not yet popular (Figure 4(a)). We can also identify videos for which promotion is unlikely to have an effect, such as those scoring very low in either the endo-or exo-dimension ( Figure 2(b)). Another important use of the Hawkes intensity model is to quantify and forecast the effect of future promotions. We validate the forecasting on a collection of 13K+ YouTube videos that are the most actively tweeted over a six-month period, and found that forecasts made with the Hawkes intensity model lowers the average error made by state-of-the-art history-driven methods by 2 percentile points in overall popularity.

The Model
We introduce a model for the evolution of online attention under external influence. First, we link the ongoing effect of external stimuli to the word-of-mouth spread of attention with a stochastic point process process. Next, we propose the Hawkes intensity process, which is a video-dependent model for explaining the observed popularity history when the underlying point process is unobserved.

A point process for endogenous self-excitation
We model the online attention as an exogenously-driven self-exciting process -each viewing event is either triggered by a previous event, or arriving as a result of the external influence. We assume that viewing events of a YouTube video follow a Hawkes point process [6], a type of non-homogenous Poisson point processes. In particular, the event rate λ(t) is determined by two additive components. One component is proportional to external influence s(t). Here s(t) represents the volume of external discussions or promotions over time, it is non-random and is often observed from history, or known from intended promotion plans. µ is a scaling parameter that can be estimated from data. The other component is the rate of being triggered by a previous event i, which occurred at time t i with magnitude m i > 0, according to a time-decaying triggering kernel φ mi (t − t i ). Furthermore, each event t i < t adds to λ(t) independently. This leads to a point process described in Eq. 1.
Eq. 2 describes the triggering kernel φ(τ ). It is designed to capture several key quantities influencing popularity. Parameter κ is a scaling factor for video quality. m describes the relative influence of the user who generated the event -i.e. m i when multiple events are concerned in Eq. 1, β accounts for the nonlinearity between user influence and the effects on popularity. Time interval τ = t − t i is the elapsed time since the parent event at t i ; c > 0 is a cutoff term to keep φ m (τ ) bounded when τ is small; 1 + θ, θ > 0 is the power-law exponent for the social system -the larger θ is, the sooner the reaction to an event will stop. The description above captures the interaction between additive external factors and the endogenous selfexciting effects with a decaying collective memory, such effects are also accumulated over time. This model is an instance of a marked Hawkes process [6] with power-law kernels often found in geophysics [7] and finance [4]. An illustration of this self-exciting process is in Figure 11 -here the overall event rate λ(t) (middle) is shown to be a sum of individual memory kernels triggered by events at time t 1 , t 2 , t 3 with magnitudes m 1 , m 2 , m 3 , respectively. Moreover, through this self-exciting process, an input time series s(t) engenders the observed popularity series ξ(t), or the aggregate effect of such self-exciting processes, described hereafter. The complex and multi-phased popularity history cannot be explained by the current state of the art models, such as [3]. (b) A sliced fitting graph of a music video (Youtube ID 0bR4L0Y94AQ) -using the characteristic responseξ(t) and exogenous stimuli s(t) to explain observed popularity. Each alternating gray and white area is a slice of endogenous reaction generated by the external influence in a given day. The total event intensity (blue solid line) is a sum of temporally shifted and scaled versions ofξ(t), which tracks well the long-term trends in observed popularity (dashed line).
(inset) Example of the characteristic responseξ(t) to one unit of external excitation. The area under this function, Aξ, quantifies the endogenous reaction of a video. It is the total number of views after each unit of exogenous excitation.

The Hawkes intensity process
The point-process description of attention is well-founded, however it has limited applicability since the assumptions about the underlying data is often unrealistic. In particular, what we often observe is the volume of total attention over time, rather than the times and properties of individual actions, due to constraints in user privacy and data volume. We introduce Hawkes intensity ξ(t), the expectation of event rate λ(t) over event history H t , consisting of the set of (random) event times and magnitudes up to time t. Such an expected event rate allows us to explain observed event volume over time, without needing to observe individual event times and magnitudes. We found that intensity ξ(t) is related to the key quantities of the point process analytically (see SI. Sec. 1Details of the Hawkes Intensity Model), and it evolves as follows: Here, event intensity ξ(t) follows a self-consistent integral equation. That is to say, the event intensity at time t is determined by external stimulus s(t), as well as event intensity at a previous time ξ(t − τ ) scaled by its corresponding decay kernel (τ + c) −(1+θ) for all temporal offsets τ < t. Note that µ, θ and c were defined in Eq 2, and that κ and β are combined into a positive constant C = κ(α−1) α−β−1 . Here α > 0 is the power-law exponent of user influence distribution. Note that the two power law exponents are distinct in meaning and function, θ defines memory decay over time, while α is determined by the user distribution at large. α is estimated from a large Twitter sample using standard fitting procedures (SI 1.1The event rate of a marked Hawkes process, [2]). θ and other video-dependent parameters are estimated from popularity history. The Hawkes intensity model has two novel features for describing popularity. First, it captures the ongoing interactions of exogenous and endogenous effects, this leads to higher descriptive power for complex popularity series with multiple rises and falls (as shows in Fig. 2). State of the art methods lack the capacity of describing such evolutions: Helmstetter and Sornette [7] fit the observed event rate after an initial shock, and Crane and Sornette [3] produce a curve fit on the long-term approximation of the endogenous decay with no exogenous input. Second, this is a generative model for the event intensity series rather than each event, compared to existing settings used in statistics [9] or social networks [13]. As is the case with the YouTube data, being able to model popularity series without observing individual viewing events is important for reasons in both data availability (aggregated volume has much less risk for leaking user information) and computation (much more efficient).

Explaining popularity histories
We estimate model parameters {µ, θ, C, c} from a history of the number of tweets/shares and views for each video, using non-linear optimization to obtain a least-squares fit (see SI. Sec. 2Fitting the Hawkes intensity model). Three example fits are shown in Figure 2. Visibly, the event intensity model in Eq 3 links the exogenous and endogenous effects of the social system, resulting in a tight fit between the the model and the observed popularity history. For video bUORBT9iFKc the memory kernel decays fast (θ = 5.37), and the resulting intensity series tracks the temporal dynamics of the stimuli closely. For video WKJoBeeSWhc, the memory kernel decays slowly (θ = 0.41), hence the delayed accumulation of exogenous promotion effects via the memory kernel results in an overall rising trend.
We can see that only by capturing the non-obvious joint effects from within and outside a social network can a model produce both fine-grained short-term dynamics and accurate long-term trends.

Measuring inherent video virality
The Hawkes intensity model describes the inherent properties of a video, in particular its potential to become viral. In this section, we will first introduce two key metrics, the endogenous response and exogenous sensitivity, to quantify different aspects of potential virality of a video. We will then visualize videos on this two-dimensional endo-exo map. This allows us to explain how different categories of video become popular, and look comparatively at videos' popularity and potential with respect to each other. One may think parameters such as scaling factor C and memory exponent θ are good candidates for describing virality. We find, however, that despite being related, the non-linear interactions among the parameters make explaining popularity indirect compared to the endogenous response described below. Statistics about important model parameters are included in SI Sec. 4Understanding popularity dynamics.

Quantifying endogenous response
As shown in the Hawkes intensity model in Eq 3, the total attention a video receives consist of two parts: input from exogenous stimuli, and endogenous response corresponding to non-linear effects accumulated through the integral equation. The first part, sensitivity to external stimuli, is readily quantifiable with scaling parameter µ. We now examine a key property of the model, and derive a quantity representing the magnitude of endogenous response.
We observe that the Hawkes intensity model in Eq 3 is a linear time-invariant (LTI) system on the positive t-axis. That is to say that if ξ(t) is the event intensity function for input s(t), then (from the same video) the event intensity function for a shifted and scaled version of the input as(t − t 0 ) is aξ(t − t 0 ) for a > 0, t 0 ≥ 0, i.e., scaled and shifted by the same amount. We show this to be true in SI Sec 1.4The Hawkes Intensity Process as an LTI system. That is to say, the endogenous response of a particular Hawkes intensity process is completely characterized by the number of events spawned from one unit of input. We defineξ(t), the characteristic response of a video to unit impulse of exogenous excitation, an example is shown in Fig 2(b). Here δ(t) is the Dirac delta function, with δ(0) = 1, δ(t > 0) = 0.
Fig 2(c) illustrates such an LTI system using a sliced and stacked graph. In each discrete time point t , each white or gray slice corresponds to a version of the characteristic response shifted by t and scaled by the external stimuli s(t ). Adding all these responses together recovers the overall intensity ξ(t) as in Eq 3 (blue line), which tracks closely the long-term dynamics of the observed popularity (dashed line). For each video,ξ(t) describes its intrinsic response over time. We define the integral of this function over time, Aξ = ∞ 0ξ (t)dt, as the total attention generated endogenously by a single unit of exogenous excitation. Aξ, referred to as endogenous response, is related to the branching factor (i.e. the number of direct descendent events) of the underlying Hawkes process. Aξ is finite when the branching factor C θc θ < 1, as shown in SI. Sec. 1.3Branching factor n and endogenous response Aξ.

Visualizing inherent virality with the endo-exo map
Intuitively, a video with a large endogenous response Aξ and a high exogenous sensitivity µ has high potential to become viral. Specifically, each unit of exogenous excitation will generate µAξ events through the Hawkes intensity process. We construct a 2-dimensional map to visualize Aξ and µ for videos -called the endo-exo map in the rest of this paper. On this map, videos in close proximity have similar potentials to become popular, and the differences in their popularity would due solely to the difference in exogenous attention. Fig 3(a) illustrates this phenomena using four videos on this endo-exo map. Videos v 1 and v 2 are very similar in both Aξ and µ, the fact that v 1 has 4.61x more views is explained by it receiving 3.22x more exogenous promotions. On the same map, v 4 received a similar amount of promotion as v 1 , and their differences in popularity is explained by v 4 being less endogenously responsive (smaller Aξ) that v 1 . Moreover, v 3 has a similar endogenous response and sees similar amounts of promotion as v 1 , their differences in popularity is explained by v 3 being less exogenously sensitive, with a lower µ.

What describes the most popular videos
One may wonder whether higher popularity can be attributed to higher exogenous sensitivity, higher endogenous response or a combination of both. We examine a collection containing diverse videos categories and find that the explanation varies. We generate the endo-exo map on a collection 13,000+ Youtube videos, called the Active set -filtered from all 81M tweeted videos over the second half of 2014, with non-trivial amount of views and shares (see Materials and Methods). Fig. 3 (c) and (d) compares the density plot on the endo-exo map for videos in Gaming and Film & Animation, to that of the top 5% in popularity from these two categories, respectively. We can see that while most popular videos in Film & Animation are described by higher exogenous sensitivity (shifting upwards), the most popular Gaming videos have higher endogenous response -their density mass is shifted to the right of the endo-exo map. Other categories such as Comedy or News & Politics (shown in the SI) present two dense regions, one for higher Aξ and one for higher µ.

Identifying un-promotable videos
The endo-exo map can be used to readily identify an interesting class of videos -ones that are very difficult to promote. Given that the quantity µAξ describes the number of views that one unit of external promotion will generate under the joint influence of endo-and exo-factors -a very small µAξ is a hallmark of a video being un-promotable, e.g, µAξ < 1e − 3. Fig. 3(b) contains a zoomed-out view of the endo-exo map associated with the category People & Blogs. We found that 63 videos (∼ 3.2%) in this category to be un-promotable. Overall, 549 (∼ 3.9%) videos in the Active set are deemed un-promotable. The thumbnail of one example video, a teenager video blog, is shown with the figure. It has µ = 2.88 × 10 −15 and Aξ = 1, each online promotion is expected to generate 0 views. In contrast, for video v 1 in Fig. 3(a), each promotion is expected to generate 598 views.

Forecasting popularity growth
The Hawkes intensity process via the endo-exo map describes a video's popularity dynamics in the presence of external promotions; this section explores the predictive power of such a model. We first illustrate the setting for popularity forecasts using two examples, and then present a quantitative evaluation.

Identifying potentially viral videos
We first use the Hawkes intensity model to identify videos that are not already popular but has a high potential to become so. Video 1PuvXpv0yDM in Fig. 4(a) is such an example. It receives a total of 15.687 views during the first 90 days. Model estimates on this period deem it to have high endogenous response (Aξ = 6.94 × 10 72 ) and high exogenous sensitivity (µ = 119.02). During the 30 days that followed, this video receives 229 shares, and drastically improved its ranking in the popularity percentile from 5.85% to 94.9%, having a total of 2.42 million videos after 120 days.

Evaluating forecast with historical data
We design a protocol to quantitatively evaluate the predictive power of the Hawkes intensity model. We use historical data held-out over time, thus avoiding the the practical difficulty of generating realistic promotions and responses in a large-scale social network. Fig. 4(b) illustrates this setting with an example music video. A vertical line divides the observation period, day 1 to 90, and the evaluation period, day 91 to 120. The viewcount and sharing history in the observation period is used to estimate the model and explain observed popularity (in blue). Then the model takes as input the exogenous promotion s(t) and estimates the number of views during the evaluation period (in purple). For this example, the forecast and the actual views are fairly similar. Comparing these quantities for a collection of videos will quantify the goodness of forecasting. Such popoularity forecasting has broader applications than evaluated here. For example, it can be used to both estimate the effect of intended interventions, and compare different schedules (amounts over time) of exogenous interventions for different videos, and not limited to using observed external promotions.

Evaluation results
We use a data set of 13,738 videos that are the most tweeted and shared over a six-months period. We measure the quality of forecast using absolute errors in popularity percentile (see Materials and Methods). Fig. 4(b) summarizes forecasting performance for the Hawkes intensity process and baselines for comparison. State-of-the-art approach for popularity prediction uses multivariate linear regression (MLR) that takes either the historic viewcount [10] or both the viewcount and the promotions as input. MLR are trained on a large collection of videos, and we obtain predictions for each video with cross-validation. We can see that the forecasts made using the Hawkes intensity model has lower average error compared to linear regression with exogenous stimuli (#shares, #tweets) or without, and the differences are statistically significant (paired t-test p < 0.001), with a significant effect size, as shown in SI. Sec. 5.2Statistical significance. Within the Hawkes intensity model variants, we found that using the number of shares generates slightly better forecast than the number of tweets, but the differences are not statistically significant at p = 0.001. We also observe that performance gap doubles when forecasting popularity on more difficult videos, see SI. Sec. 5Popularity forecasting and comparison to baseline.

Summary and discussion
This research establishes a novel mathematical model to systematically link the endogenous response to the exogenous stimuli of a social system. The model developed here provides a nuanced view of the continued interactions of endogenous and exogenous effects that generate complex and multi-phased popularity dynamics over time. Moreover, we quantitatively describe a video's inherent potential for becoming viral in terms of the endogenous and exogenous factors, visualized on the endo-exo map. We forecast future popularity under given promotion, obtaining results superior to a state-of-the-art data-driven prediction approach.
The application context in this work is large collections of YouTube videos, with their popularity and promotion history. We quantify the endogenous virality and exogenous sensitivity for each video, giving detailed insights on the community dynamics around different types of YouTube videos, and compare different content groups on the endo-exo map. Such detailed analysis is possible because the collective attention and promotion data are available from YouTube or inferred from public sources such as Twitter. We envision that the same attention dynamics would hold for other content types, such as newspaper page views, podcasts, or blogs.
There are a number of simplifying assumptions and limitations of the proposed model, which can become fruitful directions of further investigation. First, the Hawkes intensity process captures popularity dynamics that are reflected only in the observed external promotion series, and does not capture other behavioral factors such as (daily or weekly) seasonality. What this model focuses on is the expected influence over all users rather than individual influence. Both of these observations suggest extensions that could incorporate seasonality components, take into account individual influences or content freshness. Second, the observations in this paper does not directly address causality. We conducted statistical tests using the well-known Granger Causality [5] on the shares and view series (see SI Sec. 4.3Causal connexion between the views, tweets and shares series), which does not show consistent results that either shares influence views or vice versa. Lastly, media items are influenced by a variety of sources in the open online world, there are many sources of external promotion that are unobserved or are difficult to obtain data from. A well-known example is that gaming videos are known to be discussed intensively in topic-specific forums. Tracking and estimating diverse or even unknown sources of exogenous influence is another open question for this topic area.

The tweeted videos dataset
We collect a dataset of tweeted videos between 2014-05-29 and 2014-12-26 using the Twitter API, which yields a large and diverse sample of over 81 million videos with both daily viewcount and external promotion via tweets and shares available. After restricting to videos that are still online, that have their popularity and sharing history available, and that received at least 100 tweets and 100 shares, and removing 6 rare categories containing less than 1% videos, we are left with a subset of 13,738 videos across 14 categories. A profile of the dataset and details about its construction is given in the SI.

Popularity percentile and measuring forecasting performance
We construct a popularity percentile scale of the 13,738 videos at 120 days of age (See SI Sec 3.2The popularity scale over time for details). We map the predicted total viewcount to this percentile scale, and compute the error in the predicted percentile. The percentile-error metric produces meaningful measures across a spectrum of most popular to less popular videos. Compared to error metrics on the absolute or relative different in views, this metric focuses on error in the ranking with respect to a large collection of videos. It avoid the error being skewed by the long-tailed distribution of popularity [1], and ranking is a practically important in recommender systems or portfolio management. [1] Cheng, J., Adamic, L., Dow, P.A., Kleinberg, J.M., Leskovec    The y-axis represents sensitivity to exogenous stimuli. The radius of each circle is proportional to the popularity percentile P T of each video after T = 120 days, with values between 0.0 (least popular) and 1.0 (most popular). The color represents the amount (percentile) of total promotions received, denoted as S T . v 1 and v 2 present similar endogenous reaction and exogenous sensitivity, being at the same position on the endo-exo map. The difference in their popularity (size) is explained by the fact that v 1 received 3.22 time more promotions than v 2 . Both v 3 and v 4 receive similar amounts of promotion (color) as v 1 , but they achieve lower popularity (smaller size) due to their less privileged position on the endo-exo map: v 3 is less sensitive to external stimuli than v 1 and v 2 , while v 4 has a smaller endogenous reaction than v 1 and v 2 . Information about the four example videos are as follows, with their popularity percentile P T and promotion percentile S T (absolute values in parenthesis): v 1 is a short Gaming video, YoutubeID 0lTTWeavl1c, P T = 85% (634,370 views), S T = 65% (351 shares); v 2 is a collection of "ALS ice bucket challenge" videos, YoutubeID 3hSIh-tbiKE, P T = 40%(137, 481), S T = 10%(109); v 3 is a funny science video, explaining types of infinity in math, YoutubeID 23I5GS4JiDg, P T = 60%(193, 052), S T = 65%(356); v 4 is from a Portuguese youtuber, YoutubeID 0ndmJzEIcgU, P T = 40%(93, 959), S T = 60%(311). Comparison of average forecasting errors on the Active set. y-axis: Forecasting errors, calculated as the absolute difference between the popularity percentile at day 120 and that forecasted by each approach. x-axis, left to right: Hawkes intensity model, using either #shares or #tweets as s(t); multi-linear regression, using only popularity history, or #shares and #tweets, respectively.

Details of the Hawkes Intensity Model
Given time t ∈ [0, ∞), we denote by λ(t) the event rate of an online resource at time t, the goal of this section is to derive the expected event rate, denoted as ξ(t), as the average response rate from a large network.
There are two sources of events in the social system -exogenous events originating outside the system and endogenous events spawned from within the system as the response to previous events (that are either exogenous or endogenous). For example, a public speech held by a famous politician can be an exogenous source for the number of views of relevant Youtube videos on politics; the views on trailers prior to the release of new movies, on the other hand, exhibits a rich-get-richer effect for attention distribution that are characteristic of endogenous word-of-mouth diffusion.

1.1
The event rate of a marked Hawkes process.The Hawkes process [11], as defined in main text Eq. 1, is a nonhomogeneous Poisson process with self-excitation, its event rate λ(t), or instantaneous conditional intensity r(t|Ht) is: Here t > 0 denotes time; Ht = {(mi, ti); ti < t} is the event history before time t; s(t) is the rate of exogenous events. µ is a scaling factor for exogenous stimulus, and φm(τ ) is a timedecaying trigger kernel that are determined by the magnitude of each event mi. Our Hawkes intensity model is a type of marked Hawkes process, as also used in [12,22]. In marked Hawkes processes, each event i has an occurrence time ti and a magnitude mi (or mark), which captures the relative amplification of the likelihood of an event to spawn future events. Intuitively, the mark can represent the magnitude of an earthquake when modeling the occurrence of aftershocks of an earthquake, or the number of people an event can subsequently influence when modeling social networks, which can be approximated by, say, with the number of followers or friends. The triggering kernel φm(τ ), as defined in main text Eq. 2, can be written as the product of two separable terms: b(m) modeling the influence of event marks and φ(τ ) modeling the temporal decay: [ 2 ] In this work, we focus on modeling the total attentioncommonly known as popularity, that closely connect to the average event rate across the whole network. We extract the number of followers m for a large sample of users from our dataset described in Sec. 3 and fit power-law distribution p(m) = (α − 1)m −α following the method in [5]. We obtain α = 2.016 and we use it throughout the experiments. As noted in the main text, the two power law exponents are distinct in meaning and function, θ defines memory decay over time, while α is determined by the user distribution at large. α is estimated from a large Twitter sample. θ and other videodependent parameters are estimated from popularity history as detailed in Sec 2 below.
1.2 Deriving the expected event rate ξ(t) for unobserved point processes. The goal of this section is to derive the expected event rate ξ(t) over time as specified in the main text Eq 3. This is done in three steps: we first include a preliminary description of the event rate λ(t) in terms of the underlying counting process over infinitesimal intervals, we then derive the expected event rate for unmarked Hawkes processes, and finally we build upon these to derive the expected event rate for marked Hawkes processes.

Preliminaries: Event rate and the counting process
It is well known in stochastic process literature [17] that the event rate λ(t), or the conditional intensity specification r(t|Ht) of a point process is completely characterized by the corresponding counting process N (t). Here N (t) is the total number of events observed between time 0 and t. Given an infinitesimal interval δ at time t, the relationship between N (t) and r(t|Ht) is described as: Here P denotes the probability of a discrete random variable. The intuition of the expression above is that r(t|Ht) is proportional to the probability that N (t) increments by 1, and that it is "very unlikely" for N (t) to increment by more than one.
Let dNt be the counting increment N (t + δ) − N (t) as δ ↓ 0. From Eq 3, we can describe dNt as a Bernoulli random variable, with: It follows from the above that Using the shorthand λ(t) for event rate and putting the above together, we can see that Hawkes processes can be specified as: δ , [ 4 ] Note that Eq. 4 is an equivalent formulation of Eq. 1 through the counting process N (t). Eq. 4 holds for all nonhomogeneous Poisson processes. Hawkes processes (marked and unmarked) are special cases of non-homogeneous Poisson processes.

Expected event rate for unmarked Hawkes processes
We first study the simpler case of an unmarked Hawkes processes λ u (t), and derive its expected event rate ξ u (t) over possible event histories. While it is not strictly necessary to breakdown the derivation into two parts, this helps illustrate the main ideas underlying the derivation for marked processes in the next subsection. The key idea in this subsection is converting the conditional expectation of event history into increments of the counting process, and using conditional expectations to link the expectations of counting increments to the expected rate ξ u (t) via λ u (t). The next subsection will use exactly the same treatment for the history of event times, and performs a similar treatment for a history of event magnitudes. Let an unmarked Hawkes process be: Here φ(τ ) is a memory kernel specified in Eq 2, scaling constant κ is omitted without loss of generality. Note Eq. 4 still holds. Here event index i = 1, . . . , N (t), and N (t) is the total number of events before time t, i.e. the counting process. We define the expected event rate ξ u (t) as a function over time, obtained by taking expectations of λ u (t) over the event history Ht = {t 1:N (t) } consisting the event times -note that in an unmarked process, all event magnitudes are the same (i.e. 1). Equivalently, the event history can also be expressed as the counting process N (t) -Ht = {N (τ ), 0 < τ < t}. Note that Ht is random -we can think of it either as a set of random variables ti, and the dimension N (t) of these random variables is also random; or as one random function over time, i.e. N (τ ), for 0 < τ < t.
Here step 6(a) is due to µs(t) being not random, and that expectations over all t 1:N (t) is equivalent to taking expectations over event history Ht. In is the indicator function that takes value 1 when there is an event in the interval [(k − 1)δ, kδ), or dNkδ = 1, and 0 otherwise. Note that we replaced each arrival time ti from line 6(a) with kδ, since event i occurred in the time inter- In step 6(c), we first exchange the order of the limit lim δ↓0 and expectation. We note that taking expectation over the counting process {N (τ ), 0 < τ < t} is equivalent to taking expectation over its Bernoulli increments {dN τ, 0 < τ < t}, or it's discretized version over infinitesimal intervals dNkδ,k=1:K . We then unroll the sum over all intervals. For an interval k δ, or [(k − 1)δ, k δ), the indicator function becomes 1(dN k δ = 1), the kernel φ(t − k δ), and the expectation is taken over dNkδ, k=1:k since the process is causal -i.e. current events are only influenced by the past and not the future.
Each expectation term in Eq. 7 can be computed as follows. Step (8a) is due to the kernels term φ(t − k δ) being nonrandom and become constants. Also note that each expectation EdN kδ , k=1:k [1(dN k δ = 1)] can be computed by breaking down the joint distribution dNkδ, k=1:k into the conditional distribution dN k δ |H (k −1)δ and the prior distribution over H (k −1)δ . We write the prior in terms of event history for notation convenience. Due to the two equivalent definitions of event history, taking the expectation over the history EH (k −1)δ is equivalent to taking the expectation over increments of the counting process E dN kδ ,k=1:(k −1) .
Step (8b) is due to Eq. 4, the inner expectation can be writ- Step (8c) is due to the definition of the expected event rate ξ u (t) in Eq. 6. The expectation of event rate λ u (k δ) over Applying the result of Eq. 8 back to Eq. 7, we get: Applying Eq. 9 to the end of Eq. 6, and taking the limit δ ↓ 0, we have: Perform a change of variable τ ← t − τ , we obtain the integral equation specifying the expected event rate for unmarked Hawkes process.
To the best of our knowledge, this definition of the intensity function, along with the derivation of its analytical form is new. The original paper by Hawkes [11] presents an integral equation of similar form, but it is for the covariance density and not event intensity function.

Expected event rate for marked Hawkes processes
The expected event rate function ξ(t) for a marked Hawkes process, is defined as the expectation of the event rate function λ(t) over the set of event times and magnitudes before time t. In this subsection we work with the event rate as specified in Eq. 1.
The notation for event history is worth clarifying, here Ht = {(ti, mi), i=1:N (t) }, i.e., each event consist of a (random) jump time ti and a (random) event magnitude mi, and there are N (t) (another random quantity) such time-magnitude pairs. We assume that any event magnitude mi is drawn iid from the same power-law distribution p(m) = (α − 1)m −α , m > 0, once the event time ti is determined. That is to say, for an event spawned through the endogenous process, the magnitude of the event is independent of the magnitude of its parent event.
We define the expected event rate ξ(t) for the marked Hawkes processes λ(t) as follows.
Step (12a) below is due to µs(t) being non-random.
In order to unroll this expectation into K small intervals of size δ, we need a set of auxiliary variables, called mk, k = 1 : K. For each interval [(k − 1)δ, kδ), we draw mk ∼ p(m). If indicator function 1(dNkδ = 1) = 1, then mk is kept; otherwise when 1(dNkδ = 1) = 0, mk is thrown away as no event happened in this interval. One can easily verify that this process of iid draws of mk is equivalent to iid draws of the original mi. We use this to re-write the expectation in Eq. 12, and exchange the order of the expectation and the limit.
We exchange the order of the expectation and the summation, and unroll the sum over all intervals. This is similar to Eq. 7.
The causal property of marked Hawkes process means that the expectation of each term [1(dNkδ = 1)b(mk)φ(t − kδ)] k=k only depends on event history H k δ , and not after.
We now compute the expectation term when k = k .
Step (15b) breaks down the expectation over the entire history H k δ into the part over the current event (and its magnitude if happens) dN k δ , m k conditioned on prior history H (k −1)δ and over the prior history itself. This is similar to Eq. 8 for the unmarked process.
The expectation of function b(m) is the same and only depend on p(m) whenever dN k δ = 1. This is due to the generating assumption of event magnitudes at the beginning of this subsection.
Furthermore, we see that Em[b(m)] can be computed in close form, we call this modeling constant C.
Plug in the result of Eq. 16-18 back to Eq.15, and notice Eq. 12, we have: Apply this result to Eq. 14 and then to Eq.12, followed by taking the limit δ ↓ 0, we have: Eq. 20 is Eq. 3 in the main text.
1.3 Branching factor n and endogenous response Aξ. We derive two quantities from the Hawkes Intensity Process in order to better visualize and explain the diverse behavior of video popularity. The first key parameters is the branching factor n, defined as the mean number of daughter events generated by a mother event. For a marked Hawkes point process, the branching factor is computed by integrating the triggering kernel over time and taking the expectation over the magnitude m.
C θc θ , for β < α − 1 and θ > 0 . n < 1 implies a subcritical regime, i.e., the instantaneous rate of events decreases over time and the number of new events will eventually cease to occur (in probability); n > 1 implies a supercritical regime, i.e. each new event generates more than one direct descendants, which in turn generate more descendants, unless the network condition changes, the total number of events is expected to be infinity.
The second quantity Aξ, as defined in main text Sec , is the total number of (direct and indirect) descendants generated from one event. In the main text it is defined as Although defined separately, we can see that Aξ is closely related to branching factor n: the initial exogenous event will generate n events as first-generation direct descendants. Each of these events will generate an expected n events (n 2 events in the second generation), and each of these will in turn generate n events (n 3 events in the third generation), . . . Here n k is the average number of events in the k th generation, and so on. This leads to an equivalent definition of Aξ.
While both capturing the endogenous property of the Hawkes Intensity model, Aξ and n emphasize different intuitions. We chose to visualize Aξ in the endo-exo map, because it has a direct correspondence to the sliced LTI system view in main text Eq. 4 and Figure 2, and that Aξ has better numerical resolution for the more viral videos -i.e., when n is close to 1. In the main text, we obtain estimates of Aξ by numerically integratingξ in Eq. 4 to t = 10, 000 discrete time steps.
Crane and Sornette [8] showed that the Hawkes Intensity Process in a super-critical state could explain some rising patterns of popularity observed in social media. We note, however, that finite resources in the real world, such as collective human attention [21], are bound to be exhausted and online systems cannot stay indefinitely in a supercritical regime. We argue, most online media items are affected by a continued interaction of exogenous stimuli and endogenous reaction (that may be sub-or super-critical), leading to continued rise in popularity, or multiple phases of rising and falling patterns.

The Hawkes Intensity Process as an LTI system. The
Hawkes intensity process can be view as a system with one input -the exogenous stimuli rate s(t), and one outputthe event rate ξ(t). The Main Text claimed that the system s(t) → ξ(t) is an Linear Time Invariant (LTI) system. That is to say, the system has two properties: Linearity, which states that the relation between the input and the output of the system is a linear map: if s1(t) → ξ1(t) and s2(t) → ξ2(t), then as1(t)+bs2(t) → aξ1(t)+bξ2(t), ∀a, b ∈ R. We can see that linear scaling is true as1(t) → aξ1(t) by multiplying a to both sides of Eq. 3 in the main text and regrouping terms. Additivity as1(t) + bs2(t) → aξ1(t) + bξ2(t) can be shown similarly.
Time invariance, which states that the response to a delayed input is identical and similarly delayed: if We wish to show the following for Eq. 3 of the main text: After a change of variable t = t − t0, we can see that the LHS is ξ(t ). For the RHS,τ remains unchanged, the rest is: Rizoiu et al. We write the integral into two parts, i.e., (0, t) and (t , t + t0).
We note that ξ() is a causal function, i.e., ξ(t) = 0 for t < 0, or ξ(t − τ ) = 0 for τ > t . The second term vanishes. RHS becomes Note LHS = RHS due to Equation 20 and time invariance holds. The LTI properties directly implies the following about the Hawkes intensity processes, as illustrated in Figure 2 of the main text.
• Additive effects from multiple sources of external stimulation: when applying two sources of excitation, the event rate of the resulting Hawkes intensity process is the sum of the rates generates by each source of excitation independently. This allows us to separately quantify the impact of each source. • Scaling the expected event rate: if the exogenous stimuli scales up or down, the endogenous reaction will scale accordingly. In other words, if we can control the amount of exogenous promotions, we could boost or suppress the number of views for videos that respond to such promotions. • Shifting in time: if the exogenous stimuli is shifted in time, so will the views responding to it. In other words, we could schedule promotions (and subsequent views) for videos that respond to such promotions.

Fitting the Hawkes intensity model
This section describes some of the implementation and computational details for estimating the model in Eq 20 from observed popularity and promotion histories.

Settings for model estimation.
We first present the view of Eq 20 over data obtained in discrete time intervals, and then discuss a way to estimate missing external influence, and finally list the loss function for least-squares fitting.
Discretizing over time Behavioral statistics are aggregated and presented over fixed, discrete, intervals -in the case of YouTube, we observe the daily history of viewsξ[t] and sharess[t] for t = 1, . . . , T . Writing Eq 20 over discrete time, we obtain: [

]
Hereafter, we use square brackets to denote discrete time (e.g. ) and round brackets to denote continuous time (e.g. s(t), ξ(t)). Accounting for unobserved external influence. In addition to observed external promotionss[t] in tweets or shares, we model unobserved external excitation as an initial shock (at t = 0) and a constant background excitation (for t > 0).
where 1(arg) takes the value 1 when arg is true and 0 otherwise. In the absence of a parametric temporal model of generic external influence, the initial impulse and constant components require the least amount of assumptions about how unobserved influence evolve. In our experiments, adding estimates for such unobserved influence components improves the fitting for a large number of videos.
The loss function For each video with observed {ξ[t],s[t], t = 1, . . . , T }, we find an optimal set of models parameters {µ, θ, C, c} and also estimate the unobserved external influence (parameters γ and η). This is done by minimizing the square error between the seriesξ[t] and the model ξ[t], ∀t ∈ 0, 1 . . . T . The corresponding optimization problem is as follows: Note that Eq 24 involves the model components ξ[t − τ ]as we will show in the next sub section, the objective function and its gradients are computed iteratively by estimating ξ , as we would like to have the model reproducing the whole observed time series, rather than predicting the next point given observed history. As will be discussed in SI. Sec. 2.3, we further improve fitting stability by adding a L 2 regularizer to the objective function.

Computing gradients.
Eq. 24 is a non-convex objective, we use gradient-based optimization approach, and specifically L-BFGS [18] with pre-supplied gradient functions. We use the implementation supplied with the NLopt package [13]. We fit each video in parallel, starting with multiple random initializations to improve solution quality, and we present the solution with the lowest error function J. The gradient computation are listed as follows.
We define the error term as where var ∈ {µ, θ, C, c, γ, η}. Using chain rule, we obtain: ∂var [ 25 ] Rizoiu et al. Specifically, we compute the following partial derivatives and use them in Eq 25 to compute the gradient.

2.3
Adding an L 2 regularizer. We add L2 regularization on the linear coefficients of the Hawkes Intensity Process to avoid overfitting. The loss function with the regularization terms are as follows.
Jreg(ω, µ, θ, C, c) = J(µ, θ, C, c)+ Here (γ0, η0, µ0, C0) are reference values for parameters obtained by fitting the seriesξ[t] without regularization. The reference values are used to normalize the parameters in the regularization process, so that they have equal weights. Intuitively using L2 normalization in square-loss is effectively putting a Gaussian prior on the parameters being regularized. We desire parameters c and θ to take values away from zero, hence they are not regularized. The L 2 regularization term is differentiable with respect with variables (γ, η, µ, C) and the terms ω γ 0 , ω η 0 , ω µ 0 and ω C 0 are added respectively to the RHS of Eq. 31, 32, 26 and 28.
The regularizer parameters ω is expressed as a percentage of J0 (the value of the non-normalized error function) and it is determined through a line search within [10 −4 J0, 10J0] in logscale. ω is tuned per video, on a temporally hold-out tuning sequence, i.e. we use the first 75 days of observed popularity for parameter estimation, the next 15 days for tuning ω, and day 91-120 for forecasting popularity.

Data
We construct a "Tweeted Videos" dataset using data API from both Twitter and YouTube. We stream tweets from Twitter API using a set of keywords related to youtube and its videos. This returns over 5 million matched tweets per day after URL expansion and tokenization performed by Twitter, most of which mention and link to a YouTube video. The raw dataset used in this study was from 2014-05-29 to 2014-12-26, having 1,061,661,379 tweets in total. From each of these tweets, we extracted the associated Youtube video id (only the first in case multiple videos were referenced in the same tweet), resulting in 81,915,174 distinct videos in total.
From YouTube.com, we obtained video metadata including its upload date and video category, as well as the time series consisting daily number of views and shares. Along with the daily number of tweets, we obtained three attention 3.1 The 5Mo and Active datasets. We constructed two cleaned data subsets from the feed of tweeted videos, in order to collect basic data statistics and estimate the model.
• The 5Mo was constructed to have videos whose popularity history is at least 60 days long, and is used for forecasting popularity. We narrow down the timeframe of video upload to between 2014-05-29 and 2014-10-24 in order to have long enough history. There are 16,417,622 videos with publicly-available popularity history. We did not obtain the popularity history for more than half of the videos, reasons for such data loss include: a video is no longer online, a video's popularity history is not publicly-available, or requests that resulted in web server errors. This large and diverse sample allows us to estimate the background statistics of video views, tweets, and shares, as will be discussed in the next subsection. at least 120 days of tweeting and popularity history in our dataset, the activity threshold is used to ensure there is sufficient data for estimating the Hawkes intensity model and producing a forecast as describe in Section 5. Table 1 presents the category distribution for the Active dataset. It is noteworthy that the largest 4 categories cover more than 70% of all the videos in Active, with more than 25% of the videos being Music. We removed 6 content categories (i.e. Autos & Vehicles, Travel & Events, Pets & Animals, Shows, Movies, Trailers) containing less than 1% of the videos in the dataset. Their corresponding videos were also removed. The resulting dataset contains 13,738 videos. Bins by viewcounts that they receive at t days after upload, and each bin ξt(k) is marked with its maximum popularity percentile -videos in bin k are at most among the top k% popular with age t. In this work we use 40 evenly spaced bins, i.e., k = 2.5, 5.0, 7.5, . . . , 100, and each bin contains 2.5%, or ∼41K+ videos for 5Mo. Figure 1(a) and (b) contains a boxplot of video viewcounts (in log-scale) of each bin after 30 and 60 days, respectively. We can see the long tailed distribution of popularity in Youtube reflected here -videos in the less popular bins have very similar number of views, e.g. the first 6 bins, or 15% of the videos, all have less than 10 views; videos in each the middle bins (e.g. k = 17.5, . . . , 85.0) are within 1.5 times of the view count of each other; yet viewcounts of the 5% most popular (k = 97.5 and k = 100) videos span over almost two orders of magnitude. For videos in 5Mo after their first 30 and 60 days of upload, the shape of the overall popularity scale remains the same, with a slight increase in the dynamic range of views (top of the last boxplot). The popularity scale of the Active is very similar to the one presented in Figure 1(a) and (b), the only notable difference being the number of views corresponding to each bin. Active is a subset of the most popular videos, as shown by Figure 2: the videos in Active are positioned in the top 5% popularity percentiles of 5Mo (k = 97.5 and k = 100).
In Figure 1(c) we explore the change of popularity of each video from 30 days (y-axis) to 60 days (x-axis). Note that most videos retain a similar rank (in the boxes along the 45 degree diagonal line), or have a slight rank decrease as they are overtaken by other videos (slightly above the diagonal in the plot). No outliers exist in the upper-left part of the graph, since a video cannot lose viewcount that it already gained. Most notably, we can see that video from any bucket can jump to the top popularity buckets between 30 and 60 days of age, such as the outliers for the few boxes on the far right. This phenomenon elicits important questions: how did these videos do viral, and whether or not it is related to external promotions.

Understanding popularity dynamics
In this section, we provide additional observations on parameters of the Hawkes intensity model, supplementing the analysis presented in the main text Section "What describes the most popular videos" Specifically, we relate the distribution of specific parameter values such as memory exponent or exogenous sensitivity, to video groups -channels, content categoriesand a video's popularity.
4.1 Behavior across groups of videos: categories and channels. We provide in this subsection some observations on behavior statistics and key parameters broken down by video category. Furthermore, we show how the endo-exo map can be used to detect consistent behaviors across YouTube channels. Consistent behavior across channels We use the endoexo map to visualize groups of videos that belong to the same user-assigned content type, or are from the same author, called channel in YouTube. Fig. 3 shows a scatter plot of videos posted by a reporter in category News & Activism (in red) and a user focusing on recordings of Game sessions (in blue). The game recording videos are generally more popular (bigger circles) than the news videos, and this is explained by the former group having higher exogenous sensitivity -higher values of µ.
The effect of the external influence is not equal. We examine the amount of attention (in number of views) and external influence (in number of shares and tweets) in the Active dataset. This provides a basis for understanding the corresponding Hawkes intensity model. Figure 4 (top row) contains box plots of total views, along with total shares and tweets, broken down by video category. The left-most boxes (in red) depicts the profile of all videos. One notable example is videos in the Nonprofits & Activism category: overall they have less-than-average amount of views, despite being shared more than the median number of times.
Observed versus unobserved external influences. Model parameters γ and η can be interpreted respectively as the initial impulse and constant exogenous stimuli not captured in the observed exogenous activity s(t). From the bottom left two plots in Figure 4, we can see that several categories have significantly higher components of γ and η, such as Gaming, Comedy and Entertainment. This may result from a significant volume of activity outside of Twitter or Youtube sharing -Gaming videos, for example, is known to spread on dedicated social networks such as sub-reedit /r/gaming/, /r/gamingvids/ or forums, such as www.minecraftforum.net.
Exogenous sensitivity and endogenous response. The two bottom right plots of Figure 4 represent the breakdown per category of, respectively, the exogenous sensitivity µ and the endogenous response Aξ. These plots present an alternative view to the 2-dimensional density distribution of each category on the endo-exo map, shown in Figures 10 and 11. Certain categories, such as Comedy, Gaming or Sport seem to be particularly sensitive to external influence. Categories like Comedy, Entertainment or Gaming observe higher then median endogenous responses. The fact the Comedy and Gaming show both a high exogenous sensitivity and endogenous response provide a plausible explanation to why these categories observe relative high popularity (#views) despite their relative low sharing. Conversely, Nonprofits & Activism exhibits lower than median values for both µ and Aξ which accounts for its low popularity (even though highly shared).
Categories of longer versus shorter memory. Figure 5 plots distributions of the memory exponent θ, obtained using kernel density estimation. θ value for three categories, Music, Nonprofits & Activism and News & Politics (in red) are contrasted with the distribution from All videos (in blue). The solid lines in each graph indicate the median value for θ in each category, whereas the dashed lines indicate the mean value. All video categories, as well as the general population, observe a long tail distribution for θ, with a peak density around θ 3.35. A small θ leads to slower decay over time (and larger endogenous response Aλ), whereas a large θ means an event is forgotten quickly (i.e. small Aλ). We can see that a larger (than random) fraction of Music videos decay slowly (meanθ,all = 15.94, meanθ,music = 14.95), while more News & Politics and Nonprofits & Activism videos are forgotten faster, with meanθ,nonprofit = 17.56 and meanθ,news = 19.45. This suggests that there is a systematic difference across different types of videos in the rate at which the collective memory decays -one explanation for such differences can be that music are typically considered timeless content while news are considered timely whose relevance decreases rapidly over the first few days.

What makes videos popular.
In this section, we provide additional details about the relation between video popularity and fitted values of parameters µ and θ. These analysis provide additional details to the endo-exo map, by explicitly linking the endogenous and exogenous components of each video to each model parameter.
Parameters µ and θ and popularity In the Main Text, we claim a direct connection between µ the exogenous sensitivity and popularity and an inverse connection between the θ the time-decay rate of the memory kernel and the popularity. We provide, in Fig. 6, empirical proof of these connections by studying the popularity distribution for low and high values of the above parameters. The top-left graphic shows the density distribution of the fitted values of µ in the Active dataset. There is a high a peak of density around µ = 1, corresponding to videos with low sensitivity to external influence, and a second peak around µ = 10 1.73 = 53.7, corresponding to videos with higher exogenous sensitivity. We divide the range of µ into deciles (groups of 10% each) and we select the second decile (i.e. low sensitivity) and the tenth decile (high sensitivity), hashed in gray on the graphic. In the bottom-left graphic, we plot the popularity distribution for videos within each of the above deciles of µ. The subpopulation of videos with low exogenous sensitivity show a dense area of low popularity, and with only very few videos making it into the top popularity percentiles. Conversely, the density distribution of the subpopulation of videos with high exogenous sensitivity is shows an increasing trend, with a concentration of highly popular videos. This confirms the intuition that highly popular videos tend to have high values of exogenous sensitivity µ.
Similar results can be shown for the time-decaying memory exponent θ, which controls how fast videos are forgotten and the size of the endogenous response Aλ. Fig. 6b plots the density distribution of θ, which shows a peak at θ = 3.36 and selects the second and tenth percentile, corresponding respectively to low values and high values of θ. Similarly to µ, the bottom-center graphic plots the popularity distribution for each of the subpopulations defined by the selected deciles of θ. The subpopulation with high values of θ (i.e. low Aλ) tends to be forgotten more quickly and shows a concentration of videos with low popularity. Whereas, videos with lower values of θ (and higher Aλ) tend to be more popular.
Endo-exo map for additional categories. The above considerations are at the basis of the construction of the endoexo map, as shown in the Main Text and its potentially viral region -videos with high values of both exogenous sensitivity µ and endogenous response Aλ are more susceptible to become popular if given the required attention. The right column of Fig. 6 plots the 2D density of videos on the endoexo map for the entire Active dataset (top) and the top 5% most popular videos (the color map is aligned for the two graphics). Visibly, the distribution of the popular videos is skewed towards the more viral region of the map (i.e. high µ and high Aλ). In Fig. 10 and 11, we repeat this analysis and we further break down the Active population, based on video category. We plot pairs of 2D densities of videos on the endo-exo map for all categories, except Gaming and Film & Animation, which were discussed in the Main Text Fig. 3 There is even an outlier category, Comedy, which seem to have two heat centers in the top 5% popular subpopulation. This seems to indicate two distinct patterns of becoming popular within this category: one pattern involves being sensitive to exogenous excitation more than the average video, whereas the second pattern involves higher endogenous propagation in the network (higher Aλ).    The null hypothesis is that no causal relation exists between the series. We reject the null hypothesis and we accept the existence of a causal relation when we observe a test p-value lower than 10 −3 . Table 2 shows the number of pairs in each setup for which the causal relation is considered to be significant. Note that the causal relation can be reciprocal -e.g. for 253 videos, both the shares cause the views and and the views cause the shares. Considering the scale of the Active dataset (around 14 thousands videos), the number of videos for which a causation relation is detected seems very small. The only noticeable ex-i i "Hawkes˙2015˙SI" -2016/6/10 -16:38 -page 12 -#12 i i i i i i ception is the views cause shares relation, with 833 videos. For all pairs the effectives of both directions seem comparable (e.g. tweets cause views for 164 videos, and shares cause tweets for 162 videos). Lastly, it is well-known that the Granger causality does not account for latent confounding effects and it does not capture instantaneous and non-linear causal relationships, while the relation between shares and views for example is profoundly non-linear. For these reasons we do not pursue any further the analysis of causal relations.

Popularity forecasting and comparison to baseline
In this section we provide additional details and results to complete the Main Text Sec. "Forecasting popularity growth", namely, more information about the performance break down of different approaches and the statistical testing analysis for detecting statistically significant differences in the forecasting performance.
The series of the first 90 days of each video history in Active dataset are used to fit the Hawkes intensity model parameters. The series is divided into two sub-series: the first 75 days are used to fit parameters {µ, θ, C, c}, while the last 15 days for the holdout series used to fit the regularizer metaparameter ω. Either #shares and #tweets series can serve as the known exogenous stimuli series s(t). The Multi-Linear Regression (MLR) [24] baseline is trained using the same data. We adapt the original algorithm by predicting the value of the viewcounts for each of the 30 days between day 91 and 120. Furthermore, we build an enhanced version (denoted by MLR (#shares) or MLR (#tweets)) by introducing the exogenous influence as additional variables, both in the training and in the prediction. The baseline is particularly sensitive to outliers, which we remove from the training set. A video is considered an outlier if it has received a large burst of views in the period from 91 to 120 days. More precisely, we remove any video having received twice as many view between days 91 and 120 than then do between 61 and 90. 3.5% of the videos are considered outliers and eliminated from the training set. The errors are measured in average error in popularity percentile, as defined in the main text.

Additional results.
In addition to the performance comparison, shown in the Main Text Fig. 4, Figure 8 (a) presents the Cumulative Distribution Function (CDF) of the prediction errors for the Hawkes intensity model, MLR and MLR (#shares). The Hawkes intensity model consistently outperforms MLR (with and without the exogenous stimuli information): the Hawkes model forecasts popularity of 87% of the video population with a maximum 10% error, while MLR covers only 78% of the population for the same error threshold. Furthermore, MLR (#shares) obtains only marginal performance improvements over MLR, even while using the exogenous information. Figure 8 (b) shows the absolute forecasting error performances, aggregated using barplots. Visibly, the Hawkes intensity model (using either #shares and #tweets) consistently outperforms MLR both in term of median values and variation, which results in the better mean values of forecasting error already shown in the Main Text Fig. 4 (center). Figure 8 (c) analyzes closely the forecasting error distribution for the best performing version of each approach. The Hawkes (#shares) blue curve shows a concentration of lower errors and a median error value of 3%. The red curve corresponds to the error distribution for the MLR (#shares) and shows a higher concentration of larger error and a median error value of 3.75%.

Statistical significance.
We study the statistical significance of the difference of performance in forecasting popularity, observed in main text Fig. 4(b). We break down the study into two questions: 1) is the difference of forecasting performance between the Hawkes intensity process and the MLR baseline statistically significant? and 2) does the source of exogenous stimuli -#shares or #tweets -influence the quality of the forecasting? We setup four experiments: two experiments comparing the performances of forecasting methods for each of the two sources of influence and two experiments comparing the effect of the sources for each of the two algorithms. Based on the selected setup, we construct two samples and we perform a T-test. For each experiment, the null hypothesis assumes that the mean of two samples are equal (i.e. there is no difference in forecasting performance). The alternative hypothesis assumes that the true means of two samples are not equal.

Statistical testing with large sample size
The well-known p-value issued from hypothesis testing is dependent on the observed difference between the two samples, as well as the sample size. This renders analyses based only on p-value particularly sensitive to sample size, given that with sufficiently large samples, a statistical test will almost always demonstrate a significant difference [26]. Given the size of the Active dataset (i.e. 13, 738) which serves as sample for the four experiments hereafter, we measure the effect size in addition to the typical p-value. The effect size measures we report and use to justify our analysis are Cohen's d coefficient [6] and Pearson's r correlation coefficient [7].

Experimental setup
Our forecasting systems uses two inputs: the observed #views and an external stimuli source (either #shares or #tweets). Answering question 1) -significance of performance difference between the Hawkes intensity model and the MLR baseline -boils down to comparing two treatments to a single set of individuals. This translates into applying a paired T-test to a single sample. Conversely, comparing the two sources of exogenous excitation involves applying the same forecasting

Related work
Studying behavior in social media is a very active research area, this work is most related to three sub-topics. This work measures the popularity and sharing activity about a video over time. There has been a number of well-known measurement studies of content, user, and popularity on social networks, including Twitter [14], YouTube videos [3,9], news media [15], as well as specific measurements about popularity and auxiliary attributes such as locality [2]. Most current measurement studies are on a single social media network, or done independently for different networks, our work links the behavioral observations from two different networks on the same media item. A few early studies link the two networks for predictive tasks, including our own preliminary study on using Twitter feeds to predict Youtube popularity change [30], and a system by Yan et al. [28] for finding optimal Twitter followees to maximize video promotion on Twitter. Despite using similar data, a generative model that explains popularity using observed volume from multiple networks is new.
A number of models has been proposed to describe the shape and evolution of the volume of social media activity over time. The seminal meme-tracker [15] system uses a curve with polynomial increase followed by exponential decay to describe sawtooth-shaped volume of news mentions. The SpikeM [20] system uses a fixed memory component (θ = 0.5), modulated by a periodic component, and no explicit account for external influence. Most recently, Tsytsarau et al. [27] models popularity volume as the convolutions two sequences, news event importance and media response, each of a predefined shape. Yang et al. [29] propose a generative model to describe sequences that has multiple progressions stages, with algorithms to estimate model parameters and segment existing sequences. Hawkes processes is a popular model for describing social media, Simma and Jordan [25] use Hawkes process to describe individual events and not popularity volumes, whereas Linderman and Adams [16] uses Hawkes process to describe a network of nodes that influence each other. There are three key differences between this work and the popularity i i "Hawkes˙2015˙SI" -2016/6/10 -16:38 -page 16 -#16 i i i i i i and event modeling work above: we use a self-excited Hawkes process to describe total attention, it can recover all parameters from data, and we explain additional, non-stationary variations from linked data sources of external activities.
In terms of designing interventions to influence individual and aggregate behavior. Bollapragada et al. [1] used a mixedinteger program to schedule multiple airings of a television advertisement as evenly as possible, Chierichetti, Kleinberg and Panconesi [4] proposed an algorithm to determine the se-quence of activating nodes so as to maximize the total activation in a network. Liu, Slotine and Barabasi [19] applies linear control theory to networks, and observe that sparse inhomogeneous networks are difficult to control (or steer to a desired state). The last part in this work investigates the effectiveness of different temporal schedules for promoting one video, it complements the findings for scheduling activation in network [4].