1 Introduction

The augmented Dickey–Fuller (ADF) unit root test is the most popular of its kind, with countless applications. An issue that arises with the application of this test is the selection of the order of the lag augmentation, p. There are two considerations. On the one hand, for the test be correctly sized in the presence of general ARMA errors it is important that p is allowed to increase with the size of the sample, T (see, for example, Said and Dickey 1984). The rate of increase is also important, for only if the rate is fast enough can one rely on conventional data-driven lag selection procedures, such as information criteria (see Ng and Perron 1995; Chang and Park 2002). On the other hand, Monte Carlo evidence indicates that larger values of p are generally associated with reduced power (see Lopez 1997; Ng and Perron 1995, 2001). Interestingly, while low power is one of the most well-known problems of the ADF test, as far as we are aware no one has as of yet derived any asymptotic power results for the case when p is allowed to increase with T. In fact, most studies, such as those of Said and Dickey (1984), Chang and Park (2002), and Xiao and Phillips (1998), only report the asymptotic distribution under the unit root null hypothesis, although there is typically some conjecture about the behaviour under the alternative that the largest AR root is local-to-unity (see Chang and Park 2002; Xiao and Phillips 1998).Footnote 1 The only exceptions known to us are Ng and Perron (2001), whose results are designed specifically to the case when the errors follow a first-order MA process with a root that is local to \(-1\), and Paparoditis and Politis (2017), where the alternative is taken to be that the process is stationary. Both studies confirm that p is important, even asymptotically, and that it can in fact dominate the asymptotic behaviour of the ADF test.

In the present paper, we take the discussion of the last paragraph as our starting point. The purpose is to evaluate the local asymptotic distribution of the ADF test when the errors follow a general linear process driven by martingale difference innovations, which may exhibit conditional heteroskedasticity. The study may therefore be thought of as a local power extension of the study of Chang and Park (2002), who derived the asymptotic null distribution of the ADF test under the same assumption on the errors.

Notation: L is the lag operator, \(\rightarrow _p\), \(\rightarrow _w\) and \(=_d\) signify convergence in probability, weak convergence, and equality in distribution, respectively, and \(\Vert A\Vert = \sqrt{\mathrm {tr}(A'A)}\) is the Frobenius norm of any matrix A.

2 Model

The data generating process (DGP) of \(y_t\) is the same as in Chang and Park (2002), and is given by

$$\begin{aligned} y_t= & {} \alpha y_{t-1}+u_t, \end{aligned}$$
(1)
$$\begin{aligned} u_t= & {} \pi (L)\varepsilon _t, \end{aligned}$$
(2)

where \(y_0=0\), and \(\varepsilon _t\) and \(\pi (L)=\sum _{k=0}^{\infty }\pi _kL^k\) satisfy Assumptions 1 and 2, respectively.

Assumption 1

\((\varepsilon _t,\mathcal {F}_t)\)is a martingale difference sequence with some filtration\((\mathcal {F}_t)\), \(\mathbf E (\varepsilon _t^2)=\sigma ^2\), \(T^{-1}\sum _{t=1}^T\varepsilon _t^2\rightarrow _p\sigma ^2\)and\(\mathbf E (|\varepsilon _t|^4)<\infty \).

Assumption 2

\(\pi (z)\ne 0\) for all \(|z|\le 1\), and \(\sum _{k=0}^\infty |k|^s|\pi _k|<\infty \) for some \(s\ge 1\).

Remark 1

Assumptions 1 and 2 are the same as in Chang and Park (2002), and are not very restrictive. The assumption that \(y_0=0\) is more restrictive than necessary, and can be relaxed, provided that \(y_0=O_p(1)\). The fact that there are no deterministic constant and trend terms is restrictive, but as we discuss later in Remark 3 the analysis can be easily extended to accommodate such terms. Note also that the initialization becomes irrelevant if the DGP contains (at least) a constant.

All the results of Chang and Park (2002) are derived under the unit root restriction that \(\alpha = 1\). The main contribution of the present paper is to investigate the effect of a violation of this restriction. The particular assumption that we are going to be working under is given by Assumption 3.

Assumption 3

\(\alpha = 1 + cT^{-1}\), where\(c\le 0\).

As in Chang and Park (2002), \(\pi (L)\) has the Beveridge–Nelson (BN) decomposition \(\pi (L) = \pi (1) - (1-L)\bar{\pi }(L)\), where \(\bar{\pi }(L) = \sum _{k=0}^\infty \bar{\pi }_kL^k\) and \(\bar{\pi }_k = \sum _{i=k+1}^\infty \pi _i\) (see Phillips and Solo 1992, Lemma 2.3). We can therefore write

$$\begin{aligned} u_t= \pi (1)\varepsilon _{t} - \Delta \bar{u}_{t}, \end{aligned}$$
(3)

where \(\bar{u}_{t} = \sum _{k=0}^\infty \bar{\pi }_k \varepsilon _{t-k}\). Assumption 3 implies

$$\begin{aligned} y_t = \sum _{k=1}^t \alpha ^{t-k} u_k = \pi (1)w_{t} - r_t, \end{aligned}$$
(4)

where \(w_t = \sum _{k=1}^t \alpha ^{t-k}\varepsilon _{k}\) and \(r_t = \sum _{k=1}^t \alpha ^{t-k}\Delta \bar{u}_{k}\).

Under Assumptions 1 and 2, \(\pi (L)\) can be inverted, giving

$$\begin{aligned} \theta (L)u_t =\varepsilon _t, \end{aligned}$$
(5)

where \(\theta (L)= \pi (L)^{-1} = 1 - \sum _{k=1}^{\infty }\theta _kL^k\) (see Chang and Park 2002). The purpose of this paper is to investigate the effect when this infinite-order AR process is truncated at lag p. Let us therefore define \(\delta _p(L) = \sum _{k=1}^{p}\theta _kL^{k-1}\), \(\delta ^p(L) = \sum _{k=p+1}^{\infty }\theta _kL^{k-1}\) and \(\delta (L) = \delta _p(L)+\delta ^p(L)\), such that \(\theta (L) = 1-\delta (L)L\). In this notation,

$$\begin{aligned} u_t = \delta _p(L)u_{t-1} + \varepsilon _{p,t}, \end{aligned}$$
(6)

where

$$\begin{aligned} \varepsilon _{p,t} = \varepsilon _{t} + \delta ^p(L)u_{t-1}. \end{aligned}$$
(7)

By using this and the fact that \(u_t = y_t - \alpha y_{t-1} = \Delta y_t - (\alpha -1) y_{t-1}\), we obtain the following equation for \(y_t\):

$$\begin{aligned} y_t = \alpha y_{t-1}+ \delta _p(L) \Delta y_{t-1} - \delta _p(L)(\alpha -1) y_{t-2} + \varepsilon _{p,t} . \end{aligned}$$
(8)

At this point, it would seem natural given the approach of Chang and Park (2002) to take \(\alpha y_{t-1}+ \delta _p(L) \Delta y_{t-1}\) as the approximating regression function, and \(\varepsilon _{p,t}- \delta _p(L)(\alpha -1) y_{t-2}\) as the approximation error. But while this is indeed a possibility, there is a much more elegant approach. To fix ideas, let us write the regression model to be estimated by ordinary least squares (OLS) as

$$\begin{aligned} y_t = \beta y_{t-1}+ \beta _p(L) \Delta y_{t-1} + e_{p,t} , \end{aligned}$$
(9)

where \(\beta \) and \(\beta _p(L)\) are reduced form coefficients, and \(e_{p,t}\) is a reduced form error term. We now write these reduced form quantities in terms of the components of the DGP. We begin by noting that

$$\begin{aligned} -\delta _p(L)(\alpha -1) y_{t-2} = \delta _p(L)(\alpha -1) \Delta y_{t-1} - \delta _p(L)(\alpha -1) y_{t-1}. \end{aligned}$$
(10)

Consider the last term on the right. Similarly to the BN decomposition for infinite polynomials, we may decompose \(\delta _p(L) = \delta _p(1) - (1-L)\bar{\delta }_p(L)\), where \(\bar{\delta }_p(L) = \sum _{k=1}^{p-1} \bar{\delta }_{p,k}L^{k-1}\) and \(\bar{\delta }_{p,k} = \sum _{n=k+1}^p \theta _n\). This implies

$$\begin{aligned}{}[\alpha - \delta _p(L)(\alpha -1)]y_{t-1}&= [\alpha - \delta _p(1)(\alpha -1)]y_{t-1} - [\delta _p(L) - \delta _p(1)](\alpha -1)y_{t-1}\nonumber \\&= [\alpha - \delta _p(1)(\alpha -1)]y_{t-1} + (\alpha -1)\bar{\delta }_p(L)\Delta y_{t-1}. \end{aligned}$$
(11)

Hence, by collecting the terms,

$$\begin{aligned} y_t&= \alpha y_{t-1}+ \delta _p(L) \Delta y_{t-1} - \delta _p(L)(\alpha -1) y_{t-2} + \varepsilon _{p,t} \nonumber \\&= [\alpha - \delta _p(L)(\alpha -1)]y_{t-1} + \alpha \delta _p(L) \Delta y_{t-1} + \varepsilon _{p,t} \nonumber \\&= [\alpha - \delta _p(1)(\alpha -1)]y_{t-1} + [\alpha \delta _p(L)+(\alpha -1)\bar{\delta }_p(L)] \Delta y_{t-1} + \varepsilon _{p,t} , \end{aligned}$$
(12)

which is (9) with

$$\begin{aligned} \beta= & {} \alpha - \delta _p(1)(\alpha -1), \end{aligned}$$
(13)
$$\begin{aligned} \beta _p(L)= & {} \alpha \delta _p(L)+(\alpha -1)\bar{\delta }_p(L), \end{aligned}$$
(14)
$$\begin{aligned} e_{p,t}= & {} \varepsilon _{p,t} . \end{aligned}$$
(15)

This is important, for (at least) two reasons. One reason is that it shows how unless \(\alpha = 1\) (\(c=0\)), such that \(\beta = \alpha \), \(\alpha \) is not identified. This means that in the regression to be estimated the drift away from a unit root is not determined by c alone, but is in fact affected also by \(\delta _p(1)\), as is clear from

$$\begin{aligned} \beta = 1 + [1 - \delta _p(1)]cT^{-1}. \end{aligned}$$
(16)

This has implications for studies such as Moon andPhillips (2000) and Phillips et al. (2001), where the purpose is to estimate c. Another reason for why the above result is important is that it shows how the regression error in (9) is exactly the same as under the unit root null. This is very convenient in that once the model has been reparameterized as in (9), most of the main results regarding the accuracy of the approximation can be taken more or less directly form Chang and Park (2002). However, this requires \(p\rightarrow \infty \). It is therefore convenient to treat p as a function T.

Assumption 4

\(pT^{-1/2} \rightarrow 0\)as\(p,\,T\rightarrow \infty \).

Assumption 4 restricts the rate at which p is allowed to increase with T, but is weak enough to enable lag selection by standard information criteria, such as AIC and BIC.

3 The ADF test statistic and its local asymptotic distribution

Let

$$\begin{aligned} A_T&= \sum _{t=1}^Ty_{t-1}\varepsilon _{p,t}-\left( \sum _{t=1}^Ty_{t-1}x_{p,t}' \right) \left( \sum _{t=1}^Tx_{p,t}x_{p,t}' \right) ^{-1} \left( \sum _{t=1}^Tx_{p,t}\varepsilon _{p,t} \right) \end{aligned}$$
(17)
$$\begin{aligned} B_T&= \sum _{t=1}^Ty_{t-1}^2-\left( \sum _{t=1}^Ty_{t-1}x_{p,t}' \right) \left( \sum _{t=1}^Tx_{p,t}x_{p,t}' \right) ^{-1} \left( \sum _{t=1}^Tx_{p,t}y_{t-1} \right) \end{aligned}$$
(18)
$$\begin{aligned} C_T&= \sum _{t=1}^T\varepsilon _{p,t}^2-\left( \sum _{t=1}^T\varepsilon _{p,t}x_{p,t}' \right) \left( \sum _{t=1}^Tx_{p,t}x_{p,t}' \right) ^{-1} \left( \sum _{t=1}^Tx_{p,t}\varepsilon _{p,t} \right) , \end{aligned}$$
(19)

where \(x_{p,t}=(\Delta y_{t-1},...,\Delta y_{t-p})'\). It is important to remember that the OLS estimator of the coefficient of \(y_{t-1}\) in (9) is not really estimating \(\alpha \), but rather \(\beta \). Let us therefore consider OLS estimator \(\hat{\beta }\) of \(\beta \) and its standard error, which are such that

$$\begin{aligned} \hat{\beta }&= \beta +A_TB_T^{-1}, \end{aligned}$$
(20)
$$\begin{aligned} s(\hat{\beta })^2&= \hat{\sigma }^2 B_T^{-1}, \end{aligned}$$
(21)

where \(\hat{\sigma }^2 = T^{-1}(C_T-A_T^2B_T^{-1})\). The test statistic of interest is the usual ADF statistic, which is given by

$$\begin{aligned} ADF = \frac{\hat{\beta }-1}{s(\hat{\beta })}. \end{aligned}$$
(22)

Lemmas 1 and 2, which are analogous to Lemmas 3.1 and 3.2 of Chang and Park (2002), are key in deriving the local asymptotic distribution of ADF.

Lemma 1

Under Assumptions 13,

where \(w_t = \sum _{n=1}^t \alpha ^{t-n}\varepsilon _{n}\).

Lemma 2

Under the conditions of Lemma 1,

The proofs of Lemmas 1 and 2 are almost identical to the proofs of Lemmas 3.1 and 3.2 in Chang and Park (2002), and are therefore omitted. The only difference is the presence of \(\alpha \) in \(w_t\), which does not affect the derivations.Footnote 2 Lemmas 1 and 2 imply that

$$\begin{aligned} T^{-1}A_T&=\pi (1)T^{-1}\sum _{t=1}^Tw_{t-1}\varepsilon _{t} {+} o_p(1) \end{aligned}$$
(23)
$$\begin{aligned} T^{-2}B_T&=\pi (1)^2 T^{-2}\sum _{t=1}^Tw_{t-1}^2 + O_p(pT^{-1}) \end{aligned}$$
(24)
$$\begin{aligned} T^{-1}C_T&= T^{-1}\sum _{t=1}^T\varepsilon _{p,t}^2 + o_p(p^{-1}), \end{aligned}$$
(25)

where the remainder terms are all \(o_p(1)\) under Assumption 4. In view of Lemma 1 (c), this implies

$$\begin{aligned} \hat{\sigma }^2&= T^{-1}(C_T-A_T^2B_T^{-1})= T^{-1}C_T{-}T^{-1}(T^{-1}A_T)^2(T^{-2}B_T)^{-1}=T^{-1}C_T {+} o_p(1)\nonumber \\&= T^{-1}\sum _{t=1}^T \varepsilon _{t}^2 + o_p(1) \rightarrow _p \sigma ^2 \end{aligned}$$
(26)

(see Chang and Park 2002, Proof of Lemma 3.3). Let us now consider ADF. Note how \(\beta -1 = c[1 - \delta _p(1)]T^{-1}\). Together with Lemmas 1 and 2, this implies

$$\begin{aligned} ADF&= \frac{\hat{\beta }-\beta }{s(\hat{\beta })} + \frac{\beta -1}{s(\hat{\beta })} \nonumber \\&= \hat{\sigma }^{-1}\left[ T^{-1}A_T(T^{-2}B_T)^{-1/2} + c[1 - \delta _p(1)] \sqrt{T^{-2}B_T}\right] \nonumber \\&= \sigma ^{-1}\left[ \frac{T^{-1}\sum _{t=1}^Tw_{t-1}\varepsilon _{t}}{\sqrt{T^{-2}\sum _{t=1}^Tw_{t-1}^2}} \!+ c[1 {-} \delta _p(1)]\pi (1) \left( T^{-2}\sum _{t=1}^Tw_{t-1}^2\!\right) ^{1/2} \right] {+} o_p(1). \end{aligned}$$
(27)

The asymptotic distribution of the right-hand side is easily evaluated using the results provided in Hansen (1995) for the finite-order AR case, and is summarized in Theorem 1.

Theorem 1

Under Assumptions 14,

$$\begin{aligned} ADF \rightarrow _w \frac{\int _{r=0}^1 J_c(r)dW(r)}{\sqrt{\int _{r=0}^1 J_c(r)^2 dr}} + c \lim _{p\rightarrow \infty } [1 - \delta _p(1)]\pi (1) \cdot \left( \int _{r=0}^1 J_c(r)^2 dr\right) ^{1/2}, \end{aligned}$$

where \(J_c(r)=\int _{v=0}^r \exp [c(r-v)]dW(v)\) with W(r) being a standard Brownian motion on \(r\in [0,1]\).

Phillips (1987) considers the (non-augmented) Dickey–Fuller test statistic in the case of serially uncorrelated errors. The difference between the local asymptotic distribution reported in Theorem 1 and the one given in Phillips (1987) is the presence of \([1 - \delta _p(1)]\pi (1)\). It is therefore interesting to consider briefly the behaviour of this term. Note how \(\theta (1) = 1-\delta (1)\), which implies \([1 - \delta _p(1)] \rightarrow \theta (1)\) as \(p\rightarrow \infty \). But \(\theta (1) = \pi (1)^{-1}\), and so

$$\begin{aligned} \lim _{p\rightarrow \infty }[1 - \delta _p(1)]\pi (1) = 1. \end{aligned}$$
(28)

The effect of the truncation on the asymptotic distribution of the ADF test statistic is therefore negligible. This finding is in stark contrast to the results reported by Ng and Perron (2001) and Paparoditis and Politis (2017), where the effect of p is non-negligible. In practice, of course, p is fixed, which means that \([1 - \delta _p(1)]\pi (1) \ne 1\). The asymptotic null distribution of ADF under \(c=0\) is given by

$$\begin{aligned} ADF \rightarrow _w \frac{\int _{r=0}^1 W(r)dW(r)}{\sqrt{\int _{r=0}^1 W(r)^2 dr}}, \end{aligned}$$
(29)

which is independent of \([1 - \delta _p(1)]\pi (1)\). One of the effects of the truncation is therefore to affect the drift of the distribution under the alternative hypothesis that \(c<0\). Hence, while negligible, in finite samples we expect p to have an effect on power. This prediction is in agreement with the bulk of the existing Monte Carlo evidence (see, for example, Ng and Perron 1995). In fact, the local power predictions derived here seem very accurate, even when compared to the stationary predictions of Paparoditis and Politis (2017) when the data are generated as stationary. Let us explain what we mean by this. Paparoditis and Politis (2017) show that the power of the ADF test against stationary alternatives should be decreasing in p, even asymptotically. This is their theoretical prediction. They then simulate power under \(\alpha \in \{0.985, 0.97\}\), \(\pi (L) = 1 + \pi _1L\), \(\pi _1\in \{-0.5,0.5\}\), \(T\in \{50, 100, 200, 400, 800, 1600\}\) and \(p=T^a\) with a going from 0.05 to 0.49 in steps of 0.04. Except for the non-local specification of \(\alpha \), this is consistent with the DGP considered here. Note in particular how p satisfies our Assumption 4. According to the results reported in their Table 6 for the case when \(\alpha = 0.97\) and \(\pi _1 = -0.5\) (in which the effect of p is most pronounced), while when \(T=50\) power decreases almost monotonically from 0.17 when \(a=0.05\) to 0.09 when \(a = 0.49\), when \(T= 1600\) power is flat at 1. Clearly, this finding does not fit well with the prediction that power should always decrease with increases in p. It is, however, consistent with our prediction that the effect of p should tend to decrease with increasing T.

Remark 2

As already mentioned, Chang and Park (2002) only consider the asymptotic distribution under the unit root null. They also claim (without proof) in their Remark 3.2 that the asymptotic distribution under Assumption 3 with \(c\ne 0\) should be the same, but with W(r) replaced by \(J_c(r)\). In order to asset the validity of this claim, note how \(dJ_c(r)=cJ_c(r)dr + W(r)\), implying

$$\begin{aligned} \frac{\int _{r=0}^1 J_c(r)dJ_c(r)}{\sqrt{\int _{r=0}^1 J_c(r)^2 dr}} = \frac{\int _{r=0}^1 J_c(r)dW(r)}{\sqrt{\int _{r=0}^1 J_c(r)^2 dr}} + c \left( \int _{r=0}^1 J_c(r)^2 dr\right) ^{1/2}, \end{aligned}$$
(30)

which is identically the local asymptotic distribution reported by Phillips (1987). The fact that this distribution is also the limit of the local asymptotic distribution in Theorem 1 as \(p\rightarrow \infty \) proves that the claim of Chang and Park (2002) is in fact correct.

Remark 3

As discussed in Remark 3.1 of Chang and Park (2002), DGPs with deterministic constant and trend terms can be easily accommodated. Such an extension is interesting not only in its own right, but also because it shows how the results reported here extends to other unit root tests. Let us therefore use \(z_t\) to denote the observed data. A common way to accommodate deterministic constant and trend terms is through the following components model: \(z_t = \mu + \tau t + y_t\), where \(y_t\) is as in (1). In this DGP, testing for a unit root in \(z_t\) is equivalent to testing for a unit root in \(y_t\). The problem is how to purge the effect of the deterministic terms. Chang and Park (2002) discuss the case when this is done through an auxiliary OLS regression of \(z_t\) onto a constant or a constant and trend. In this case, the results reported in this paper are the same, except that \(J_c(r)\) has to be replaced by its suitably demeaned or detrended version, \(J_c^d(r)\) say. Specifically, while in the constant-only case case, \(J_c^d(r) = J_c(r)-\int _{v=0}^1J_c(v)dv\), in the case with both a constant and trend, \(J_c^d(r)= J_c(r)+(6r-4)\int _{v=0}^1J_c(v)dv-(12r-6)\int _{v=0}^1vJ_c(v)dv\). An alternative to OLS is to perform generalized least squares (GLS) under the local alternative, as first suggested by Elliott et al. (1996). As Westerlund (2014) shows, except for \([1 - \delta _p(1)]\pi (1)\), the asymptotic distribution of the resulting ADF–GLS test in the constant-only case is identical to the one given in Theorem 1. The results reported here regarding the effect of p therefore apply also this other test. Another possibility is to follow, for example, Shin and So (2001) and to perform the OLS demeaning recursively. The asymptotic distribution in this case is again the same as in Theorem 1 but now with \(J_c(r)\) replaced by \(J_c^d(r) = J_c(r)- r^{-1}\int _{v=0}^r J_c(v)dv\). The asymptotic distributions of these other tests in the trend case do not have the same form as in Theorem 1, but the effect of p is still expected to be negligible. Moreover, these results extend quite naturally to the bulk of the existing panel data unit root tests, which are typically nothing but panel extensions of known time series tests (see, for example, Westerlund 2016, for a discussion of the issue of parametric lag correction in the panel data context).