斷點回歸設計的實現

最新 04-01

最近在做一個需要利用斷點回歸設計的研究。為了保證實踐的規範性，並且避免未來審稿中可能面對的質疑，花了幾天時間梳理了一下斷點回歸設計的標準操作，整理出來，供來人參考。

本文參考了三篇文獻，先擺在這裡，建議大家去讀原文：

第一是香樟經濟學圈發表的基於Lee, and Lemieux, 2010，" Regression Discontinuity Designs in Economics "，Journal of Economic Literature, Vol. 48: 281–355.的公眾號推文，【香樟推文0620】運用斷點回歸設計做研究的規定動作

第二是2017年AER論文Pinotti, Paolo. "Clicking on heaven"s door: The effect of immigrant legalization on crime."American Economic Review107.1 (2017): 138-68.（https://pubs.aeaweb.org/doi/pdfplus/10.1257/aer.20150355）

第三是一篇實際操作的比較Thoemmes, Felix, Wang Liao, and Ze Jin. "The Analysis of the Regression-Discontinuity Design in R."Journal of Educational and Behavioral Statistics42.3 (2017): 341-360.（http://journals.sagepub.com/doi/pdf/10.3102/1076998616680587）

香樟推文中的規定動作

第1步檢查配置變數（assignment variable，又叫running variable、forcing variable）是否被操縱。

這裡的配置變數，其實就是RD中決定是否進入實驗的分數（Score），是否被操縱的意思就是，是否存在某種跳躍性的變化。在實際操作中有兩種方式來檢驗，一是畫出配置變數的分布圖。最直接的方法，是使用一定數量的箱體（bin），畫出配置變數的歷史直方圖（histogrm）。為了觀察出分布的總體形狀，箱體的寬度要盡量小。頻數（frequencies）在箱體間的跳躍式變化，能就斷點處的跳躍是否正常給我們一些啟發。從這個角度來說，最好利用核密度估計做出一個光滑的函數曲線。二是利用McCrary（2008）的核密度函數檢驗。（命令是DCdensity，介紹見陳強編著的《高級計量經濟學及Stata應用》（第二版）第569頁）, Frandsen (2013)提出了一種新的檢驗方法，但目前被使用的並不多。

第2步畫因變數均值對配置變數的散點圖，並選擇帶寬（bandwidth selection）。

首先，挑選出一定數目的箱體，求因變數在每個箱體內的均值，畫出均值對箱體中間點的散點圖。一定要畫每個箱體平均值的圖。如果直接畫原始數據的散點圖，那麼噪音太大，看不出潛在函數的形狀。不要畫非參數估計的連續統，因為這個方法自然地傾向於給出存在斷點的印象，儘管總體中本來不存在這樣的斷點。

然後，選擇第三步驟中需要的帶寬。Lee和Lemieux（2010）介紹了兩種確定最優帶寬的方法：拇指規則法（rule of thumb）和交叉驗證法（CV）。還有另外兩種比較受關注的方法：IK法和CCT法。IK法以Imbens和Kalyanaraman兩個人命名，對應著論文Imbens和Kalyanaraman（2012）。這篇論文發表在Review of Economic Studies，Lee和Lemieux（2010）文中提到過此文2009年的NBER工作論文版。CCT法以Calonico、Cattaneo和Titiunik三個人命名，對應著論文Calonico、Cattaneo和Titiunik（2014a）。用非參數法做斷點回歸估計時的stata命令rd，就是用IK發確定最優帶寬。stata命令rdrobust、rdbwselect，提供CV、IK、CCT三種不同的最優帶寬計算方法選項。但是實際上rdrobust中已經更新了IK帶寬選擇函數，更新的演算法與IK演算法的區別有待考證，後續會補充。實際操作中一般是兩種演算法都會採納，並彙報參數估計對帶寬選擇是不敏感的。

第3步估計，又分為參數估計和非（半）參數估計兩種方案。

在Sharp RD情形，參數法將Y在每個箱體內的均值作為因變數，用處理變數、配置變數的多次項作為自變數，在斷點兩邊分別跑回歸，得到斷點左右兩邊因變數的擬合值，兩個擬合值的差值便是我們想估計的實驗對因變數的因果效應。將這些擬合值畫在第2步的圖中，並用光滑的曲線連接起來。在推文人讀過的RD論文中，多次項一般都使用1到4次項，但沒有論文解釋為什麼只用到4次項。半參數的方法便是用非參數估計的方法替代斷點兩邊估計因變數的擬合。對於Fuzzy的情形，參數估計意味著將配置變數（score）以及配置變數與是否超過斷點的乘積(score*above_cutoff)作為實驗變數的工具變數來進行兩階段最小二乘估計，實際應用中往往聯合使用score, score*above_cutoff,score^2,score^2* above_cutoff作為估計的工具變數，見第二部分的例子。非參數估計的做法是，利用核密度函數局部現性回歸來代替2SLS裡面一般線性回歸，rdrobust命令可以直接實現這種估計。

第4步檢驗前定變數在斷點處是否跳躍，這一步的目的是證明不存在跳躍，否則就麻煩了。

前定變數指的是那些在實驗之前已經確定的變數，例如，發生在2008年的實驗，那些2007年的觀測值便是前定的，理論上這些變數是不應該在斷點出跳躍的。此步和第1步是RD方法的適用性檢驗。此步的檢驗包括兩項內容：1.像前三步那樣畫前定變數的圖。無論參數還是非參數，RD研究都要大把的圖！這些圖在正式發表的論文中都必不可少！原文中說了這麼句話：用RD做的論文，如果缺乏相關的圖，十有八九是因為圖顯示的結果不好，作者故意不報告。2.將前定變數作為因變數，將常數項、處理變數、配置變數多次項、處理變數和配置變數多次項的交互項作為自變數，跑回歸。一個前定變數有一個回歸，看所有回歸中處理變數的係數估計是否都為。檢驗這種跨方程的假設，需要用似不相關回歸（Seemingly Unrelated Regression, SUR）（命令是sureg，用法見陳強編著的《高級計量經濟學及Stata應用》（第二版）第471-474頁）。在推文人讀過的RD實證論文中（尤其是AER2015-2016年所有用RD做的論文中），均沒用SUR，只是簡單的看每個回歸中處理變數的係數估計均為。

第5步檢驗結果對不同帶寬、不同多項式次數的穩健性。

嘗試的其它帶寬，一般是最優帶寬的一半和兩倍。挑選多項式的最優次數，可用赤池信息準則（Akaike"s Information Criterion，AIC）。在我們嘗試的包含配置變數1次方、2次方、……N次方的眾多方程中，AIC取值最小的那個就是我們想要的。實操時，試到多少次為好？Gelman和Imbens（2014）的NBER工作論文說，試到N次的做法要不得，最多只能搞到2次。原因待我閱讀完原文之後補充。

第6步檢驗結果對加入前定變數的穩健性。

如上所述，如果不能操控配置變數的假設成立，那麼無論前定變數與因變數的相關性有多高，模型中加入前定變數都不應該影響處理效應的估計結果。如果加入前定變數導致處理效應的估計結果變化較大，那麼配置變數可能存在排序現象，前定變數在斷點處也很可能存在跳躍。實操時在確定多項式的次數後，直接在回歸方程中加入前定變數。如果這導致處理效應估計值大幅變化或者導致標準誤大幅增加，那麼可能意味著函數中多項式的次數不正確。另外一個檢驗是殘差化，看相同次數的多項式模型對殘差的擬合好不好。

Pinotti, Paolo(2017) AER論文的實際操作

Replication大法好！否則就會「讀了那麼多文獻還是做不好研究。這篇論文的研究問題是移民獲得合法身份後的犯罪是否會減少。其中，實驗變數是是否獲得合法身份，因變數是移民申請人在申請合法身份後的第一年（2008年）是否有犯罪記錄，這裡面的選擇偏誤是合法身份並不是隨機發放給移民申請人的，那些預期犯罪更少的移民更有機會（有僱主幫助申請）和動力（花更多的時間和精力去準備申請）來申請合法身份，直接對比犯罪率會誇大合法身份的作用（使負向係數更小）。為了克服這個選擇偏誤，作者利用了義大利的自然實驗，義大利的移民身份申請是先到先得的，也就是在系統開放後，申請時間越晚中籤率會越低，作者發現申請遞交時間timing上有一個斷點，當申請人晚於這個斷點提交申請書時，會導致其中籤率跳躍式下滑，但不至於完全為零，於是作者找到了一個Fuzzy斷點情景。下面我們看看作者是怎麼操作的。其實這個情景很像上海的機動車抽籤系統。

2.1配置與處置變數的散點圖

在實際操作中，作者Pinotti在描述政策背景的時候直接彙報了配置與處置變數的散點圖（5min作為箱體），如下圖

然後作者使用Andrews(1993)檢驗的方法檢驗存在結構性斷點，但是作者沒有在正文與附錄中彙報該檢驗的結果。

2.2確定斷點

作者針對每一個「搖號點」，用Andrew(1993)的方案找出「the most likely break points」，同樣作者沒有彙報cutoff point的詳細數據以及尋找過程。僅僅在附件中彙報了下圖。

2.3全局2SLS估計

作者將樣本人為限制在cutoff point附近半個小時以內樣本。

（1）首先彙報了因變數（2008年犯罪率）與前定變數（2007年犯罪率）與配置變數的散點圖及其置信區間。

上圖的實際實現分為兩步，第一步是用是否大於cutoff point的dummy與score的四次項多項式回歸，得出不同分數的擬合值與置信區間，第二步是畫出散點圖與擬合曲線圖。以2007年為例，stata實現為：

bys bin: egen mean=mean(serious07)

reg serious07 ontime mindelay mindelay2 ontimexmindelay ontimexmindelay2, robust

predict fit

predict fitsd, stdp

gen upfit=fit+1.645*fitsd

gen downfit=fit-1.645*fitsd

preserve

twoway (rarea upfit downfit mindelay, sort fcolor(gs12) lcolor(gs12)) ///

(line fit mindelay if mindelay0, sort lcolor(red) lwidth(thick)) (scatter mean midbin, msize(large) mcolor(black) msymbol(circle_hollow)), ///

ytitle("") xtitle("Timing of the application, X (cutoff: X=0)") xline(0, lcolor(black)) legend(off) xlabel(-30(10)30) title("2007, all applicants")

graph copy all2007, replace

restore

drop *fit* mean

（2）然後,作者在一張表中彙報了2sls以及前定變數的ols結果

通過上表，我們可以得出在前定變數方面，斷點兩邊的差異是不顯著，在因變數方面顯著，而且顯著性來自type-A樣本。在回歸中作者使用了四次項多項式，而且只用了ontime這一個工具變數。以2008年為例，其實stata現方式為:

*** reduced form, 2008 ***

reg serious08 ontime mindelay mindelay2 ontimexmindelay ontimexmindelay2, robust

reg serious08 ontime mindelay mindelay2 ontimexmindelay ontimexmindelay2 if type=="A", robust

reg serious08 ontime mindelay mindelay2 ontimexmindelay ontimexmindelay2 if type=="B", robust

xtreg serious08 ontime mindelay mindelay2 ontimexmindelay ontimexmindelay2 age age2, robust i(lotterycode) fe cluster(lotterycode)

xtreg serious08 ontime mindelay mindelay2 ontimexmindelay ontimexmindelay2 age age2 if type=="A", robust i(lotterycode) fe cluster(lotterycode)

xtreg serious08 ontime mindelay mindelay2 ontimexmindelay ontimexmindelay2 age age2 if type=="B", robust i(lotterycode) fe cluster(lotterycode)

*** 2SLS, 2008 ***

ivreg2 serious08 (permit2007=ontime) mindelay mindelay2 ontimexmindelay ontimexmindelay2, robust first

ivreg2 serious08 (permit2007=ontime) mindelay mindelay2 ontimexmindelay ontimexmindelay2 if type=="A", robust first

ivreg2 serious08 (permit2007=ontime) mindelay mindelay2 ontimexmindelay ontimexmindelay2 if type=="B", robust first

xtivreg2 serious08 (permit2007=ontime) mindelay mindelay2 ontimexmindelay ontimexmindelay2 age age2, robust first i(lotterycode) fe cluster(lotterycode)

xtivreg2 serious08 (permit2007=ontime) mindelay mindelay2 ontimexmindelay ontimexmindelay2 age age2 if type=="A", robust first i(lotterycode) fe cluster(lotterycode)

xtivreg2 serious08 (permit2007=ontime) mindelay mindelay2 ontimexmindelay ontimexmindelay2 age age2 if type=="B", robust first i(lotterycode) fe cluster(lotterycode)

2.4其他前定變數的穩健性

作者分別彙報了Age,來源國，來源國收入水平等維度上的配置變數散點圖來顯示cutoff point兩邊的無差異性，做法與圖三相同。

2.5穩健性檢驗

（1）局部現性2SLS估計

為了體現論文的穩健性，作者接著使用了局部線性估計，結果如下：

在實際操作中，作者彙報了IK2012與CCT2014兩種帶寬選擇方案的結果，其中IK的樣本大約為40%左右，CCT為30%。

表4的實現方式如下，注意，作者並沒有使用Covariates,我的理解是實際操作中Cov的影響很小（畢竟在斷點附近COV幾乎沒有差異），而且離散型是的Cov會導致結果不收斂。

*** reduced form, 2008 ***

rdrobust serious08 mindelay, bwselect(IK)

rdrobust serious08 mindelay if type=="A", bwselect(IK)*新的rdrobust命令中bwselect演算法已經更新

rdrobust serious08 mindelay if type=="B", bwselect(IK)

rdrobust serious08 mindelay

rdrobust serious08 mindelay if type=="A"

rdrobust serious08 mindelay if type=="B"

*** 2SLS, 2008 ***

rdrobust serious08 mindelay, fuzzy(permit2007) bwselect(IK)

rdrobust serious08 mindelay if type=="A", fuzzy(permit2007) bwselect(IK)

rdrobust serious08 mindelay if type=="B", fuzzy(permit2007) bwselect(IK)

rdrobust serious08 mindelay, fuzzy(permit2007)

rdrobust serious08 mindelay if type=="A", fuzzy(permit2007)

rdrobust serious08 mindelay if type=="B", fuzzy(permit2007)

（2）作者使用不同的多項式次數與帶寬選擇進行估計，彙報結果如下

實際實現為：

*** generate polynomials grade 3-6 ***

forvalues i=3/6 {

gen mindelay$_i=mindelay^$_i

gen ontimexmindelay$_i=ontime*mindelay$_i

}

*** PARAMETRIC ESTIMATES (top graphs) ***

foreach t in "A" "B" {

preserve

keep if type=="$_t"

qui xtreg serious08 ontime age age2, robust i(lotterycode) fe

display "type $_t| polynomial degree 0| coeff. "_b[ontime] "| std.err. " _se[ontime]

qui xtreg serious08 ontime mindelay ontimexmindelay age age2, robust i(lotterycode) fe

display "type $_t| polynomial degree 1| coeff. "_b[ontime] "| std.err. " _se[ontime]

qui xtreg serious08 ontime mindelay mindelay2 ontimexmindelay ontimexmindelay2 age age2, robust i(lotterycode) fe

display "type $_t| polynomial degree 2| coeff. "_b[ontime] "| std.err. " _se[ontime]

qui xtreg serious08 ontime mindelay mindelay2 mindelay3 ontimexmindelay ontimexmindelay2 ontimexmindelay3 age age2, robust i(lotterycode) fe

display "type $_t| polynomial degree 3| coeff. "_b[ontime] "| std.err. " _se[ontime]

qui xtreg serious08 ontime mindelay mindelay2 mindelay3 mindelay4 ontimexmindelay ontimexmindelay2 ontimexmindelay3 ontimexmindelay4 age age2, robust i(lotterycode) fe

display "type $_t| polynomial degree 4| coeff. "_b[ontime] "| std.err. " _se[ontime]

qui xtreg serious08 ontime mindelay mindelay2 mindelay3 mindelay4 mindelay5 ontimexmindelay ontimexmindelay2 ontimexmindelay3 ontimexmindelay4 ontimexmindelay5 age age2, robust i(lotterycode) fe

display "type $_t| polynomial degree 5| coeff. "_b[ontime] "| std.err. " _se[ontime]

qui xtreg serious08 ontime mindelay mindelay2 mindelay3 mindelay4 mindelay5 mindelay6 ontimexmindelay ontimexmindelay2 ontimexmindelay3 ontimexmindelay4 ontimexmindelay5 ontimexmindelay6 age age2, robust i(lotterycode) fe

display "type $_t| polynomial degree 6| coeff. "_b[ontime] "| std.err. " _se[ontime]

restore

}

*** NONPARAMETRIC ESTIMATES (bottom graphs) ***

foreach t in "A" "B" {

preserve

keep if type=="$_t"

forvalues h=1/30 {

qui rdrobust serious08 mindelay if type=="A", fuzzy(permit2007) h($_h)

display "type $_t| bandwidth $_h | `e(tau_F_cl)" | `e(se_F_cl)" | `e(N)""

}

restore