TCGA資料庫mRNA＆lncRNA數據提取

最新 02-08

前段時間忙於期末考試和一些其他的事，已經久到不記得多久沒更了，手動哭。為了避免掉粉，今天放一段乾貨。TCGA資料庫中mRNA和lncRNA數據counts文件的整合，別掉粉別掉粉，這次用到的語言是Python，所以各位需要裝一個Python、pycharm和anaconda，具體的安裝方法和軟體設置百度就OK了，以後就真的可以獲取最新最準確的TCGA數據嘍。重要的事說三遍，別掉粉別掉粉別掉粉。

importgzip

importos

# 所有數據源的根目錄（絕對路徑）

ROOT_DIR =r"D:
esearch2017LMEBVSTADmRNA&lncRNA"

# 忽略需要的標識文件名

annotation_file =r"annotations.txt"

# 解壓後的目標目錄（絕對路徑）

uzip_counts_dirs =r"D:
esearch2017LMEBVSTAD
esult_unzip"

# 壓縮文件的後綴

key_zip_suffix =".counts.gz"

key_suffix =".gz"

# 初始化目標目錄,目錄存在且不為空時不能執行

if notos.path.isdir(uzip_counts_dirs):

os.makedirs(uzip_counts_dirs)

# 切換工作目錄到ROOT_DIR下

os.chdir(ROOT_DIR)

# 獲取所有數據源列表(保證沒有其餘的無關文件，本程序不做異常處理)

source_data_dir = os.listdir(ROOT_DIR)

# 讀壓縮文件，寫count文件

defunzip_counts(each_data):

count_gz_file = [xforxinos.listdir(each_data)if

x.endswith(key_zip_suffix)]

ifcount_gz_file:

count_gz_file_ = count_gz_file[]

count_gz_file = os.path.join(ROOT_DIR,each_data,count_gz_file_)

target_file = os.path.join(uzip_counts_dirs,count_gz_file_.strip(key_suffix))

print("Dealing with {} ot {}".format(count_gz_file,target_file))

input_file = gzip.open(count_gz_file,"rb")

withopen(target_file,"wb")asoutput_file:

output_file.write(input_file.read())

input_file.close()

else:

print("Please check this dir: These aren"t count.gz file

found!-->{}".format(os.path.join(ROOT_DIR,each_data)))

defdeal_with_main():

# 處理主函數

foreach_datainsource_data_dir:

ifannotation_fileinos.listdir(each_data):

print(each_data,"---------------------------------

--------------------")

pass

else:

unzip_counts(each_data)

if__name__ =="__main__":

deal_with_main()

就醬紫嘍。提取出來的counts文件在R裡面用一個for的函數就可以全部讀出來在一張表上了，剩下的處理流程之前TCGA的推送里有的，因為偷懶直接在公眾號里直接寫的推送沒用秀米，所以別嫌棄丑啦，格式問題，可能用iPad閱覽效果更佳。不寫下期預告了，每次的下期預告都不是下期預告！！！下期可能為了偷懶會放點我收藏的比較好用的R包或者小函數把。哭著搬磚去啦。

不會搞科研的醫生不是好廚子，歡迎大家來廚房找我玩，雖然你們肯定找不到我hahahahaha~~~~~~

喜歡這篇文章嗎？立刻分享出去讓更多人知道吧！

本站內容充實豐富，博大精深，小編精選每日熱門資訊，隨時更新，點擊「搶先收到最新資訊」瀏覽吧！

請您繼續閱讀更多來自 全球大搜羅 的精彩文章:

※浙西大峽谷旅遊介紹
※初次見面，請多多包涵

TAG:全球大搜羅 |