AI 基礎入門系列4：設置機器學習工作台

最新 02-02

這是GHBD的第13篇文章

GHBD旨在推廣醫療大數據與人工智慧的發展

「讓我們與世界連接」

Setup of a Machine Learning Workbench

Edward C. Cheng

Today let us talk about some practical action points of setting up a working environment for you or your IT Engineering team and Data Scientists to work on some Machine Learning (ML) projects. We will look at theConceptual architecture, system architecture and toolsthat are helpful in accomplishing our goal.

今天讓我們來談一些建立 AI 程式的實際工作行動要點，讓你和你的 IT 工程團隊和數據科學家能搭建一個編寫神經網路的工作環境，來開展一些機器學習項目。我們也討論一下有助於實現我們目標的概念架構、系統架構和工具。

This is very much like in the past you will need to set up a development environment for programming Web applications. You will select your technology stack such as programming language, compilers, helpful software libraries, database software, web server software, message queue software, version control software and IDE.

這很像在傳統的軟體工程師團隊工作中，你可能需要建立一個 Web 應用程序的開發環境。你將選擇你的技術堆棧，如編程語言、編輯器、有用的軟體庫、資料庫軟體、Web 伺服器軟體、消息隊列軟體、版本控制軟體和 IDE。

I will not bored you with those software that you already know but focus on some of the newer technologies that you need to support the connections to different data sources and the platform that actually perform ML.

我不會談那些你已經知道的軟體，而是把重點放在你需要用來支持連接不同數據源和實際執行機器學習平台的新技術。

Below is the conceptual architecture for Big Data and AI.

以下為一個大數據和人工智慧的概念體系結構：

Fusion of Data Sources and Data Models

數據源與數據模型的融合

The differentdata sourcesat the bottom are managed by different resource managers, which includeStructured and unstructured data, SQL and NoSQL, and even time series data streamssuch as those from mobile apps and IoT apps.

底層的不同數據源，是由不同的資源管理系統管理。這些資源管理系統管理的，包括結構和非結構化數據，SQL 和非SQL，甚至包括來自移動應用和物聯網應用的時間序列數據流。

In other words, these data sources may very well be represented in different data models. All of these data sources will be streamed or piped into the upper layer to perform data transformation. The goal is to merged these different data models to a common model so that the further upper layers may consume the data in performing machine learning.

換句話說，這些數據源很可能是用不同的數據模型來表示的。所有這些數據源都將被流式傳輸或傳送到上層，以執行數據轉換。目標是將這些不同的數據模型合併到一個通用模型中，以便在上層的系統可以在執行機器學習時使用數據。

The transformation of data into different data sets are usually done in a MapReduce fashion. This is a work by the Google researchers in their Big Data work. This gives rise to theGFS (Google File System), which later becameHDFS (Hadoop Distributed File System), for storing disreibutive data nodes in a cluster of machines.

將數據轉換成不同的數據集通常是以 MapReduce（地圖縮減）方式完成。這是谷歌研究人員在他們的大數據工作中的一項工作。從此就產生了GFS（谷歌文件系統），後來成為了HDFS（Hadoop 分散式文件系統），用於將分散式數據節點存儲在一組機器中。

It also bring about Google"sMapReduce programming framework, which maps a task into a number of smaller sub-tasks on multiple data nodes, and then another program reduces the parallel results into a smaller data set.

它還帶來了谷歌的MapReduce 編程框架，它將一個任務映射到多個數據節點上的一些較小的子任務，然後另一個程序將並行結果縮減為一個更小的數據集。

The process or cycle of Map and Reduce repeats again and again until it filters into a data set that the machine learning platform is ready to consume and learn. This process is what people recntly calledbuilding distributive data pipelines.

映射和還原的過程或周期反覆重複，直到它過濾到機器學習平台已準備好消耗和學習的數據集為止。這個過程，就是人們最近所說的建立分散式數據管道 (Data Pipeline)。

These are pipelines because often they are streaming in from different data sources and when going through the multiple layers of MapReduce, they create different data sets and these data sets are fed into the pipeline as input for the creation of even other data sets. The final product is a data set that is ready to be learned.

這些都是管道，因為它們從不同的數據源流入。當經歷多層 MapReduce 時，它們創建不同的數據集，這些數據集被輸入到管道中，以創建其他數據集。最終產出是一個準備好學習的數據集。

Distributive Data Pipelines and Map Reduce

分散式數據管道和地圖縮減

MapReduce is a parallel, distributive processing function.TheSpark frameworkis usually used in this process. The open source Apache Spark provides you with the API"s to program the map and reduce tasks handily.

MapReduce 是一個並行的分散式處理功能。除了使用MapReduce以外，我們也可以用Spark 框架來完成這分散式、並行流程。開源的 Apache Spark 為您提供了 API 來編寫MapReduce功能，並輕鬆地減少任務。

To use Spark,(1)you have to first distribute your Data Nodes onto a cluster of machines.(2)Then you configure Spark according to that distributive data setup. In your Spark program(3)you can now create a Spark Context with that configuration, and(4)a Sream Context can then be created with the Spark Context. Finallty when(5)you submit the Spark job, the system will follow this configuration to issue the job in parallel to all distributive nodes, bring the data back to perform the "reduce" work and give you the distributive nodes, bring the data back to perform the "reduce" work and give you the desired result.

要使用 Spark,(1)你必須首先把數據節點分配到一組機器上。(2)然後根據分散式數據設置來配置 Spark。在你的 Spark 程序中(3)你現在可以使用該配置創建一個 Spark 上下文，並且(4)可以使用 Spark 上下文創建 Stream上下文。最後，當(5)你提交 Spark 作業時，系統將按照此配置與所有分散式節點並行發出作業，將數據返回到執行「reduce」作業，並給你想要的結果。

As the conceptual architecture above depicts that data scientists would usually be performing data visualization with certain tools to gain a good sense of the data, its value distribution, correlation, as well as completeness versus missing data, before embarking on the actual task of ML.

正如上面的概念架構所描述，在開始執行機器學習的實際任務之前，數據科學家通常會用特定的工具來進行數據可視化，以獲得對數據內容、數據分布、相關性、完整性和缺失數據的良好意識。

This is a very common architecture today for supporting distributive data pipelines to feed data onto an ML platform. And on top of that we run ML development and testing environment.

這在今天是一個非常常見的體系結構，用於支持分散式數據管道，將數據提供給機器學習平台。在這理念上，我們運行機器學習的開發和測試環境。

Below is the system architecture for the Big Data and ML platform.

以下是大數據和機器學習平台的系統架構。

In this architecture, I usedDockerto deploy all of my ML/AI applications so that they are paltform independent and can be re-deployed by my users or partners on their own environment with minimum trouble of trying to recreate my runtime environment.HTCondoris used for my Python program to run in parallel. This is particularly handy if the ML is to be performed in a machine cluster environment.

在這個架構中，我使用了Docker 框架來部署所有的機器學習和人工智慧應用程序，以使它們獨立於平台，並且可以由用戶或合作夥伴在自己的環境中重新部署，同時儘可能地減少了他們重新創建我的運行環境的麻煩。HTCondor用於我的 Python 程序並行運行。如果要在機器集群環境中執行ML，這將特別方便。

Machine Learning Python Program

機器學習 Python 程序

I have chosen Python as my primary programming language because much of the Machine Learning Open Libraries are written in Python. TheJupyter Notebookplatform is a very nice sharing development and document environment for developers and data scientists to share their work. It can be deployed on the cloud so that the heavy ML tasks can take advantage of the distributive machine cluster.

我選擇了 Python 作為我的主要編程語言，因為大部分機器學習開放庫都是用 Python 編寫的。Jupyter Notebook平台是開發人員和數據科學家分享他們工作的一個非常好的共享開發和文檔的環境。它可以部署在雲上，以便繁重的機器學習任務可以利用分散式機群。

TensorFlow, Numpy, Theanoand the likes are good backend engine libraries to be used in theConvolutional Neural Network (CNN)development. TensorFlow is developed by Google but is now a very nice open source library that thousands of developers are contributing to its growth.

TensorFlow, Numpy, Theano等都是後端引擎庫，用於卷積神經網路（CNN）的開發。TensorFlow 由 Google 開發，但現在已經是一個很好的開源庫，成千上萬的開發者正為其發展作出貢獻。

On top of that we run the model level open libraries to support engineers in using open source CNN models.In practice most fo the ML engineers nowadays do not build Neural Network from scratch. There are many NN models a vailable in the open source community that other researchers in resolving various problems had already developed these models with satisfactory results.

最重要的是，我們運行模型級開放庫來支持工程師使用開源 CNN 模型。在實踐中，現在的機器學習工程師大多不會從頭開始構建神經網路。開源社區有許多 NN 模型，是其他研究人員在解決各種問題時開發出的模型，且結果令人滿意。

They have put these models and source codes onto places likeGit Hubthat you can freely download (there are different open source licensing agreement that governed the use of them, but you can always start with learning from these models and then modify them according to your own needs).

他們把這些模型和源代碼放到像Git Hub這樣的地方，你可以自由下載（有不同的開源許可協議來管理它們的使用，但是你可以隨時從這些模型開始學習，然後根據你自己的需要進行修改。）

AI Engineers will usually map the problem they are looking at to a problem that people had already solved out there. To kick start their work they import the CNN model from the related solved problem in order to begin their work.

人工智慧工程師通常會把他們研究的問題，映射到人們已經解決了的問題。為了啟動他們的工作，他們從相關解決了的問題中導入解決方案的 CNN 模型。

Very often researchers don"t just have the completed CNN model uploaded as open source, but also their learning results, the learned parameters. The learning parameters are the result that might take days or even longer to obtain by running on very large and powerful machines.

通常，研究人員不僅將完整的 CNN 模型作為開源上傳，而且還將其機器學習得到的結果，即學習參數一起上傳。學習參數，是通過在非常強大的機器上運行所得的結果，可能需要幾天甚至更長時間，才能獲得的結果。

You can download these pre-learned parameters for free and then only use your own data to further train the CNN model for your specific use.This dramatically shortened the time it takes to develop a neural network to solve your problem.

你可以免費下載這些預先學習的參數，然後只使用你自己的數據來進一步訓練 CNN 模型以供具體使用。這大大縮短了開發神經網路解決問題所需的時間。

I hope this series of Big Data and Machine Learning discussion helps some hospital IT professionals out there. Please share with me your experience as well. Enjoy!

我希望這一系列的大數據和機器學習討論能夠幫助一些醫院的IT專業人員。請與我分享你的經驗。請享用！

- End -

深入淺出 AI 基礎入門系列

關於作者

Dr. Cheng 是美國一位經驗豐富的創業者與高科技公司行政人員，擁有3項全球發明專利。現任詩智科技 CEO, GHBD 創始人。研究興趣在於大數據，機器學習和敏捷方法。

歡迎關注，轉載請聯繫授權。

（WeChat ID: Grace-daydreamer）

Global Healthcare Big Data

在已成功舉辦第二屆環球醫療大數據研討會（2017）、第一屆國際雲、移動和大數據研討會（2015），並分別在斯坦福大學醫學中心（2016）、香港大學（2016）和北京大學大數據中心（2017）舉辦了3次環球醫療大數據工作會議成果基礎上。我們的目標是為國內外行業領域專家，搭建一個持續的國際平台，組成一個獨特的專業群體，讓政府機構、醫療從業者、科技研究人員和國內外學者等信息化專業人士從世界各地匯聚在一起相互交流未來醫院 IT 發展的重要思想和成果。

喜歡這篇文章嗎？立刻分享出去讓更多人知道吧！

本站內容充實豐富，博大精深，小編精選每日熱門資訊，隨時更新，點擊「搶先收到最新資訊」瀏覽吧！

請您繼續閱讀更多來自 機器學習 的精彩文章:

※AMD第二代Ryzen四月見 Vega轉攻機器學習和集顯
※人工智慧和機器學習將如何影響SEO的內容？

TAG:機器學習 |