【原】GFS分布式文件系統(tǒng)

夜貓速讀 2022-06-17 發(fā)布于湖北

展開全文

一、GlusterFS概述；

二、GlusterFS存儲架構(gòu)；

三、GlusterFS工作原理；

四、GlusterFS卷的類型；

五、案例：搭建Gluster分布式文件系統(tǒng)；

一、GlusterFS概述；

概述：GlusterFS（Google File System）是一個開源的分布式文件系統(tǒng)，Gluster借助TCP/IP網(wǎng)絡(luò)將存儲資源分散存儲在網(wǎng)絡(luò)的不同節(jié)點，在通過匯聚為客戶端提供統(tǒng)一的資源訪問，在存儲方面具有很強大的橫向擴展能力，通過擴展不同的節(jié)點可以支持PB級別的存儲容量；

Bit、Byte、KB、MB、GB、TB、PB、EB、ZB、YB、DB、NB

特點：

擴展性與高性能：通過Scale-out架構(gòu)可以增加存儲節(jié)點的方式來提高容量和性能（磁盤、計算、I/O資源都可以獨立增加），Gluster彈性哈希（Elastic Hash）解決了Gluster服務(wù)對元數(shù)據(jù)服務(wù)器的依賴，Gluster采用彈性哈希算法來確定數(shù)據(jù)在chunk節(jié)點中的分布情況，無須元數(shù)據(jù)服務(wù)器，實現(xiàn)了存儲的橫向擴展，改善了元數(shù)據(jù)服務(wù)器節(jié)點的壓力以及單點故障；

高可用性：GlusterFS通過配置不同類型的卷，可以對數(shù)據(jù)進行自動復制（類似于RAID1），即使某節(jié)點故障，也不影響數(shù)據(jù)的訪問；

通用性：GlusterFS沒有設(shè)置獨立的私有數(shù)據(jù)文件系統(tǒng)，而是采用以往的ext4、ext3等，數(shù)據(jù)可以通過傳統(tǒng)的磁盤訪問方式被客戶端所訪問；

彈性卷管理：GlusterFS通過將數(shù)據(jù)存儲在邏輯卷上，邏輯卷從邏輯存儲池進行獨立邏輯劃分，邏輯存儲池可以在線進行增加和刪除，不會導致業(yè)務(wù)中斷，邏輯卷的數(shù)量可以根據(jù)實際需求進行自行增加和縮減；

二、GlusterFS存儲架構(gòu)；

專業(yè)術(shù)語：

Brick（存儲塊）：存儲池中節(jié)點對外提供存儲服務(wù)的目錄；

Volume（邏輯卷）：一個邏輯卷時一組Brick的集合，卷是數(shù)據(jù)存儲的邏輯設(shè)備，類似LVM中的邏輯卷，大部分GlusterFS管理操作都是在邏輯卷上進行的；

FUSE（用戶空間文件系統(tǒng)）：是一個內(nèi)核模塊，用戶自行創(chuàng)建掛載的的文件系統(tǒng)；

VFS（接口）：內(nèi)核空間對用戶空間提供的訪問磁盤的接口；

Glusterd（后臺管理進程）：在存儲集群中的每個節(jié)點上都要運行；

三、GlusterFS工作原理；

數(shù)據(jù)訪問流程：

1. 首先是在客戶端，用戶通過glusterfs的mount point 來讀寫數(shù)據(jù)，對于用戶來說，集群系統(tǒng)的存在對用戶是完全透明的，用戶感覺不到是操作本地系統(tǒng)還是遠端的集群系統(tǒng)。
2. 用戶的這個操作被遞交給本地linux系統(tǒng)的VFS來處理。
3. VFS 將數(shù)據(jù)遞交給FUSE 內(nèi)核文件系統(tǒng)：在啟動 glusterfs 客戶端以前，需要想系統(tǒng)注冊一個實際的文件系統(tǒng)FUSE，如上圖所示，該文件系統(tǒng)與ext3在同一個層次上面，ext3 是對實際的磁盤進行處理，而 fuse 文件系統(tǒng)則是將數(shù)據(jù)通過/dev/fuse 這個設(shè)備文件遞交給了glusterfs client端。所以我們可以將 fuse 文件系統(tǒng)理解為一個代理。
4. 數(shù)據(jù)被 fuse 遞交給 Glusterfs client 后，client 對數(shù)據(jù)進行一些指定的處理（所謂的指定，是按照client 配置文件據(jù)來進行的一系列處理，我們在啟動glusterfs client 時需要指定這個文件，其默認位置：/etc/glusterfs/client.vol）。
5. 在glusterfs client的處理末端，通過網(wǎng)絡(luò)將數(shù)據(jù)遞交給 Glusterfs Server，并且將數(shù)據(jù)寫入到服務(wù)器所控制的存儲設(shè)備上。

四、GlusterFS卷的類型；

分布式卷、條帶卷、復制卷、分布式條帶卷、分布式復制卷、條帶復制卷、分布式條帶復制卷；

1. 分布式卷

分布式卷是GlusterFS的默認卷，在創(chuàng)建卷時，默認選項是創(chuàng)建分布式卷。在該模式下，并沒有對文件進行分塊處理，文件直接存儲在某個Server節(jié)點上。由于使用本地文件系統(tǒng)，所以存取效率并沒有提高，反而會因為網(wǎng)絡(luò)通信的原因而有所降低，另外支持超大型文件也會有一定的難度，因為分布式卷不會對文件進行分塊處理，一個文件要么在Server1上，要么在Serve2上，不能分塊同時存放在Sever1和Server2上；

特點：

文件分布在不同的服務(wù)器，不具備冗余性；

更容易且廉價地擴展卷的大小；

單點故障會造成數(shù)據(jù)丟失；

依賴底層的數(shù)據(jù)保護；

創(chuàng)建方法：

[root@gfs ~]# gluster volume create dis-volume server1:/dir1 server2:/dir2

Creation of dis -volume has been successful

Please start the volume to access data

2.條帶卷

Stripe模式相當于RAIDO，在該模式下，根據(jù)偏移量將文件分成N塊，輪詢地存儲在每個Brick Server節(jié)點。節(jié)點把每個數(shù)據(jù)塊都作為普通文件存入本地文件系統(tǒng)中，通過擴展屬性記錄總塊數(shù)（Stripe-count）和每塊的序號（Stripe-index），在配置時指定的條帶數(shù)必須等于卷中Brick所包含的存儲服務(wù)器數(shù)，在存儲大文件時，性能尤為突出，但是不具備冗余性；

特點：

數(shù)據(jù)被分割成更小塊分布到塊服務(wù)器群中的不同；

分布減少了負載且更小的文件提高了存取速度；

沒有數(shù)據(jù)冗余；

創(chuàng)建方法：

[root@gfs ~]# gluster volume create stripe-volume stripe 2 transport tcp server1:/dir1 server2:/dir2

create of Stripe -volume has been successful

please start the volume to access data

3.復制卷

也稱為AFR（AutGilePepliatio）相當于RAD1，即同一文件保存一份或多份副本。每個節(jié)點上保存相同的內(nèi)容和目錄結(jié)構(gòu)。復制模式因為要保存副本，所以磁盤利用率較低，復制卷時，復制數(shù)必須等于卷中Brick所包含的存儲服務(wù)器數(shù)，復制卷具備冗余性，即使一個節(jié)點損壞，也不影響數(shù)據(jù)的正常使用；

特點：

卷中所有的服務(wù)器均保存一個完整的副本；

卷的副本數(shù)量可由客戶創(chuàng)建的時候決定；

最少保證兩個塊服務(wù)器或更多服務(wù)器；

具備冗余效果；

創(chuàng)建方法：

[root@gfs ~]# gluster volume create rep-volume replica 2 transport tcp server1:/dir1 server2:/dir2

create of rep -volume has been successful

please start the volume to access data

4.分布式條帶卷
分布式條帶卷兼顧分布式卷和條帶卷的功能，可以理解成為大型的條帶卷，主要用于大文件訪問處理，創(chuàng)建一個分布式條帶，卷最少需要4臺服務(wù)器；

創(chuàng)建方法：

[root@gfs ~]# gluster volume create dis-stripe stripe 2 transport tcp server1:/dir1 server2:/dir2 server3:/dir3 server4:/dir4

create of dis-stripe has been successful

please start the volume to access data

上述命令創(chuàng)建了一個名為dis-stripe的分布式條帶卷，配置分布式條帶卷時，卷中Brick所包含的存儲服務(wù)器必須是條帶數(shù)的倍數(shù)（大于等于2倍），如上述命令，Brick的數(shù)量為4，條帶數(shù)為2；

5.分布式復制卷

分布式復制卷兼顧分布式卷和復制卷的功能，可以理解成為大型的復制卷，主要用于冗余的場景下，創(chuàng)建一個分布式復制卷，最少需要4塊brick；

創(chuàng)建方法：

[root@gfs ~]# gluster volume create dis-rep replica 2 transport tcp server1:/dir1 server2:/dir2 server3:/dir3 server4:/dir4

create of dis-rep has been successful

please start the volume to access data

6.條帶復制卷

條帶復制卷兼顧了條帶卷和復制卷兩者的優(yōu)點，相當于RADI 10，用于存儲效率高，備份冗余的場景下，創(chuàng)建條帶復制卷，最少需要四個brick；

創(chuàng)建方法：

[root@gfs ~]# gluster volume create test-volume stripe 2 replica 2 transport tcp server1:/dir1 server2:/dir2 server3:/dir3 server4:/dir4

create of test-volume has been successful

please start the volume to access data

7.分布式條帶復制卷

分布式條帶復制卷將分布條帶數(shù)據(jù)在復制卷集群。為了獲得最佳效果，可以選擇使用分布在高并發(fā)的條帶復制卷環(huán)境下并行訪問非常大的文件和性能是至關(guān)重要的；

五、案例：搭建Gluster分布式文件系統(tǒng)；

案例環(huán)境：

系統(tǒng)類型	IP地址	主機名	所需軟件
Centos 7.4 1708 64bit	192.168.100.101	data1.linuxfan.cn	glusterfs glusterfs-server glusterfs-fuse glusterfs-rdma
Centos 7.4 1708 64bit	192.168.100.102	data2.linuxfan.cn	glusterfs glusterfs-server glusterfs-fuse glusterfs-rdma
Centos 7.4 1708 64bit	192.168.100.103	data3.linuxfan.cn	glusterfs glusterfs-server glusterfs-fuse glusterfs-rdma
Centos 7.4 1708 64bit	192.168.100.104	client.linuxfan.cn	glusterfs glusterfs-fuse

案例步驟：

?配置主機之間的解析（在此所有主機配置相同，在此只列舉data1節(jié)點的配置）；

?在所有data節(jié)點上安裝GlusterFS（在此所有主機配置相同，在此只列舉data1節(jié)點的配置）；

?在data1節(jié)點上進行創(chuàng)建集群，其他節(jié)點會同步配置；

?在多個data節(jié)點創(chuàng)建數(shù)據(jù)存儲的位置；

?在data1節(jié)點創(chuàng)建數(shù)據(jù)存儲的卷（復制卷），其他節(jié)點會同步配置；

?在client客戶端節(jié)點上安裝gluster客戶端工具并測試掛載；

?client客戶端節(jié)點測試存放文件；

?擴展：Gluster的管理命令；

?配置主機之間的解析（在此所有主機配置相同，在此只列舉data1節(jié)點的配置）；

[root@data1 ~]# cat <>/etc/hosts

192.168.100.101 data1.linuxfan.cn

192.168.100.102 data2.linuxfan.cn

192.168.100.103 data3.linuxfan.cn

192.168.100.104 client.linuxafn.cn

END

[root@data1 ~]# ping data1.linuxfan.cn -c 2 ##ping命令進行測試

PING data1.linuxfan.cn (192.168.100.101) 56(84) bytes of data.

64 bytes from data1.linuxfan.cn (192.168.100.101): icmp_seq=1 ttl=64 time=0.062 ms

64 bytes from data1.linuxfan.cn (192.168.100.101): icmp_seq=2 ttl=64 time=0.040 ms

?在所有data節(jié)點上安裝GlusterFS（在此所有主機配置相同，在此只列舉data1節(jié)點的配置）；

[root@data1 ~]# wget -O /etc/yum.repos.d/CentOS-Base.repo http://mirrors.aliyun.com/repo/Centos-7.repo

[root@data1 ~]# yum -y install centos-release-gluster ##安裝gluster包的yum源

[root@data1 ~]# yum -y install glusterfs glusterfs-server glusterfs-fuse glusterfs-rdma

[root@data1 ~]# systemctl start glusterd

[root@data1 ~]# systemctl enable glusterd

Created symlink from /etc/systemd/system/multi-user.target.wants/glusterd.service to /usr/lib/systemd/system/glusterd.service.

[root@data1 ~]# netstat -utpln |grep glu

tcp 0 0 0.0.0.0:24007 0.0.0.0:* LISTEN 1313/glusterd

[root@data1 ~]# netstat -utpln |grep rpc

tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN 1311/rpcbind

udp 0 0 0.0.0.0:111 0.0.0.0:* 1311/rpcbind

udp 0 0 0.0.0.0:634 0.0.0.0:* 1311/rpcbind

?在data1節(jié)點上進行創(chuàng)建集群，其他節(jié)點會同步；

[root@data1 ~]# gluster peer probe data1.linuxfan.cn ##添加本機節(jié)點

peer probe: success. Probe on localhost not needed

[root@data1 ~]# gluster peer probe data2.linuxfan.cn ##添加data2節(jié)點

peer probe: success.

[root@data1 ~]# gluster peer probe data3.linuxfan.cn ##添加data3節(jié)點

peer probe: success.

[root@data1 ~]# gluster peer status ##查看gluster集群狀態(tài)

Number of Peers: 2

Hostname: data2.linuxfan.cn

Uuid: a452f7f4-7604-4d44-8b6a-f5178a41e308

State: Peer in Cluster (Connected)

Hostname: data3.linuxfan.cn

Uuid: b08f1b68-3f2c-4076-8121-1ab17d1517e1

State: Peer in Cluster (Connected)

?在多個data節(jié)點創(chuàng)建數(shù)據(jù)存儲的位置；

[root@data1 ~]# mkdir /data

[root@data1 ~]# gluster volume info

No volumes present

?在data1節(jié)點創(chuàng)建數(shù)據(jù)存儲的卷（復制卷），其他節(jié)點會同步配置；

[root@data1 ~]# gluster volume create rep-volume replica 3 transport tcp data1.linuxfan.cn:/data data2.linuxfan.cn:/data data3.linuxfan.cn:/data force ##創(chuàng)建復制卷，名稱如上

volume create: rep-volume: success: please start the volume to access data

[root@data1 ~]# gluster volume info

Volume Name: rep-volume

Type: Replicate

Volume ID: ac59612b-e6ce-46ce-85a7-74262fb722b2

Status: Created

Snapshot Count: 0

Number of Bricks: 1 x 3 = 3

Transport-type: tcp

Bricks:

Brick1: data1.linuxfan.cn:/data

Brick2: data2.linuxfan.cn:/data

Brick3: data3.linuxfan.cn:/data

Options Reconfigured:

transport.address-family: inet

nfs.disable: on

performance.client-io-threads: off

[root@data1 ~]# gluster volume start rep-volume ##啟動該卷

volume start: rep-volume: success

?在client客戶端節(jié)點上安裝gluster客戶端工具并測試掛載；

[root@client ~]# yum install -y glusterfs glusterfs-fuse

[root@client ~]# mount -t glusterfs data1.linuxfan.cn:rep-volume /mnt/

[root@client ~]# ls /mnt/

[root@client ~]# df -hT |tail -1

data1.linuxfan.cn:rep-volume fuse.glusterfs 19G 2.0G 17G 11% /mnt

?client客戶端節(jié)點測試存放文件；

[root@client ~]# touch /mnt/{1..10}.file

[root@client ~]# dd if=/dev/zero of=/mnt/1.txt bs=1G count=1

[root@client ~]# ls /mnt/

10.file 1.file 1.txt 2.file 3.file 4.file 5.file 6.file 7.file 8.file 9.file

[root@client ~]# du -sh /mnt/1.txt

1.0G /mnt/1.txt

?擴展：Gluster的管理命令；

Gluster peer status ##查看所有的節(jié)點信息

Gluster peer probe name ##添加節(jié)點

Gluster peer detach name ##刪除節(jié)點

Gluster volume create xxx ##創(chuàng)建卷

Gluster volume info ##查看卷信息

系統(tǒng)配額：

1、開啟/關(guān)閉系統(tǒng)配額

gluster volume quota VOLNAME enable/disable

2、設(shè)置(重置)目錄配額

gluster volume quota VOLNAME limit-usage /img limit-value

gluster volume quota img limit-usage /quota 10GB

設(shè)置img 卷下的quota 子目錄的限額為10GB。這個目錄是以系統(tǒng)掛載目錄為根目錄”/”，所以/quota 即客戶端掛載目錄下的子目錄quota

3、配額查看

gluster volume quota VOLNAME list

gluster volume quota VOLNAME list

可以使用如上兩個命令進行系統(tǒng)卷的配額查看，第一個命令查看目的卷的所有配額設(shè)置，

第二個命令則是執(zhí)行目錄進行查看。可以顯示配額大小及當前使用容量，若無使用容量(最小0KB)則說明設(shè)置的目錄可能是錯誤的(不存在)。

地域復制：

gluster volume geo-replication MASTER SLAVE start/status/stop

//地域復制是系統(tǒng)提供的災(zāi)備功能，能夠?qū)⑾到y(tǒng)的全部數(shù)據(jù)進行異步的增量備份到另外的磁盤中。

gluster volume geo-replication img 192.168.10.8:/data1/brick1 start

如上，開始執(zhí)行將img 卷的所有內(nèi)容備份到10.8 下的/data1/brick1 中的task，需要注意的是，這個備份目標不能是系統(tǒng)中的Brick。

平衡卷：

平衡布局是很有必要的，因為布局結(jié)構(gòu)是靜態(tài)的，當新的bricks 加入現(xiàn)有卷，新創(chuàng)建的文件會分布到舊的bricks 中，所以需要平衡布局結(jié)構(gòu)，使新加入的bricks 生效。布局平衡只是使

新布局生效，并不會在新的布局移動老的數(shù)據(jù)，如果你想在新布局生效后，重新平衡卷中的數(shù)據(jù)，還需要對卷中的數(shù)據(jù)進行平衡。

當你擴展或者縮小卷之后，需要重新在服務(wù)器直接重新平衡一下數(shù)據(jù)，重新平衡的操作被分

為兩個步驟：

1、Fix Layout

修改擴展或者縮小后的布局，以確保文件可以存儲到新增加的節(jié)點中。

2、Migrate Data

重新平衡數(shù)據(jù)在新加入bricks 節(jié)點之后。

* Fix Layout and Migrate Data

先重新修改布局然后移動現(xiàn)有的數(shù)據(jù)(重新平衡)

# gluster volume rebalance VOLNAME fix-layout start# gluster volume rebalance VOLNAME migrate-data start

也可以兩步合一步同時操作

# gluster volume rebalance VOLNAME start# gluster volume rebalance VOLNAME status //你可以在在平衡過程中查看平衡信息#  gluster volume rebalance VOLNAME stop //你也可以暫停平衡，再次啟動平衡的時候會從上次暫停的地方繼續(xù)開始平衡。

I/O 信息查看：

Profile Command 提供接口查看一個卷中的每一個brick 的IO 信息

#gluster volume profile VOLNAME start //啟動profiling，之后則可以進行IO 信息查看#gluster volume profile VOLNAME info //查看IO 信息，可以查看到每一個Brick 的IO 信息#gluster volume profile VOLNAME stop //查看結(jié)束之后關(guān)閉profiling 功能

Top監(jiān)控：

Top command 允許你查看bricks 的性能例如：read, write, fileopen calls, file read calls, file,write calls,directory open calls, and directory real calls

所有的查看都可以設(shè)置top 數(shù)，默認100

# gluster volume top VOLNAME open [brick BRICK-NAME] [list-cnt cnt] //查看打開的fd# gluster volume top VOLNAME read [brick BRICK-NAME] [list-cnt cnt] //查看調(diào)用次數(shù)最多的讀調(diào)用# gluster volume top VOLNAME write [brick BRICK-NAME] [list-cnt cnt] //查看調(diào)用次數(shù)最多的寫調(diào)用# gluster volume top VOLNAME opendir [brick BRICK-NAME] [list-cnt cnt] //查看次數(shù)最多的目錄調(diào)用# gluster volume top VOLNAME readdir [brick BRICK-NAME] [list-cnt cnt] //查看次數(shù)最多的目錄調(diào)用# gluster volume top VOLNAME read-perf [bs blk-size count count] [brickBRICK-NAME] [list-cnt cnt] //查看每個Brick 的讀性能# gluster volume top VOLNAME write-perf [bs blk-size count count] [brickBRICK-NAME] [list-cnt cnt] //查看每個Br

贊賞

共11人贊賞