之前一篇文章对光荣《三国志》系列人物进行了数值上的分析, 一个同学看过后提议让我试下对数据进行clustering, 我觉得可行,也许会有有趣的发现,于是就有了这篇文章。
Data Processing
在那篇文章里我最终用来分析的数据是一个大的data frame, 前6行数据如下所示。
## 姓名 統率 武力 智力 魅力 運勢 版本 政治 陸指 水指 體力
## 1 丁奉 NA 22 81 29 47 三國志1 NA NA NA 81
## 2 于禁 NA 72 20 25 28 三國志1 NA NA NA 82
## 3 公孫瓚 NA 70 67 89 28 三國志1 NA NA NA 84
## 4 太史慈 NA 97 47 84 34 三國志1 NA NA NA 88
## 5 孔融 NA 82 61 50 77 三國志1 NA NA NA 84
## 6 文聘 NA 84 22 64 83 三國志1 NA NA NA 88
它的维度为\(6123 \times 11\).
dim(dt)
## [1] 6123 11
在做clustering之前我们还需要对数据进行些最后的处理。第一个问题是我们有重名的人物,李丰,马忠,张南,张温。我可以把他们找出来,然后改名,比如李丰(魏), 李丰(蜀), 不过问题是每代出现的可能是不同势力的同名人物,而这些人物数据属性又不突出,无法直接判断。当然我也可以从原数据中找出他们的字,用来判断势力,最终我决定不那么做,因为我认为这四个人不会对整体聚类分析有影响,于是就把他们drop了。
same_name <- dt %>% mutate(姓名 = as.character(姓名)) %>%
group_by(姓名, 版本) %>%
summarise(次數 = n()) %>%
filter(次數 > 1) %>%
select(姓名) %>%
unique() %>%
unlist
same_name
## 姓名1 姓名2 姓名3 姓名4
## "李豐" "馬忠" "張南" "張溫"
dt_unique <- dt %>% filter(!(姓名 %in% same_name))
《三国志3》中出现了陆指和水指的属性,分别代表人物在陆上指挥和水上指挥的能力,这是唯一一代出现这种属性的游戏,我的处理是对陆指和水指取平均值为统率。而体力和运势这两个属性也是只有一代有,我将它们也去掉。
dt_drop <- dt_unique %>% mutate(統率 = ifelse(版本 == "三國志3",
(水指 + 陸指) / 2, 統率)) %>%
select(-水指, -陸指, -體力, -運勢)
最后对武力,智力,政治,统率和魅力求平均值,作为最终的数据。有一些人物完全没有统率,政治或者魅力,求均值得到的是NaN
, 我们将它们drop之。
dt_mean <- dt_drop %>% mutate(姓名 = as.character(姓名)) %>%
group_by(姓名) %>%
select(-版本) %>%
summarise_at(vars(統率:政治), mean, na.rm = T)
dt_final <- na.omit(dt_mean)
我们来看看最终的数据
head(dt_final)
## # A tibble: 6 x 6
## 姓名 統率 武力 智力 魅力 政治
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 阿貴 36.0 74.0 36.0 27.0 29.0
## 2 阿會喃 57.3 72.4 27.2 28.7 32.8
## 3 鮑隆 55.7 68.3 37.1 33.8 26.0
## 4 鮑三娘 72.0 83.0 56.0 75.0 36.0
## 5 鮑信 63.9 56.7 61.2 61.4 63.9
## 6 鮑忠 42.5 73.0 45.0 52.0 42.0
K-means Clustering
Clustering是一种非监督式的学习算法,我们的数据没有所谓的label,与之相对的是监督式学习,我们可以对labelled的数据进行预测和分类,比如预测房价和分类数字。 我们可以通过clustering来找出隐藏在数据中的结构,每个分出的类中的元素在某方面可能会有相似之处。
Cluster的数量我们通常用elbow method来判断,我们希望between-cluster sum of squares/Total Variance的比例在达到某一个cluster数量之后,该比值增加的幅度减少,因为cluster增加该比值必然会增大, 我们不希望overfit data。 所以我们用elbow method,通过测试不同数量的cluster, 最终找出一个在比值变化最大之后减缓的那个cluster个数。
从图中来看,选5个cluster比较合适,因为percentage of variance explained在4和5之间变化比较大,而之后上升速度明显变缓。
k <- 15
multi <- sapply(2:k, function(x) kmeans(dt_final[, 2:6], centers = x))
perc_var_explained <- sapply(1:k-1,
function(x) multi[,x]$betweenss/multi[,x]$totss) %>%
unlist()
ggplot(data.frame(Clusters = 2:k, perc_var_explained =
perc_var_explained)) +
geom_point(aes(x = Clusters, y = perc_var_explained)) +
ggtitle("Number of Clusters VS Explained Variance") +
scale_x_continuous(limits = c(2, 15), breaks = seq(2, 15, 1))
接着用R自带的kmeans
function来进行聚类,我们先从每个聚类中选出20个人物,看看有没有什么特点。
set.seed(1024)
cluster_mod <- kmeans(dt_final[, 2:6], centers = 5)
splitted <- split(dt_final$姓名, cluster_mod$cluster)
sample_splitted <- lapply(1:5,
function(x) splitted[[x]][sample(length(splitted[[x]]), 20)])
names(sample_splitted) <- paste0("Cluster", 1:5)
粗略的看下,似乎第一个cluster几乎都是武将,第二个cluster里多为各项属性都比较优秀的人物,第三个cluster多为智力属性不太高,武力也比较中庸的武将。Cluster4 和Cluster5没看出什么特点。
sample_splitted
## $Cluster1
## [1] "顏良" "高翔" "蔣班" "蹋頓" "孫異" "曹昂" "州泰"
## [8] "軻比能" "馬隆" "周昂" "公孫範" "袁尚" "周倉" "王桃"
## [15] "馬雲騄" "曹洪" "李通" "丁封" "馬騰" "于禁"
##
## $Cluster2
## [1] "徐盛" "龐娥" "賈詡" "呂範" "司馬懿" "周魴" "孫尚香"
## [8] "成公英" "向寵" "劉馥" "王渾" "韓遂" "袁遺" "董和"
## [15] "劉表" "李恢" "荀攸" "王濬" "羊祜" "楊肇"
##
## $Cluster3
## [1] "陳就" "張達" "鄂煥" "劉丞" "趙岑" "韓忠" "孫峻"
## [8] "沙摩柯" "焦觸" "武安國" "張象" "范疆" "魏續" "王方"
## [15] "陳式" "骨進" "胡車兒" "潘璋" "曹豹" "文醜"
##
## $Cluster4
## [1] "王允" "劉巴" "孫登" "袁胤" "劉琮" "李珪" "劉永" "荀惲" "許攸" "蒯越"
## [11] "閻象" "韓胤" "向朗" "趙累" "黨均" "鄒氏" "費詩" "甄氏" "許劭" "郭圖"
##
## $Cluster5
## [1] "李樂" "杜襲" "夏侯存" "雅丹" "王垢" "王韜" "左靈"
## [8] "黃琬" "公孫修" "關彝" "龐柔" "鮑忠" "郝萌" "太史享"
## [15] "李肅" "何進" "秦琪" "張先" "樂就" "徐榮"
那么我们来看看每个cluster的平均属性,看能不能验证上边我们的想法。Cluster1的人物统率和武力较高,可能因为该cluster里猛将比较多,符合我们的猜测。。Cluster2里的人物各项属性都较高,除了武力低了些,里边的人物应该大多多项属性较为突出,是游戏中我们都希望得到的将领,相对比较全能。Cluster3的人物统率和武力相对较高,但不如Cluster1, 其它三维就更低的可怜,可以判断里边的人物可能多为中庸型武将。 Cluster4中的人物都有较高的智力,魅力和政治,统率和武力非常低,可以判断该cluster里多为文官。Cluster5平均属性全都在60以下,总体偏弱,唯一的特点大概就是没有特点吧。
lapply(1:5, function(x) dt_final %>%
filter(姓名 %in% splitted[[x]]) %>%
summarise_all(mean) %>%
select(-姓名)) %>%
rbindlist() %>%
as.data.frame() %>%
mutate(类 = paste0("Cluster", 1:5)) %>%
select(6, 1:5)
## 类 統率 武力 智力 魅力 政治
## 1 Cluster1 68.85052 73.66874 53.59062 62.21448 46.93281
## 2 Cluster2 71.77614 59.82638 76.96676 74.49536 74.50311
## 3 Cluster3 52.55453 67.35665 33.75154 36.73511 28.21515
## 4 Cluster4 28.85496 26.43456 67.65743 63.12949 70.54883
## 5 Cluster5 45.16717 59.14660 51.63146 52.34532 50.49595
这里把每个聚类所有的武将都放出来。
注:我们发现张飞,文丑这种武力超高的人物被聚类到为了弱武将,第一个原因是其它武将武力比较低,第二个是张飞文丑的智力都很低,而这个cluster基本汇聚的都是智力低的武将。
names(splitted) <- c("猛将", "相对全能", "智力较低武将", "文官", "总体较弱人物")
splitted
## $猛将
## [1] "鮑三娘" "卑衍" "步度根" "曹昂" "曹純" "曹洪" "曹仁"
## [8] "曹休" "曹彰" "徹里吉" "陳到" "陳騫" "陳武" "淳于瓊"
## [15] "戴陵" "單經" "鄧賢" "鄧忠" "丁封" "丁原" "董衡"
## [22] "董襲" "董卓" "馮習" "傅僉" "傅彤" "傅嬰" "甘寧"
## [29] "高幹" "高覽" "高順" "高翔" "公孫度" "公孫範" "公孫康"
## [36] "公孫續" "公孫淵" "公孫越" "公孫瓚" "關平" "關索" "關統"
## [43] "關興" "關銀屏" "毌丘儉" "韓當" "賀齊" "侯成" "侯選"
## [50] "呼廚泉" "胡奮" "胡烈" "胡淵" "胡遵" "花鬘" "華雄"
## [57] "黃蓋" "黃忠" "紀靈" "姜敘" "蔣班" "蔣義渠" "蔣欽"
## [64] "焦彝" "句安" "軻比能" "雷銅" "李典" "李通" "李歆"
## [71] "廖化" "凌操" "凌統" "留略" "留平" "留贊" "劉封"
## [78] "劉璝" "劉磐" "樓班" "呂玲綺" "呂義" "馬超" "馬岱"
## [85] "馬隆" "馬騰" "馬鐵" "馬休" "馬雲騄" "孟達" "孟獲"
## [92] "寗隨" "龐德" "龐會" "牽弘" "丘力居" "全琮" "全端"
## [99] "全禕" "全懌" "沈瑩" "盛曼" "石苞" "士徽" "宋謙"
## [106] "蘇飛" "孫觀" "孫冀" "孫禮" "孫韶" "孫秀" "孫異"
## [113] "孫震" "蹋頓" "太史慈" "唐彬" "唐咨" "陶濬" "田楷"
## [120] "王惇" "王平" "王頎" "王桃" "王悅" "魏邈" "魏延"
## [127] "文虎" "文聘" "文鴦" "吾彥" "吳班" "吳景" "吳蘭"
## [134] "伍習" "伍延" "夏侯霸" "夏侯德" "夏侯蘭" "夏侯尚" "夏侯威"
## [141] "夏侯淵" "徐晃" "荀愷" "閻行" "顏良" "嚴顏" "楊奉"
## [148] "楊懷" "楊任" "楊欣" "雍闓" "於夫羅" "于禁" "于詮"
## [155] "袁尚" "袁術" "袁譚" "樂進" "臧霸" "張苞" "張承"
## [162] "張虎" "張濟" "張梁" "張曼成" "張任" "張氏" "張衛"
## [169] "張郃" "張繡" "張勳" "張燕" "張楊" "張翼" "張英"
## [176] "張遵" "趙昂" "趙廣" "趙統" "州泰" "周昂" "周倉"
## [183] "周善" "周泰" "朱靈" "朱然" "朱異" "朱讚" "諸葛靚"
## [190] "諸葛尚" "鄒靖" "祖茂" "左奕"
##
## $相对全能
## [1] "鮑信" "步闡" "步協" "步騭" "蔡瑁" "曹操" "曹丕"
## [8] "曹叡" "曹真" "陳表" "陳登" "陳宮" "陳泰" "成公英"
## [15] "程普" "程昱" "鄧艾" "鄧芝" "丁奉" "董承" "董和"
## [22] "董厥" "杜畿" "杜預" "法正" "費耀" "費禕" "高柔"
## [29] "關寧" "關羽" "毌丘甸" "郭淮" "韓遂" "闞澤" "郝昭"
## [36] "胡濟" "胡質" "皇甫嵩" "黃崇" "黃權" "黃月英" "霍峻"
## [43] "霍弋" "賈範" "賈逵" "賈詡" "駱統" "姜維" "蔣琬"
## [50] "沮授" "李恢" "李嚴" "梁習" "劉備" "劉表" "劉諶"
## [57] "劉馥" "劉劭" "劉焉" "劉虞" "盧植" "魯淑" "魯肅"
## [64] "陸景" "陸凱" "陸抗" "陸遜" "羅憲" "呂岱" "呂範"
## [71] "呂據" "呂蒙" "馬良" "馬謖" "滿寵" "孟建" "龐娥"
## [78] "龐統" "牽招" "審配" "石韜" "士燮" "司馬孚" "司馬師"
## [85] "司馬望" "司馬炎" "司馬懿" "司馬攸" "司馬昭" "孫策" "孫桓"
## [92] "孫堅" "孫皎" "孫靜" "孫權" "孫尚香" "孫休" "孫瑜"
## [99] "田疇" "田豐" "田豫" "王昶" "王甫" "王渾" "王基"
## [106] "王經" "王淩" "王戎" "王濬" "王異" "衛瓘" "吾粲"
## [113] "吳懿" "夏侯惇" "夏侯和" "夏侯惠" "向寵" "辛評" "徐盛"
## [120] "徐庶" "荀攸" "荀彧" "閻柔" "羊祜" "楊阜" "楊濟"
## [127] "楊儀" "楊肇" "虞汜" "袁紹" "袁熙" "袁遺" "張寶"
## [134] "張既" "張角" "張遼" "張魯" "張特" "張悌" "張嶷"
## [141] "趙雲" "鍾會" "鍾離牧" "周魴" "周瑜" "朱桓" "朱據"
## [148] "朱儁" "諸葛誕" "諸葛瑾" "諸葛恪" "諸葛亮" "諸葛喬" "諸葛瞻"
##
## $智力较低武将
## [1] "阿貴" "阿會喃" "鮑隆" "卞喜" "波才" "蔡和"
## [7] "蔡陽" "蔡中" "曹豹" "曹性" "曹訓" "陳橫"
## [13] "陳紀" "陳就" "陳蘭" "陳式" "陳應" "成廉"
## [19] "成宜" "程銀" "程遠志" "鄧茂" "鄧義" "典滿"
## [25] "典韋" "董旻" "董荼那" "俄何燒戈" "蛾遮塞" "鄂煥"
## [31] "樊稠" "樊能" "范疆" "方悅" "苻健" "傅士仁"
## [37] "高定" "高昇" "龔都" "骨進" "管亥" "毌丘秀"
## [43] "郭馬" "郭汜" "郭援" "韓德" "韓暹" "韓玄"
## [49] "韓忠" "何儀" "何植" "胡車兒" "胡赤兒" "胡軫"
## [55] "許褚" "許儀" "黃亂" "黃祖" "蔣奇" "蔣舒"
## [61] "焦觸" "金環三結" "金旋" "雷薄" "冷苞" "李別"
## [67] "李傕" "李堪" "李鵬" "李異" "梁綱" "梁寬"
## [73] "梁興" "粱剛" "劉辯" "劉丞" "劉岱" "劉宏"
## [79] "劉辟" "路昭" "呂布" "呂常" "呂曠" "呂威璜"
## [85] "呂翔" "馬邈" "馬玩" "忙牙長" "孟坦" "孟優"
## [91] "迷當大王" "糜芳" "木鹿大王" "穆順" "牛輔" "牛金"
## [97] "潘鳳" "潘臨" "潘璋" "裴元紹" "千万" "強端"
## [103] "橋蕤" "秦朗" "區星" "麴義" "沙摩柯" "施朔"
## [109] "師纂" "宋憲" "眭固" "眭元進" "孫綝" "孫皓"
## [115] "孫峻" "孫歆" "孫翊" "孫仲" "譚雄" "田續"
## [121] "土安" "王方" "王門" "王雙" "王同" "王真"
## [127] "王忠" "魏續" "文醜" "文欽" "烏延" "武安國"
## [133] "兀突骨" "奚泥" "夏侯楙" "謝旌" "邢道榮" "徐質"
## [139] "閻宇" "嚴白虎" "嚴綱" "嚴輿" "嚴政" "晏明"
## [145] "楊昂" "楊柏" "楊醜" "楊鋒" "楊秋" "楊祚"
## [151] "尹禮" "尤突" "于糜" "俞涉" "越吉" "樂綝"
## [157] "笮融" "張達" "張飛" "張橫" "張闓" "張球"
## [163] "張象" "張顗" "張著" "趙岑" "趙弘" "周旨"
## [169] "朱褒" "祝融" "鄒丹"
##
## $文官
## [1] "卞氏" "蔡氏" "蔡琰" "蔡邕" "曹沖" "曹芳"
## [7] "曹奐" "曹熊" "曹植" "岑昏" "陳珪" "陳矯"
## [13] "陳琳" "陳群" "陳壽" "陳震" "程秉" "程武"
## [19] "崔林" "崔琰" "崔州平" "大喬" "黨均" "貂蟬"
## [25] "丁儀" "董白" "董朝" "董允" "董昭" "杜瓊"
## [31] "杜氏" "樊建" "樊氏" "費詩" "逢紀" "伏完"
## [37] "傅幹" "傅嘏" "傅巽" "高堂隆" "耿紀" "耿武"
## [43] "顧譚" "顧雍" "關純" "管輅" "郭嘉" "郭氏"
## [49] "郭圖" "郭奕" "郭攸之" "國淵" "韓嵩" "韓胤"
## [55] "何晏" "和洽" "胡沖" "許靖" "許劭" "許汜"
## [61] "許攸" "華覈" "華佗" "華歆" "桓範" "桓階"
## [67] "黃承彥" "黃皓" "賈充" "蹇碩" "簡雍" "蔣斌"
## [73] "蔣幹" "蔣濟" "金禕" "孔融" "孔伷" "蒯良"
## [79] "蒯越" "李孚" "李珪" "李儒" "李勝" "李氏"
## [85] "廖立" "劉巴" "劉禪" "劉琮" "劉和" "劉理"
## [91] "劉琦" "劉氏" "劉協" "劉璿" "劉曄" "劉永"
## [97] "劉璋" "柳甫" "婁圭" "樓玄" "陸績" "陸鬱生"
## [103] "呂伯奢" "呂凱" "馬鈞" "孟宗" "糜氏" "糜竺"
## [109] "禰衡" "潘濬" "龐羲" "裴秀" "濮陽興" "橋玄"
## [115] "譙周" "秦宓" "全尚" "尚弘" "尚舉" "邵悌"
## [121] "士壹" "司馬徽" "司馬朗" "宋忠" "孫登" "孫和"
## [127] "孫亮" "孫魯班" "孫乾" "孫氏" "陶謙" "滕脩"
## [133] "滕胤" "萬彧" "王粲" "王楷" "王朗" "王累"
## [139] "王肅" "王祥" "王修" "王業" "王允" "王則"
## [145] "韋康" "韋昭" "魏諷" "魏攸" "溫恢" "吳綱"
## [151] "吳國太" "吳質" "郤正" "戲志才" "夏侯令女" "夏侯氏"
## [157] "夏侯玄" "向朗" "小喬" "辛敞" "辛毗" "辛憲英"
## [163] "徐邈" "徐氏" "薛珝" "薛瑩" "薛綜" "荀諶"
## [169] "荀爽" "荀勗" "荀顗" "荀惲" "閻圃" "閻象"
## [175] "嚴畯" "楊彪" "楊弘" "楊洪" "楊密" "楊琦"
## [181] "楊氏" "楊修" "伊籍" "尹大目" "尹默" "于吉"
## [187] "虞翻" "袁渙" "袁胤" "張春華" "張紘" "張華"
## [193] "張緝" "張節" "張鈞" "張讓" "張紹" "張世平"
## [199] "張松" "張休" "張昭" "趙累" "甄氏" "鄭度"
## [205] "鍾繇" "鍾毓" "諸葛均" "宗預" "鄒氏" "左慈"
##
## $总体较弱人物
## [1] "鮑忠" "蔡勳" "曹安民" "曹髦" "曹爽" "曹羲"
## [7] "曹彥" "曹宇" "車冑" "陳生" "崔勇" "帶來洞主"
## [13] "鄧龍" "董璜" "杜襲" "朵思大王" "費觀" "費棧"
## [19] "高沛" "公孫恭" "公孫修" "鞏志" "關靖" "關彝"
## [25] "韓福" "韓馥" "韓浩" "韓莒子" "韓猛" "郝萌"
## [31] "何進" "胡班" "胡才" "許昌" "許貢" "黃琬"
## [37] "季雍" "賈華" "沮鵠" "柯吾" "孔秀" "李封"
## [43] "李蒙" "李肅" "李暹" "李樂" "梁緒" "劉豹"
## [49] "劉度" "劉晙" "劉先" "劉賢" "劉勳" "劉循"
## [55] "劉延" "劉繇" "倫直" "呂公" "呂建" "呂虔"
## [61] "馬漢" "馬延" "馬遵" "芒中" "毛玠" "龐柔"
## [67] "橋瑁" "秦良" "秦琪" "丘本" "丘建" "去卑"
## [73] "全紀" "任峻" "申耽" "申儀" "審榮" "史渙"
## [79] "史蹟" "士匡" "士祗" "司馬伷" "宋果" "蘇由"
## [85] "孫匡" "孫朗" "太史享" "田章" "王昌" "王垢"
## [91] "王含" "王伉" "王匡" "王韜" "王威" "王植"
## [97] "王子服" "吳敦" "吳巨" "吳碩" "吳子蘭" "伍瓊"
## [103] "夏侯存" "夏侯恩" "辛明" "徐榮" "徐商" "薛蘭"
## [109] "薛禮" "薛悌" "雅丹" "楊齡" "楊松" "尹奉"
## [115] "尹楷" "尹賞" "袁燿" "樂就" "張布" "張邈"
## [121] "張先" "張允" "趙範" "趙衢" "趙叡" "鍾進"
## [127] "周昕" "朱光" "朱治" "諸葛緒" "卓膺" "宗寶"
## [133] "左靈"