KnowledgeHierarchy
RQ: what is a good question
Zhihu tag tree
知乎数据抓取 https://github.com/7sDream/zhihu-oauth
https://www.zhihu.com/topic/19776749/organize/entire#anchor-children-topic
import pandas as pd
# https://github.com/Lynxmac/zhihu_topic_tree/
with open('/Users/datalab/bigdata/zhihu_topic_tree.txt', 'r', encoding='gb18030') as f:
lines = f.readlines()
df_list = []
for index, line in enumerate(lines):
a = line.rstrip().split('─')
hierarchy = len(a[0])
if index > 312482:
hierarchy -= 1
sign = a[0][-1]
b = a[-1].split('_', maxsplit = 2)
ids = b[0]
name = '_'.join(b[1:])
df_list.append([index, hierarchy, sign, ids, name, line])
df = pd.DataFrame(df_list, columns = ['loc', 'hierarchy', 'sign', 'id', 'name', 'line'])
# clean the hierarchy variable
new_hierarchy = []
for i in df.hierarchy:
if i % 3 ==1:
new_hierarchy.append(i)
elif i%3 ==2:
new_hierarchy.append(i-1)
elif i%3 ==0:
new_hierarchy.append(i-2)
df['new_hierarchy'] = new_hierarchy
df['good_hierarchy'] = [(i-1)/3 + 1 for i in new_hierarchy]
# add missing id for level 1 topics
id_list = [(29855, 19778298, '「形而上」话题'),
(178555,19560891,'产业'),
(190122, 19618774, '学科'),
(223661, 19778287, '实体'),
(312482, 19778317,'生活、艺术、文化与活动')]
for i in id_list:
df['id'][i[0]] = i[1]
df['name'][i[0]] = i[2]
# delete wrong ids
error_id_index = []
for k, i in enumerate(df.id):
try:
j = int(i)
except:
error_id_index.append(k)
len(error_id_index)
df = df.drop(error_id_index)
df['id'] = [int(i) for i in df.id]
# construct network
# it takes around 3 hours
# search for the nearest high level neighbor and link together
from flownetwork.flownetwork import flushPrint
net = []
for i in df.index:
if i%100 ==0:
flushPrint(i)
ids = df['id'][i]
hierarchy = df['good_hierarchy'][i]
loc = df['loc'][i]
if hierarchy == 1:
net.append(('root', ids))
else:
upper_hierarchy = hierarchy - 1
upper_nodes = df[df['good_hierarchy'] == upper_hierarchy]
upper_node_loc = [j for j in upper_nodes['loc'] if (loc - j) > 0][-1]
upper_node_id = df['id'][df['loc'] == upper_node_loc]
net.append(( int(upper_node_id), ids))
StackOverflow tag network
StackOverflow using tags to organize raised questions, see the tags here: https://stackoverflow.com/tags
Given a tag, such as javascript, you can see the tagged questions: https://stackoverflow.com/questions/tagged/javascript
Note that, stackoverflow also demonstrates the related tags for a tag. For example, the javascript tag is related to
Related Tags
- jquery × 518122
- html × 317264
- css × 146448
- angularjs × 116430
- php × 111251
- node.js × 92095
- ajax × 88493
- json × 58117
- html5 × 51808
- reactjs × 51308
- arrays × 49362
- asp.net × 31550
- regex × 28362
- twitter-bootstrap × 24516
- angular × 24174
- c# × 23339
- forms × 22346
- google-chrome × 21292
- d3.js × 21232
- dom × 19442
- google-maps × 18658
- typescript × 18244
- java × 17724
- canvas × 17054
- express × 16103
Quora
https://www.quora.com/topic/Computer-Science
https://github.com/tapaswenipathak/pyQTopic/blob/master/qtopic/pyqtopics.py