KnowledgeHierarchy

RQ: what is a good question

Zhihu tag tree

知乎数据抓取 https://github.com/7sDream/zhihu-oauth

https://www.zhihu.com/topic/19776749/organize/entire#anchor-children-topic

import pandas as pd

# https://github.com/Lynxmac/zhihu_topic_tree/
with open('/Users/datalab/bigdata/zhihu_topic_tree.txt', 'r', encoding='gb18030') as f:
    lines = f.readlines()

df_list = []
for index, line in enumerate(lines):
    a = line.rstrip().split('─')
    hierarchy = len(a[0])
    if index > 312482:
        hierarchy -= 1
    sign = a[0][-1]
    b = a[-1].split('_', maxsplit = 2)
    ids = b[0]
    name = '_'.join(b[1:])
    df_list.append([index, hierarchy, sign, ids, name, line])

df = pd.DataFrame(df_list, columns = ['loc', 'hierarchy', 'sign', 'id', 'name', 'line'])

# clean the hierarchy variable
new_hierarchy = []
for i in df.hierarchy:
    if i % 3 ==1:
        new_hierarchy.append(i)
    elif i%3 ==2:
        new_hierarchy.append(i-1)
    elif i%3 ==0:
        new_hierarchy.append(i-2)
        
df['new_hierarchy'] = new_hierarchy
df['good_hierarchy'] = [(i-1)/3 + 1 for i in new_hierarchy]

# add missing id for level 1 topics
id_list = [(29855, 19778298, '「形而上」话题'), 
 (178555,19560891,'产业'),
 (190122, 19618774, '学科'),
 (223661, 19778287, '实体'),
 (312482, 19778317,'生活、艺术、文化与活动')]

for i in id_list:
    df['id'][i[0]] = i[1]
    df['name'][i[0]] = i[2]

# delete wrong ids
error_id_index = []
for k, i in enumerate(df.id):
    try:
        j = int(i)
    except:
        error_id_index.append(k)
len(error_id_index)

df = df.drop(error_id_index)
df['id'] = [int(i) for i in df.id]

# construct network
# it takes around 3 hours
# search for the nearest high level neighbor and link together

from flownetwork.flownetwork import flushPrint

net = []
for i in df.index:
    if i%100 ==0:
        flushPrint(i)
    ids = df['id'][i]
    hierarchy = df['good_hierarchy'][i]
    loc = df['loc'][i]
    if hierarchy == 1:
        net.append(('root', ids))
    else:
        upper_hierarchy = hierarchy - 1
        upper_nodes = df[df['good_hierarchy'] == upper_hierarchy]
        upper_node_loc = [j for j in upper_nodes['loc'] if (loc - j) > 0][-1]
        upper_node_id = df['id'][df['loc'] == upper_node_loc]
        net.append(( int(upper_node_id), ids))

StackOverflow tag network

StackOverflow using tags to organize raised questions, see the tags here: https://stackoverflow.com/tags

Given a tag, such as javascript, you can see the tagged questions: https://stackoverflow.com/questions/tagged/javascript

Note that, stackoverflow also demonstrates the related tags for a tag. For example, the javascript tag is related to

Related Tags

  • jquery × 518122
  • html × 317264
  • css × 146448
  • angularjs × 116430
  • php × 111251
  • node.js × 92095
  • ajax × 88493
  • json × 58117
  • html5 × 51808
  • reactjs × 51308
  • arrays × 49362
  • asp.net × 31550
  • regex × 28362
  • twitter-bootstrap × 24516
  • angular × 24174
  • c# × 23339
  • forms × 22346
  • google-chrome × 21292
  • d3.js × 21232
  • dom × 19442
  • google-maps × 18658
  • typescript × 18244
  • java × 17724
  • canvas × 17054
  • express × 16103

Quora

https://www.quora.com/topic/Computer-Science

https://github.com/tapaswenipathak/pyQTopic/blob/master/qtopic/pyqtopics.py

https://github.com/csu/quora-api

References

User:Wangchj04