학술 논문 추천 시스템 구축¶

인용문을 검색하고 싶지만 아직 Google Scholar를 모르는 사람에게 유용함
Microsoft Academic Graph Dataset 사용 Open Academic Graph
- 논문 개수: 166,192,182개
- 데이터셋 크기: 104GB
- 컬럼 수: 18개

항목 기반 협업 필터링¶

Amazon에서 제품 추천을 위한 사용자 기반 알고리즘을 향상시키기 위해 처음 개발됨
항목 간의 유사도를 기반으로 한 추천을 제공함
1. 항목에 대한 정보를 일반화 함
2. 항목 간의 유사도 점수를 계산함
3. 점수 기반의 순위를 통해 상위 $N$ 개의 유사 항목을 추천함

첫 번째 단계: 데이터 가져오기, 정제하기, 피처 파싱하기¶

가설: 거의 같은 시기에 비슷한 연구 분야에서 출간된 논문이 사용자에게 가장 유용할 것이라고 가정함
- 전체 데이터셋의 하위 샘플에서 이와 관련된 필드를 파싱하는 단순한 접근법 사용
유사도: 코사인 유사도
- 보수?인 코사인 거리 사용
- $D_C(A,B)=1 - S_C(A,B)$

학술 논문 추천 시스템: 단순 접근법¶

데이터셋 확인 및 사용할 영역 확인

예제 9-1데이터 임포트 및 필터링

import pandas as pd

model_df = pd.read_json('mag_papers_0.txt', lines=True) # 100만
model_df.shape

(1000000, 19)

# 2만 건의 데이터만 예제에서 사용
df20000 = model_df.iloc[:20000,:]
df20000.shape

(20000, 19)

df20000.to_json('mag_subset20K.txt', orient='records', lines=True)

model_df = pd.read_json('mag_subset20K.txt', lines=True)
model_df.shape

(20000, 19)

model_df.columns

Index(['abstract', 'authors', 'doc_type', 'doi', 'fos', 'id', 'issue',
       'keywords', 'lang', 'n_citation', 'page_end', 'page_start', 'publisher',
       'references', 'title', 'url', 'venue', 'volume', 'year'],
      dtype='object')

# 영어가 아닌 논문은 제외.
# 제목이 중복인 것 제외
model_df = model_df[model_df.lang == 'en'].drop_duplicates(subset='title', keep='first')

# abstract, authors, fos, keywords, year, title 컬럼만 사용
model_df = model_df.drop(['doc_type', 
                          'doi', 'id', 
                          'issue', 'lang', 
                          'n_citation', 
                          'page_end', 
                          'page_start', 
                          'publisher', 
                          'references',
                          'url', 
                          'venue', 
                          'volume'], axis=1)

# 최종적으로 약 1만개의 논문만 사용한다.
model_df.shape

(10399, 6)

model_df.head(2)

for col in model_df.columns:
    print('{}: {}'.format(col, model_df[col].isnull().sum()))

abstract: 4393
authors: 1
fos: 1733
keywords: 4294
title: 0
year: 0

필드명	설명	필드 타입	# NaN
abstract	논문 초록	string	4393
authors	저자 이름과 소속	list of dict, keys = name, org	1
fos	연구 분야(fields of study)	list of strings	1733
keywords	키워드	list of strings	4294
title	논문 제목	string	0
year	출간 년도	int	0

예제 9-2협업 필터링 단계 1: 항목 피처 행렬 생성

# 연구분야
unique_fos = sorted(list({feature
                          for paper_row in model_df.fos.fillna('0')
                          for feature in paper_row }))

# 출간년도
unique_year = sorted(model_df['year'].astype('str').unique())

print('unique_fos  :', len(unique_fos))
print('unique_year :', len(unique_year))
print('total       :', len(unique_fos + unique_year))

unique_fos  : 7604
unique_year : 156
total       : 7760

def feature_array(x, unique_array):
    row_dict = {}
    for i in x.index:
        var_dict = {}
        
        for j in range(len(unique_array)):
            if type(x[i]) is list:
                if unique_array[j] in x[i]:
                    var_dict.update({unique_array[j]: 1})
                else:
                    var_dict.update({unique_array[j]: 0})
            else:
                if unique_array[j] == str(x[i]):
                    var_dict.update({unique_array[j]: 1})
                else:
                    var_dict.update({unique_array[j]: 0})
        
        row_dict.update({i : var_dict})
    
    feature_df = pd.DataFrame.from_dict(row_dict, dtype='str').T
    
    return feature_df

year_features = feature_array(model_df['year'], unique_year)
year_features.shape

(10399, 156)

year_features.head()

fos_features = feature_array(model_df['fos'], unique_fos)
fos_features.shape

(10399, 7604)

fos_features.head()

first_features = fos_features.join(year_features).T
first_features.shape

(7760, 10399)

first_features.head()

from sys import getsizeof

print('original array: {}'.format(getsizeof(model_df)))
print('first feature array: {}'.format(getsizeof(first_features)))

original array: 13892361
first feature array: 5003741538

예제 9-3협업 필터링 단계 2: 유사 항목 검색

from scipy.spatial.distance import cosine

def item_collab_filter(features_df):
    item_similarities = pd.DataFrame(index=features_df.columns, columns=features_df.columns)
    
    for i in features_df.columns:
        for j in features_df.columns:
            item_similarities.loc[i][j] = 1 - cosine(features_df[i].astype('float'), features_df[j].astype('float'))
    
    return item_similarities

%time first_items = item_collab_filter(first_features.loc[:, 0:1000])

Wall time: 12min 5s

예제 9-4논문 추천 시스템에 대한 히트맵

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

sns.set()
ax = sns.heatmap(first_items.fillna(0), 
                 vmin=0, vmax=1, 
                 cmap="YlGnBu", 
                 xticklabels=250, yticklabels=250)
ax.tick_params(labelsize=12)

진한색일수록 유사한 항목
대부분이 서로 비슷하지 않음

예제 9-5항목 기반 협업 필터링 추천

def paper_recommender(paper_index, items_df):
    print('Based on the paper: \nindex = ', paper_index)
    print(model_df.iloc[paper_index])
    top_results = items_df.loc[paper_index].sort_values(ascending=False).head(4)
    print('\nTop three results: ') 
    order = 1
    for i in top_results.index.tolist()[-3:]:
        print(order,'. Paper index = ', i)
        print('Similarity score: ', top_results[i])
        print(model_df.iloc[i], '\n')
        if order < 5: 
            order += 1

paper_recommender(2, first_items)

Based on the paper: 
index =  2
abstract                                                 None
authors     [{'name': 'Jovana P. Lekovich', 'org': 'Weill ...
fos                                                      None
keywords                                                 None
title       Should endometriosis be an indication for intr...
year                                                     2015
Name: 2, dtype: object

Top three results: 
1 . Paper index =  2
Similarity score:  1.0
abstract                                                 None
authors     [{'name': 'Jovana P. Lekovich', 'org': 'Weill ...
fos                                                      None
keywords                                                 None
title       Should endometriosis be an indication for intr...
year                                                     2015
Name: 2, dtype: object 

2 . Paper index =  292
Similarity score:  1.0
abstract                                                 None
authors     [{'name': 'John C. Newton'}, {'name': 'Beers M...
fos         [Wide area multilateration, Maneuvering speed,...
keywords                                                 None
title                    Automatic speed control for aircraft
year                                                     1955
Name: 561, dtype: object 

3 . Paper index =  593
Similarity score:  1.0
abstract    This paper demonstrates that on‐site greywater...
authors     [{'name': 'Eran Friedler', 'org': 'Division of...
fos         [Public opinion, Environmental Engineering, Wa...
keywords    [economic analysis, tratamiento desperdicios, ...
title       The water saving potential and the socio-econo...
year                                                     2008
Name: 1152, dtype: object

자기 자신이 가장 유사하게 나옴
그러나, 나머지 두 논문은 유사하지 않은데 유사하게 나옴
현재의 방법은 반복적인 엔지니어링을 위해서는 너무 느림
- 다른 방법을 모색해야함

두 번째 단계: 피처 엔지니어링과 더 똑똑한 모델¶

두 개의 피처에 더 나은 기법을 적용하고 더 빠르게 반복 처리를 하도록 항목 기반 협업 필터를 수정

학술 논문 추천 시스템: 테이크 2¶

피처로 원시 카운트를 사용하면 유사성 척도를 사용하는 기법에서 문제가 있음

예제 9-6고정 폭 비닝 + 더미 코딩(파트 1)

model_df['year'].tail()

19994    1951
19995    2017
19997    1971
19998    1986
19999    2015
Name: year, dtype: int64

print("Year spread: ", model_df['year'].min()," - ", model_df['year'].max())
print("Quantile spread:\n", model_df['year'].quantile([0.25, 0.5, 0.75]))

Year spread:  1831  -  2017
Quantile spread:
 0.25    1990.0
0.50    2005.0
0.75    2012.0
Name: year, dtype: float64

# year의 분포 확인
sns.set_style('whitegrid')
fig, ax = plt.subplots(figsize=(7, 5))
model_df['year'].hist(ax=ax, bins= model_df['year'].max() - model_df['year'].min())
ax.tick_params(labelsize=12)
ax.set_xlabel('Year Count', fontsize=12)
ax.set_ylabel('Occurrence', fontsize=12)

Text(0, 0.5, 'Occurrence')

예제 9-7고정 폭 비닝 + 더미 코딩(파트 2)

# bin은 데이터의 수가 아니라 변수의 범위를 기준으로 설정한다.
model_df['year'].max() - model_df['year'].min()

186

# year 피쳐를 10년 단위로 비닝
bins = int(round((model_df['year'].max() - model_df['year'].min()) / 10))

temp_df = pd.DataFrame(index=model_df.index)
temp_df['yearBinned'] = pd.cut(model_df['year'].tolist(), bins, precision=0)

# year 피쳐를 10년 단위로 비닝함으로써 피쳐 공간을 156에서 19로 줄인다.
print('We have reduced from', len(model_df['year'].unique()),
      'to', len(temp_df['yearBinned'].values.unique()), 'features representing the year.')

We have reduced from 156 to 19 features representing the year.

X_yrs = pd.get_dummies(temp_df['yearBinned'])
X_yrs.head()

# 비닝한 year의 분포 확인
sns.set_style('white')
fig, ax = plt.subplots(figsize=(7, 5))
X_yrs.sum().plot.bar(ax = ax)
ax.tick_params(labelsize=12)
ax.set_xlabel('Binned Years', fontsize=12)
ax.set_ylabel('Counts', fontsize=12)

Text(0, 0.5, 'Counts')

예제 9-8Pandas 시리즈인 bag-of-phrases를 NumPy 희소 배열로 변환

X_fos = fos_features.values

# 각 객체의 크기를 보면 나중에 어떤 차이를 만들게 될지 예상할 수 있다.
print('Our pandas Series, in bytes: ', getsizeof(fos_features))
print('Our hashed numpy array, in bytes: ', getsizeof(X_fos))

Our pandas Series, in bytes:  4902998648
Our hashed numpy array, in bytes:  112

예제 9-9협업 필터링 단계 1+2: 항목 피처 행렬 생성, 유사 항목 검색

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import DictVectorizer

X_yrs.shape[1] + X_fos.shape[1]

7623

# 10399 x 7623 array
%time second_features = np.append(X_fos, X_yrs, axis = 1)

second_size = getsizeof(second_features)
print('Size of second feature array, in bytes: ', second_size)

Wall time: 308 ms
Size of second feature array, in bytes:  634172728

print("The power of feature engineering saves us, in bytes: ", getsizeof(fos_features) - second_size)

The power of feature engineering saves us, in bytes:  4268825920

def piped_collab_filter(features_matrix, index, top_n):
                
    item_similarities = 1 - cosine_similarity(features_matrix[index:index+1], features_matrix).flatten() 
    related_indices = [i for i in item_similarities.argsort()[::-1] if i != index]

    return [(index, item_similarities[index]) for index in related_indices][0:top_n]

예제 9-10항목 기반 협업 필터링 추천 시스템: 테이크 2

def paper_recommender(items_df, paper_index, top_n):
    if paper_index in model_df.index:
        
        print('Based on the paper:')
        print('Paper index = ', model_df.loc[paper_index].name)
        print('Title :', model_df.loc[paper_index]['title'])
        print('FOS :', model_df.loc[paper_index]['fos'])
        print('Year :', model_df.loc[paper_index]['year'])
        print('Abstract :', model_df.loc[paper_index]['abstract'])
        print('Authors :', model_df.loc[paper_index]['authors'], '\n')
        
        # 요청된 DataFrame 인덱스에 대한 위치 인덱스 정의
        array_ix = model_df.index.get_loc(paper_index)

        top_results = piped_collab_filter(items_df, array_ix, top_n)
        
        print('\nTop',top_n,'results: ') 
        
        order = 1
        for i in range(len(top_results)):
            print(order,'. Paper index = ', model_df.iloc[top_results[i][0]].name)
            print('Similarity score: ', top_results[i][1])
            print('Title :', model_df.iloc[top_results[i][0]]['title'])
            print('FOS :', model_df.iloc[top_results[i][0]]['fos'])
            print('Year :', model_df.iloc[top_results[i][0]]['year'])
            print('Abstract :', model_df.iloc[top_results[i][0]]['abstract'])
            print('Authors :', model_df.iloc[top_results[i][0]]['authors'], '\n')
            if order < top_n: order += 1
    
    else:
        print('Whoops! Choose another paper. Try something from here: \n', model_df.index[100:200])

paper_recommender(second_features, 2, 3)

Based on the paper:
Paper index =  2
Title : Should endometriosis be an indication for intracytoplasmic sperm injection (ICSI) in fresh IVF cycles
FOS : None
Year : 2015
Abstract : None
Authors : [{'name': 'Jovana P. Lekovich', 'org': 'Weill Cornell Medical College, New York, NY'}, {'name': 'G.D. Palermo', 'org': 'Weill Medical College of Cornell University, New York, NY'}, {'name': 'Nigel Pereira', 'org': 'The Ronald O. Perelman and Claudia Cohen Center, New York, NY'}, {'name': 'Zev Rosenwaks', 'org': 'Weill Cornell Medical College, New York, NY'}] 


Top 3 results: 
1 . Paper index =  10055
Similarity score:  1.0
Title : [Diagnosis of cerebral tumors; comparative studies on arteriography, ventriculography and electroencephalography].
FOS : ['Radiology', 'Pathology', 'Surgery']
Year : 1953
Abstract : None
Authors : [{'name': 'Antoine'}, {'name': 'Lepoire'}, {'name': 'Schoumacker'}] 

2 . Paper index =  11771
Similarity score:  1.0
Title : A Study of Special Functions in the Theory of Eclipsing Binary Systems
FOS : ['Contact binary']
Year : 1981
Abstract : None
Authors : [{'name': 'Filaretti Zafiropoulos', 'org': 'University of Manchester'}] 

3 . Paper index =  11773
Similarity score:  1.0
Title : Studies of powder flow using a recording powder flowmeter and measurement of the dynamic angle of repose
FOS : None
Year : 1985
Abstract : This paper describes the utility of the dynamic measurement of the angle of repose for pharmaceutical systems, using a variable rotating cylinder to quantify powder flow. The dynamic angle of repose of sodium chloride powder sieve fractions was evaluated using a variable rotating cylinder. The relationship between the static and the dynamic angle of repose is discussed. The dynamic angle of repose of six lots of a multivitamin preparation were compared for inter- and intralot variation. In both cases, no significant differences (p> 0.05) were observed. In the multivitamin formulation, lubricants at lower concentration levels did not show a significant effect (p> 0.05) on the dynamic angle of repose when compared with flow rates. The effect of different hopper sizes and geometry has been evaluated using the recording powder flowmeter. The results indicate that although different hoppers affect the quantitative nature of the results, the same general trends are apparent. Thus, it appears possible to use a recording powder flowmeter with small quantities of material to predict the effect of formulation and processing variables on the flow of production scale quantities. This paper does not describe a comprehensive evaluation of the pharmaceutical utility of measuring the dynamic angle of repose. However, the results discussed are not encouraging and suggest that the recording powder flowmeter is more sensitive to the effects of formulation and production variables on powder flow.
Authors : [{'name': 'Ramachandra P. Hegde', 'org': 'Department of Pharmacy, University of Rhode Island, Kingston, RI 02881'}, {'name': 'J.L. Rheingold', 'org': 'Formulation Research, Lederle Laboratories, Pearl River, NY 10965'}, {'name': 'S. Welch', 'org': 'Formulation Research, Lederle Laboratories, Pearl River, NY 10965'}, {'name': 'C. T. Rhodes', 'org': 'Department of Pharmacy, University of Rhode Island, Kingston, RI 02881|||Department of Pharmacy, University of Rhode Island, Kingston, RI 02881'}]

예제 9-11변환으로 인한 인덱스 할당의 변화

model_df.loc[21]

abstract    A microprocessor includes hardware registers t...
authors                      [{'name': 'Mark John Ebersole'}]
fos         [Embedded system, Parallel computing, Computer...
keywords                                                 None
title       Microprocessor that enables ARM ISA program to...
year                                                     2013
Name: 21, dtype: object

model_df.iloc[21]

abstract                                                 None
authors     [{'name': 'Nicola M. Heller'}, {'name': 'Steph...
fos         [Biology, Medicine, Post-transcriptional regul...
keywords    [glucocorticoids, post transcriptional regulat...
title       Post-transcriptional regulation of eotaxin by ...
year                                                     2002
Name: 30, dtype: object

model_df.index.get_loc(30)

21

세 번째 단계: 추가 피처 = 추가 정보¶

출간 년도와 연구 분야가 유사한 논문을 추천하기에 충분할 것이라는 가설을 지지하지 못하고 있음
- 원본 데이터셋을 좀 더 많이 사용해 더 나은 결과를 얻을 수 있는지 확인 (크게 달라지지 않을 것)
- 좋은 추천 결과를 제공할 수 있을 만큼 충분한 데이터인지 좀 더 시간을 들여 탐색 (이미 충분)
- 피처 추가 (abstract와 authors에 초점을 둠)

학술 논문 추천 시스템: 테이크 3¶

tf-idf 이용

예제 9-12불용어 처리 + tf-idf

# sklearn을 사용하기 위해 NaN 항목을 채워준다.
filled_df = model_df.fillna('None')
filled_df.head()

# abstract: 불용어, 빈도기반 필터링
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')
X_abstract = vectorizer.fit_transform(filled_df['abstract'])
X_abstract

<10399x48516 sparse matrix of type '<class 'numpy.float64'>'
	with 374055 stored elements in Compressed Sparse Row format>

print("n_samples: %d, n_features: %d" % X_abstract.shape)
X_yrs.shape[1] + X_fos.shape[1] + X_abstract.shape[1]

n_samples: 10399, n_features: 48516

56139

# 10399 x 56139 array

%time third_features = np.append(second_features, X_abstract.toarray(), axis = 1)

Wall time: 9.69 s

paper_recommender(third_features, 2, 3)

Based on the paper:
Paper index =  2
Title : Should endometriosis be an indication for intracytoplasmic sperm injection (ICSI) in fresh IVF cycles
FOS : None
Year : 2015
Abstract : None
Authors : [{'name': 'Jovana P. Lekovich', 'org': 'Weill Cornell Medical College, New York, NY'}, {'name': 'G.D. Palermo', 'org': 'Weill Medical College of Cornell University, New York, NY'}, {'name': 'Nigel Pereira', 'org': 'The Ronald O. Perelman and Claudia Cohen Center, New York, NY'}, {'name': 'Zev Rosenwaks', 'org': 'Weill Cornell Medical College, New York, NY'}] 


Top 3 results: 
1 . Paper index =  10055
Similarity score:  1.0
Title : [Diagnosis of cerebral tumors; comparative studies on arteriography, ventriculography and electroencephalography].
FOS : ['Radiology', 'Pathology', 'Surgery']
Year : 1953
Abstract : None
Authors : [{'name': 'Antoine'}, {'name': 'Lepoire'}, {'name': 'Schoumacker'}] 

2 . Paper index =  11773
Similarity score:  1.0
Title : Studies of powder flow using a recording powder flowmeter and measurement of the dynamic angle of repose
FOS : None
Year : 1985
Abstract : This paper describes the utility of the dynamic measurement of the angle of repose for pharmaceutical systems, using a variable rotating cylinder to quantify powder flow. The dynamic angle of repose of sodium chloride powder sieve fractions was evaluated using a variable rotating cylinder. The relationship between the static and the dynamic angle of repose is discussed. The dynamic angle of repose of six lots of a multivitamin preparation were compared for inter- and intralot variation. In both cases, no significant differences (p> 0.05) were observed. In the multivitamin formulation, lubricants at lower concentration levels did not show a significant effect (p> 0.05) on the dynamic angle of repose when compared with flow rates. The effect of different hopper sizes and geometry has been evaluated using the recording powder flowmeter. The results indicate that although different hoppers affect the quantitative nature of the results, the same general trends are apparent. Thus, it appears possible to use a recording powder flowmeter with small quantities of material to predict the effect of formulation and processing variables on the flow of production scale quantities. This paper does not describe a comprehensive evaluation of the pharmaceutical utility of measuring the dynamic angle of repose. However, the results discussed are not encouraging and suggest that the recording powder flowmeter is more sensitive to the effects of formulation and production variables on powder flow.
Authors : [{'name': 'Ramachandra P. Hegde', 'org': 'Department of Pharmacy, University of Rhode Island, Kingston, RI 02881'}, {'name': 'J.L. Rheingold', 'org': 'Formulation Research, Lederle Laboratories, Pearl River, NY 10965'}, {'name': 'S. Welch', 'org': 'Formulation Research, Lederle Laboratories, Pearl River, NY 10965'}, {'name': 'C. T. Rhodes', 'org': 'Department of Pharmacy, University of Rhode Island, Kingston, RI 02881|||Department of Pharmacy, University of Rhode Island, Kingston, RI 02881'}] 

3 . Paper index =  11778
Similarity score:  1.0
Title : Direct antagonistic action of prostaglandins and phenylisopropyladenosine on the activity of pancreatic triglyceride lipase and the antagonistic effect of polyphloretin phosphate
FOS : ['Endocrinology', 'Biochemistry', 'Diabetes mellitus']
Year : 1974
Abstract : None
Authors : [{'name': 'Mentz P'}, {'name': 'Förster W'}, {'name': 'Giessler C'}]

예제 9-13scikit-learn의 DictVectorizer를 사용한 원-핫 인코딩

authors_df = pd.DataFrame(filled_df.authors)
authors_df.head()

authors_list = []

for row in authors_df.itertuples():
    # 각 시리즈 인덱스로부터 딕셔너리 생성 
    if type(row.authors) is str:
        y = {'None': row.Index}
    if type(row.authors) is list:
        # 이 키와 값을 딕셔너리에 추가
        y = dict.fromkeys(row.authors[0].values(), row.Index)
    authors_list.append(y)

authors_list[0:5]

[{'None': 0},
 {'Ahmed M. Alluwaimi': 1},
 {'Jovana P. Lekovich': 2, 'Weill Cornell Medical College, New York, NY': 2},
 {'George C. Sponsler': 5},
 {'M. T. Richards': 7}]

v = DictVectorizer(sparse=False)
D = authors_list
X_authors = v.fit_transform(D)
X_authors

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

print("n_samples: %d, n_features: %d" % X_authors.shape)
X_yrs.shape[1] + X_fos.shape[1] + X_abstract.shape[1] + X_authors.shape[1]

n_samples: 10399, n_features: 14028

70167

# 10399 x 70167 array

%time fourth_features = np.append(third_features, X_authors, axis = 1)

Wall time: 7.39 s

예제 9-14항목 기반 협업 필터링 추천 시스템: 테이크 3

paper_recommender(fourth_features, 2, 3)

Based on the paper:
Paper index =  2
Title : Should endometriosis be an indication for intracytoplasmic sperm injection (ICSI) in fresh IVF cycles
FOS : None
Year : 2015
Abstract : None
Authors : [{'name': 'Jovana P. Lekovich', 'org': 'Weill Cornell Medical College, New York, NY'}, {'name': 'G.D. Palermo', 'org': 'Weill Medical College of Cornell University, New York, NY'}, {'name': 'Nigel Pereira', 'org': 'The Ronald O. Perelman and Claudia Cohen Center, New York, NY'}, {'name': 'Zev Rosenwaks', 'org': 'Weill Cornell Medical College, New York, NY'}] 


Top 3 results: 
1 . Paper index =  10055
Similarity score:  1.0
Title : [Diagnosis of cerebral tumors; comparative studies on arteriography, ventriculography and electroencephalography].
FOS : ['Radiology', 'Pathology', 'Surgery']
Year : 1953
Abstract : None
Authors : [{'name': 'Antoine'}, {'name': 'Lepoire'}, {'name': 'Schoumacker'}] 

2 . Paper index =  5601
Similarity score:  1.0
Title : 633 Survival after coronary revascularization, with and without mitral valve repair, in patients with ischemic mitral regurgitation. Importance of pre-operative myocardial viability
FOS : ['Cardiology']
Year : 2005
Abstract : None
Authors : [{'name': 'J.B. Le Polain De Waroux'}, {'name': 'Anne-Catherine Pouleur'}, {'name': 'B. Beige'}, {'name': 'Agnes Pasquet'}, {'name': 'B. Gerbe'}, {'name': 'B. Noirhomme'}, {'name': 'G. El Khoury'}, {'name': 'Jean Louis Vanoverschelde'}] 

3 . Paper index =  12256
Similarity score:  1.0
Title : Nucleotide Sequence and Analysis of an Insertion Sequence from Bacillus thuringiensis Related to IS150
FOS : ['Biology', 'Molecular biology', 'Insertion sequence', 'Nucleic acid sequence', 'Bioinformatics', 'Genetics']
Year : 1994
Abstract : A 5.8-kb DNA fragment encoding the  cryIC  gene from  Bacillus thuringiensis  (Bt) subsp.  aizawai  HD229 was subcloned into the pMex7 vector for expression in  Escherichia coli . In addition to the 135-kDa CryIC δ-endotoxin, this DNA fragment also encoded a 30-kDa polypeptide whose open reading frame ( orfX ) was located less than 200 bp upstream of  cryIC  . Nucleotide sequencing showed that  orfX  was truncated at the 5′ end, and full sequence was obtained from a second overlapping clone. Sequence analysis showed that  orfX  could encode a polypeptide closely related to the putative transposase from IS 150 . OrfX was flanked by a 17-bp imperfect inverted repeat, defining the length of the element as 998 bp. Southern blot analysis revealed that the novel insertion sequence was present in a single copy and located in an identical position immediately upstream of  cryIC  in plasmid DNA from both Bt subsp.  aizawai  and  entomocidus.
Authors : [{'name': 'Geoffrey P. Smith'}, {'name': 'David J. Ellar'}, {'name': 'Sharon J. Keeler'}, {'name': 'Cynthia E. Seip'}]

그나마 의학분야의 논문을 보여줌 (총 의학분야는 7,604개)
제목에서 명사구를 추출하거나 키워드에서 어근(stem)을 추출하는 등 텍스트 변수를 더 추가하면 좋아질 것임

a = (0, 0, 0, 1)
b = (0, 0, 1, 0)

cosine(a, b) # distance

1.0

help(cosine)

Help on function cosine in module scipy.spatial.distance:

cosine(u, v, w=None)
    Compute the Cosine distance between 1-D arrays.
    
    The Cosine distance between `u` and `v`, is defined as
    
    .. math::
    
        1 - \frac{u \cdot v}
                  {||u||_2 ||v||_2}.
    
    where :math:`u \cdot v` is the dot product of :math:`u` and
    :math:`v`.
    
    Parameters
    ----------
    u : (N,) array_like
        Input array.
    v : (N,) array_like
        Input array.
    w : (N,) array_like, optional
        The weights for each value in `u` and `v`. Default is None,
        which gives each value a weight of 1.0
    
    Returns
    -------
    cosine : double
        The Cosine distance between vectors `u` and `v`.
    
    Examples
    --------
    >>> from scipy.spatial import distance
    >>> distance.cosine([1, 0, 0], [0, 1, 0])
    1.0
    >>> distance.cosine([100, 0, 0], [0, 1, 0])
    1.0
    >>> distance.cosine([1, 1, 0], [0, 1, 0])
    0.29289321881345254

	...	2015	2016
0	...	1	0
1	...	0	1
2	...	1	0
5	...	0	0
7	...	0	0

	0	0-10 V lighting control	1/N expansion	10G-PON	14-3-3 protein	2-choice hashing	20th-century philosophy	2D computer graphics	2DEG	3-D Secure	...	k-nearest neighbors algorithm	m-derived filter	microRNA	pH	photoperiodism	route	strictfp	string	Ćuk converter	μ operator
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
5	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
7	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

	(1958.0, 1968.0]	(1968.0, 1978.0]	(2007.0, 2017.0]
0	0	0	1
1	0	0	1
2	0	0	1
5	1	0	0
7	0	1	0

All about Feature Scaling (0)	2022.06.21
[Labeling] Snorkel 소개 (0)	2020.11.20
Ch.7 Nonlinear Featurization viaK-Means Model Stacking (0)	2020.05.21
Ch.6 Dimensionality Reduction: Squashing the Data Pancake with PCA (0)	2020.05.06
Ch.5 Categorical Variables: Counting Eggs in theAge of Robotic Chickens (0)	2020.04.14

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

데이터과학 삼학년

데이터과학 삼학년

Ch.9 Back to the Feature: Building an Academic Paper Recommender 본문

Ch.9 Back to the Feature: Building an Academic Paper Recommender

학술 논문 추천 시스템 구축¶

항목 기반 협업 필터링¶

첫 번째 단계: 데이터 가져오기, 정제하기, 피처 파싱하기¶

학술 논문 추천 시스템: 단순 접근법¶

두 번째 단계: 피처 엔지니어링과 더 똑똑한 모델¶

학술 논문 추천 시스템: 테이크 2¶

세 번째 단계: 추가 피처 = 추가 정보¶

학술 논문 추천 시스템: 테이크 3¶

'Feature Engineering' 카테고리의 다른 글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역

	abstract	authors	fos	keywords	title	year
0	A system and method for maskless direct write ...	None	[Electronic engineering, Computer hardware, En...	None	System and Method for Maskless Direct Write Li...	2015
1	None	[{'name': 'Ahmed M. Alluwaimi'}]	[Biology, Virology, Immunology, Microbiology]	[paratuberculosis, of, subspecies, proceedings...	The dilemma of the Mycobacterium avium subspec...	2016

	...	2015	2016
0	...	1	0
1	...	0	1
2	...	1	0
5	...	0	0
7	...	0	0

	0	0-10 V lighting control	1/N expansion	10G-PON	14-3-3 protein	2-choice hashing	20th-century philosophy	2D computer graphics	2DEG	3-D Secure	...	k-nearest neighbors algorithm	m-derived filter	microRNA	pH	photoperiodism	route	strictfp	string	Ćuk converter	μ operator
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
5	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
7	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

	(1958.0, 1968.0]	(1968.0, 1978.0]	(2007.0, 2017.0]
0	0	0	1
1	0	0	1
2	0	0	1
5	1	0	0
7	0	1	0

	authors
0	None
1	[{'name': 'Ahmed M. Alluwaimi'}]
2	[{'name': 'Jovana P. Lekovich', 'org': 'Weill ...
5	[{'name': 'George C. Sponsler'}]
7	[{'name': 'M. T. Richards'}]

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

	...	2015	2016
0	...	1	0
1	...	0	1
2	...	1	0
5	...	0	0
7	...	0	0

	0	0-10 V lighting control	1/N expansion	10G-PON	14-3-3 protein	2-choice hashing	20th-century philosophy	2D computer graphics	2DEG	3-D Secure	...	k-nearest neighbors algorithm	m-derived filter	microRNA	pH	photoperiodism	route	strictfp	string	Ćuk converter	μ operator
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
5	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
7	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

	(1958.0, 1968.0]	(1968.0, 1978.0]	(2007.0, 2017.0]
0	0	0	1
1	0	0	1
2	0	0	1
5	1	0	0
7	0	1	0

	...	2015	2016
0	...	1	0
1	...	0	1
2	...	1	0
5	...	0	0
7	...	0	0

	0	0-10 V lighting control	1/N expansion	10G-PON	14-3-3 protein	2-choice hashing	20th-century philosophy	2D computer graphics	2DEG	3-D Secure	...	k-nearest neighbors algorithm	m-derived filter	microRNA	pH	photoperiodism	route	strictfp	string	Ćuk converter	μ operator
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
5	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
7	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

	(1958.0, 1968.0]	(1968.0, 1978.0]	(2007.0, 2017.0]
0	0	0	1
1	0	0	1
2	0	0	1
5	1	0	0
7	0	1	0