<p>首先,我们创建<code>person</code>作为数据帧:</p>
<pre><code>columns = ['nconst', 'primaryName', 'primaryProfession', 'knownForTitles',]
data = [
('nm0000103', 'Fairuza Balk', 'actress,soundtrack', 'tt0181875,tt0089908,tt0120586,tt0115963'),
('nm0000106', 'Drew Barrymore', 'producer,actress,soundtrack', 'tt0120888,tt0343660,tt0151738,tt0120631'),
('nm0000117', 'Neve Campbell', 'actress,producer,soundtrack', 'tt0134084,tt1262416,tt0120082,tt0117571'),
('nm0000132', 'Claire Danes', 'actress,producer,soundtrack', 'tt0274558,tt0108872,tt1796960,tt0117509'),
('nm0000138', 'Leonardo DiCaprio', 'actor,producer,writer', 'tt0120338,tt0993846,tt1375666,tt0407887'),
]
person = pd.DataFrame(data=data, columns=columns)
</code></pre>
<p>其次,我们将字符串拆分为两列的列表:</p>
<pre><code>for field in ['primaryProfession', 'knownForTitles']:
person[field] = person[field].str.split(',')
</code></pre>
<p>第三,我们使用<code>explode</code>函数将一行转换为多行:</p>
<pre><code>person = person.explode('knownForTitles').explode('primaryProfession')
</code></pre>
<p>第四,我们只选择演员/演员作为主要职业:</p>
<pre><code>actor_actress = person[ person['primaryProfession'].isin(['actress', 'actor'])]
</code></pre>
<p>现在,我们有了一个所谓的整洁格式的数据框(每个单元格都有一个值,而不是一个列表):</p>
<pre><code> nconst primaryName primaryProfession knownForTitles
0 nm0000103 Fairuza Balk actress tt0181875
0 nm0000103 Fairuza Balk actress tt0089908
0 nm0000103 Fairuza Balk actress tt0120586
0 nm0000103 Fairuza Balk actress tt0115963
1 nm0000106 Drew Barrymore actress tt0120888
</code></pre>
<p>在这一点上,我们可以对电影数据帧重复这些步骤,然后加入演员(使用knownfortles)和电影(使用tconst)</p>
<p>对不起,回复的时间太长了。这种方法的关键点是使用<code>str.split(',')</code>,然后使用<code>explode()</code>将数据帧转换为适合联接、合并等的格式</p>