<ul>
<li>当前用于提取isbn元数据的实现速度非常慢,效率也非常低。
<ul>
<li>如上所述,有482000个唯一的isbn值,数据被多次下载(例如,在当前编写代码时,每列下载一次)</li>
</ul>
</li>
<li>最好一次下载所有元数据,然后作为一个单独的操作从<code>dict</code>中提取数据</李>
<li>使用<code>try-except</code>块从无效的isbn值捕获错误。
<ul>
<li>返回一个空的<code>dict</code>,<code>{}</code>,因为<code>pd.json_normalize</code>不能与<code>NaN</code>或<code>None</code>一起使用</李>
<li>没有必要将isbn列分块</李>
</ul>
</li>
<li><code>pd.json_normalize</code>用于扩展从<code>.meta</code>返回的<code>dict</code></李>
<li>使用<code>pandas.DataFrame.rename</code>重命名列,使用<code>pandas.DataFrame.drop</code>删除列</李>
<li>此实现将比当前实现快得多,并且对用于获取元数据的API发出的请求将少得多</李>
<li>要从<code>lists</code>中提取值,例如<code>'Authors'</code>列,请使用<code>df_meta = df_meta.explode('Authors')</code>;如果有多个作者,则将为列表中的每个其他作者创建一个新行</李>
</ul>
<pre class="lang-py prettyprint-override"><code>import pandas as pd # version 1.1.3
import isbnlib # version 3.10.3
# sample dataframe
df = pd.DataFrame({'isbn': ['9780446310789', 'abc', '9781491962299', '9781449355722']})
# function with try-except, for invalid isbn values
def get_meta(col: pd.Series) -> dict:
try:
return isbnlib.meta(col)
except isbnlib.NotValidISBNError:
return {}
# get the meta data for each isbn or an empty dict
df['meta'] = df.isbn.apply(get_meta)
# df
isbn meta
0 9780446310789 {'ISBN-13': '9780446310789', 'Title': 'To Kill A Mockingbird', 'Authors': ['Harper Lee'], 'Publisher': 'Grand Central Publishing', 'Year': '1988', 'Language': 'en'}
1 abc {}
2 9781491962299 {'ISBN-13': '9781491962299', 'Title': 'Hands-On Machine Learning With Scikit-Learn And TensorFlow - Techniques And Tools To Build Learning Machines', 'Authors': ['Aurélien Géron'], 'Publisher': "O'Reilly Media", 'Year': '2017', 'Language': 'en'}
3 9781449355722 {'ISBN-13': '9781449355722', 'Title': 'Learning Python', 'Authors': ['Mark Lutz'], 'Publisher': '', 'Year': '2013', 'Language': 'en'}
# extract all the dicts in the meta column
df = df.join(pd.json_normalize(df.meta)).drop(columns=['meta'])
# extract values from the lists in the Authors column
df = df.explode('Authors')
# df
isbn ISBN-13 Title Authors Publisher Year Language
0 9780446310789 9780446310789 To Kill A Mockingbird Harper Lee Grand Central Publishing 1988 en
1 abc NaN NaN NaN NaN NaN NaN
2 9781491962299 9781491962299 Hands-On Machine Learning With Scikit-Learn And TensorFlow - Techniques And Tools To Build Learning Machines Aurélien Géron OReilly Media 2017 en
3 9781449355722 9781449355722 Learning Python Mark Lutz 2013 en
</code></pre>