从印第安人的名字推断种姓
outkast的Python项目详细描述
利用来自19个州的1.4亿印度人的数据,我们估计了特定姓氏、年份和州的在册种姓、在册部落和其他人的比例。在
为什么?在
我们提供这套方案,以便人们能够评估、强调和抵制不公平。在
基础数据是如何产生的?在
- script下载发布的here的SECC的clean version。在
- Infer the last name
- remove names with non-alphabetical characters
- remove records with missing last names
- remove < 2 char last names
- remove rows with birth_date < 1900
- last name shared by at least 1000
基本分类器
我们从提供姓氏的基本模型开始,该模型给出了Bayes 提供姓氏为SC、ST和其他的比例的最优解。 我们还提供了一系列的基本模型 住所是已知的。在
用法
^{pr2}$使用EHT 3>>>> import pandas as pd
>>> from outkast import secc_caste
>>>
>>> names = [{'name': 'patel'},
... {'name': 'zala'},
... {'name': 'lal'},
... {'name': 'agarwal'}]
>>>
>>> df = pd.DataFrame(names)
>>>
>>> secc_caste(df, 'name')
name n_sc n_st n_other prop_sc prop_st prop_other
0 patel 5681 112302 631393 0.007581 0.149861 0.842558
1 zala 667 14 34550 0.018932 0.000397 0.980670
2 lal 703595 241846 1314224 0.311371 0.107027 0.581601
3 agarwal 39 12 4375 0.008812 0.002711 0.988477
>>>
>>> help(secc_caste)
Help on method secc_caste in module outkast.secc_caste_ln:
secc_caste(df, namecol, state=None, year=None) method of builtins.type instance
Appends additional columns from SECC data to the input DataFrame
based on the last name.
Removes extra space. Checks if the name is the SECC data.
If it is, outputs data from that row.
Args:
df (:obj:`DataFrame`): Pandas DataFrame containing the last name
column.
namecol (str or int): Column's name or location of the name in
DataFrame.
state (str): The state name of SECC data to be used.
(default is None for all states)
year (int): The year of SECC data to be used.
(default is None for all years)
Returns:
DataFrame: Pandas DataFrame with additional columns:-
'n_sc', 'n_st', 'n_other',
'prop_sc', 'prop_st', 'prop_other' by last name
许可证
包在MIT License下发布。在
- 项目
标签: