用于机器学习和文本处理的Soldai实用程序
soldai-utils的Python项目详细描述
苏蒂尔
这个存储库包含一组处理机器学习和自然语言处理任务的工具,包括对不同分类模型进行快速实验的类。在
数据集
这个类用于加载csv样式数据集,其中所有的特性都是逗号分隔的,类在最后一列中。 它包括一些功能来规范化特征,添加偏差,将数据保存到一个文件并从中加载。还包括函数 分割列车、验证和测试数据集。在
fromsutil.base.DatasetimportDatasetdatafile='./sutil/datasets/ex2data1.txt'd=Dataset.fromDataFile(datafile,',')print(d.size)sample=d.sample(0.3)print(sample.size)sample.save("modelo_01")train,validation,test=d.split(train=0.8,validation=0.2)print(train.size)print(validation.size)print(test.size)
正则Logistic回归
您还可以将自己的模型作为正则化逻辑回归包含在内,使用numpy手动实现并包含在sutil.模型包装
^{pr2}$Sklearn模型
您还可以将sklearn模型嵌入到包装器类中,以便使用在sklearn中实现的不同模型运行实验。在同一个样式中,您可以创建tensorflow、keras或Pythorch模型sutil.modes.Model类和 实现列车模型和预测方法。在
importnumpyasnpfromsutil.base.DatasetimportDatasetfromsutil.models.SklearnModelimportSklearnModelfromsklearn.linear_modelimportLogisticRegressiondatafile='./sutil/datasets/ex2data1.txt'd=Dataset.fromDataFile(datafile,',')ms=LogisticRegression()m=SklearnModel('Sklearn Logistic',ms)m.trainModel(d)m.score(d.X,d.y)m.roc.plot()m.roc.zoom((0,0.4),(0.5,1.0))
神经网络分类器
这个类让你用一个神经网络,multiperceptron classifier来执行分类。它包装了sklearn MLPClassifer 并实现了一种搜索不同激活、解算器和隐藏层结构的方法。万国邮联可以通过 你自己的参数来初始化你想要的网络。在
fromsutil.base.DatasetimportDatasetfromsutil.neuralnet.NeuralNetworkClassifierimportNeuralNetworkClassifierdatafile='./sutil/datasets/ex2data1.txt'd=Dataset.fromDataFile(datafile,',')d.normalizeFeatures()sample=d.sample(examples=30)nn=NeuralNetworkClassifier((d.n,len(d.labels)))nn.searchParameters(sample)nn.trainModel(d)nn.score(d.X,d.y)nn.roc.plot()
实验
实验类让您执行数据分割并针对不同的模型进行测试,以比较 自动执行
importnumpyasnpfromsutil.base.DatasetimportDatasetfromsklearn.linear_modelimportLogisticRegressionfromsutil.base.ExperimentimportExperimentfromsutil.models.SklearnModelimportSklearnModelfromsutil.models.RegularizedLogisticRegressionimportRegularizedLogisticRegressionfromsutil.neuralnet.NeuralNetworkClassifierimportNeuralNetworkClassifier# Load the datadatafile='./sutil/datasets/ex2data1.txt'd=Dataset.fromDataFile(datafile,',')d.normalizeFeatures()print("Size of the dataset... ")print(d.size)sample=d.sample(0.3)print("Size of the sample... ")print(d.sample)# Create the modelstheta=np.zeros((d.n+1,1))lr=RegularizedLogisticRegression(theta,0.03,0)m=SklearnModel('Sklearn Logistic',LogisticRegression())# Look for the best parameters using a samplenn=NeuralNetworkClassifier((d.n,len(d.labels)))nn.searchParameters(sample)input("Press enter to continue...")# Create the experimentexperiment=Experiment(d,None,0.8,0.2)experiment.addModel(lr,name='Sutil Logistic Regression')experiment.addModel(m,name='Sklearn Logistic Regression')experiment.addModel(nn,name='Sutil Neural Network')# Run the experimentexperiment.run(plot=True)
文本实用程序
Sutil包含文本实用程序来处理和转换文本以进行分类
预处理器
Pre-processor类让您实现文本Pre-processinf函数来转换数据。它包装了nltk方法并使用自己的方法来执行:
- 大小写规范化
- 噪声消除
- 堵塞
- 皮革化
- 模式文本规范化
fromsutil.text.PreProcessorimportPreProcessorstring="La Gata maullaba en la noche $'@|··~½¬½¬{{[[]}aqAs qasdas 1552638"p=PreProcessor.standard()print(p.preProcess(string))patterns=[("\d+","NUMBER")]c=[("case","lower"),("denoise","spanish"),("stopwords","spanish"),("stem","spanish"),("lemmatize","spanish"),("normalize",patterns)]p2=PreProcessor(c)print(p2.preProcess(string))c=[("case","lower"),("denoise","spanish"),("stem","spanish"),("normalize",patterns)]p3=PreProcessor(c)print(p3.preProcess(string))
短语标记器
PhraseTokenizer允许您拆分给定分隔符字符的短语的标记。还有一个GramTokenizer类,它允许您按固定数量的字符分割单词。在
fromsutil.text.GramTokenizerimportGramTokenizerfromsutil.text.PhraseTokenizerimportPhraseTokenizerstring="Hi I'm a really helpful string"t=PhraseTokenizer()print(t.tokenize(string))t2=GramTokenizer()print(t2.tokenize(string))
文本矢量器
TextVectorizer类是将文本矢量化并将向量转换为文本的方法的抽象。Sutil实现,OneHotVectorizer,tfidVectorizer,CountVectorizer。在
importpandasaspdfromsutil.text.TFIDFVectorizerimportTFIDFVectorizerfromsutil.text.GramTokenizerimportGramTokenizerfromsutil.text.PreProcessorimportPreProcessorfromnltk.tokenizeimportTweetTokenizer# Load the datadir_data="./sutil/datasets/"df=pd.read_csv(dir_data+'tweets.csv')# Clean the datapatterns=[("\d+","NUMBER")]c=[("case","lower"),("denoise","english"),("stopwords","english"),("normalize",patterns)]p2=PreProcessor(c)df['clean_tweet']=df.tweet.apply(p2.preProcess)vectorizer=TFIDFVectorizer({},TweetTokenizer())vectorizer.initialize(df.clean_tweet)print(vectorizer.dictionary.head())vectorizer2=TFIDFVectorizer({},GramTokenizer())vectorizer2.initialize(df.clean_tweet)print(vectorizer2.dictionary.head())vector=vectorizer.encodePhrase(df.clean_tweet[0])print(vectorizer.getValues()[0])print(vector)vector2=vectorizer2.encodePhrase(df.clean_tweet[0])print(vectorizer2.getValues()[0])print(vector2)print(df.clean_tweet[0])print(vectorizer.decodeVector(vector))print("*"*50)print(vectorizer2.decodeVector(vector2))
文本数据集
TextDataSet类抽象由文本构成的数据集,其中包括一个向量器和一个预处理器,用于预处理文本,并将文本从文本转换为向量,从向量转换为文本:
fromsutil.text.TextDatasetimportTextDatasetfromsutil.text.GramTokenizerimportGramTokenizerfromsutil.text.TFIDFVectorizerimportTFIDFVectorizerfromsutil.text.PhraseTokenizerimportPhraseTokenizerfromsutil.text.PreProcessorimportPreProcessor# Load the data in the standard wayfilename="./sutil/datasets/tweets.csv"t=TextDataset.standard(filename,",")print(t.texts)print(t.X)print(t.shape)print(t.X[0])print(t.vectorizer.index)print(t.vectorizer.encodePhrase("united oh"))x=input("Presione enter para continuar...")# Creaate the dataset witha custom vectorizer and pre processorpatterns=[("\d+","NUMBER")]c=[("case","lower"),("denoise","spanish"),("stopwords","spanish"),("normalize",patterns)]preprocessor=PreProcessor(c)vectorizer=TFIDFVectorizer({},GramTokenizer())t2=TextDataset.setvectorizer(filename,vectorizer,preprocessor)print(t2.texts)print(t2.X)print(t2.shape)print(t2.X[0])vector=t2.encodePhrase("United oh the")i=0forvinvector:ifv!=0:print(v)print(i)i+=1print(t2.vectorizer.decodeVector(vector))x=input("Presione enter para continuar...")patterns=[("\d+","NUMBER")]c=[("case","lower"),("denoise","spanish"),("stopwords","english"),("normalize",patterns)]pre2=PreProcessor(c)vectorizer=TFIDFVectorizer({},PhraseTokenizer())t3=TextDataset.setvectorizer(filename,vectorizer,pre2)print(t3.texts)print(t3.X)print(t3.shape)print(t3.X[0])vector=t3.encodePhrase("United oh the")i=0forvinvector:ifv!=0:print(v)print(i)i+=1print(t3.decodeVector(vector))
- 项目
标签: