用于机器学习和文本处理的Soldai实用程序

soldai-utils的Python项目详细描述


苏蒂尔

这个存储库包含一组处理机器学习和自然语言处理任务的工具,包括对不同分类模型进行快速实验的类。在

数据集

这个类用于加载csv样式数据集,其中所有的特性都是逗号分隔的,类在最后一列中。 它包括一些功能来规范化特征,添加偏差,将数据保存到一个文件并从中加载。还包括函数 分割列车、验证和测试数据集。在

fromsutil.base.DatasetimportDatasetdatafile='./sutil/datasets/ex2data1.txt'd=Dataset.fromDataFile(datafile,',')print(d.size)sample=d.sample(0.3)print(sample.size)sample.save("modelo_01")train,validation,test=d.split(train=0.8,validation=0.2)print(train.size)print(validation.size)print(test.size)

正则Logistic回归

您还可以将自己的模型作为正则化逻辑回归包含在内,使用numpy手动实现并包含在sutil.模型包装

^{pr2}$

Sklearn模型

您还可以将sklearn模型嵌入到包装器类中,以便使用在sklearn中实现的不同模型运行实验。在同一个样式中,您可以创建tensorflow、keras或Pythorch模型sutil.modes.Model类和 实现列车模型和预测方法。在

importnumpyasnpfromsutil.base.DatasetimportDatasetfromsutil.models.SklearnModelimportSklearnModelfromsklearn.linear_modelimportLogisticRegressiondatafile='./sutil/datasets/ex2data1.txt'd=Dataset.fromDataFile(datafile,',')ms=LogisticRegression()m=SklearnModel('Sklearn Logistic',ms)m.trainModel(d)m.score(d.X,d.y)m.roc.plot()m.roc.zoom((0,0.4),(0.5,1.0))

神经网络分类器

这个类让你用一个神经网络,multiperceptron classifier来执行分类。它包装了sklearn MLPClassifer 并实现了一种搜索不同激活、解算器和隐藏层结构的方法。万国邮联可以通过 你自己的参数来初始化你想要的网络。在

fromsutil.base.DatasetimportDatasetfromsutil.neuralnet.NeuralNetworkClassifierimportNeuralNetworkClassifierdatafile='./sutil/datasets/ex2data1.txt'd=Dataset.fromDataFile(datafile,',')d.normalizeFeatures()sample=d.sample(examples=30)nn=NeuralNetworkClassifier((d.n,len(d.labels)))nn.searchParameters(sample)nn.trainModel(d)nn.score(d.X,d.y)nn.roc.plot()

实验

实验类让您执行数据分割并针对不同的模型进行测试,以比较 自动执行

importnumpyasnpfromsutil.base.DatasetimportDatasetfromsklearn.linear_modelimportLogisticRegressionfromsutil.base.ExperimentimportExperimentfromsutil.models.SklearnModelimportSklearnModelfromsutil.models.RegularizedLogisticRegressionimportRegularizedLogisticRegressionfromsutil.neuralnet.NeuralNetworkClassifierimportNeuralNetworkClassifier# Load the datadatafile='./sutil/datasets/ex2data1.txt'd=Dataset.fromDataFile(datafile,',')d.normalizeFeatures()print("Size of the dataset... ")print(d.size)sample=d.sample(0.3)print("Size of the sample... ")print(d.sample)# Create the modelstheta=np.zeros((d.n+1,1))lr=RegularizedLogisticRegression(theta,0.03,0)m=SklearnModel('Sklearn Logistic',LogisticRegression())# Look for the best parameters using a samplenn=NeuralNetworkClassifier((d.n,len(d.labels)))nn.searchParameters(sample)input("Press enter to continue...")# Create the experimentexperiment=Experiment(d,None,0.8,0.2)experiment.addModel(lr,name='Sutil Logistic Regression')experiment.addModel(m,name='Sklearn Logistic Regression')experiment.addModel(nn,name='Sutil Neural Network')# Run the experimentexperiment.run(plot=True)

文本实用程序

Sutil包含文本实用程序来处理和转换文本以进行分类

预处理器

Pre-processor类让您实现文本Pre-processinf函数来转换数据。它包装了nltk方法并使用自己的方法来执行:

  • 大小写规范化
  • 噪声消除
  • 堵塞
  • 皮革化
  • 模式文本规范化
fromsutil.text.PreProcessorimportPreProcessorstring="La Gata maullaba en la noche $'@|··~½¬½¬{{[[]}aqAs   qasdas 1552638"p=PreProcessor.standard()print(p.preProcess(string))patterns=[("\d+","NUMBER")]c=[("case","lower"),("denoise","spanish"),("stopwords","spanish"),("stem","spanish"),("lemmatize","spanish"),("normalize",patterns)]p2=PreProcessor(c)print(p2.preProcess(string))c=[("case","lower"),("denoise","spanish"),("stem","spanish"),("normalize",patterns)]p3=PreProcessor(c)print(p3.preProcess(string))

短语标记器

PhraseTokenizer允许您拆分给定分隔符字符的短语的标记。还有一个GramTokenizer类,它允许您按固定数量的字符分割单词。在

fromsutil.text.GramTokenizerimportGramTokenizerfromsutil.text.PhraseTokenizerimportPhraseTokenizerstring="Hi I'm a really helpful string"t=PhraseTokenizer()print(t.tokenize(string))t2=GramTokenizer()print(t2.tokenize(string))

文本矢量器

TextVectorizer类是将文本矢量化并将向量转换为文本的方法的抽象。Sutil实现,OneHotVectorizertfidVectorizerCountVectorizer。在

importpandasaspdfromsutil.text.TFIDFVectorizerimportTFIDFVectorizerfromsutil.text.GramTokenizerimportGramTokenizerfromsutil.text.PreProcessorimportPreProcessorfromnltk.tokenizeimportTweetTokenizer# Load the datadir_data="./sutil/datasets/"df=pd.read_csv(dir_data+'tweets.csv')# Clean the datapatterns=[("\d+","NUMBER")]c=[("case","lower"),("denoise","english"),("stopwords","english"),("normalize",patterns)]p2=PreProcessor(c)df['clean_tweet']=df.tweet.apply(p2.preProcess)vectorizer=TFIDFVectorizer({},TweetTokenizer())vectorizer.initialize(df.clean_tweet)print(vectorizer.dictionary.head())vectorizer2=TFIDFVectorizer({},GramTokenizer())vectorizer2.initialize(df.clean_tweet)print(vectorizer2.dictionary.head())vector=vectorizer.encodePhrase(df.clean_tweet[0])print(vectorizer.getValues()[0])print(vector)vector2=vectorizer2.encodePhrase(df.clean_tweet[0])print(vectorizer2.getValues()[0])print(vector2)print(df.clean_tweet[0])print(vectorizer.decodeVector(vector))print("*"*50)print(vectorizer2.decodeVector(vector2))

文本数据集

TextDataSet类抽象由文本构成的数据集,其中包括一个向量器和一个预处理器,用于预处理文本,并将文本从文本转换为向量,从向量转换为文本:

fromsutil.text.TextDatasetimportTextDatasetfromsutil.text.GramTokenizerimportGramTokenizerfromsutil.text.TFIDFVectorizerimportTFIDFVectorizerfromsutil.text.PhraseTokenizerimportPhraseTokenizerfromsutil.text.PreProcessorimportPreProcessor# Load the data in the standard wayfilename="./sutil/datasets/tweets.csv"t=TextDataset.standard(filename,",")print(t.texts)print(t.X)print(t.shape)print(t.X[0])print(t.vectorizer.index)print(t.vectorizer.encodePhrase("united oh"))x=input("Presione enter para continuar...")# Creaate the dataset witha custom vectorizer and pre processorpatterns=[("\d+","NUMBER")]c=[("case","lower"),("denoise","spanish"),("stopwords","spanish"),("normalize",patterns)]preprocessor=PreProcessor(c)vectorizer=TFIDFVectorizer({},GramTokenizer())t2=TextDataset.setvectorizer(filename,vectorizer,preprocessor)print(t2.texts)print(t2.X)print(t2.shape)print(t2.X[0])vector=t2.encodePhrase("United oh the")i=0forvinvector:ifv!=0:print(v)print(i)i+=1print(t2.vectorizer.decodeVector(vector))x=input("Presione enter para continuar...")patterns=[("\d+","NUMBER")]c=[("case","lower"),("denoise","spanish"),("stopwords","english"),("normalize",patterns)]pre2=PreProcessor(c)vectorizer=TFIDFVectorizer({},PhraseTokenizer())t3=TextDataset.setvectorizer(filename,vectorizer,pre2)print(t3.texts)print(t3.X)print(t3.shape)print(t3.X[0])vector=t3.encodePhrase("United oh the")i=0forvinvector:ifv!=0:print(v)print(i)i+=1print(t3.decodeVector(vector))

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
java在活动中显示转换的文件   java如何在每小时开始时使用Quartz启动cron?   用于以编程方式删除所有注释的Java正则表达式   oracle EXP实用程序通过Java仅导出少数表   java如何计算太阳黑子的方向和速度?   java为给定时间安排作业   java Linux IntelliJ Chrome WebDriverManager“Chrome(或任何其他浏览器)无法启动”   从数据包头设置Java字节数组大小的socket   java docx4j:使用Eclipse导出时,文档中的所有样式都会消失   java无法在更新通知后取消通知   java将自定义对象绑定到JMS映射消息   java是泛型堆栈构造函数的BigO   java Netbeans API:如何保存当前项目中的一个文件或所有文件?   java使用Hibernate envers时,什么可能导致奇怪的属性解析错误?   foreach使用Collection时Java ConcurrentModificationException背后的原因。删除()   mongodb Java:从Json文档读取单个值