A. Berger and H. Printz. 1998. Recognition performance of a large-scale dependency-grammar language model. In Int'l Conference on Spoken Language Processing (ICSLP'98), Sydney, Australia.

A. Blum. 1992. Learning boolean functions in an infinite attribute space. Machine Learning, 9(4):373-386.

E. Brill. 1995. Transformation-based error-driven learning and natural language processing: A case study in part of speech tagging. Computational Linguistics, 21(4):543-565.

C. Chelba and F. Jelinek. 1998. Exploiting syntactic structure for language modeling. In COLINGA CL '98.

C. Cumby and D. Roth. 2000. Relational representations that facilitate learning. In Proc. of the International Conference on the Principles of Knowledge Representation and Reasoning. To appear.

I. Dagan, L. Lee, and F. Pereira. 1999. Similaritybased models of word cooccurrence probabilities. Machine Learning, 34(1-3):43-69.

A. R. Golding and D. Roth. 1999. A Winnow based approach to context-sensitive spelling correction. Machine Learning, 34(1-3):107-130. Special Issue on Machine Learning and Natural Language.

F. Jelinek. 1998. Statistical Methods for Speech Recognition. MIT Press.

D. Jurafsky and J. H. Martin. 200. Speech and Language Processing. Prentice Hall. 




例如,在第一个字符串AuthorName(A. Berger)中,后跟一个and和另一个作者名(H. printz.),然后是年份1998.。但是在第二个字符串中,Authorname(A. Blum.)紧跟在Year1992.之后


unable to get a proper result. because string does not have any specific end. But every new string is starting with Author Name(s) following by year

这可能足够了。我写了一个正则表达式,它works on your whole sample

((?:(?<![a-zA-Z])[A-Z]\.[ \t]+)+[A-Z][a-zA-Z]+(?:[ \t]*,[ \t]*(?:(?<![a-zA-Z])[A-Z]\.[ \t]+)+[A-Z][a-zA-Z]+)*(?:[ \t]*,)?(?:[ \t]+and[ \t]+(?:(?<![a-zA-Z])[A-Z]\.[ \t]+)+[A-Z][a-zA-Z]+)*[ \t]*\.[ \t]*\d{4}[ \t]*\.)(?!\S)




>>> Rx = re.compile( r"((?:(?<![a-zA-Z])[A-Z]\.[ \t]+)+[A-Z][a-zA-Z]+(?:[ \t]*,[ \t]*(?:(?<![a-zA-Z])[A-Z]\.[ \t]+)+[A-Z][a-zA-Z]+)*(?:[ \t]*,)?(?:[ \t]+and[ \t]+(?:(?<![a-zA-Z])[A-Z]\.[ \t]+)+[A-Z][a-zA-Z]+)*[ \t]*\.[ \t]*\d{4}[ \t]*\.)(?!\S)" )
\S)" )
>>> print (re.sub( Rx, r'\r\n\1', biblioStr ))

