Language Scaling: Applications, Challenges and Approach

Lecture Tutorial for The KDD Conference 2021, August 14-18, 2021

Linjun Shou, Ming Gong, Jian Pei, Xiubo Geng, Xinjie Zhou and Daxin Jiang

Abstract

Language scaling aims to deploy Natural Language Processing (NLP) applications economically across many countries/regions with different languages all over the world. Although recent deep learning techniques have achieved great performance in NLP, they heavily rely on huge amounts of human labeled data. Unfortunately, most languages are resource-low, that is, they have very limited linguistic resources. Language scaling by transferring knowledge from resource rich languages, such as English, to resource low languages is invaluable to the advance of social welfare. Language scaling has been heavily invested in industry parties that want to deploy their applications/services to global markets. At the same time, scaling out NLP applications to various languages, essentially a data science problem, remains a grand challenge due to the huge differences in the morphology, syntax, and pragmatics among different languages, and thus has attracted intensive interest from researchers in machine learning, data mining, and natural language processing.

Thanks to the recent progress in deep learning, pre-training, and transfer learning, many approaches for language scaling have been proposed in the past years. It is high time we provided a comprehensive survey and tutorial to promote further research and practical applications. In this tutorial, we start with a clear problem description for language scaling and an intuitive discussion on the overall challenges. Then, we outline two major categories of approaches to language scaling, namely, model transfer and data transfer. We present a taxonomy to summarize various methods in literature. A large part of the tutorial will be organized to address various types of NLP applications. For each type, we pick up one representative application to elaborate the training/evaluation data as well as the methods. We also demonstrate how to generalize the methods to similar applications of the same type, and share with the audience our lessons and experience learned in developing the applications for Microsoft products. Finally, we discuss several important challenges in this area and future directions.

Tutorial Materials

Presenters

Daxin Jiang (homepage)

Daxin Jiang, Ph.D., Chief Scientist, Microsoft Software Technology Center Asia. Daxin Jiang has years of experience of Research and Engineering in Machine Learning, Data Mining, Natural Language Processing, and Bioinformatics. He received Ph.D. in Computer Science from the Statue University of New York at Buffalo in 2005. He has published extensively in prestigious conferences and journals, and served as a PC member of numerous conferences. He received Best Application Paper Award of SIGKDD’08 and Runner-up for Best Application Paper Award of SIGKDD’04. Daxin is leading an R&D group in Microsoft with 170+ applied scientists and engineers to develop NLP algorithms, applications and platforms, which support various Microsoft products, including Bing, Cortana, Teams, Outlook, and Microsoft Cognitive Services.

Address: 5 Danling Street, Hai Dian, Beijing, China, 100080. Email: djiang@microsoft.com. Tel: +86 (10) 5917 3321.

Jian Pei (homepage)

Jian Pei, Ph.D., Professor, School of Computing Science, Simon Fraser University. His expertise is in developing effective and efficient data analysis techniques for novel data intensive applications. He is a research leader in the general areas of data science, big data, data mining, and database systems. He is recognized as a fellow of Royal Society of Canada (RSC) (i.e., the national academy of Canada), the Canadian Academy of Engineering (CAE), ACM and IEEE. He is one of the most cited authors in data mining, database systems, and information retrieval. His research has generated remarkable impact substantially beyond academia. His algorithms have been adopted by industry in production and popular open source software suites. He is responsible for several commercial systems of record-breaking large scale. As a renowned professional leader, he has played important roles in many academic organizations and activities. He is the Chair of ACM SIGKDD and was the Editor-in-Chief of IEEE TKDE. He received many prestigious awards, including the 2017 ACM SIGKDD Innovation Award and the 2015 ACM SIGKDD Service Award. In his last leave-of-absence from the university, he took the executive roles of two Fortune Global 500 companies. He is a mentor of Creative Destruction Lab (CDL).

Address: 8888 University Drive, Burnaby, BC Canada, V5A 1S6. Email: jpei@cs.sfu.ca. Tel: +1 (778) 782 6851. Fax: +1 (778) 782 3045.

Linjun Shou (homepage)

Linjun Shou, Senior Applied Scientist Manager, Microsoft Software Technology Center Asia. He has good publications on several prestigious international conferences such as ACL, EMNLP, COLING, SIGKDD, AAAI, WSDM, etc and served as the program committees on numerous conferences. His research interests include question answering, cross lingual transfer learning, representation learning, etc. Plenty of his research has been transferred to real Microsoft products such as Bing universal question answering, query understanding, document understanding, news/tweets ranking systems. Besides, he is also actively contributing to the academic community through open sourcing projects (e.g. NeuronBlocks) and benchmarks like XGLUE, CodeXGLUE.

Address: 5 Danling Street, Hai Dian, Beijing, China, 100080. Email: lisho@microsoft.com.

Ming Gong (homepage)

Ming Gong, Ph.D., Principal Applied Scientist Manager, Microsoft Software Technology Center Asia. She received Ph.D. on Graphics and Visual Computing in 2013 from Institute of Computing Technology, Chinese Academy of Sciences. She is leading an elite team with 10+ applied scientists and engineers to develop novel NLP technologies for AI applications. Her research interests include question answering, search intelligence, multilingual/cross-modal modeling, representation learning, etc. She published 30+ papers in top conferences and journals (e.g. ACL, COLING, SIGKDD, EMNLP, WSDM, AAAI, PR, CVIU), and also served as PC members of top NLP/AI conferences. Besides, she is actively contributing to the academic community by open-sourcing projects (e.g. NeuronBlocks) and benchmarks like XGLUE, CodeXGLUE. Many of the novel technologies have been transferred to Microsoft’s global products and online services including Bing search, multilingual question answering services and document understanding platform.

Address: 5 Danling Street, Hai Dian, Beijing, China, 100080. Email: migon@microsoft.com.

Xiubo Geng (homepage)

Xiubo Geng, Ph.D., Senior Applied Scientist, Microsoft Software Technology Center Asia. She received Ph.D in Computer Science in 2011 from Institute of Computing Technology, Chinese Academy of Sciences. Her research interests include machine learning, search intelligence, question answering, multilingual modeling, reasoning, etc. She has good publications on top conferences (e.g. SIGIR, NeurIPS, WWW, EMNLP, IJCAI, etc.), and served as a PC member of several conferences.

Address: 5 Danling Street, Hai Dian, Beijing, China, 100080. Email: xigeng@microsoft.com.

Xinjie Zhou

Xinjie Zhou, Ph.D., Senior Software Engineer Lead, Microsoft Software Technology Center Asia. He received Ph.D. on natural language processing in 2017 from Peking University. His research interests include machine learning, natural language understanding, multilingual modeling, etc. He has good publications on top conferences including ACL, EMNLP, AAAI, etc.

Address: Bldg #25, 328 Xinghu Street, SIP, Suzhou, China, 215000. Email: xinjzhou@microsoft.com.

📕 end of posts 📕