| Peer-Reviewed

Using Human Intelligence to Test the Impact of Popular Preprocessing Steps and Feature Extraction in the Analysis of Human Language

Received: 10 January 2022    Accepted: 5 February 2022    Published: 16 February 2022
Views:       Downloads:
Abstract

More than half a century has passed since Chomsky’s theory of language acquisition, Green and colleagues’ first natural language processor Baseball, and the Brown Corpus creation. Throughout the early decades, many believed that once computers became powerful enough, the development of A.I. systems that could understand and interact with humans using our natural languages would quickly follow. Since then, Moore’s Law has basically held; computer storage and performance has kept pace with our imaginations. And yet, 60 years later, even with these dramatic advances in computer technology, we still face major challenges in using computers to understand human language. The authors suggest that these same exponential increases in computational power have led current efforts to rely too much on techniques designed to exploit raw computational power and, in so doing, efforts have been diverted from advancing and applying the theoretical study of language to the task. In support of this view, the authors provide empirical evidence exposing the limitations of techniques – such as n-gram extraction – used to pre-process language. In addition, the authors conducted an analysis comparing three leading natural-language processing question-answering systems to human performance, and found that human subjects far outperformed all question answering-systems tested. The authors conclude by advocating for efforts to discover new approaches that use computational power to support linguistic and cognitive approaches to natural language understanding, as opposed to current techniques founded on patterns of word frequency.

Published in International Journal of Data Science and Analysis (Volume 8, Issue 1)
DOI 10.11648/j.ijdsa.20220801.13
Page(s) 18-22
Creative Commons

This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.

Copyright

Copyright © The Author(s), 2024. Published by Science Publishing Group

Keywords

Natural Language Processing, NLP, N-gram, Phrase-tructure Parsing, AllenNLP, DeepPavlov.ai, BERT

References
[1] Chomsky, N. (1957). Syntactic Structures. Boston, MA: Mouton de Gruyter.
[2] Green, B., Wolf, A., Chomsky, C., & Laughery, K. (1961). Baseball: An automatic question answerer. Western joint IRE-AIEE-ACM computer conference, 19, 219-224.
[3] Kučera, H., & Francis, W. N. (1967). Computational analysis of present-day American English. Providence, RI: Brown University Press.
[4] Moore, G. E. (1965, April 19). Cramming more components onto integrated circuits. Electronics, 38 (8), 114 ff.
[5] Roser, M., & Ritchie, H. (2017). Technological progress. https://ourworldindata.org/technological-progress.
[6] Panesar, K. (2020). Conversational artificial intelligence: Demystifying statistical vs linguistic NLP solutions. Journal of Computer-Assisted Linguistic Research, 4, 47-79. http://hdl.handle.net/10454/18121.
[7] Manning, C. D. (2016, June 23). Language is communication; Texts are knowledge. The Future of Artificial Intelligence Conference, Stanford University, Stanford, CA. https://www.vimeo.com/173057086.
[8] Mikolov, T. (2018, August 25). When shall we achieve human-level AI? Human-Level AI Conference, Prague, Czech Republic. https://www.slideslive.com/38910040/when-shall-we-achieve-humanlevel-ai.
[9] Dunietz, J., Burnham, G., Bharadwaj, A., Rambow, O., Chu-Carroll, J., & Ferrucci, D. (2020). To test machine comprehension, start by defining comprehension. Proceedings of the 58th annual meeting of the Association for Computational Linguistics, 7839-7859. https://www.aclweb.org/anthology/2020.acl-main.pdf.
[10] Dolch, E. W. (1936). A basic sight vocabulary. The Elementary School Journal, 36 (6), 456-460.
[11] Davies, M. (2020-). The Corpus of Contemporary American English (COCA): 1 billion words, 1990-present. https://www.english-corpora.org/coca.
[12] Brysbaert, M., & New, B. (2009). Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41 (4), 977-990.
[13] Hudson, R. (1994). About 37% of word tokens are nouns. Language, 70 (2), 331-339. https://doi.org/10.2307/415831.
[14] Ford, W. R., & Berkeley III, A. R. Understanding natural language using tumbling-frequency phrase chain parsing. U.S. Patent No. 10,783,330, September 22, 2020. http://www.freepatentsonline.com/y2020/0125641.html.
[15] Gardner, M., Grus, J., Neumann, M., Tafjord, O., Dasigi, P., Liu, N. H. S., Peters, M. E., Schmitz, M., & Zettlemoyer, L. (2018). AllenNLP: A deep semantic natural language processing platform. Proceedings of the workshop for NLP Open Source Software (NLP-OSS), 1-6. https://www.aclweb.org/anthology/W18-2501.pdf.
[16] Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the conference of the North American chapter of the Association for Computational Linguistics: Human Language Technologies, 1, 4171-4186. https://doi.org/10.18653/v1/N19-1423.
[17] Burtsev, M., Seliverstov, A., Airapetyan, R., Arkhipov, M., Baymurzina, D., Bushkov, N., Gureenkova, O., Khakhulin, T., Kuratov, Y., Kuznetsov, D., Litinsky, A., Logacheva, V., Lymar, A., Malykh, V., Petrov, M., Polulyakh, V., Pugachev, L., Sorokin, A., Vikhreva, M., & Zaynutdinov, M. (2018). DeepPavlov: Open-source library for dialogue systems. Proceedings of the 56th annual meeting of the ACL: System Demonstrations, 122-127. https://doi.org/10.18653/v1/P18-4021.
[18] Ford, W. R. Multi-stage pattern reduction for natural language. U.S. Patent No. 7, 599, 831, October 6, 2009. http://www.freepatentsonline.com/7599831.html.
[19] Ford, W. R. & Berkeley, A. R., Newman, M. A. Understanding natural language using split-phrase tumbling frequency phrase-chain parsing. U.S. Patent No. 11,055,487, July 6, 2021. http://www.freepatentsonline.com/11055487.html.
Cite This Article
  • APA Style

    W. Randolph Ford, Ingrid G. Farreras. (2022). Using Human Intelligence to Test the Impact of Popular Preprocessing Steps and Feature Extraction in the Analysis of Human Language. International Journal of Data Science and Analysis, 8(1), 18-22. https://doi.org/10.11648/j.ijdsa.20220801.13

    Copy | Download

    ACS Style

    W. Randolph Ford; Ingrid G. Farreras. Using Human Intelligence to Test the Impact of Popular Preprocessing Steps and Feature Extraction in the Analysis of Human Language. Int. J. Data Sci. Anal. 2022, 8(1), 18-22. doi: 10.11648/j.ijdsa.20220801.13

    Copy | Download

    AMA Style

    W. Randolph Ford, Ingrid G. Farreras. Using Human Intelligence to Test the Impact of Popular Preprocessing Steps and Feature Extraction in the Analysis of Human Language. Int J Data Sci Anal. 2022;8(1):18-22. doi: 10.11648/j.ijdsa.20220801.13

    Copy | Download

  • @article{10.11648/j.ijdsa.20220801.13,
      author = {W. Randolph Ford and Ingrid G. Farreras},
      title = {Using Human Intelligence to Test the Impact of Popular Preprocessing Steps and Feature Extraction in the Analysis of Human Language},
      journal = {International Journal of Data Science and Analysis},
      volume = {8},
      number = {1},
      pages = {18-22},
      doi = {10.11648/j.ijdsa.20220801.13},
      url = {https://doi.org/10.11648/j.ijdsa.20220801.13},
      eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ijdsa.20220801.13},
      abstract = {More than half a century has passed since Chomsky’s theory of language acquisition, Green and colleagues’ first natural language processor Baseball, and the Brown Corpus creation. Throughout the early decades, many believed that once computers became powerful enough, the development of A.I. systems that could understand and interact with humans using our natural languages would quickly follow. Since then, Moore’s Law has basically held; computer storage and performance has kept pace with our imaginations. And yet, 60 years later, even with these dramatic advances in computer technology, we still face major challenges in using computers to understand human language. The authors suggest that these same exponential increases in computational power have led current efforts to rely too much on techniques designed to exploit raw computational power and, in so doing, efforts have been diverted from advancing and applying the theoretical study of language to the task. In support of this view, the authors provide empirical evidence exposing the limitations of techniques – such as n-gram extraction – used to pre-process language. In addition, the authors conducted an analysis comparing three leading natural-language processing question-answering systems to human performance, and found that human subjects far outperformed all question answering-systems tested. The authors conclude by advocating for efforts to discover new approaches that use computational power to support linguistic and cognitive approaches to natural language understanding, as opposed to current techniques founded on patterns of word frequency.},
     year = {2022}
    }
    

    Copy | Download

  • TY  - JOUR
    T1  - Using Human Intelligence to Test the Impact of Popular Preprocessing Steps and Feature Extraction in the Analysis of Human Language
    AU  - W. Randolph Ford
    AU  - Ingrid G. Farreras
    Y1  - 2022/02/16
    PY  - 2022
    N1  - https://doi.org/10.11648/j.ijdsa.20220801.13
    DO  - 10.11648/j.ijdsa.20220801.13
    T2  - International Journal of Data Science and Analysis
    JF  - International Journal of Data Science and Analysis
    JO  - International Journal of Data Science and Analysis
    SP  - 18
    EP  - 22
    PB  - Science Publishing Group
    SN  - 2575-1891
    UR  - https://doi.org/10.11648/j.ijdsa.20220801.13
    AB  - More than half a century has passed since Chomsky’s theory of language acquisition, Green and colleagues’ first natural language processor Baseball, and the Brown Corpus creation. Throughout the early decades, many believed that once computers became powerful enough, the development of A.I. systems that could understand and interact with humans using our natural languages would quickly follow. Since then, Moore’s Law has basically held; computer storage and performance has kept pace with our imaginations. And yet, 60 years later, even with these dramatic advances in computer technology, we still face major challenges in using computers to understand human language. The authors suggest that these same exponential increases in computational power have led current efforts to rely too much on techniques designed to exploit raw computational power and, in so doing, efforts have been diverted from advancing and applying the theoretical study of language to the task. In support of this view, the authors provide empirical evidence exposing the limitations of techniques – such as n-gram extraction – used to pre-process language. In addition, the authors conducted an analysis comparing three leading natural-language processing question-answering systems to human performance, and found that human subjects far outperformed all question answering-systems tested. The authors conclude by advocating for efforts to discover new approaches that use computational power to support linguistic and cognitive approaches to natural language understanding, as opposed to current techniques founded on patterns of word frequency.
    VL  - 8
    IS  - 1
    ER  - 

    Copy | Download

Author Information
  • Department of Data Science, Harrisburg University of Science and Technology, Harrisburg, United States of America

  • Department of Psychology, Hood College, Frederick, United States of America

  • Sections