Latent Semantic Analysis: five methodological recommendations
The recent influx in generation, storage, and availability of textual data presents researchers with the challenge of developing suitable methods for their analysis. Latent Semantic Analysis (LSA), a member of a family of methodological approaches that offers an opportunity to address this gap by describing the semantic content in textual data as a set of vectors, was pioneered by researchers in psychology, information retrieval, and bibliometrics. LSA involves a matrix operation called singular value decomposition, an extension of principal component analysis. LSA generates latent semantic dimensions that are either interpreted, if the researcher's primary interest lies with the understanding of the thematic structure in the textual data, or used for purposes of clustering, categorization, and predictive modeling, if the interest lies with the conversion of raw text into numerical data, as a precursor to subsequent analysis. This paper reviews five methodological issues that need to be addressed by the researcher who will embark on LSA. We examine the dilemmas, present the choices, and discuss the considerations under which good methodological decisions are made. We illustrate these issues with the help of four small studies, involving the analysis of abstracts for papers published in the European Journal of Information Systems.
This is a preview of subscription content, log in via an institution to check access.
Access this article
Subscribe and save
Springer+ Basic
€32.70 /Month
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Buy Now
Price includes VAT (France)
Instant access to the full article PDF.
Rent this article via DeepDyve
Similar content being viewed by others
LSAfun - An R package for computations based on Latent Semantic Analysis
Article 26 November 2014
Multivariate data analysis of categorical data: taking advantage of the rhetorical power of numbers in qualitative research
Article 07 January 2023
Qualitative Content Analysis: Theoretical Background and Procedures
Chapter © 2015
Explore related subjects
References
- Abbasi A and Chen H (2008) CyberGate: A design framework and system for text analysis of computer-mediated communication. MIS Quarterly 32 (4), 811–837. Google Scholar
- Altman M, Gill J and McDonald M (2004) Numerical Issues in Statistical Computing for the Social Scientist. Wiley Series in Probability and Statistics. John Wiley, Hoboken, NJ.
- Bajwa IS, Samad A and Mumtaz S (2009) Object oriented software modeling using NLP based knowledge extraction. European Journal of Scientific Research 35 (1), 22–33. Google Scholar
- Barrett MI (1999) Challenges of EDI adoption for electronic trading in the London insurance market. European Journal of Information Systems 8 (1), 1–15. ArticleGoogle Scholar
- Berry MW, Browne M, Langville AN, Pauca VP and Plemmons RJ (2007) Algorithms and applications for approximate nonnegative matrix factorization. Computational Statistics & Data Analysis 52 (1), 155–173. ArticleGoogle Scholar
- Blegind Jensen T and Aanestad M (2007) Hospitality and hostility in hospitals: A case study of an EPR adoption among surgeons. European Journal of Information Systems 16 (6), 672–680. ArticleGoogle Scholar
- Blei DM, Ng AY and Jordan MI (2003) Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993–1022. Google Scholar
- Bondarouk TV (2006) Action-oriented group learning in the implementation of information technologies: Results from three case studies. European Journal of Information Systems 15 (1), 42–53. ArticleGoogle Scholar
- Bradford RB (2008) An empirical study of required dimensionality for large-scale latent semantic indexing applications. CIKM ’08: Proceedings of the 17th ACM Conference on Information and Knowledge Management (S HANAHAN J, A MER -Y AHIA S, Z HANG Y, K OLCZ A, C HOWDURY A and K ELLY D, Eds) pp 153–162, ACM, New York. ChapterGoogle Scholar
- Chew P, Bader B, Kolda T and Abdelali A (2007) Cross-language information retrieval using PARAFAC2. In Proceedings of the 13th ACM SIGKDD (G AFFNEY S, Ed.), pp 143–152, ACM Publications, Baltimore, MD. Google Scholar
- Cho S and Mathiassen L (2007) The role of industry infrastructure in telehealth innovations: A multi-level analysis of a telestroke program. European Journal of Information Systems 16 (6), 738–750. ArticleGoogle Scholar
- Couger JD and O’Callaghan R (1994) Comparing the motivations of Spanish and Finnish computer personnel with those of the United States. European Journal of Information Systems 3 (4), 285–291. ArticleGoogle Scholar
- Coussement K and Van den Poel D (2008) Improving customer complaint management by automatic email classification using linguistic style features as predictors. Decision Support Systems 44 (4), 870–882. ArticleGoogle Scholar
- Dam G and Kaufmann S (2008) Computer assessment of interview data using latent semantic analysis. Behavior Research Methods 40 (1), 8–20. ArticleGoogle Scholar
- Damsgaard J and Truex D (2000) Binary trading relations and the limits of EDI standards: The Procrustean bed of standards. European Journal of Information Systems 9 (3), 173–188. ArticleGoogle Scholar
- Deerwester S, Dumais S, Furnas G, Landauer T and Harshman R (1990) Indexing by latent semantic analysis. Journal of the American Society for Information Science 41 (6), 391–407. ArticleGoogle Scholar
- Doxas I, Dennis S and Oliver WL (2010) The dimensionality of discourse. Proceedings of the National Academy of Sciences of the United States of America (PNAS) 107, 4866–4871. ArticleGoogle Scholar
- Dumais ST (1991) Improving the retrieval of information from external sources. Behavior Research Methods, Instruments, and Computers 23 (2), 229–236. ArticleGoogle Scholar
- Dumais ST (2004) Latent semantic analysis. Annual Review of Information Science and Technology 38, 189–230. Google Scholar
- Dumais ST (2007) LSA and information retrieval: Getting back to basics. In Handbook of Latent Semantic Analysis (L ANDAUER TK, M C N AMARA DS, D ENNIS S and K INTSCH W, Eds), pp 293–322, Lawrence Erlbaum Associates, Mahwah, NJ. Google Scholar
- Dwivedi YK and Kuljis J (2008) Profile of IS research published in the European Journal of Information Systems. European Journal of Information Systems 17 (6), 678–693. ArticleGoogle Scholar
- Efron M (2005) Eigenvalue-based model selection during latent semantic indexing. Journal of the American Society for Information Science and Technology 56 (9), 969–988. ArticleGoogle Scholar
- Franzosi R (2004) From Words to Numbers: Narrative, Data, and Social Science. Cambridge University Press, Cambridge, UK. Google Scholar
- Galliers RD and Whitley EA (2007) Vive les differences? Developing a profile of European information systems research as a basis for international comparisons. European Journal of Information Systems 16 (1), 20–35. ArticleGoogle Scholar
- Ghose A (2009) Internet exchanges for used goods: An empirical analysis of trade patterns and adverse selection. MIS Quarterly 33 (2), 263–292. Google Scholar
- Griffiths T and Steyvers M (2004) Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America (PNAS) 101, 5228–5235. ArticleGoogle Scholar
- Haley DT, Thomas P, De Roeck A and Petre M (2007) Tuning an LSA-based assessment system for short answers in the domain of computer science: The elusive optimum dimension. In Mini-Proceedings of the 1st European Workshop on Latent Semantic Analysis in Technology-enhanced Learning (W ILD F, K ALZ M, VAN B RUGGEN J and K OPER R, Eds), pp 22–23, Open University of the Netherlands, Heerlen, NL. Google Scholar
- Han J and Kamber M (2006) Data Mining: Concepts and Techniques. 2nd edn, Morgan Kaufmann (Elsevier), San Francisco, CA. Google Scholar
- Hovorka D and Larsen K (2006) Enabling agile adoption practices through network organizations. European Journal of Information Systems 15 (2), 159–168. ArticleGoogle Scholar
- Hovorka D, Larsen K and Monarchi D (2009) Conceptual convergences: Positioning information systems among the business disciplines. In Proceedings of the 17th European Conference on Information Systems (ECIS) (N EWELL S, W HITLEY E, P OULOUDI N, W AREHAM J and M ATHIASSEN L Eds), manuscript 0217.R1, published by Università di Verona and London School of Economics, Verona, Italy. Google Scholar
- Hu X, Cai Z, Wiemer-Hastings P, Graesser AC and McNamara DS (2007) Strengths, limitations, and extensions of LSA. In Handbook of Latent Semantic Analysis (L ANDAUER TK, M C N AMARA DS, D ENNIS S and K INTSCH W, Eds), pp 401–425, Lawrence Erlbaum Associates, Mahwah, NJ. Google Scholar
- Husbands P, Simon H and Ding CH (2001) On the use of the singular value decomposition for text retrieval. In Computational Information Retrieval (B ERRY M Ed.), pp 145–156, Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA. Google Scholar
- Johnson RA and Wichern DW (2007) Applied Multivariate Statistical Analysis. Pearson/Prentice Hall, NJ. Google Scholar
- Kuechler WL (2007) Business applications of unstructured text. Communications of the ACM 50 (10), 86–93. ArticleGoogle Scholar
- Landauer T, Laham D and Derr M (2004) From paragraph to graph: Latent semantic analysis for information visualization. Proceedings of the National Academy of Sciences of the United States of America (PNAS) 101, 5214–5219. ArticleGoogle Scholar
- Landauer TK (2007) LSA as a theory of meaning. In Handbook of Latent Semantic Analysis (L ANDAUER TK, M C N AMARA DS, D ENNIS S and K INTSCH W, Eds), pp 3–32, Lawrence Erlbaum Associates, Mahwah, NJ. Google Scholar
- Larsen KR and Monarchi DE (2004) A mathematical approach to categorization and labeling of qualitative data: The latent categorization method. Sociological Methodology 34 (1), 349–392. ArticleGoogle Scholar
- Larsen KR, Monarchi DE, Hovorka DS and Bailey CN (2008) Analyzing unstructured text data: Using latent categorization to identify intellectual communities in information systems. Decision Support Systems 45, 884–896. ArticleGoogle Scholar
- Lifchitz A, Jhean-Larose S and DenhiÈre G (2009) Effect of tuned parameters on an LSA multiple choice questions answering model. Behavior Research Methods 41 (4), 1201–1209. ArticleGoogle Scholar
- Martin D and Berry M (2007) Mathematical foundations behind latent semantic analysis. In Handbook of Latent Semantic Analysis (L ANDAUER TK, M C N AMARA DS, D ENNIS S and K INTSCH W, Eds), pp 33–57, Lawrence Erlbaum Associates, Mahwah, NJ. Google Scholar
- Manning C, Raghavan P and Schütze H (2008) Introduction to Information Retrieval. Cambridge University Press, New York. BookGoogle Scholar
- Meroño-Cerdan AL and Soto-Acosta P (2007) External web content and its influence on organizational performance. European Journal of Information Systems 16 (1), 66–80. ArticleGoogle Scholar
- Moore GC and Benbasat I (1991) Development of an instrument to measure the perceptions of adopting an information technology innovation. Information Systems Research 2 (3), 192–222. ArticleGoogle Scholar
- Ord T, Martins E, Thakur S, Mane K and Börner K (2005) Trends in animal behaviour research (1968–2002): Ethoinformatics and the mining of library databases. Animal Behaviour 69, 1399–1413. ArticleGoogle Scholar
- Palvia P, Leary D, Mao E, Midha V, Pinjani P and Salam AF (2004) Research methodologies in MIS: An update. Communications of the AIS 14, article 24, 526–542. Google Scholar
- Panteli A, Stack J, Atkinson M and Ramsay H (1999) The status of women in the UK IT industry: An empirical study. European Journal of Information Systems 8 (3), 170–182. ArticleGoogle Scholar
- Park L and Ramamohanarao K (2009) An analysis of latent semantic term self-correlation. ACM Transactions on Information Systems 27 (2), 8:1–8:35. ArticleGoogle Scholar
- Penumatsa P, Ventura M, Graesser AC, Louwerse M, Hu X, Cai Z and Franceschetti DR (2006) The right threshold value: What is the right threshold of cosine measure when using Latent Semantic Analysis for evaluating student answers? International Journal on Artificial Intelligence Tools 15 (5), 767–777. ArticleGoogle Scholar
- Porter M (1980) An algorithm for suffix stripping. Program 14 (3), 130–137. Republished as: Porter M (2006) An algorithm for suffix stripping. Program: Electronic Library and Information Systems40(3), 211–218. ArticleGoogle Scholar
- Salton G (1975) A vector space model for automatic indexing. Communications of the ACM 18 (11), 613–620. ArticleGoogle Scholar
- Salton G and Buckley C (1988) Term-weighting approaches in automatic text retrieval. Information Processing and Management 24, 513–523. ArticleGoogle Scholar
- Shahnaz F, Berry MW, Pauca VP and Plemmons RJ (2006) Document clustering using nonnegative matrix factorization. Information Processing and Management 42, 373–386. ArticleGoogle Scholar
- Sidorova A, Evangelopoulos N, Valacich JS and Ramakrishnan T (2008) Uncovering the intellectual core of the information systems discipline. MIS Quarterly 32 (3), 467–482 & A1–A20. Google Scholar
- Spomer JE (2009) Latent Semantic Analysis and classification modeling in applications for social movement theory. MS Thesis, Department of Mathematical Sciences, Central Connecticut State University.
- Steyvers M and Griffiths T (2007) Probabilistic topic models. In Handbook of Latent Semantic Analysis (L ANDAUER TK, M C N AMARA DS, D ENNIS S and K INTSCH W, Eds), pp 427–448, Lawrence Erlbaum Associates, Mahwah, NJ. Google Scholar
- Teh YW, Jordan MI, Beal MJ and Blei DM (2006) Hierarchical dirichlet processes. Journal of the American Statistical Association 101, 1566–1581. ArticleGoogle Scholar
- Valle-Lisboa JC and Mizraji E (2007) The uncovering of hidden structures by latent semantic analysis. Information Sciences 177 (19), 4122–4147. ArticleGoogle Scholar
- Weber RP (1990) Basic Content Analysis. 2nd edn, Sage, Newbury Park, CA. BookGoogle Scholar
- Wei C-P, Hu PJ-H, Tai C-H, Huang C-N and Yang C-S (2008b) Managing word mismatch problems in information retrieval: A topic-based query expansion approach. Journal of Management Information Systems 24 (3), 269–295. ArticleGoogle Scholar
- Wei C-P, Yang CC and Lin C-M (2008a) A latent semantic indexing-based approach to multilingual document clustering. Decision Support Systems 45, 606–620. ArticleGoogle Scholar
- Willcocks L, Whitley EA and Avgerou C (2008) The ranking of top IS journals: A perspective from the London School of Economics. European Journal of Information Systems 17 (2), 163–168. ArticleGoogle Scholar
- Witten IH and Frank E (2005) Data Mining: Practical Machine Learning Tools and Techniques. 2nd edn, Morgan Kaufmann, San Francisco, CA. Google Scholar
- Zhu M and Ghodsi A (2006) Automatic dimensionality selection from the scree plot via the use of profile likelihood. Computational Statistics & Data Analysis 51 (2), 918–930. ArticleGoogle Scholar
Author information
Authors and Affiliations
- Information Technology and Decision Sciences Department, College of Business, University of North Texas, U.S.A. Nicholas Evangelopoulos
- Department of Business Informatics, College of Informatics, Northern Kentucky University, U.S.A. Xiaoni Zhang
- Information Technology and Decision Sciences Department, College of Business, University of North Texas, U.S.A. Victor R Prybutok
- Nicholas Evangelopoulos