Methodology
Methodologies should present a new experimental or computational method, test or procedure. The method described may either be completely new, or may offer a better version of an existing method. The article must describe a demonstrable advance on what is currently available. The method needs to have been well tested and ideally, but not necessarily, used in a way that proves its value.
Systematic Reviews strongly encourages that all datasets on which the conclusions of the paper rely should be available to readers. We encourage authors to ensure that their datasets are either deposited in publicly available repositories (where available and appropriate) or presented in the main manuscript or additional supporting files whenever possible. Please see Springer Nature’s information on recommended repositories .

Preparing your manuscript
The information below details the section headings that you should include in your manuscript and what information should be within each section.
Please note that your manuscript must include a 'Declarations' section including all of the subheadings (please see below for more information).
The title page should:
- "A versus B in the treatment of C: a randomized controlled trial", "X is a risk factor for Y: a case control study", "What is the impact of factor X on subject Y: A systematic review"
- or for non-clinical or non-research studies a description of what the article reports
- if a collaboration group should be listed as an author, please list the Group name as an author. If you would like the names of the individual members of the Group to be searchable through their individual PubMed records, please include this information in the “Acknowledgements” section in accordance with the instructions below
- Large Language Models (LLMs), such as ChatGPT , do not currently satisfy our authorship criteria . Notably an attribution of authorship carries with it accountability for the work, which cannot be effectively applied to LLMs. Use of an LLM should be properly documented in the Methods section (and if a Methods section is not available, in a suitable alternative part) of the manuscript.
- indicate the corresponding author
The Abstract should not exceed 350 words. Please minimize the use of abbreviations and do not cite references in the abstract. Reports of randomized controlled trials should follow the CONSORT extension for abstracts. The abstract must include the following separate sections:
- Background: the context and purpose of the study
- Methods: how the study was performed and statistical tests used
- Results: the main findings
- Conclusions: brief summary and potential implications
- Trial registration: If your article reports the results of a health care intervention on human participants, it must be registered in an appropriate registry and the registration number and date of registration should be stated in this section. If it was not registered prospectively (before enrollment of the first participant), you should include the words 'retrospectively registered'. See our editorial policies for more information on trial registration
Three to ten keywords representing the main content of the article.
The Background section should explain the background to the study, its aims, a summary of the existing literature and why this study was necessary or its contribution to the field.
The methods section should include:
- the aim, design and setting of the study
- the characteristics of participants or description of materials
- a clear description of all processes, interventions and comparisons. Generic drug names should generally be used. When proprietary brands are used in research, include the brand names in parentheses
- the type of statistical analysis used, including a power calculation if appropriate
This should include the findings of the study including, if appropriate, results of statistical analysis which must be included either in the text or as tables and figures.
This section should discuss the implications of the findings in context of existing research and highlight limitations of the study.
Conclusions
This should state clearly the main conclusions and provide an explanation of the importance and relevance of the study reported.
List of abbreviations
If abbreviations are used in the text they should be defined in the text at first use, and a list of abbreviations should be provided.
Declarations
All manuscripts must contain the following sections under the heading 'Declarations':
Ethics approval and consent to participate
Consent for publication, availability of data and materials, competing interests, authors' contributions, acknowledgements.
- Authors' information (optional)
Please see below for details on the information to be included in these sections.
If any of the sections are not relevant to your manuscript, please include the heading and write 'Not applicable' for that section.
Manuscripts reporting studies involving human participants, human data or human tissue must:
- include a statement on ethics approval and consent (even where the need for approval was waived)
- include the name of the ethics committee that approved the study and the committee’s reference number if appropriate
Studies involving animals must include a statement on ethics approval and for experimental studies involving client-owned animals, authors must also include a statement on informed consent from the client or owner.
See our editorial policies for more information.
If your manuscript does not report on or involve the use of any animal or human data or tissue, please state “Not applicable” in this section.
If your manuscript contains any individual person’s data in any form (including any individual details, images or videos), consent for publication must be obtained from that person, or in the case of children, their parent or legal guardian. All presentations of case reports must have consent for publication.
You can use your institutional consent form or our consent form if you prefer. You should not send the form to us on submission, but we may request to see a copy at any stage (including after publication).
See our editorial policies for more information on consent for publication.
If your manuscript does not contain data from any individual person, please state “Not applicable” in this section.
All manuscripts must include an ‘Availability of data and materials’ statement. Data availability statements should include information on where data supporting the results reported in the article can be found including, where applicable, hyperlinks to publicly archived datasets analysed or generated during the study. By data we mean the minimal dataset that would be necessary to interpret, replicate and build upon the findings reported in the article. We recognise it is not always possible to share research data publicly, for instance when individual privacy could be compromised, and in such instances data availability should still be stated in the manuscript along with any conditions for access.
Authors are also encouraged to preserve search strings on searchRxiv https://searchrxiv.org/ , an archive to support researchers to report, store and share their searches consistently and to enable them to review and re-use existing searches. searchRxiv enables researchers to obtain a digital object identifier (DOI) for their search, allowing it to be cited.
Data availability statements can take one of the following forms (or a combination of more than one if required for multiple datasets):
- The datasets generated and/or analysed during the current study are available in the [NAME] repository, [PERSISTENT WEB LINK TO DATASETS]
- The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.
- All data generated or analysed during this study are included in this published article [and its supplementary information files].
- The datasets generated and/or analysed during the current study are not publicly available due [REASON WHY DATA ARE NOT PUBLIC] but are available from the corresponding author on reasonable request.
- Data sharing is not applicable to this article as no datasets were generated or analysed during the current study.
- The data that support the findings of this study are available from [third party name] but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of [third party name].
- Not applicable. If your manuscript does not contain any data, please state 'Not applicable' in this section.
More examples of template data availability statements, which include examples of openly available and restricted access datasets, are available here .
BioMed Central strongly encourages the citation of any publicly available data on which the conclusions of the paper rely in the manuscript. Data citations should include a persistent identifier (such as a DOI) and should ideally be included in the reference list. Citations of datasets, when they appear in the reference list, should include the minimum information recommended by DataCite and follow journal style. Dataset identifiers including DOIs should be expressed as full URLs. For example:
Hao Z, AghaKouchak A, Nakhjiri N, Farahmand A. Global integrated drought monitoring and prediction system (GIDMaPS) data sets. figshare. 2014. http://dx.doi.org/10.6084/m9.figshare.853801
With the corresponding text in the Availability of data and materials statement:
The datasets generated during and/or analysed during the current study are available in the [NAME] repository, [PERSISTENT WEB LINK TO DATASETS]. [Reference number]
If you wish to co-submit a data note describing your data to be published in BMC Research Notes , you can do so by visiting our submission portal . Data notes support open data and help authors to comply with funder policies on data sharing. Co-published data notes will be linked to the research article the data support ( example ).
All financial and non-financial competing interests must be declared in this section.
See our editorial policies for a full explanation of competing interests. If you are unsure whether you or any of your co-authors have a competing interest please contact the editorial office.
Please use the authors initials to refer to each authors' competing interests in this section.
If you do not have any competing interests, please state "The authors declare that they have no competing interests" in this section.
All sources of funding for the research reported should be declared. The role of the funding body in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript should be declared.
The individual contributions of authors to the manuscript should be specified in this section. Guidance and criteria for authorship can be found in our editorial policies .
Please use initials to refer to each author's contribution in this section, for example: "FC analyzed and interpreted the patient data regarding the hematological disease and the transplant. RH performed the histological examination of the kidney, and was a major contributor in writing the manuscript. All authors read and approved the final manuscript."
Please acknowledge anyone who contributed towards the article who does not meet the criteria for authorship including anyone who provided professional writing services or materials.
Authors should obtain permission to acknowledge from all those mentioned in the Acknowledgements section.
See our editorial policies for a full explanation of acknowledgements and authorship criteria.
If you do not have anyone to acknowledge, please write "Not applicable" in this section.
Group authorship (for manuscripts involving a collaboration group): if you would like the names of the individual members of a collaboration Group to be searchable through their individual PubMed records, please ensure that the title of the collaboration Group is included on the title page and in the submission system and also include collaborating author names as the last paragraph of the “Acknowledgements” section. Please add authors in the format First Name, Middle initial(s) (optional), Last Name. You can add institution or country information for each author if you wish, but this should be consistent across all authors.
Please note that individual names may not be present in the PubMed record at the time a published article is initially included in PubMed as it takes PubMed additional time to code this information.
Authors' information
This section is optional.
You may choose to use this section to include any relevant information about the author(s) that may aid the reader's interpretation of the article, and understand the standpoint of the author(s). This may include details about the authors' qualifications, current positions they hold at institutions or societies, or any other relevant background information. Please refer to authors using their initials. Note this section should not be used to describe any competing interests.
Footnotes can be used to give additional information, which may include the citation of a reference included in the reference list. They should not consist solely of a reference citation, and they should never include the bibliographic details of a reference. They should also not contain any figures or tables.
Footnotes to the text are numbered consecutively; those to tables should be indicated by superscript lower-case letters (or asterisks for significance values and other statistical data). Footnotes to the title or the authors of the article are not given reference symbols.
Always use footnotes instead of endnotes.
Examples of the Vancouver reference style are shown below.
See our editorial policies for author guidance on good citation practice
Web links and URLs: All web links and URLs, including links to the authors' own websites, should be given a reference number and included in the reference list rather than within the text of the manuscript. They should be provided in full, including both the title of the site and the URL, as well as the date the site was accessed, in the following format: The Mouse Tumor Biology Database. http://tumor.informatics.jax.org/mtbwi/index.do . Accessed 20 May 2013. If an author or group of authors can clearly be associated with a web link, such as for weblogs, then they should be included in the reference.
Example reference style:
Article within a journal
Smith JJ. The world of science. Am J Sci. 1999;36:234-5.
Article within a journal (no page numbers)
Rohrmann S, Overvad K, Bueno-de-Mesquita HB, Jakobsen MU, Egeberg R, Tjønneland A, et al. Meat consumption and mortality - results from the European Prospective Investigation into Cancer and Nutrition. BMC Medicine. 2013;11:63.
Article within a journal by DOI
Slifka MK, Whitton JL. Clinical implications of dysregulated cytokine production. Dig J Mol Med. 2000; doi:10.1007/s801090000086.
Article within a journal supplement
Frumin AM, Nussbaum J, Esposito M. Functional asplenia: demonstration of splenic activity by bone marrow scan. Blood 1979;59 Suppl 1:26-32.
Book chapter, or an article within a book
Wyllie AH, Kerr JFR, Currie AR. Cell death: the significance of apoptosis. In: Bourne GH, Danielli JF, Jeon KW, editors. International review of cytology. London: Academic; 1980. p. 251-306.
OnlineFirst chapter in a series (without a volume designation but with a DOI)
Saito Y, Hyuga H. Rate equation approaches to amplification of enantiomeric excess and chiral symmetry breaking. Top Curr Chem. 2007. doi:10.1007/128_2006_108.
Complete book, authored
Blenkinsopp A, Paxton P. Symptoms in the pharmacy: a guide to the management of common illness. 3rd ed. Oxford: Blackwell Science; 1998.
Online document
Doe J. Title of subordinate document. In: The dictionary of substances and their effects. Royal Society of Chemistry. 1999. http://www.rsc.org/dose/title of subordinate document. Accessed 15 Jan 1999.
Online database
Healthwise Knowledgebase. US Pharmacopeia, Rockville. 1998. http://www.healthwise.org. Accessed 21 Sept 1998.
Supplementary material/private homepage
Doe J. Title of supplementary material. 2000. http://www.privatehomepage.com. Accessed 22 Feb 2000.
University site
Doe, J: Title of preprint. http://www.uni-heidelberg.de/mydata.html (1999). Accessed 25 Dec 1999.
Doe, J: Trivial HTTP, RFC2169. ftp://ftp.isi.edu/in-notes/rfc2169.txt (1999). Accessed 12 Nov 1999.
Organization site
ISSN International Centre: The ISSN register. http://www.issn.org (2006). Accessed 20 Feb 2007.
Dataset with persistent identifier
Zheng L-Y, Guo X-S, He B, Sun L-J, Peng Y, Dong S-S, et al. Genome data from sweet and grain sorghum (Sorghum bicolor). GigaScience Database. 2011. http://dx.doi.org/10.5524/100012 .
Figures, tables and additional files
See General formatting guidelines for information on how to format figures, tables and additional files.
Submit manuscript
- Editorial Board
- Manuscript editing services
- Instructions for Editors
- Sign up for article alerts and news from this journal
- Follow us on Twitter
Annual Journal Metrics
Citation Impact 3.136 - 2-Year Impact Factor (2021) 4.916 - 5-Year Impact Factor (2021) 1.199 - Source Normalized Impact per Paper (SNIP) 1.107 - SCImago Journal Rank (SJR)
Speed 185 days to first decision for all manuscripts (Median) 195 days to first decision for reviewed manuscripts only (Median)
Usage 3,104,570 Downloads (2022) 3,531 Altmetric mentions (2021)
- More about our metrics
Systematic Reviews
ISSN: 2046-4053
- Submission enquiries: Access here and click Contact Us
- General enquiries: [email protected]
Advertisement
Introduction
The clinical question, search strategy, study selection, assessing the quality of studies, data extraction, combining the data (meta-analysis), making conclusions, concluding remarks, systematic review and meta-analysis methodology.
- Split-Screen
- Request Permissions
- Cite Icon Cite
- Search Site
- Open the PDF for in another window
Mark Crowther , Wendy Lim , Mark A. Crowther; Systematic review and meta-analysis methodology. Blood 2010; 116 (17): 3140–3146. doi: https://doi.org/10.1182/blood-2010-05-280883
Download citation file:
- Ris (Zotero)
- Reference Manager
Systematic reviews and meta-analyses are being increasingly used to summarize medical literature and identify areas in which research is needed. Systematic reviews limit bias with the use of a reproducible scientific process to search the literature and evaluate the quality of the individual studies. If possible the results are statistically combined into a meta-analysis in which the data are weighted and pooled to produce an estimate of effect. This article aims to provide the reader with a practical overview of systematic review and meta-analysis methodology, with a focus on the process of performing a review and the related issues at each step.
The average hematologist is faced with increasingly large amounts of new information about hematologic disease. This ranges from the latest findings of complex molecular studies to results from randomized controlled trials (RCTs) to case reports of possible therapies for very rare conditions. With this vast amount of information being produced in published journals, presentations at conferences, and now increasingly online, it is virtually impossible for hematologists to keep up to date without many hours being spent searching and reading articles. For example, a search for ‘deep vein thrombosis’ in PubMed produced 55 568 possible articles, with 831 published in 2010 alone (search performed May 11, 2010). Review articles traditionally provide an overview of a topic and summarize the latest evidence, thus reducing the time clinicians would need to spend performing literature searches and interpreting the primary data. These review articles, known as narrative reviews, typically address a broad number of issues related to a topic. 1 Narrative reviews do not describe the process of searching the literature, article selection, or study quality assessment. The data are usually summarized but not statistically combined (qualitative summary), and key studies are highlighted. The inferences made from narrative reviews may be, but are not necessarily, evidence based. Narrative reviews are useful for obtaining a broad overview of a topic, usually from acknowledged experts. However, narrative reviews are susceptible to bias if a comprehensive literature search is not performed, or if only selected data are presented which conveys the author's views on a particular topic. 2
Systemic reviews aim to reduce bias with the use of explicit methods to perform a comprehensive literature search and critical appraisal of the individual studies. Thus, in contrast to narrative reviews, systematic reviews pose a defined clinical question. The process of performing the literature search and the specific inclusion and exclusion criteria used for study selection are described. The quality of the included studies is formally appraised. The data are summarized, and, if the data are statistically combined (quantitative summary), the systematic review is referred to as a meta-analysis. The inferences made from systematic reviews are usually evidence based.
Furthermore, systematic reviews also attempt to identify if certain subtypes of evidence (eg, small negative studies) are absent from the literature; this so-called “publication bias” is an important cause of incorrect conclusions in narrative reviews. 3 Systematic reviews frequently, but not necessarily, use statistical methods, meta-analysis, to combine the data from the literature search to produce a single estimate of effect. 4
In view of the increasing number of systematic reviews published, we feel it is important to discuss the methodology of the systematic review to allow readers to better appreciate and critically appraise systematic reviews that may be relevant for their practice. The findings of systematic reviews can be included in the introduction of scientific papers and are increasingly performed for grant applications to summarize what is known about a topic and highlight areas in which research is needed.
Having the knowledge to appraise a systematic review is an important skill, because systematic reviews are considered to be the study design with the highest level of study quality. Although many studies are labeled as systematic reviews, this does not necessarily indicate that the study itself is of high quality because any group of studies can be subject to a systematic review, and data can almost always be combined in a meta-analysis. The important issue is identifying if the systematic review was conducted in a manner that is replicable and free of bias, and if a meta-analysis was performed whether the data were appropriately combined. An evaluation of the quality of reviews (as measured by specific published criteria) published in 1996 in 6 core general medicine journals found that only 1% of review articles met all the recommended methodologic criteria. 5
The objective of this article is to provide a practical approach to preparing and critically appraising a systematic review. Further guidance can be obtained from the Cochrane Collaboration's Web site, 6 and recommendations for reporting of systematic reviews are outlined by the Quality of Reporting of Meta-analyses (QUORUM) group (for randomized trials 7 ), and the Meta-analysis Of Observational Studies in Epidemiology (MOOSE) group (for observational studies 8 ). A modified version of the Quality of Reporting of Meta-analyses statement is presented in Table 1 .
The QUORUM statement on how to report a systematic review
Modified from Moher et al. 7
What is the clinical question that needs to be answered? A careful articulation of the question is critical, because it provides the scope of the review by defining the type of patients, intervention, comparator, and outcomes evaluated in the review. 9 The nature of the question dictates study eligibility; hence, the more specific the question, the more focused the literature search albeit at the expense of decreased generalizability of the results. Thus, a systematic review on the use of colony-stimulating factors in patients with hematologic malignancies will be a far greater undertaking than a systematic review of colony-stimulating factors in preventing chemotherapy-induced febrile neutropenia in children with acute lymphoblastic leukemia. 10 When reading a systematic review one must always ascertain that the investigators are answering the question originally posed. A major cause of bias in a systematic review is answering a different question to that originally asked.
The completeness of the search strategy will determine the comprehensiveness of the review. The more exhaustive the search the greater the effort required to produce the systematic review, but the resulting review is generally of higher quality. The development of an inclusive search strategy requires expertise, and, unless the investigator is skilled in literature searches, the help of an experienced librarian is invaluable and strongly recommended. It is recommended that the search be performed in duplicate, because one person, especially if he or she is screening thousands of studies, may miss relevant studies. The literature search usually involves searching the following source.
Electronic databases
Many readers may remember the published Index Medicus in which journal articles were indexed based on topic. This has since been replaced by several electronic, Web-based, searchable databases. By entering a search strategy (usually according to Boolean language [OR, AND, NOT]) the databases provide a list of articles that meet the search criteria. The database to be used depends on what area of medicine the search is to be performed in. Examples of commonly used databases are shown in Table 2 . The most commonly used databases include PubMed, MEDLINE, Embase, and the Cochrane library. MEDLINE is the largest component of PubMed, which is a free online database of biomedical journal citations and abstracts created by the US National Library of Medicine.
Examples of some commonly used electronic databases
Conference abstracts
Many papers are presented at conferences before publication. It can take years for the content of these abstracts to be published. Conference abstracts can be searched and evidence can be extracted before full publication. The abstracts themselves may provide sufficient data to be included in the systematic review or, if a significant publication is anticipated, may warrant contacting the abstract author to obtain information. There are advantages and disadvantages to including conference abstracts. Studies that show inconclusive or negative results for an intervention are less likely to be published in journals but may be published in abstract form. Data from abstracts, reports, or other documents that are not distributed or indexed by commercial publishers (and which may be difficult to locate) are known as “gray literature.” 11 Inclusion of abstracts and other gray literature potentially reduces the effect of publication bias. However, abstract results often differ significantly from the final publication, and abstracts have not generally undergone the rigorous peer review process required for most journal articles. This increases the likelihood that bias will influence the results of the systematic review.
Handsearching
The introduction and discussion section of relevant studies may provide additional references on a subject that may have been missed by the search strategy. It is recommended that authors manually search the reference lists of found studies as a final check that no studies have been missed. One can also manually search journals in which studies on the subject of the review are likely to be published.
Contacting investigators
Writing to investigators active in the area may provide results of studies yet to be presented or published, but care must be taken with this information because it has not undergone any review process. Furthermore, most investigators will be hesitant to provide unpublished information because its inclusion in a systematic review may hamper subsequent publication. Perhaps the greatest utility of inquiring with investigators is gaining knowledge of studies about to be published, and when delaying the systematic review will allow inclusion of these articles and thus make the review more timely. Investigators may also be contacted if clarification of published information is required.
Apart from the searchable databases discussed earlier, there are other useful resources online. These include registers of clinical trials (eg, www.clinicaltrials.gov ), data clearinghouses (eg, http://www.guideline.gov/ ), agencies charged with improving the quality of health care ( http://www.ahrq.gov/ ), information on specific researchers from academic Web sites, university theses, and product information from drug companies. Searching the Internet with the use of search engines such as Google provides a user-friendly method to obtain information, but it is not recommended for systematic reviews because the accuracy of the information on the Internet is not ensured. Internet searches are notoriously nonspecific, and much time may be spent without much gain; this highlights the advantages of seeking input from a professional librarian before initiating a search.
Several decisions need to be made with regard to the search strategy. Investigators must decide what databases are to be searched, the level of detail in the search strategy, and whether the search should be done in duplicate. The search will produce a large number of possible studies, many of which can be excluded on the basis of their title and abstract. However, more detailed review of individual studies is required for those studies passing the initial screen.
When appraising a review, the reader should assess the completeness of the literature search. All relevant databases should be searched, and the search terms should be scrutinized for alternate terms or alternate spellings. For example, searching ‘hemolytic anemia’ in PubMed yields 58 560 citations, whereas ‘haemolytic anemia’ yields 3848 citations. The reader should also assess if the search strategy could have excluded relevant studies by being too specific. For example, when determining the effect of intravenous immune globulin (IVIg) in immune thrombocytopenia, a search strategy might be ‘IVIg AND immune thrombocytopenia,’ but this may miss studies that looked at patients with immune thrombocytopenia who were treated with steroids and given IVIg if they did not respond. If these studies were important to the question being asked, a more general search strategy might be considered, such as ‘immune thrombocytopenia treatment.’
It is best to establish, a priori, inclusion and exclusion criteria for accepting studies. These criteria should be explicit, and the most rigorous reviews should record the specific reasons for including or excluding all studies identified in the literature search. Specific recording for each study not only reduces the risk of bias but also allows rapid reassessment should the rationale for exclusion of one or more studies be called into question. Selecting studies in duplicate can help ensure that the correct studies are included and relevant studies are not missed. Agreement statistics can be calculated on the selection process, most commonly using the κ statistic. 12
Deciding which studies to include or exclude in the review is very important. Inclusion or exclusion is usually based on the following different reasons.
Study design
The quality of the systematic review or meta-analysis is based in part on the quality of the included studies. RCTs have a lower potential for bias compared with observational studies. If there are available RCTs in the area of the review, the included studies may be limited to RCTs. Generally, RCTs (studies in which participants are randomly assigned to an intervention) are intrinsically of better quality than nonrandomized studies (in which participants are given an intervention then compared with another group that is similar but did not receive the intervention) that in turn are better than case series or case reports. The randomization process should equally distribute measurable and unmeasurable confounding factors between the 2 groups. As a result, differences observed should be due to the intervention rather than their occurring as a result of the effect of differences between those patients receiving the experimental, and those receiving the control, intervention. Given the reduced likelihood for bias, many systematic reviews only include RCTs; nonrandomized data are only included if randomized data are not available. It is important to note that even if the systematic review is based on randomized data, this does not ensure that the review itself is of high quality or that definitive conclusions can be made.
Limiting studies by language will reduce the number of studies needed to review, especially if there is difficulty in translating a study. This may be acceptable for many reviews, but in some areas there may be many important studies published in other languages. Consequently, excluding studies on the basis of language must be done with care. For example, Chagas disease is endemic in Latin America, and a systematic review of transfusion-transmitted Chagas disease limited to English-only publications will exclude potentially important studies.
Date of publication
Limiting studies by date can be done if data does not exist before a specific date. For example, imatinib mesylate (Gleevec) for treatment of chronic myeloid leukemia was developed in the 1990s with phase 1 clinical trial data emerging by the end of the decade. Hence, a systematic review that involves imatinib would not require a literature search earlier than 1990.
Duplicate data
Some studies publish interim data or use the same patient cohorts in multiple publications. Excluding duplicate studies will eliminate overrepresentation of that particular data in the systematic review.
We would encourage readers to access the Cochrane Library's free-to-access database of systematic reviews of hematologic malignancies 13 for detailed examples of study inclusion and exclusion criteria. A summary of inclusion and exclusion criteria for granulopoiesis-stimulating factors to prevent adverse effects in the treatment of malignant lymphoma 14 can be found in Table 3 .
An example of inclusion and exclusion criteria
An example of inclusion and exclusion criteria for studies to be included in a systematic review of colony-stimulating factor use for the prevention of adverse effects in the treatment of lymphoma. 13
CLL indicates chronic lymphoblastic leukemia leukemia; G-CSF, granulocyte colony-stimulating factor; GM-CSF, granulocyte macrophage; and DoB, date of birth.
The quality of the studies included in the systematic review determines the certainty with which conclusions can be drawn, based on the summation of the evidence. Consequently, once all the relevant studies have been identified, the studies should undergo a quality assessment. This is particularly important if there is contradictory evidence. As with study selection, quality assessment performed in duplicate can help to minimize subjectivity in the assessment.
Various tools are designed for performing study quality assessment. The Jadad score is frequently used for quality assessment of RCTs, 15 and the Newcastle-Ottawa score is used for nonrandomized studies. 16 A modified example of the Jadad score is seen in Table 4 . The important features in a quality assessment of RCTs include the following:
The participants are not highly selected and are similar to those found in normal clinical practice (content validity).
Neither the participants nor the researchers are able to tell how patients will be allocated before random assignment (allocation concealment). 17
Participants are followed up for an appropriate length of time, depending on the outcome assessed.
Follow-up should be complete, with as few participants as possible being lost to follow-up. The reasons accounting for why patients dropped out or were lost should be provided to assess if these losses were due, in whole or in part, to side effects from the treatment.
As many people as is feasible who are involved in the study are masked to the treatment received; ideally, participants, care providers, data collectors, and outcome adjudicators should be masked. The statistician can also be masked to the specific intervention.
The results of the study should be analyzed as intention to treat (all patients who underwent allocation are analyzed regardless of how long they stayed in the study; this provides the best “real world” estimate of the effect) and per protocol (only patients who remained within the protocol for a predetermined period are analyzed; this gives the best safety data).
The modified Jadad scoring system for randomized controlled trials 15
When appraising a review, the reader needs to assess if an appropriate quality assessment tool has been used, whether the quality assessment was done in duplicate, and, if so, whether there was agreement between the investigators. Again, for useful examples in the setting of hematologic disease we encourage review of the Cochrane Library. For example, a systematic review on the use of immunoglobulin replacement in hematologic malignancies and hematopoietic stem cell transplantation 18 assesses the quality of the studies on the basis of allocation concealment, allocation generation (randomization procedure), and masking. In the sensitivity analysis that compared studies fulfilling and not fulfilling these criteria they showed no difference in the result.
The data from the studies can then be extracted, usually onto prepared data case report forms. The data to be extracted should be carefully considered before the start of the review to avoid having to re-extract data that were missed on the initial data collection. Data extraction should ideally be done in duplicate to allow identification of transcription errors and to minimize any subjectivity that may occur when interpreting data that are presented in a different format than that required on the case report form.
If suitable, data from several studies can be statistically combined to give an overall result. Because this overall result reflects data from a larger number of participants than in the individual studies, the results are less likely to be affected by a type 2 error (failing to detect a “real” difference that exists between the 2 groups). One of the main criticisms of meta-analysis is that studies that are quite different can have their results combined inappropriately, and the result is not an accurate reflection of the “true” value. Therefore, before embarking on a meta-analysis, one must consider whether the difference between the studies (heterogeneity) precludes pooling of the data. Only studies with similar interventions, patients, and measures of outcomes should be combined. For example, 3 studies that all evaluate the efficacy of a new iron-chelating agent in patients with transfusion-related iron overload may be inappropriately combined if one study compared the iron chelator with placebo and measured hepatic iron content by liver biopsy at 1 year, the second study compared the new iron chelator with a different iron chelator and performed liver biopsy at 6 months, and the third study was an observational study in which patients receiving the new iron chelator underwent cardiac magnetic resonance imaging at the start of treatment and again at 1 year. Although all 3 studies are evaluating the efficacy of a novel iron chelator, the studies differ importantly. Because they differ in their design, comparator, timing of outcome measurement, and the method of outcome ascertainment combining, these data are inappropriate.
The most commonly used meta-analysis software is RevMan, 19 available from the Cochrane Collaboration. Categorical data (eg, number of remissions) or continuous data (eg, time to relapse) is entered into the program, combined, and then visually presented in a Forest plot (explained in Figure 1 ). Various statistical methods are used for combining different types of data. The data from each individual study are weighted such that studies that have less variance (spread of data) or a larger sample size contribute more heavily to the overall estimate of effect. The common mathematical methods used to combine data include the Mantel-Haenszel method 20 and the Inverse Variance method. 21 The Mantel-Haenszel method is used for categorical data and results in a risk ratio or relative risk. The risk ratio expresses the chance that an event will occur if the patient received the intervention compared with if they received the control. For example, if a meta-analysis of studies that measure infections after giving prophylactic antibiotics to neutropenic patients showed a relative risk of 1.5 comparing no antibiotics with antibiotics, this means that taking antibiotics reduces the risk of infection by 1.5 times.

An example of a Forest plot . The names of the individual studies are on the left, the individual studies results are seen in the green boxes, and the overall combined result is seen in the orange box. The purple box shows the weighting given to each study, which is based on the number of participants (larger studies given more weight). The blue box displays the statistics for the meta-analysis, including whether the overall result is statistically significant (test for overall effect) and 2 measures of heterogeneity (χ 2 and I 2 tests). On the far right is the graphical representation of the results, known as the Forest plot. The studies are displayed horizontally, whereas the horizontal axis represents the magnitude of the difference between the intervention and control group. Each study is represented by a blue box and a black horizontal line. The blue box represents the result of the study, with the larger the box indicating the greater the weight of the study to the overall result. The black horizontal line represents the 95% confidence intervals for that study. If both the box and the horizontal line lie to the left of the vertical line, then that study shows that the intervention is statistically significantly better than the control, whereas, if the box and horizontal line all lie to the right of the vertical line, then the control is statistically significantly better. If the box or horizontal line cross the vertical line, then the individual study is not statistically significant. The overall result is represented by a diamond, with the size of the diamond being determined by the 95% confidence intervals for the overall combined result. If the diamond does not touch the vertical line, then the overall result is statistically significant, to the left the intervention is better than the control group and to the right indicates that the control group is better. If the diamond touches the line, then there is no statistical difference between the 2 groups.
The Inverse Variance method is used for continuous data and results in a mean difference. The mean difference is the average difference that will be achieved by giving the patient the intervention rather than the control. If the same studies also measured length of stay and patients who had taken antibiotics had a mean difference of −1.3 days compared with patients not taking antibiotics, this would suggest that on average patients taking antibiotics would stay for 1.3 days less than patents who did not take antibiotics. The mean difference is used if the outcome that is measured is the same in all the studies, whereas the standard mean difference is used if the outcomes are measured slightly differently. For example, if studies of postthrombotic syndrome (PTS) severity used the Villalta PTS scale, 22 then the mean difference can be used for the meta-analysis. However, if some of the studies used the Villalta PTS scale and others used the Ginsberg clinical scale, 23 a standard mean difference should be used.
As discussed earlier, one of the main problems with meta-analysis is that the studies being combined are different, resulting in heterogeneity. 24 There will always be some heterogeneity between studies because of chance, but when performing meta-analysis this needs to be investigated to determine whether the data can be combined reliably. RevMan calculates 2 measures of heterogeneity, the χ 2 test and the I 2 test. 25 The χ 2 test determines whether there is greater spread of results between the studies than is due to chance (hence, heterogeneity is present) and a value less than 0.10 usually suggests this. The I 2 test tries to quantify any heterogeneity that may be present, a result greater than 40% usually suggests its presence, the higher the percentage the greater the heterogeneity. If heterogeneity is present, it should be investigated by removing studies or individual patients from the analysis and seeing if that removes the heterogeneity. Differences between the included patients in the individual studies may explain the heterogeneity (clinical heterogeneity). For example, the efficacy of a new treatment for multiple myeloma will depend on whether the patients' condition is newly diagnosed, previously treated, or after transplantation. Differences in drug dosing, route, and frequency of administration will also contribute to heterogeneity. Other contributors to heterogeneity may include the design of the study and how the study was funded (eg, commercial vs noncommercial sponsorship). If, after detailed investigation, there is no obvious cause for the heterogeneity, the data should be analyzed with a more conservative statistical method that will account for the heterogeneity. In reviews with significant heterogeneity, a more conservative overall result will be obtained if the analysis uses a random-effects model, compared with a fixed-effect model. A random-effects analysis makes the assumption that individual studies are estimating different treatment effects. These different effects are assumed to have a distribution with some central value and some variability. The random-effects meta-analysis attempts to account for this distribution of effects and provides a more conservative estimate of the effect. In contrast, a fixed-effect analysis assumes that a single common effect underlies every study included in the meta-analysis; thus, it assumes there is no statistical heterogeneity among the studies.
Other techniques used to account for heterogeneity include subgroup analyses and meta-regression. Subgroup analyses are meta-analyses on individual clinical subgroups that determine the specific effect for those patients. Common subgroups may be based on age, sex, race, drug dosage, or other factors. Ideally, subgroup analyses should be limited in number and should be specified a priori. Meta-regression is used to formally test whether there is evidence of different effects in different subgroups of studies. 26 This technique is not available in RevMan.
Sensitivity analyses can be performed to determine whether the results of the meta-analysis are robust. This involves the removal of studies that meet certain criteria (eg, poor quality, commercial sponsorship, conference abstract) to determine their effect on the overall result. For example, large studies will generally have a lower variance and will thus be more heavily weighted than small studies. However, because this weighting does not account for study quality, a poorly designed large study might be overrepresented in the analysis compared with a small well-performed study. A sensitivity analysis could be performed in which the large poorly designed study is removed and the meta-analysis is repeated with the remaining studies to assess if the overall effect estimate remains the same. For example, if a drug appears to have a positive effect on relapse-free survival, but this effect disappears when commercial studies are removed in the sensitivity analysis, the reader should be aware that the results and conclusions of such study could be biased.
Assessing for publication bias can be performed with funnel plots ( Figure 2 ). Studies that are negative are less likely to be published, and their absence from the review is a potential source of bias. 11 In a funnel plot, the vertical axis measures the precision of the estimate of the treatment effect (eg, standard error of the log relative risk, sample size) and the horizontal axis measures the treatment effect (eg, relative risk). If there is no publication bias, all the studies should uniformly fall within the inverted V. If a section of the inverted V is devoid of studies, this indicates a publication bias (most often the failure of small negative studies to be published and thus included in the analysis). Analyses that fail to account for missing negative and smaller studies will tend to overestimate the treatment effect.
![what is systematic review methodology Figure 2. Funnel plot to assess for publication bias. In this plot the result of the individual study is plotted on the horizontal axis (in this case the risk ratio [RR]) against a measure of the precision of the data (either the spread of the data or the size of the study) on the vertical axis (this graph uses spread of data measured by the standard error of the log of the relative risk [SE(logRR)]), the smaller the spread of data or the greater the study size the further up the vertical axis. Individual studies are represented by the small squares. From the overall result of the meta-analysis the central estimate is plotted (the vertical dashed line), and the 95% confidence intervals are drawn (the diagonal dashed lines) to form the funnel or inverted V. The assumption is that the larger the study (or the study with a smaller spread of data) then the nearer to the true result it will be, meaning the spread about the overall result will be reduced as the study size increases, hence the funnel shape. If there is publication bias, then the studies will not be equally distributed within the inverted V. The usual sign of publication bias is the absence of studies in the green box that represents where small negative studies lie.](https://ash.silverchair-cdn.com/ash/content_public/journal/blood/116/17/10.1182_blood-2010-05-280883/4/m_zh89991059480002.jpeg?Expires=1689071517&Signature=Wf5XfUb-z8UKggV4KkSpkC8XlUPAj4pCuuAC7CTxoEa7k7m7blJqTbR7MbxUzfESpM4Gbxk9jDPJlILHV3MInALV5VpFIfXC4fTZoUDGuJi5h6CxmXMWsxgnTIjVpsLzJ8L7ViA8AI7xlb6j8GvSk223JHN-69jYq~4enno7lz6xyT3lMnvJ-cfaQcmuZafBsb-LEpTc4T7DyLK2kgSQnCiza64oX5VcOKQxbpHCTcbNtVNe5n9gvg5-u4OfPXrFNGf-b4Du~9HCAVLkE7iTkfBMhb7bKrlhLAozahJ2dVJGBlmDJgLNpXlpjHtAGdfOZZpuk7kSSoyGtcf5y43uhw__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Funnel plot to assess for publication bias . In this plot the result of the individual study is plotted on the horizontal axis (in this case the risk ratio [RR]) against a measure of the precision of the data (either the spread of the data or the size of the study) on the vertical axis (this graph uses spread of data measured by the standard error of the log of the relative risk [SE(logRR)]), the smaller the spread of data or the greater the study size the further up the vertical axis. Individual studies are represented by the small squares. From the overall result of the meta-analysis the central estimate is plotted (the vertical dashed line), and the 95% confidence intervals are drawn (the diagonal dashed lines) to form the funnel or inverted V. The assumption is that the larger the study (or the study with a smaller spread of data) then the nearer to the true result it will be, meaning the spread about the overall result will be reduced as the study size increases, hence the funnel shape. If there is publication bias, then the studies will not be equally distributed within the inverted V. The usual sign of publication bias is the absence of studies in the green box that represents where small negative studies lie.
When all the suitable studies have been collected, quality assessed, data extracted, and, if possible, meta-analysis performed, then conclusions need to be made. The authors must refer back to the original question and ask if there is enough evidence to conclusively answer the question and, if there is, how strong the supporting evidence is. In evaluating a systematic review, the reader must decide if the authors have made an objective conclusion on the basis of the available evidence and not on personal opinion. The discussion should make reference to any sources of heterogeneity and whether there are subgroups in which the evidence is stronger than in others. The results of sensitivity analyses, if performed, may be discussed particularly if the results suggest the presence of bias in the overall results. At the end of the systematic review, it is possible that there is insufficient evidence to draw clear conclusions. In many situations, authors will conclude that further research is needed to provide stronger recommendations or give specific recommendations for clinically important subgroups.
We hope that you have found this article useful for future appraisal of systematic reviews and meta-analysis. The techniques described should not be used just for the production of “formal for publication” reviews but can be equally well applied to day-to-day analysis of clinical problems found in the consulting room. Systematic reviews, compared with primary research, requires relatively few resources, allowing clinicians not normally involved in research to produce high-quality, clinically relevant papers.
Contribution: All authors contributed equally to the design, preparation, and editing of the document. M.A.C. was responsible for the final review and approval for submission.
Conflict-of-interest disclosure: The authors declare no competing financial interests.
Correspondence: Mark A. Crowther, Rm L301, St. Joseph's Hospital, 50 Charlton Ave East, Hamilton, ON, Canada, L8N 4A6; e-mail: [email protected] .
This feature is available to Subscribers Only
- Previous Article
- Next Article
Email alerts
Affiliations.
- Current Issue
- First edition
- Collections
- Submit to Blood
- About Blood
- Subscriptions
- Public Access
- Permissions
- Blood Classifieds
- Advertising in Blood
- Terms and Conditions
American Society of Hematology
- 2021 L Street NW, Suite 900
- Washington, DC 20036
- TEL +1 202-776-0544
- FAX +1 202-776-0545
ASH Publications
- Blood Advances
- Hematology, ASH Education Program
- ASH Clinical News
- The Hematologist
- Publications
- Privacy Policy
- Cookie Policy
- Terms of Use
This Feature Is Available To Subscribers Only
Sign In or Create an Account

Adaptations of data mining methodologies: a systematic literature review
The use of end-to-end data mining methodologies such as CRISP-DM, KDD process, and SEMMA has grown substantially over the past decade. However, little is known as to how these methodologies are used in practice. In particular, the question of whether data mining methodologies are used ‘as-is’ or adapted for specific purposes, has not been thoroughly investigated. This article addresses this gap via a systematic literature review focused on the context in which data mining methodologies are used and the adaptations they undergo. The literature review covers 207 peer-reviewed and ‘grey’ publications. We find that data mining methodologies are primarily applied ‘as-is’. At the same time, we also identify various adaptations of data mining methodologies and we note that their number is growing rapidly. The dominant adaptations pattern is related to methodology adjustments at a granular level (modifications) followed by extensions of existing methodologies with additional elements. Further, we identify two recurrent purposes for adaptation: (1) adaptations to handle Big Data technologies, tools and environments (technological adaptations); and (2) adaptations for context-awareness and for integrating data mining solutions into business processes and IT systems (organizational adaptations). The study suggests that standard data mining methodologies do not pay sufficient attention to deployment issues, which play a prominent role when turning data mining models into software products that are integrated into the IT architectures and business processes of organizations. We conclude that refinements of existing methodologies aimed at combining data, technological, and organizational aspects, could help to mitigate these gaps.
Introduction
The availability of Big Data has stimulated widespread adoption of data mining and data analytics in research and in business settings ( Columbus, 2017 ). Over the years, a certain number of data mining methodologies have been proposed, and these are being used extensively in practice and in research. However, little is known about what and how data mining methodologies are applied, and it has not been neither widely researched nor discussed. Further, there is no consolidated view on what constitutes quality of methodological process in data mining and data analytics, how data mining and data analytics are applied/used in organization settings context, and how application practices relate to each other. That motivates the need for comprehensive survey in the field.
There have been surveys or quasi-surveys and summaries conducted in related fields. Notably, there have been two systematic systematic literature reviews; Systematic Literature Review, hereinafter, SLR is the most suitable and widely used research method for identifying, evaluating and interpreting research of particular research question, topic or phenomenon ( Kitchenham, Budgen & Brereton, 2015 ). These reviews concerned Big Data Analytics, but not general purpose data mining methodologies. Adrian et al. (2004) executed SLR with respect to implementation of Big Data Analytics (BDA), specifically, capability components necessary for BDA value discovery and realization. The authors identified BDA implementation studies, determined their main focus areas, and discussed in detail BDA applications and capability components. Saltz & Shamshurin (2016) have published SLR paper on Big Data Team Process Methodologies. Authors have identified lack of standard in regards to how Big Data projects are executed, highlighted growing research in this area and potential benefits of such process standard. Additionally, authors synthesized and produced list of 33 most important success factors for executing Big Data activities. Finally, there are studies that surveyed data mining techniques and applications across domains, yet, they focus on data mining process artifacts and outcomes ( Madni, Anwar & Shah, 2017 ; Liao, Chu & Hsiao, 2012 ), but not on end-to-end process methodology.
There have been number of surveys conducted in domain-specific settings such as hospitality, accounting, education, manufacturing, and banking fields. Mariani et al. (2018) focused on Business Intelligence (BI) and Big Data SLR in the hospitality and tourism environment context. Amani & Fadlalla (2017) explored application of data mining methods in accounting while Romero & Ventura (2013) investigated educational data mining. Similarly, Hassani, Huang & Silva (2018) addressed data mining application case studies in banking and explored them by three dimensions—topics, applied techniques and software. All studies were performed by the means of systematic literature reviews. Lastly, Bi & Cochran (2014) have undertaken standard literature review of Big Data Analytics and its applications in manufacturing.
Apart from domain-specific studies, there have been very few general purpose surveys with comprehensive overview of existing data mining methodologies, classifying and contextualizing them. Valuable synthesis was presented by Kurgan & Musilek (2006) as comparative study of the state-of-the art of data mining methodologies. The study was not SLR, and focused on comprehensive comparison of phases, processes, activities of data mining methodologies; application aspect was summarized briefly as application statistics by industries and citations. Three more comparative, non-SLR studies were undertaken by Marban, Mariscal & Segovia (2009) , Mariscal, Marbán & Fernández (2010) , and the most recent and closest one by Martnez-Plumed et al. (2017) . They followed the same pattern with systematization of existing data mining frameworks based on comparative analysis. There, the purpose and context of consolidation was even more practical—to support derivation and proposal of the new artifact, that is, novel data mining methodology. The majority of the given general type surveys in the field are more than a decade old, and have natural limitations due to being: (1) non-SLR studies, and (2) so far restricted to comparing methodologies in terms of phases, activities, and other elements.
The key common characteristic behind all the given studies is that data mining methodologies are treated as normative and standardized (‘one-size-fits-all’) processes. A complementary perspective, not considered in the above studies, is that data mining methodologies are not normative standardized processes, but instead, they are frameworks that need to be specialized to different industry domains, organizational contexts, and business objectives. In the last few years, a number of extensions and adaptations of data mining methodologies have emerged, which suggest that existing methodologies are not sufficient to cover the needs of all application domains. In particular, extensions of data mining methodologies have been proposed in the medical domain ( Niaksu, 2015 ), educational domain ( Tavares, Vieira & Pedro, 2017 ), the industrial engineering domain ( Huber et al., 2019 ; Solarte, 2002 ), and software engineering ( Marbán et al., 2007 , 2009 ). However, little attention has been given to studying how data mining methodologies are applied and used in industry settings, so far only non-scientific practitioners’ surveys provide such evidence.
Given this research gap, the central objective of this article is to investigate how data mining methodologies are applied by researchers and practitioners, both in their generic (standardized) form and in specialized settings. This is achieved by investigating if data mining methodologies are applied ‘as-is’ or adapted, and for what purposes such adaptations are implemented.
Guided by Systematic Literature Review method, initially we identified a corpus of primary studies covering both peer-reviewed and ‘grey’ literature from 1997 to 2018. An analysis of these studies led us to a taxonomy of uses of data mining methodologies, focusing on the distinction between ‘as is’ usage versus various types of methodology adaptations. By analyzing different types of methodology adaptations, this article identifies potential gaps in standard data mining methodologies both at the technological and at the organizational levels.
The rest of the article is organized as follows. The Background section provides an overview of key concepts of data mining and associated methodologies. Next, Research Design describes the research methodology. The Findings and Discussion section presents the study results and their associated interpretation. Finally, threats to validity are addressed in Threats to Validity while the Conclusion summarizes the findings and outlines directions for future work.
The section introduces main data mining concepts, provides overview of existing data mining methodologies, and their evolution.
Data mining is defined as a set of rules, processes, algorithms that are designed to generate actionable insights, extract patterns, and identify relationships from large datasets ( Morabito, 2016 ). Data mining incorporates automated data extraction, processing, and modeling by means of a range of methods and techniques. In contrast, data analytics refers to techniques used to analyze and acquire intelligence from data (including ‘big data’) ( Gandomi & Haider, 2015 ) and is positioned as a broader field, encompassing a wider spectrum of methods that includes both statistical and data mining ( Chen, Chiang & Storey, 2012 ). A number of algorithms has been developed in statistics, machine learning, and artificial intelligence domains to support and enable data mining. While statistical approaches precedes them, they inherently come with limitations, the most known being rigid data distribution conditions. Machine learning techniques gained popularity as they impose less restrictions while deriving understandable patterns from data ( Bose & Mahapatra, 2001 ).
Data mining projects commonly follow a structured process or methodology as exemplified by Mariscal, Marbán & Fernández (2010) , Marban, Mariscal & Segovia (2009) . A data mining methodology specifies tasks, inputs, outputs, and provides guidelines and instructions on how the tasks are to be executed ( Mariscal, Marbán & Fernández, 2010 ). Thus, data mining methodology provides a set of guidelines for executing a set of tasks to achieve the objectives of a data mining project ( Mariscal, Marbán & Fernández, 2010 ).
The foundations of structured data mining methodologies were first proposed by Fayyad, Piatetsky-Shapiro & Smyth (1996a , 1996b , 1996c) , and were initially related to Knowledge Discovery in Databases (KDD). KDD presents a conceptual process model of computational theories and tools that support information extraction (knowledge) with data ( Fayyad, Piatetsky-Shapiro & Smyth, 1996a ). In KDD, the overall approach to knowledge discovery includes data mining as a specific step. As such, KDD, with its nine main steps (exhibited in Fig. 1 ), has the advantage of considering data storage and access, algorithm scaling, interpretation and visualization of results, and human computer interaction ( Fayyad, Piatetsky-Shapiro & Smyth, 1996a , 1996c ). Introduction of KDD also formalized clearer distinction between data mining and data analytics, as for example formulated in Tsai et al. (2015) : “…by the data analytics, we mean the whole KDD process, while by the data analysis, we mean the part of data analytics that is aimed at finding the hidden information in the data, such as data mining”.

Figure 1: An overview of the steps composing the KDD process, as presented in Fayyad, Piatetsky-Shapiro & Smyth (1996a , 1996c) .
Step 1: Learning application domain: In the first step, it is needed to develop an understanding of the application domain and relevant prior knowledge followed by identifying the goal of the KDD process from the customer’s viewpoint.
Step 2: Dataset creation: Second step involves selecting a dataset, focusing on a subset of variables or data samples on which discovery is to be performed.
Step 3: Data cleaning and processing: In the third step, basic operations to remove noise or outliers are performed. Collection of necessary information to model or account for noise, deciding on strategies for handling missing data fields, and accounting for data types, schema, and mapping of missing and unknown values are also considered.
Step 4: Data reduction and projection: Here, the work of finding useful features to represent the data, depending on the goal of the task, application of transformation methods to find optimal features set for the data is conducted.
Step 5: Choosing the function of data mining: In the fifth step, the target outcome (e.g., summarization, classification, regression, clustering) are defined.
Step 6: Choosing data mining algorithm: Sixth step concerns selecting method(s) to search for patterns in the data, deciding which models and parameters are appropriate and matching a particular data mining method with the overall criteria of the KDD process.
Step 7: Data mining: In the seventh step, the work of mining the data that is, searching for patterns of interest in a particular representational form or a set of such representations: classification rules or trees, regression, clustering is conducted.
Step 8: Interpretation: In this step, the redundant and irrelevant patterns are filtered out, relevant patterns are interpreted and visualized in such way as to make the result understandable to the users.
Step 9: Using discovered knowledge: In the last step, the results are incorporated with the performance system, documented and reported to stakeholders, and used as basis for decisions.
The KDD process became dominant in industrial and academic domains ( Kurgan & Musilek, 2006 ; Marban, Mariscal & Segovia, 2009 ). Also, as timeline-based evolution of data mining methodologies and process models shows ( Fig. 2 below), the original KDD data mining model served as basis for other methodologies and process models, which addressed various gaps and deficiencies of original KDD process. These approaches extended the initial KDD framework, yet, extension degree has varied ranging from process restructuring to complete change in focus. For example, Brachman & Anand (1996) and further Gertosio & Dussauchoy (2004) (in a form of case study) introduced practical adjustments to the process based on iterative nature of process as well as interactivity. The complete KDD process in their view was enhanced with supplementary tasks and the focus was changed to user’s point of view (human-centered approach), highlighting decisions that need to be made by the user in the course of data mining process. In contrast, Cabena et al. (1997) proposed different number of steps emphasizing and detailing data processing and discovery tasks. Similarly, in a series of works Anand & Büchner (1998) , Anand et al. (1998) , Buchner et al. (1999) presented additional data mining process steps by concentrating on adaptation of data mining process to practical settings. They focused on cross-sales (entire life-cycles of online customer), with further incorporation of internet data discovery process (web-based mining). Further, Two Crows data mining process model is consultancy originated framework that has defined the steps differently, but is still close to original KDD. Finally, SEMMA (Sample, Explore, Modify, Model and Assess) based on KDD, was developed by SAS institute in 2005 ( SAS Institute Inc., 2017 ). It is defined as a logical organization of the functional toolset of SAS Enterprise Miner for carrying out the core tasks of data mining. Compared to KDD, this is vendor-specific process model which limits its application in different environments. Also, it skips two steps of original KDD process (‘Learning Application Domain’ and ‘Using of Discovered Knowledge’) which are regarded as essential for success of data mining project ( Mariscal, Marbán & Fernández, 2010 ). In terms of adoption, new KDD-based proposals received limited attention across academia and industry ( Kurgan & Musilek, 2006 ; Marban, Mariscal & Segovia, 2009 ). Subsequently, most of these methodologies converged into the CRISP-DM methodology.

Figure 2: Evolution of data mining process and methodologies, as presented in Martnez-Plumed et al. (2017) .
Additionally, there have only been two non-KDD based approaches proposed alongside extensions to KDD. The first one is 5A’s approach presented by De Pisón Ascacbar (2003) and used by SPSS vendor. The key contribution of this approach has been related to adding ‘Automate’ step while disadvantage was associated with omitting ‘Data Understanding’ step. The second approach was 6-Sigma which is industry originated method to improve quality and customer’s satisfaction ( Pyzdek & Keller, 2003 ). It has been successfully applied to data mining projects in conjunction with DMAIC performance improvement model (Define, Measure, Analyze, Improve, Control).
Phase 1: Business understanding: The focus of the first step is to gain an understanding of the project objectives and requirements from a business perspective followed by converting these into data mining problem definitions. Presentation of a preliminary plan to achieve the objectives are also included in this first step.
Phase 2: Data understanding: This step begins with an initial data collection and proceeds with activities in order to get familiar with the data, identify data quality issues, discover first insights into the data, and potentially detect and form hypotheses.
Phase 3: Data preparation: The third step covers activities required to construct the final dataset from the initial raw data. Data preparation tasks are performed repeatedly.
Phase 4: Modeling phase: In this step, various modeling techniques are selected and applied followed by calibrating their parameters. Typically, several techniques are used for the same data mining problem.
Phase 5: Evaluation of the model(s): The fifth step begins with the quality perspective and then, before proceeding to final model deployment, ascertains that the model(s) achieves the business objectives. At the end of this phase, a decision should be reached on how to use data mining results.
Phase 6: Deployment phase: In the final step, the models are deployed to enable end-customers to use the data as basis for decisions, or support in the business process. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized, presented, distributed in a way that the end-user can use it. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process.

Figure 3: CRISP-DM phases and key outputs (adapted from Chapman et al. (2000) ).
The development of CRISP-DM was led by industry consortium. It is designed to be domain-agnostic ( Mariscal, Marbán & Fernández, 2010 ) and as such, is now widely used by industry and research communities ( Marban, Mariscal & Segovia, 2009) . These distinctive characteristics have made CRISP-DM to be considered as ‘de-facto’ standard of data mining methodology and as a reference framework to which other methodologies are benchmarked ( Mariscal, Marbán & Fernández, 2010 ).
Similarly to KDD, a number of refinements and extensions of the CRISP-DM methodology have been proposed with the two main directions—extensions of the process model itself and adaptations, merger with the process models and methodologies in other domains. Extensions direction of process models could be exemplified by Cios & Kurgan (2005) who have proposed integrated Data Mining & Knowledge Discovery (DMKD) process model. It contains several explicit feedback mechanisms, modification of the last step to incorporate discovered knowledge and insights application as well as relies on technologies for results deployment. In the same vein, Moyle & Jorge (2001) , Blockeel & Moyle (2002) proposed Rapid Collaborative Data Mining System (RAMSYS) framework—this is both data mining methodology and system for remote collaborative data mining projects. The RAMSYS attempted to achieve the combination of a problem solving methodology, knowledge sharing, and ease of communication. It intended to allow the collaborative work of remotely placed data miners in a disciplined manner as regards information flow while allowing the free flow of ideas for problem solving ( Moyle & Jorge, 2001 ). CRISP-DM modifications and integrations with other specific domains were proposed in Industrial Engineering (Data Mining for Industrial Engineering by Solarte (2002) ), and Software Engineering by Marbán et al. (2007 , 2009) . Both approaches enhanced CRISP-DM and contributed with additional phases, activities and tasks typical for engineering processes, addressing on-going support ( Solarte, 2002 ), as well as project management, organizational and quality assurance tasks ( Marbán et al., 2009 ).
Finally, limited number of attempts to create independent or semi-dependent data mining frameworks was undertaken after CRISP-DM creation. These efforts were driven by industry players and comprised KDD Roadmap by Debuse et al. (2001) for proprietary predictive toolkit (Lanner Group), and recent effort by IBM with Analytics Solutions Unified Method for Data Mining (ASUM-DM) in 2015 ( IBM Corporation, 2016 : https://developer.ibm.com/technologies/artificial-intelligence/articles/architectural-thinking-in-the-wild-west-of-data-science/ ). Both frameworks contributed with additional tasks, for example, resourcing in KDD Roadmap, or hybrid approach assumed in ASUM, for example, combination of agile and traditional implementation principles.
The Table 1 above summarizes reviewed data mining process models and methodologies by their origin, basis and key concepts.
Research Design
The main research objective of this article is to study how data mining methodologies are applied by researchers and practitioners. To this end, we use systematic literature review (SLR) as scientific method for two reasons. Firstly, systematic review is based on trustworthy, rigorous, and auditable methodology. Secondly, SLR supports structured synthesis of existing evidence, identification of research gaps, and provides framework to position new research activities ( Kitchenham, Budgen & Brereton, 2015 ). For our SLR, we followed the guidelines proposed by Kitchenham, Budgen & Brereton (2015) . All SLR details have been documented in the separate, peer-reviewed SLR protocol (available at https://figshare.com/articles/Systematic-Literature-Review-Protocol/10315961 ).

Research questions
As suggested by Kitchenham, Budgen & Brereton (2015) , we have formulated research questions and motivate them as follows. In the preliminary phase of research we have discovered very limited number of studies investigating data mining methodologies application practices as such. Further, we have discovered number of surveys conducted in domain-specific settings, and very few general purpose surveys, but none of them considered application practices either. As contrasting trend, recent emergence of limited number of adaptation studies have clearly pinpointed the research gap existing in the area of application practices. Given this research gap, in-depth investigation of this phenomenon led us to ask: “How data mining methodologies are applied (‘as-is’ vs adapted) (RQ1)?” Further, as we intended to investigate in depth universe of adaptations scenarios, this naturally led us to RQ2: “How have existing data mining methodologies been adapted?” Finally, if adaptions are made, we wish to explore what the associated reasons and purposes are, which in turn led us to RQ3: “For what purposes are data mining methodologies adapted?”
Research Question 1: How data mining methodologies are applied (‘as-is’ versus adapted)? This question aims to identify data mining methodologies application and usage patterns and trends.
Research Question 2: How have existing data mining methodologies been adapted? This questions aims to identify and classify data mining methodologies adaptation patterns and scenarios.
Research Question 3: For what purposes have existing data mining methodologies been adapted? This question aims to identify, explain, classify and produce insights on what are the reasons and what benefits are achieved by adaptations of existing data mining methodologies. Specifically, what gaps do these adaptations seek to fill and what have been the benefits of these adaptations. Such systematic evidence and insights will be valuable input to potentially new, refined data mining methodology. Insights will be of interest to practitioners and researchers.
Data collection strategy
Our data collection and search strategy followed the guidelines proposed by Kitchenham, Budgen & Brereton (2015) . It defined the scope of the search, selection of literature and electronic databases, search terms and strings as well as screening procedures.
Primary search
The primary search aimed to identify an initial set of papers. To this end, the search strings were derived from the research objective and research questions. The term ‘data mining’ was the key term, but we also included ‘data analytics’ to be consistent with observed research practices. The terms ‘methodology’ and ‘framework’ were also included. Thus, the following search strings were developed and validated in accordance with the guidelines suggested by Kitchenham, Budgen & Brereton (2015) :
(‘data mining methodology’) OR (‘data mining framework’) OR (‘data analytics methodology’) OR (‘data analytics framework’)
The search strings were applied to the indexed scientific databases Scopus, Web of Science (for ‘peer-reviewed’, academic literature) and to the non-indexed Google Scholar (for non-peer-reviewed, so-called ‘grey’ literature). The decision to cover ‘grey’ literature in this research was motivated as follows. As proposed in number of information systems and software engineering domain publications ( Garousi, Felderer & Mäntylä, 2019 ; Neto et al., 2019 ), SLR as stand-alone method may not provide sufficient insight into ‘state of practice’. It was also identified ( Garousi, Felderer & Mäntylä, 2016 ) that ‘grey’ literature can give substantial benefits in certain areas of software engineering, in particular, when the topic of research is related to industrial and practical settings. Taking into consideration the research objectives, which is investigating data mining methodologies application practices, we have opted for inclusion of elements of Multivocal Literature Review (MLR) 1 in our study. Also, Kitchenham, Budgen & Brereton (2015) recommends including ‘grey’ literature to minimize publication bias as positive results and research outcomes are more likely to be published than negative ones. Following MLR practices, we also designed inclusion criteria for types of ‘grey’ literature reported below.
The selection of databases is motivated as follows. In case of peer-reviewed literature sources we concentrated to avoid potential omission bias. The latter is discussed in IS research ( Levy & Ellis, 2006 ) in case research is concentrated in limited disciplinary data sources. Thus, broad selection of data sources including multidisciplinary-oriented (Scopus, Web of Science, Wiley Online Library) and domain-oriented (ACM Digital Library, IEEE Xplorer Digital Library) scientific electronic databases was evaluated. Multidisciplinary databases have been selected due to wider domain coverage and it was validated and confirmed that they do include publications originating from domain-oriented databases, such as ACM and IEEE. From multi-disciplinary databases as such, Scopus was selected due to widest possible coverage (it is worlds largest database, covering app. 80% of all international peer-reviewed journals) while Web of Science was selected due to its longer temporal range. Thus, both databases complement each other. The selected non-indexed database source for ‘grey’ literature is Google Scholar, as it is comprehensive source of both academic and ‘grey’ literature publications and referred as such extensively ( Garousi, Felderer & Mäntylä, 2019 ; Neto et al., 2019 ).
Further, Garousi, Felderer & Mäntylä (2019) presented three-tier categorization framework for types of ‘grey literature’. In our study we restricted ourselves to the 1st tier ‘grey’ literature publications of the limited number of ‘grey’ literature producers. In particular, from the list of producers ( Neto et al., 2019 ) we have adopted and focused on government departments and agencies, non-profit economic, trade organizations (‘think-tanks’) and professional associations, academic and research institutions, businesses and corporations (consultancy companies and established private companies). The 1st tier ‘grey’ literature selected items include: (1) government, academic, and private sector consultancy reports 2 , (2) theses (not lower than Master level) and PhD Dissertations, (3) research reports, (4) working papers, (5) conference proceedings, preprints. With inclusion of the 1st tier ‘grey’ literature criteria we mitigate quality assessment challenge especially relevant and reported for it ( Garousi, Felderer & Mäntylä, 2019 ; Neto et al., 2019 ).
Scope and domains inclusion
Context of technology and infrastructure for data mining/data analytics tasks and projects.
Granular methods application in data mining process itself or their application for data mining tasks, for example, constructing business queries or applying regression or neural networks modeling techniques to solve classification problems. Studies with granular methods are included in primary texts corpus as long as method application is part of overall methodological approach.
Technological aspects in data mining for example, data engineering, dataflows and workflows.
Traditional statistical methods not associated with data mining directly including statistical control methods.
Similarly to Budgen et al. (2006) and Levy & Ellis (2006) , initial piloting revealed that search engines retrieved literature available for all major scientific domains including ones outside authors’ area of expertise (e.g., medicine). Even though such studies could be retrieved, it would be impossible for us to analyze and correctly interpret literature published outside the possessed area of expertise. The adjustments toward search strategy were undertaken by retaining domains closely associated with Information Systems, Software Engineering research. Thus, for Scopus database the final set of inclusive domains was limited to nine and included Computer Science, Engineering, Mathematics, Business, Management and Accounting, Decision Science, Economics, Econometrics and Finance, and Multidisciplinary as well as Undefined studies. Excluded domains covered 11.5% or 106 out of 925 publications; it was confirmed in validation process that they primarily focused on specific case studies in fundamental sciences and medicine 3 . The included domains from Scopus database were mapped to Web of Science to ensure consistent approach across databases and the correctness of mapping was validated.
Screening criteria and procedures
Based on the SLR practices (as in Kitchenham, Budgen & Brereton (2015) , Brereton et al. (2007) ) and defined SLR scope, we designed multi-step screening procedures (quality and relevancy) with associated set of Screening Criteria and Scoring System . The purpose of relevancy screening is to find relevant primary studies in an unbiased way ( Vanwersch et al., 2011 ). Quality screening, on the other hand, aims to assess primary relevant studies in terms of quality in unbiased way.
Screening Criteria consisted of two subsets— Exclusion Criteria applied for initial filtering and Relevance Criteria , also known as Inclusion Criteria .
Quality 1: The publication item is not in English (understandability).
either the same document retrieved from two or all three databases.
or different versions of the same publication are retrieved (i.e., the same study published in different sources)—based on best practices, decision rule is that the most recent paper is retained as well as the one with the highest score ( Kofod-Petersen, 2014 ).
if a publication is published both as conference proceeding and as journal article with the same name and same authors or as an extended version of conference paper, the latter is selected.
Quality 3: Length of the publication is less than 6 pages—short papers do not have the space to expand and discuss presented ideas in sufficient depth to examine for us.
Quality 4: The paper is not accessible in full length online through the university subscription of databases and via Google Scholar—not full availability prevents us from assessing and analyzing the text.
The initially retrieved list of papers was filtered based on Exclusion Criteria . Only papers that passed all criteria were retained in the final studies corpus. Mapping of criteria towards screening steps is exhibited in Fig. 4 .

Figure 4: Relevance and quality screening steps with criteria.
Relevance Criteria were designed to identify relevant publications and are presented in Table 2 below while mapping to respective process steps is presented in Fig. 4 . These criteria were applied iteratively.
As a final SLR step, the full texts quality assessment was performed with constructed Scoring Metrics (in line with Kitchenham & Charters (2007) ). It is presented in the Table 3 below.
Data extraction and screening process
The conducted data extraction and screening process is presented in Fig. 4 . In Step 1 initial publications list were retrieved from pre-defined databases—Scopus, Web of Science, Google Scholar. The lists were merged and duplicates eliminated in Step 2. Afterwards, texts being less than 6 pages were excluded (Step 3). Steps 1–3 were guided by Exclusion Criteria . In the next stage (Step 4), publications were screened by Title based on pre-defined Relevance Criteria . The ones which passed were evaluated by their availability (Step 5). As long as study was available, it was evaluated again by the same pre-defined Relevance Criteria applied to Abstract, Conclusion and if necessary Introduction (Step 6). The ones which passed this threshold formed primary publications corpus extracted from databases in full. These primary texts were evaluated again based on full text (Step 7) applying Relevance Criteria first and then Scoring Metrics .
Results and quantitative analysis
In Step 1, 1,715 publications were extracted from relevant databases with the following composition—Scopus (819), Web of Science (489), Google Scholar (407). In terms of scientific publication domains, Computer Science (42.4%), Engineering (20.6%), Mathematics (11.1%) accounted for app. 74% of Scopus originated texts. The same applies to Web of Science harvest. Exclusion Criteria application produced the following results. In Step 2, after eliminating duplicates, 1,186 texts were passed for minimum length evaluation, and 767 reached assessment by Relevancy Criteria .
As mentioned Relevance Criteria were applied iteratively (Step 4–6) and in conjunction with availability assessment. As a result, only 298 texts were retained for full evaluation with 241 originating from scientific databases while 57 were ‘grey’. These studies formed primary texts corpus which was extracted, read in full and evaluated by Relevance Criteria combined with Scoring Metrics . The decision rule was set as follows. Studies that scored “1” or “0” were rejected, while texts with “3” and “2” evaluation were admitted as final primary studies corpus. To this end, as an outcome of SLR-based, broad, cross-domain publications collection and screening we identified 207 relevant publications from peer-reviewed (156 texts) and ‘grey’ literature (51 texts). Figure 5 below exhibits yearly published research numbers with the breakdown by ‘peer-reviewed’ and ‘grey’ literature starting from 1997.

Figure 5: SLR derived relevant texts corpus—data mining methodologies peer-reviewed research and ‘grey’ for period 1997–2018 (no. of publications).
In terms of composition, ‘peer-reviewed’ studies corpus is well-balanced with 72 journal articles and 82 conference papers while book chapters account for 4 instances only. In contrast, in ‘grey’ literature subset, articles in moderated and non-peer reviewed journals are dominant ( n = 34) compared to overall number of conference papers ( n = 13), followed by small number of technical reports and pre-prints ( n = 4).
Temporal analysis of texts corpus (as per Fig. 5 below) resulted in two observations. Firstly, we note that stable and significant research interest (in terms of numbers) on data mining methodologies application has started around a decade ago—in 2007. Research efforts made prior to 2007 were relatively limited with number of publications below 10. Secondly, we note that research on data mining methodologies has grown substantially since 2007, an observation supported by the 3-year and 10-year constructed mean trendlines. In particular, the number of publications have roughly tripled over past decade hitting all time high with 24 texts released in 2017.
Further, there are also two distinct spike sub-periods in the years 2007–2009 and 2014–2017 followed by stable pattern with overall higher number of released publications on annual basis. This observation is in line with the trend of increased penetration of methodologies, tools, cross-industry applications and academic research of data mining.
Findings and Discussion
In this section, we address the research questions of the paper. Initially, as part of RQ1, we present overview of data mining methodologies ‘as-is’ and adaptation trends. In addressing RQ2, we further classify the adaptations identified. Then, as part of RQ3 subsection, each category identified under RQ2 is analyzed with particular focus on the goals of adaptations.
RQ1: How data mining methodologies are applied (‘as-is’ vs. adapted)?
The first research question examines the extent to which data mining methodologies are used ‘as-is’ versus adapted. Our review based on 207 publications identified two distinct paradigms on how data mining methodologies are applied. The first is ‘as-is’ where the data mining methodologies are applied as stipulated. The second is with ‘adaptations’; that is, methodologies are modified by introducing various changes to the standard process model when applied.
We have aggregated research by decades to differentiate application pattern between two time periods 1997–2007 with limited vs 2008–2018 with more intensive data mining application. The given cut has not only been guided by extracted publications corpus but also by earlier surveys. In particular, during the pre-2007 research, there where ten new methodologies proposed, but since then, only two new methodologies have been proposed. Thus, there is a distinct trend observed over the last decade of large number of extensions and adaptations proposed vs entirely new methodologies.
We note that during the first decade of our time scope (1997–2007), the ratio of data mining methodologies applied ‘as-is’ was 40% (as presented in Fig. 6A ). However, the same ratio for the following decade is 32% ( Fig. 6B ). Thus, in terms of relative shares we note a clear decrease in using data mining methodologies ‘as-is’ in favor of adapting them to cater to specific needs.The trend is even more pronounced when comparing numbers—adaptations more than tripled (from 30 to 106) while ‘as-is’ scenario has increased modestly (from 20 to 51). Given this finding, we continue with analyzing how data mining methodologies have been adapted under RQ2.

Figure 6: Applications of data mining methodologies: (A) breakdown by ‘as-is’ vs. adaptions for 1997–2007 period; (B) breakdown by ‘as-is’ vs. adaptions for 2008–2018 period.
Rq2: how have existing data mining methodologies been adapted.
Level 1 Decision: Has the methodology been combined with another methodology? If yes, the resulting methodology was classified in the ‘integration’ category. Otherwise, we posed the next question.
Level 2 Decision: Are any new elements (phases, tasks, deliverables) added to the methodology? If yes, we designate the resulting methodology as an ‘extension’ of the original one. Otherwise, we classify the resulting methodology as a modification of the original one.
Scenario ‘Modification’: introduces specialized sub-tasks and deliverables in order to address specific use cases or business problems. Modifications typically concentrate on granular adjustments to the methodology at the level of sub-phases, tasks or deliverables within the existing reference frameworks (e.g., CRISP-DM or KDD) stages. For example, Chernov et al. (2014) , in the study of mobile network domain, proposed automated decision-making enhancement in the deployment phase. In addition, the evaluation phase was modified by using both conventional and own-developed performance metrics. Further, in a study performed within the financial services domain, Yang et al. (2016) presents feature transformation and feature selection as sub-phases, thereby enhancing the data mining modeling stage.
Scenario ‘Extension’: primarily proposes significant extensions to reference data mining methodologies. Such extensions result in either integrated data mining solutions, data mining frameworks serving as a component or tool for automated IS systems, or their transformations to fit specialized environments. The main purposes of extensions are to integrate fully-scaled data mining solutions into IS/IT systems and business processes and provide broader context with useful architectures, algorithms, etc. Adaptations, where extensions have been made, elicit and explicitly present various artifacts in the form of system and model architectures, process views, workflows, and implementation aspects. A number of soft goals are also achieved, providing holistic perspective on data mining process, and contextualizing with organizational needs. Also, there are extensions in this scenario where data mining process methodologies are substantially changed and extended in all key phases to enable execution of data mining life-cycle with the new (Big) Data technologies, tools and in new prototyping and deployment environments (e.g., Hadoop platforms or real-time customer interfaces). For example, Kisilevich, Keim & Rokach (2013) presented extensions to traditional CRISP-DM data mining outcomes with fully fledged Decision Support System (DSS) for hotel brokerage business. Authors ( Kisilevich, Keim & Rokach, 2013 ) have introduced spatial/non-spatial data management (extending data preparation), analytical and spatial modeling capabilities (extending modeling phase), provided spatial display and reporting capabilities (enhancing deployment phase). In the same work domain knowledge was introduced in all phases of data mining process, and usability and ease of use were also addressed.
Scenario ‘Integration’: combines reference methodology, for example, CRISP-DM with: (1) data mining methodologies originated from other domains (e.g., Software engineering development methodologies), (2) organizational frameworks (Balanced Scorecard, Analytics Canvass, etc.), or (3) adjustments to accommodate Big Data technologies and tools. Also, adaptations in the form of ‘Integration’ typically introduce various types of ontologies and ontology-based tools, domain knowledge, software engineering, and BI-driven framework elements. Fundamental data mining process adjustments to new types of data, IS architectures (e.g., real time data, multi-layer IS) are also presented. Key gaps addressed with such adjustments are prescriptive nature and low degree of formalization in CRISP-DM, obsolete nature of CRISP-DM with respect to tools, and lack of CRISP-DM integration with other organizational frameworks. For example, Brisson & Collard (2008) developed KEOPS data mining methodology (CRIPS-DM based) centered on domain knowledge integration. Ontology-driven information system has been proposed with integration and enhancements to all steps of data mining process. Further, an integrated expert knowledge used in all data mining phases was proved to produce value in data mining process.
To examine how the application scenario of each data mining methodology usage has developed over time, we mapped peer-reviewed texts and ‘grey’ literature to respective adaptation scenarios, aggregated by decades (as presented in the Fig. 7 for peer-reviewed and Fig. 8 for ‘grey’).

Figure 7: Data Mining methodologies application research—primary ‘peer-reviewed’ texts classification by types of scenarios aggregated by decades (with numbers and relative proportions).

Figure 8: Data Mining methodologies application research—primary ‘grey’ texts classification by types of scenarios aggregated by decades (with numbers and relative proportions).
For peer-reviewed research, such temporal analysis resulted in three observations. Firstly, research efforts in each adaptation scenario has been growing and number of publication more than quadrupled (128 vs. 28). Secondly, as noted above relative proportion of ‘as-is’ studies is diluted (from 39% to 33%) and primarily replaced with ‘Extension’ paradigm (from 25% to 30%). In contrast, in relative terms ‘Modification’ and ‘Integration’ paradigms gains are modest. Further, this finding is reinforced with other observation—most notable gaps in terms of modest number of publications remain in ‘Integration’ category where excluding 2008–2009 spike, research efforts are limited and number of texts is just 13. This is in stark contrast with prolific research in ‘Extension category’ though concentrated in the recent years. We can hypothesize that existing reference methodologies do not accommodate and support increasing complexity of data mining projects and IS/IT infrastructure, as well as certain domains specifics and as such need to be adapted.
In ‘grey’ literature, in contrast to peer-reviewed research, growth in number of publications is less profound—29 vs. 22 publications or 32% comparing across two decade (as per Fig. 8 ). The growth is solely driven by ‘Integration’ scenarios application (13 vs. 4 publications) while both ‘as-is’ and other adaptations scenarios are stagnating or in decline.
RQ3: For what purposes have existing data mining methodologies been adapted?
We address the third research question by analyzing what gaps the data mining methodology adaptations seek to fill and the benefits of such adaptations. We identified three adaptation scenarios, namely ‘Modification’, ‘Extension’, and ‘Integration’. Here, we analyze each of them.
Modification
Modifications of data mining methodologies are present in 30 peer-reviewed and 4 ‘grey’ literature studies. The analysis shows that modifications overwhelmingly consist of specific case studies. However, the major differentiating point compared to ‘as-is’ case studies is clear presence of specific adjustments towards standard data mining process methodologies. Yet, the proposed modifications and their purposes do not go beyond traditional data mining methodologies phases. They are granular, specialized and executed on tasks, sub-tasks, and at deliverables level. With modifications, authors describe potential business applications and deployment scenarios at a conceptual level, but typically do not report or present real implementations in the IS/IT systems and business processes.
Further, this research subcategory can be best classified based on domains where case studies were performed and data mining methodologies modification scenarios executed. We have identified four distinct domain-driven applications presented in the Fig. 9 .

Figure 9: ‘Modification’ paradigm application studies for period 1997–2018—mapping to domains.
It, is domain.
The largest number of publications (14 or app. 40%), was performed on IT, IS security, software development, specific data mining and processing topics. Authors address intrusion detection problem in Hossain, Bridges & Vaughn (2003) , Fan, Ye & Chen (2016) , Lee, Stolfo & Mok (1999) , specialized algorithms for variety of data types processing in Yang & Shi (2010) , Chen et al. (2001) , Yi, Teng & Xu (2016) , Pouyanfar & Chen (2016) , effective and efficient computer and mobile networks management in Guan & Fu (2010) , Ertek, Chi & Zhang (2017) , Zaki & Sobh (2005) , Chernov, Petrov & Ristaniemi (2015) , Chernov et al. (2014) .
Manufacturing and engineering
The next most popular research area is manufacturing/engineering with 10 case studies. The central topic here is high-technology manufacturing, for example, semi-conductors associated—study of Chien, Diaz & Lan (2014) , and various complex prognostics case studies in rail, aerospace domains ( Létourneau et al., 2005 ; Zaluski et al., 2011 ) concentrated on failure predictions. These are complemented by studies on equipment fault and failure predictions and maintenance ( Kumar, Shankar & Thakur, 2018 ; Kang et al., 2017 ; Wang, 2017 ) as well as monitoring system ( García et al., 2017 ).
Sales and services, incl. financial industry
The third category is presented by seven business application papers concerning customer service, targeting and advertising ( Karimi-Majd & Mahootchi, 2015 ; Reutterer et al., 2017 ; Wang, 2017 ), financial services credit risk assessments ( Smith, Willis & Brooks, 2000 ), supply chain management ( Nohuddin et al., 2018 ), and property management ( Yu, Fung & Haghighat, 2013 ), and similar.
As a consequence of specialization, these studies concentrate on developing ‘state-of-the art’ solution to the respective domain-specific problem.
Purpose 1: To implement fully scaled, integrated data mining solution and regular, repeatable knowledge discovery process— address model, algorithm deployment, implementation design (including architecture, workflows and corresponding IS integration). Also, complementary goal is to tackle changes to business process to incorporate data mining into organization activities.
Purpose 2: To implement complex, specifically designed systems and integrated business applications with data mining model/solution as component or tool. Typically, this adaptation is also oriented towards Big Data specifics, and is complemented by proposed artifacts such as Big Data architectures, system models, workflows, and data flows.
Purpose 3: To implement data mining as part of integrated/combined specialized infrastructure, data environments and types (e.g., IoT, cloud, mobile networks) .
Purpose 4: To incorporate context-awareness aspects.
The specific list of studies mapped to each of the given purposes presented in the Appendix ( Table A1 ). Main purposes of adaptations, associated gaps and/or benefits along with observations and artifacts are documented in the Fig. 10 below.

Figure 10: ‘Extension’ scenario adaptations goals, benefits, artifacts and number of publications for period 1997–2018.
In ‘Extension’ category, studies executed with the Purpose 1 propose fully scaled, integrated data mining solutions of specific data mining models, associated frameworks and processes. The distinctive trait of this research subclass is that it ensures repeatability and reproducibility of delivered data mining solution in different organizational and industry settings. Both the results of data mining use case as well as deployment and integration into IS/IT systems and associated business process(es) are presented explicitly. Thus, ‘Extension’ subclass is geared towards specific solution design, tackling concrete business or industrial setting problem or addressing specific research gaps thus resembling comprehensive case study.
This direction can be well exemplified by expert finder system in research social network services proposed by Sun et al. (2015) , data mining solution for functional test content optimization by Wang (2015) and time-series mining framework to conduct estimation of unobservable time-series by Hu et al. (2010) . Similarly, Du et al. (2017) tackle online log anomalies detection, automated association rule mining is addressed by Çinicioğlu et al. (2011) , software effort estimation by Deng, Purvis & Purvis (2011) , network patterns visual discovery by Simoff & Galloway (2008) . Number of studies address solutions in IS security ( Shin & Jeong, 2005 ), manufacturing ( Güder et al., 2014 ; Chee, Baharudin & Karkonasasi, 2016 ), materials engineering domains ( Doreswamy, 2008 ), and business domains ( Xu & Qiu, 2008 ; Ding & Daniel, 2007 ).
In contrast, ‘Extension’ studies executed for the Purpose 2 concentrate on design of complex, multi-component information systems and architectures. These are holistic, complex systems and integrated business applications with data mining framework serving as component or tool. Moreover, data mining methodology in these studies is extended with systems integration phases.
For example, Mobasher (2007) presents data mining application in Web personalization system and associated process; here, data mining cycle is extended in all phases with utmost goal of leveraging multiple data sources and using discovered models and corresponding algorithms in an automatic personalization system. Authors comprehensively address data processing, algorithm, design adjustments and respective integration into automated system. Similarly, Haruechaiyasak, Shyu & Chen (2004) tackle improvement of Webpage recommender system by presenting extended data mining methodology including design and implementation of data mining model. Holistic view on web-mining with support of all data sources, data warehousing and data mining techniques integration, as well as multiple problem-oriented analytical outcomes with rich business application scenarios (personalization, adaptation, profiling, and recommendations) in e-commerce domain was proposed and discussed by Büchner & Mulvenna (1998) . Further, Singh et al. (2014) tackled scalable implementation of Network Threat Intrusion Detection System. In this study, data mining methodology and resulting model are extended, scaled and deployed as module of quasi-real-time system for capturing Peer-to-Peer Botnet attacks. Similar complex solution was presented in a series of publications by Lee et al. (2000 , 2001) who designed real-time data mining-based Intrusion Detection System (IDS). These works are complemented by comprehensive study of Barbará et al. (2001) who constructed experimental testbed for intrusion detection with data mining methods. Detection model combining data fusion and mining and respective components for Botnets identification was developed by Kiayias et al. (2009) too. Similar approach is presented in Alazab et al. (2011) who proposed and implemented zero-day malware detection system with associated machine-learning based framework. Finally, Ahmed, Rafique & Abulaish (2011) presented multi-layer framework for fuzzy attack in 3G cellular IP networks.
A number of authors have considered data mining methodologies in the context of Decision Support Systems and other systems that generate information for decision-making, across a variety of domains. For example, Kisilevich, Keim & Rokach (2013) executed significant extension of data mining methodology by designing and presenting integrated Decision Support System (DSS) with six components acting as supporting tool for hotel brokerage business to increase deal profitability. Similar approach is undertaken by Capozzoli et al. (2017) focusing on improving energy management of properties by provision of occupancy pattern information and reconfiguration framework. Kabir (2016) presented data mining information service providing improved sales forecasting that supported solution of under/over-stocking problem while Lau, Zhang & Xu (2018) addressed sales forecasting with sentiment analysis on Big Data. Kamrani, Rong & Gonzalez (2001) proposed GA-based Intelligent Diagnosis system for fault diagnostics in manufacturing domain. The latter was tackled further in Shahbaz et al. (2010) with complex, integrated data mining system for diagnosing and solving manufacturing problems in real time.
Lenz, Wuest & Westkämper (2018) propose a framework for capturing data analytics objectives and creating holistic, cross-departmental data mining systems in the manufacturing domain. This work is representative of a cohort of studies that aim at extending data mining methodologies in order to support the design and implementation of enterprise-wide data mining systems. In this same research cohort, we classify Luna, Castro & Romero (2017) , which presents a data mining toolset integrated into the Moodle learning management system, with the aim of supporting university-wide learning analytics.
One study addresses multi-agent based data mining concept. Khan, Mohamudally & Babajee (2013) have developed unified theoretical framework for data mining by formulating a unified data mining theory. The framework is tested by means of agent programing proposing integration into multi-agent system which is useful due to scalability, robustness and simplicity.
The subcategory of ‘Extension’ research executed with Purpose 3 is devoted to data mining methodologies and solutions in specialized IT/IS, data and process environments which emerged recently as consequence of Big Data associated technologies and tools development. Exemplary studies include IoT associated environment research, for example, Smart City application in IoT presented by Strohbach et al. (2015) . In the same domain, Bashir & Gill (2016) addressed IoT-enabled smart buildings with the additional challenge of large amount of high-speed real time data and requirements of real-time analytics. Authors proposed integrated IoT Big Data Analytics framework. This research is complemented by interdisciplinary study of Zhong et al. (2017) where IoT and wireless technologies are used to create RFID-enabled environment producing analysis of KPIs to improve logistics.
Significant number of studies addresses various mobile environments sometimes complemented by cloud-based environments or cloud-based environments as stand-alone. Gomes, Phua & Krishnaswamy (2013) addressed mobile data mining with execution on mobile device itself; the framework proposes innovative approach addressing extensions of all aspects of data mining including contextual data, end-user privacy preservation, data management and scalability. Yuan, Herbert & Emamian (2014) and Yuan & Herbert (2014) introduced cloud-based mobile data analytics framework with application case study for smart home based monitoring system. Cuzzocrea, Psaila & Toccu (2016) have presented innovative FollowMe suite which implements data mining framework for mobile social media analytics with several tools with respective architecture and functionalities. An interesting paper was presented by Torres et al. (2017) who addressed data mining methodology and its implementation for congestion prediction in mobile LTE networks tackling also feedback reaction with network reconfigurations trigger.
Further, Biliri et al. (2014) presented cloud-based Future Internet Enabler—automated social data analytics solution which also addresses Social Network Interoperability aspect supporting enterprises to interconnect and utilize social networks for collaboration. Real-time social media streamed data and resulting data mining methodology and application was extensively discussed by Zhang, Lau & Li (2014) . Authors proposed design of comprehensive ABIGDAD framework with seven main components implementing data mining based deceptive review identification. Interdisciplinary study tackling both these topics was developed by Puthal et al. (2016) who proposed integrated framework and architecture of disaster management system based on streamed data in cloud environment ensuring end-to-end security. Additionally, key extensions to data mining framework have been proposed merging variety of data sources and types, security verification and data flow access controls. Finally, cloud-based manufacturing was addressed in the context of fault diagnostics by Kumar et al. (2016) .
Also, Mahmood et al. (2013) tackled Wireless Sensor Networks and associated data mining framework required extensions. Interesting work is executed by Nestorov & Jukic (2003) addressing rare topic of data mining solutions integration within traditional data warehouses and active mining of data repositories themselves.
Supported by new generation of visualization technologies (including Virtual Reality environments), Wijayasekara, Linda & Manic (2011) proposed and implemented CAVE-SOM (3D visual data mining framework) which offers interactive, immersive visual data mining with multiple visualization modes supported by plethora of methods. Earlier version of visual data mining framework was successfully developed and presented by Ganesh et al. (1996) as early as in 1996.
Large-scale social media data is successfully tackled by Lemieux (2016) with comprehensive framework accompanied by set of data mining tools and interface. Real time data analytics was addressed by Shrivastava & Pal (2017) in the domain of enterprise service ecosystem. Images data was addressed in Huang et al. (2002) by proposing multimedia data mining framework and its implementation with user relevance feedback integration and instance learning. Further, exploded data diversity and associated need to extend standard data mining is addressed by Singh et al. (2016) in the study devoted to object detection in video surveillance systems supporting real time video analysis.
Finally, there is also limited number of studies which addresses context awareness (Purpose 4) and extends data mining methodology with context elements and adjustments. In comparison with ‘Integration’ category research, here, the studies are at lower abstraction level, capturing and presenting list of adjustments. Singh, Vajirkar & Lee (2003) generate taxonomy of context factors, develop extended data mining framework and propose deployment including detailed IS architecture. Context-awareness aspect is also addressed in the papers reviewed above, for example, Lenz, Wuest & Westkämper (2018) , Kisilevich, Keim & Rokach (2013) , Sun et al. (2015) , and other studies.
Integration
Purpose 1: to integrate/combine with various ontologies existing in organization .
Purpose 2: to introduce context-awareness and incorporate domain knowledge .
Purpose 3: to integrate/combine with other research or industry domains framework, process methodologies and concepts .
Purpose 4: to integrate/combine with other well-known organizational governance frameworks, process methodologies and concepts .
Purpose 5: to accommodate and/or leverage upon newly available Big Data technologies, tools and methods.
The specific list of studies mapped to each of the given purposes presented in Appendix ( Table A2 ). Main purposes of adaptations, associated gaps and/or benefits along with observations and artifacts are documented in Fig. 11 below.

Figure 11: ‘Integration’ scenario adaptations goals, benefits, artifacts and number of publications for period 1997–2018.
As mentioned, number of studies concentrates on proposing ontology-based Integrated data mining frameworks accompanies by various types of ontologies (Purpose 1). For example, Sharma & Osei-Bryson (2008) focus on ontology-based organizational view with Actors, Goals and Objectives which supports execution of Business Understanding Phase. Brisson & Collard (2008) propose KEOPS framework which is CRISP-DM compliant and integrates a knowledge base and ontology with the purpose to build ontology-driven information system (OIS) for business and data understanding phases while knowledge base is used for post-processing step of model interpretation. Park et al. (2017) propose and design comprehensive ontology-based data analytics tool IRIS with the purpose to align analytics and business. IRIS is based on concept to connect dots, analytics methods or transforming insights into business value, and supports standardized process for applying ontology to match business problems and solutions.
Further, Ying et al. (2014) propose domain-specific data mining framework oriented to business problem of customer demand discovery. They construct ontology for customer demand and customer demand discovery task which allows to execute structured knowledge extraction in the form of knowledge patterns and rules. Here, the purpose is to facilitate business value realization and support actionability of extracted knowledge via marketing strategies and tactics. In the same vein, Cannataro & Comito (2003) presented ontology for the Data Mining domain which main goal is to simplify the development of distributed knowledge discovery applications. Authors offered to a domain expert a reference model for different kind of data mining tasks, methodologies, and software capable to solve the given business problem and find the most appropriate solution.
Apart from ontologies, Sharma & Osei-Bryson (2009) in another study propose IS inspired, driven by Input-Output model data mining methodology which supports formal implementation of Business Understanding Phase. This research exemplifies studies executed with Purpose 2. The goal of the paper is to tackle prescriptive nature of CRISP-DM and address how the entire process can be implemented. Cao, Schurmann & Zhang (2005) study is also exemplary in terms of aggregating and introducing several fundamental concepts into traditional CRISP-DM data mining cycle—context awareness, in-depth pattern mining, human–machine cooperative knowledge discovery (in essence, following human-centricity paradigm in data mining), loop-closed iterative refinement process (similar to Agile-based methodologies in Software Development). There are also several concepts, like data, domain, interestingness, rules which are proposed to tackle number of fundamental constrains identified in CRISP-DM. They have been discussed and further extended by Cao & Zhang (2007 , 2008) , Cao (2010) into integrated domain driven data mining concept resulting in fully fledged D3M (domain-driven) data mining framework. Interestingly, the same concepts, but on individual basis are investigated and presented by other authors, for example, context-aware data mining methodology is tackled by Xiang (2009a , 2009b) in the context of financial sector. Pournaras et al. (2016) attempted very crucial privacy-preservation topic in the context of achieving effective data analytics methodology. Authors introduced metrics and self-regulatory (reconfigurable) information sharing mechanism providing customers with controls for information disclosure.
A number of studies have proposed CRISP-DM adjustments based on existing frameworks, process models or concepts originating in other domains (Purpose 3), for example, software engineering ( Marbán et al., 2007 , 2009 ; Marban, Mariscal & Segovia, 2009 ) and industrial engineering ( Solarte, 2002 ; Zhao et al., 2005 ).
Meanwhile, Mariscal, Marbán & Fernández (2010) proposed a new refined data mining process based on a global comparative analysis of existing frameworks while Angelov (2014) outlined a data analytics framework based on statistical concepts. Following a similar approach, some researchers suggest explicit integration with other areas and organizational functions, for example, BI-driven Data Mining by Hang & Fong (2009) . Similarly, Chen, Kazman & Haziyev (2016) developed an architecture-centric agile Big Data analytics methodology, and an architecture-centric agile analytics and DevOps model. Alternatively, several authors tackled data mining methodology adaptations in other domains, for example, educational data mining by Tavares, Vieira & Pedro (2017) , decision support in learning management systems ( Murnion & Helfert, 2011 ), and in accounting systems ( Amani & Fadlalla, 2017 ).
Other studies are concerned with actionability of data mining and closer integration with business processes and organizational management frameworks (Purpose 4). In particular, there is a recurrent focus on embedding data mining solutions into knowledge-based decision making processes in organizations, and supporting fast and effective knowledge discovery ( Bohanec, Robnik-Sikonja & Borstnar, 2017 ).
Examples of adaptations made for this purpose include: (1) integration of CRISP-DM with the Balanced Scorecard framework used for strategic performance management in organizations ( Yun, Weihua & Yang, 2014 ); (2) integration with a strategic decision-making framework for revenue management Segarra et al. (2016) ; (3) integration with a strategic analytics methodology Van Rooyen & Simoff (2008) , and (4) integration with a so-called ‘Analytics Canvas’ for management of portfolios of data analytics projects Kühn et al. (2018) . Finally, Ahangama & Poo (2015) explored methodological attributes important for adoption of data mining methodology by novice users. This latter study uncovered factors that could support the reduction of resistance to the use of data mining methodologies. Conversely, Lawler & Joseph (2017) comprehensively evaluated factors that may increase the benefits of Big Data Analytics projects in an organization.
Lastly, a number of studies have proposed data mining frameworks (e.g., CRISP-DM) adaptations to cater for new technological architectures, new types of datasets and applications (Purpose 5). For example, Lu et al. (2017) proposed a data mining system based on a Service-Oriented Architecture (SOA), Zaghloul, Ali-Eldin & Salem (2013) developed a concept of self-service data analytics, Osman, Elragal & Bergvall-Kåreborn (2017) blended CRISP-DM into a Big Data Analytics framework for Smart Cities, and Niesen et al. (2016) proposed a data-driven risk management framework for Industry 4.0 applications.
Our analysis of RQ3, regarding the purposes of existing data mining methodologies adaptations, revealed the following key findings. Firstly, adaptations of type ‘Modification’ are predominantly targeted at addressing problems that are specific to a given case study. The majority of modifications were made within the domain of IS security, followed by case studies in the domains of manufacturing and financial services. This is in clear contrast with adaptations of type ‘Extension’, which are primarily aimed at customizing the methodology to take into account specialized development environments and deployment infrastructures, and to incorporate context-awareness aspects. Thirdly, a recurrent purpose of adaptations of type ‘Integration’ is to combine a data mining methodology with either existing ontologies in an organization or with other domain frameworks, methodologies, and concepts. ‘Integration’ is also used to instill context-awareness and domain knowledge into a data mining methodology, or to adapt it to specialized methods and tools, such as Big Data. The distinctive outcome and value (gaps filled in) of ‘Integrations’ stems from improved knowledge discovery, better actionability of results, improved combination with key organizational processes and domain-specific methodologies, and improved usage of Big Data technologies.
We discovered that the adaptations of existing data mining methodologies found in the literature can be classified into three categories: modification, extension, or integration.
We also noted that adaptations are executed either to address deficiencies and lack of important elements or aspects in the reference methodology (chiefly CRISP-DM). Furthermore, adaptations are also made to improve certain phases, deliverables or process outcomes.
improve key reference data mining methodologies phases—for example, in case of CRISP-DM these are primarily business understanding and deployment phases.
support knowledge discovery and actionability.
introduce context-awareness and higher degree of formalization.
integrate closer data mining solution with key organizational processes and frameworks.
significantly update CRISP-DM with respect to Big Data technologies, tools, environments and infrastructure.
incorporate broader, explicit context of architectures, algorithms and toolsets as integral deliverables or supporting tools to execute data mining process.
expand and accommodate broader unified perspective for incorporating and implementing data mining solutions in organization, IT infrastructure and business processes.
Threats to Validity
Systematic literature reviews have inherent limitations that must be acknowledged. These threats to validity include subjective bias (internal validity) and incompleteness of search results (external validity).
The internal validity threat stems from the subjective screening and rating of studies, particularly when assessing the studies with respect to relevance and quality criteria. We have mitigated these effects by documenting the survey protocol (SLR Protocol), strictly adhering to the inclusion criteria, and performing significant validation procedures, as documented in the Protocol.
The external validity threat relates to the extent to which the findings of the SLR reflect the actual state of the art in the field of data mining methodologies, given that the SLR only considers published studies that can be retrieved using specific search strings and databases. We have addressed this threat to validity by conducting trial searches to validate our search strings in terms of their ability to identify relevant papers that we knew about beforehand. Also, the fact that the searches led to 1,700 hits overall suggests that a significant portion of the relevant literature has been covered.
In this study, we have examined the use of data mining methodologies by means of a systematic literature review covering both peer-reviewed and ‘grey’ literature. We have found that the use of data mining methodologies, as reported in the literature, has grown substantially since 2007 (four-fold increase relative to the previous decade). Also, we have observed that data mining methodologies were predominantly applied ‘as-is’ from 1997 to 2007. This trend was reversed from 2008 onward, when the use of adapted data mining methodologies gradually started to replace ‘as-is’ usage.
The most frequent adaptations have been in the ‘Extension’ category. This category refers to adaptations that imply significant changes to key phases of the reference methodology (chiefly CRISP-DM). These adaptations particularly target the business understanding, deployment and implementation phases of CRISP-DM (or other methodologies). Moreover, we have found that the most frequent purposes of adaptions are: (1) adaptations to handle Big Data technologies, tools and environments (technological adaptations); and (2) adaptations for context-awareness and for integrating data mining solutions into business processes and IT systems (organizational adaptations). A key finding is that standard data mining methodologies do not pay sufficient attention to deployment aspects required to scale and transform data mining models into software products integrated into large IT/IS systems and business processes.
Apart from the adaptations in the ‘Extension’ category, we have also identified an increasing number of studies focusing on the ‘Integration’ of data mining methodologies with other domain-specific and organizational methodologies, frameworks, and concepts. These adaptions are aimed at embedding the data mining methodology into broader organizational aspects.
Overall, the findings of the study highlight the need to develop refinements of existing data mining methodologies that would allow them to seamlessly interact with IT development platforms and processes (technological adaptation) and with organizational management frameworks (organizational adaptation). In other words, there is a need to frame existing data mining methodologies as being part of a broader ecosystem of methodologies, as opposed to the traditional view where data mining methodologies are defined in isolation from broader IT systems engineering and organizational management methodologies.
Supplemental Information
All publication graphs (11)..
Unfortunately, we were not able to upload any graph (original png files). Based on Overleaf placed PeerJ template we constructed graphs files based on the template examples. Unfortunately, we were not able to understand why it did not fit, redoing to new formats will change all texts flow and generated pdf file. We submit graphs in archived file as part of supplementary material. We will do our best to redo the graphs further based on instructions from You.
SLR primary texts corpus in full.
File starts with Definitions page—it lists and explains all columns definitions as well as SLR scoring metrics. Second page contains"Peer reviewed" texts while next one "grey" literature corpus.
Download article
Report a problem.
Common use cases Typos, corrections needed, missing information, abuse, etc
Our promise PeerJ promises to address all issues as quickly and professionally as possible. We thank you in advance for your patience and understanding.
Typo Missing or incorrect metadata Quality: PDF, figure, table, or data quality Download issues Abusive behavior Research misconduct Other issue not listed above
Follow this publication for updates
You can also choose to receive updates via daily or weekly email digests. If you are following multiple publications then we will send you no more than one email per day or week based on your preferences.
Note: You are now also subscribed to the subject areas of this publication and will receive updates in the daily or weekly email digests if turned on. You can add specific subject areas through your profile settings.
Change notification settings or unfollow
Loading ...
Usage since published - updated daily
Top referrals unique visitors
Share this publication, articles citing this paper.

IMAGES
VIDEO
COMMENTS
Methodologies should present a new experimental or computational method, test or procedure. With over 2.9 million article accesses in 2021 alone, Systematic Reviews is one of the world's leading journals in applied methodology
A systematic review is a scholarly synthesis of the evidence on a clearly presented topic using critical methods to identify, define and assess research on the topic.A systematic review extracts and interprets data from published studies on the topic
Over the years, a certain number of data mining methodologies have been proposed, and these are being used extensively in practice and in research. These reviews concerned Big Data Analytics
For example, a search for published, we feel it is important to discuss the methodology of the ‘deep vein thrombosis' in PubMed produced 55 568 possible systematic review to allow readers to better appreciate and articles
Present a summary of newly developed guidelines for use specifically in systematic review and dissemination in conservation and environmental management. www.cebc.bham.ac.uk. Aim to reduce systematic errors or