Machine learning in the real world

Chaoji, Vineet; Rastogi, Rajeev; Roy, Gourav

doi:10.14778/3007263.3007318

Cited by 15 publications

(7 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Stattdessen existieren viele Anleitungen zum praktischen, induktiven Vorgehen (vgl. Amershi et al, 2019;Chaoji et al, 2016). Dazu kommt die (Selbst-)kritik zum Mangel an Theoriebezügen (vgl.…”

Section: Methodologieunclassified

“…Kandel et al, 2012) oder Workflows schematisch dargestellt (vgl. Kotsiantis, 2007;Gill et al, 2020;Chaoji et al, 2016;Amershi et al, 2019), wobei sich diese Ansätze eher unter dem Begriff "Data Science" sammeln und nicht "Machine Learning" oder "Textklassifikation". Eine hohe Precision beschreibt damit den Anteil der identifizierten Klassifizierungen an jenen, die gefunden werden sollten.…”

Section: Automatische Textklassifikation: Gütekriterienunclassified

See 1 more Smart Citation

Qualitätskriterien für die automatische Inhaltsanalyse. Zur Integration von Verfahren des maschinellen Lernens in die Kommunikationswissenschaft

Laugwitz¹

2021

Preprint

View full text Add to dashboard Cite

In der automatischen Inhaltsanalyse werden die standardisierte manuelle Inhaltsanalyse der Kommunikationswissenschaft und die automatische Textklassifikation des überwachten maschinellen Lernens verbunden, um die wachsende Menge medial vermittelter Inhalte zu beschreiben, analysieren und vergleichen. Qualitätskriterien für die manuelle Inhaltsanalyse zielen auf Validität und Reliabilität ab, während sich Gütekriterien für die automatische Textklassifikation größtenteils mit Reproduzierbarkeit befassen. Zweifel daran, ob Textklassifikationsmodelle inhaltlich relevante Features lernen anstatt auf Scheinkorrelationen oder Artefakte im Datensatz trainiert zu werden, deuten auf ein Validitätsproblem für die automatische Inhaltsanalyse hin: Kommunikationswissenschaftliche Forschung, die diese Methode nutzt, muss sicherstellen, dass die automatische Textklassifikation nicht lediglich zuverlässig reproduziert, sondern gültige Ergebnisse liefert. Diese Arbeit bündelt epistemologische Differenzen in Sozialwissenschaften und Informatik und zeigt daraus resultierende Reibungspunkte im Umgang mit Theorie, Methodologie, Qualitätskriterien und dem Forschungsprozess in der Kommunikationswissenschaft und dem Machine Learning auf. Im Vergleich der Forschungsprozesse zeigt sich, dass das Kriterium der Erklärbarkeit im maschinellen Lernen als Streben nach Validität zu verstehen ist. Daraufhin wird geprüft, welche Erklärbarkeitsstrategien im maschinellen Lernen für eine Validitätsprüfung nutzbar gemacht werden können. Empfehlungen für eine Weiterentwicklung der automatischen Inhaltsanalyse umfassen die Entwicklung übergreifender Qualitätskriterien, eines interdisziplinären Forschungsprozesses und die Auseinandersetzung mit den grundlegenderen epistemologischen und methodologischen Konflikten.

show abstract

Section: Methodologieunclassified

Section: Automatische Textklassifikation: Gütekriterienunclassified

Qualitätskriterien für die automatische Inhaltsanalyse. Zur Integration von Verfahren des maschinellen Lernens in die Kommunikationswissenschaft

Laugwitz¹

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Data science projects can range from well-defined prediction tasks (e.g., predict labels given images) to building and monitoring a large collection of modeling or analysis pipelines, often over a long period of time [3], [8], [27], [28]. Using a lifecycle provenance management system ( Fig.…”

Section: A System Design and Motivating Examplementioning

confidence: 99%

Understanding Data Science Lifecycle Provenance via Graph Segmentation and Summarization

Miao

Deshpande

2019

2019 IEEE 35th International Conference on Data Engineering (ICDE)

View full text Add to dashboard Cite

Along with the prosperous data science activities, the importance of provenance during data science project lifecycle is recognized and discussed in recent data science systems research. Increasingly modern data science platforms today have nonintrusive and extensible provenance ingestion mechanisms to collect rich provenance and context information, handle modifications to the same file using distinguishable versions, and use graph data models (e.g., property graphs) and query languages (e.g., Cypher) to represent and manipulate the stored provenance/context information. Due to the schema-later nature of the metadata, multiple versions of the same files, and unfamiliar artifacts introduced by team members, the "provenance graph" is verbose and evolving, and hard to understand; using standard graph query model, it is difficult to compose queries and utilize this valuable information.In this paper, we propose two high-level graph query operators to address the verboseness and evolving nature of such provenance graphs. First, we introduce a graph segmentation operator, which queries the retrospective provenance between a set of source vertices and a set of destination vertices via flexible boundary criteria to help users get insight about the derivation relationships among those vertices. We show the semantics of such a query in terms of a context-free grammar, and develop efficient algorithms that run orders of magnitude faster than state-of-the-art. Second, we propose a graph summarization operator that combines similar segments together to query prospective provenance of the underlying project. The operator allows tuning the summary by ignoring vertex details and characterizing local structures, and ensures the provenance meaning using path constraints. We show the optimal summary problem is PSPACE-complete and develop effective approximation algorithms. The operators are implemented on top of a property graph backend. We evaluate our query methods extensively and show the effectiveness and efficiency of the proposed methods.

show abstract

“…Machine learning (ML) has become ubiquitous in recent years and its success can be attributed to its ability to extract knowledge and make decisions by learning the underlying structures of large input datasets [17], [27], [36]. To train learning models, ML applications often adopt the iterative optimization process [12].…”

Section: Introductionmentioning

confidence: 99%

“…In many reallife applications, the training algorithm has to process a tremendous number of input data instances and takes a significantly long time, tending to be the bottleneck of ML. The outstanding challenge still remains of how to efficiently use ML systems on massive input data points [17], [28] and commodity hardware [24], [44].…”

Section: Introductionmentioning

confidence: 99%

SlimML: Removing Non-critical Input Data in Large-scale Iterative Machine Learning

Han

Liu

et al. 2019

IEEE Trans. Knowl. Data Eng.

View full text Add to dashboard Cite

The core of many large-scale machine learning (ML) applications, such as neural networks (NN), support vector machine (SVM), and convolutional neural network (CNN), is the training algorithm that iteratively updates model parameters by processing massive datasets. From a plethora of studies aiming at accelerating ML, being data parallelization and parameter server, the prevalent assumption is that all data points are equivalently relevant to model parameter updating. In this paper, we challenge this assumption by proposing a criterion to measure a data point's effect on model parameter updating, and experimentally demonstrate that the majority of data points are non-critical in the training process. We develop a slim learning framework, termed SlimML, which trains the ML models only on the critical data and thus significantly improves training performance. To such an end, SlimML efficiently leverages a small number of aggregated data points per iteration to approximate the criticalness of original input data instances. The proposed approach can be used by changing a few lines of code in a standard stochastic gradient descent (SGD) procedure, and we demonstrate experimentally, on NN regression, SVM classification, and CNN training, that for large datasets, it accelerates model training process by an average of 3.61 times while only incurring accuracy losses of 0.37%.

show abstract

Machine learning in the real world

Cited by 15 publications

References 5 publications

Qualitätskriterien für die automatische Inhaltsanalyse. Zur Integration von Verfahren des maschinellen Lernens in die Kommunikationswissenschaft

Qualitätskriterien für die automatische Inhaltsanalyse. Zur Integration von Verfahren des maschinellen Lernens in die Kommunikationswissenschaft

Understanding Data Science Lifecycle Provenance via Graph Segmentation and Summarization

SlimML: Removing Non-critical Input Data in Large-scale Iterative Machine Learning

Contact Info

Product

Resources

About