Ziawasch Abedjan's Books

Data Profiling

By: Ziawasch Abedjan,Lukasz Golab,Felix Naumann,Thorsten Papenbrock

Write a review Read reviews Take a Quiz Solve Book Puzzle

Data Profiling

By: Ziawasch Abedjan,Lukasz Golab,Felix Naumann,Thorsten Papenbrock

Data profiling refers to the activity of collecting data about data, {i.e.}, metadata. Most IT professionals and researchers who work with data have engaged in data profiling, at least informally, to understand and explore an unfamiliar dataset or to determine whether a new dataset is appropriate for a particular task at hand. Data profiling results are also important in a variety of other situations, including query optimization, data integration, and data cleaning. Simple metadata are statistics, such as the number of rows and columns, schema and datatype information, the number of distinct values, statistical value distributions, and the number of null or empty values in each column. More complex types of metadata are statements about multiple columns and their correlation, such as candidate keys, functional dependencies, and other types of dependencies. This book provides a classification of the various types of profilable metadata, discusses popular data profiling tasks, and surveys state-of-the-art profiling algorithms. While most of the book focuses on tasks and algorithms for relational data profiling, we also briefly discuss systems and techniques for profiling non-relational data such as graphs and text. We conclude with a discussion of data profiling challenges and directions for future work in this area.

Average ratings

N/A

Write a review

Solve jigsaw puzzle

Age

Page Count

136

Publisher

Springer Nature

Published Date

ISBN 10

3031018656

ISBN 13

9783031018657

Advancing the Discovery of Unique Column Combinations

By: Ziawasch Abedjan,Felix Naumann

Write a review Read reviews Take a Quiz Solve Book Puzzle

Advancing the Discovery of Unique Column Combinations

By: Ziawasch Abedjan,Felix Naumann

Unique column combinations of a relational database table are sets of columns that contain only unique values. Discovering such combinations is a fundamental research problem and has many different data management and knowledge discovery applications. Existing discovery algorithms are either brute force or have a high memory load and can thus be applied only to small datasets or samples. In this paper, the wellknown GORDIAN algorithm and "Apriori-based" algorithms are compared and analyzed for further optimization. We greatly improve the Apriori algorithms through efficient candidate generation and statistics-based pruning methods. A hybrid solution HCAGORDIAN combines the advantages of GORDIAN and our new algorithm HCA, and it significantly outperforms all previous work in many situations.

Average ratings

N/A

Write a review

Solve jigsaw puzzle

Age

Page Count

Publisher

Universitätsverlag Potsdam

Published Date

ISBN 10

3869561483

ISBN 13

9783869561486

Covering Or Complete?

Discovering Conditional Inclusion Dependencies

By: Jana Bauckmann,Abedjan, Ziawasch,Leser, Ulf,Müller, Heiko,Naumann, Felix

Write a review Read reviews Take a Quiz Solve Book Puzzle

Covering Or Complete?

Discovering Conditional Inclusion Dependencies

By: Jana Bauckmann,Abedjan, Ziawasch,Leser, Ulf,Müller, Heiko,Naumann, Felix

Data dependencies, or integrity constraints, are used to improve the quality of a database schema, to optimize queries, and to ensure consistency in a database. In the last years conditional dependencies have been introduced to analyze and improve data quality. In short, a conditional dependency is a dependency with a limited scope defined by conditions over one or more attributes. Only the matching part of the instance must adhere to the dependency. In this paper we focus on conditional inclusion dependencies (CINDs). We generalize the definition of CINDs, distinguishing covering and completeness conditions. We present a new use case for such CINDs showing their value for solving complex data quality tasks. Further, we define quality measures for conditions inspired by precision and recall. We propose efficient algorithms that identify covering and completeness conditions conforming to given quality thresholds. Our algorithms choose not only the condition values but also the condition attributes automatically. Finally, we show that our approach efficiently provides meaningful and helpful results for our use case.

Average ratings

N/A