Enron's Spreadsheets and Related Emails: A Dataset and Analysis

Hermans, Felienne; Murphy-Hill, Emerson

doi:10.1109/icse.2015.129

Cited by 63 publications

(55 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It has 4, 498 unique spreadsheets, which are gathered through Google searches using keywords such as nancial and inventory. The ENRON corpus [9] contains over 15, 000 spreadsheets, extracted from the Enron email archive. This corpus is of a particular interest, since it provides access to real-world business spreadsheets used in industry.…”

Section: Dataset Of Annotated Tablesmentioning

confidence: 99%

Table Identification and Reconstruction in Spreadsheets

Koci

Thiele

Romero

et al. 2017

Advanced Information Systems Engineering

View full text Add to dashboard Cite

Abstract. Spreadsheets are one of the most successful content generation tools, used in almost every enterprise to perform data transformation, visualization, and analysis. The high degree of freedom provided by these tools results in very complex sheets, intermingling the actual data with formatting, formulas, layout artifacts, and textual metadata.To unlock the wealth of data contained in spreadsheets, a human analyst will often have to understand and transform the data manually.To overcome this cumbersome process, we propose a framework that is able to automatically infer the structure and extract the data from these documents in a canonical form. In this paper, we describe our heuristicsbased method for discovering tables in spreadsheets, given that each cell is classied as either header, attribute, metadata, data, or derived. Experimental results on a real-world dataset of 439 worksheets (858 tables) show that our approach is feasible and eectively identies tables within partially structured spreadsheets.

show abstract

Section: Dataset Of Annotated Tablesmentioning

confidence: 99%

Table Identification and Reconstruction in Spreadsheets

Koci

Thiele

Romero

et al. 2017

Advanced Information Systems Engineering

View full text Add to dashboard Cite

show abstract

“…However it is a large set, the spreadsheets have been collected from practice, and it has been used in several works of spreadsheet research [16]. In his work Jansen [17] shows how the EUSES corpus is also similar to the more recent ENRON corpus [18], which is a collection of spreadsheets obtained from the e-mail archives of Enron Corporation, disclosed during the trials related to its bankruptcy.…”

Section: A Covering Other Approaches Of Metadata Extractionmentioning

confidence: 99%

Evaluating Automatic Spreadsheet Metadata Extraction on a Large Set of Responses from MOOC Participants

Roy

Hermans

Aivaloglou

et al. 2016

2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER)

Self Cite

View full text Add to dashboard Cite

Abstract-Spreadsheets are popular end-user computing applications and one reason behind their popularity is that they offer a large degree of freedom to their users regarding the way they can structure their data. However, this flexibility also makes spreadsheets difficult to understand. Textual documentation can address this issue, yet for supporting automatic generation of textual documentation, an important pre-requisite is to extract metadata inside spreadsheets. It is a challenge though, to distinguish between data and metadata due to the lack of universally accepted structural patterns in spreadsheets. Two existing approaches for automatic extraction of spreadsheet metadata were not evaluated on large datasets consisting of user inputs. Hence in this paper, we describe the collection of a large number of user responses regarding identification of spreadsheet metadata from participants of a MOOC. We describe the use of this large dataset to understand how users identify metadata in spreadsheets, and to evaluate two existing approaches of automatic metadata extraction from spreadsheets. The results provide us with directions to follow in order to improve metadata extraction approaches, obtained from insights about user perception of metadata. We also understand what type of spreadsheet patterns the existing approaches perform well and on what type poorly, and thus which problem areas to focus on in order to improve.

show abstract

“…In order to better understand the use of lookup functions, we analyze their use in the Enron corpus, a recently released set of more than 16.000 spreadsheets from the bankrupt company Enron [2]. We are especially interested in learning more about the two different ways in which lookup functions can be applied: for exact matching, where only exactly corresponding results can be returned-often used to combine two worksheets-and the approximate match, where approximate results may be returned, used mainly for simple classification.…”

Section: Introductionmentioning

confidence: 99%

Detecting problematic lookup functions in spreadsheets

Hermans

Aivaloglou

Jansen

2015

2015 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC)

Self Cite

View full text Add to dashboard Cite

Abstract-Spreadsheets are used heavily in many business domains around the world. They are easy to use and as such enable end-user programmers to and build and maintain all sorts of reports and analyses. In addition to using spreadsheets for modeling and calculation, spreadsheets are often also used for creating reports and dashboards: combining data from different sources and creating overviews. For this, lookup functions can be used: they search for a value in a range and return a corresponding row or column. Lookup functions are common: according to recent research the VLOOKUP is the fifth most common Excel function. In this paper we investigate the use of lookup functions in more detail. We analyze lookup functions within the newly released Enron spreadsheet corpus. The results show that 1) a minority of 43% of lookup formulas use the default setting where an approximate match may be returned, 2) 77% of approximate matches are used unnecessary and 3) 23% of approximate lookups is problematic: they search over unsorted ranges, while this is specifically advised against in the specification, and might lead to wrong results.

show abstract

Enron's Spreadsheets and Related Emails: A Dataset and Analysis

Cited by 63 publications

References 21 publications

Table Identification and Reconstruction in Spreadsheets

Table Identification and Reconstruction in Spreadsheets

Evaluating Automatic Spreadsheet Metadata Extraction on a Large Set of Responses from MOOC Participants

Detecting problematic lookup functions in spreadsheets

Contact Info

Product

Resources

About