The Remaking of Reading: Data Mining and the Digital Humanities

Abstract

In this paper, we describe the design of a number of alternative interface "droplets" that are intended for use by humanities scholars interested in applying data mining and information visualization tools to the job of hypothesis conception. The trained droplets provide several functions. Their primary purpose is to encapsulate the results of the software preparation phase. They tin can be saved for future re-use against other collections or combinations of collections. They can exist modified by having the user accept or reject features identified by the data mining software. Finally, they can also contain choices for how to brandish and organize items in the drove. The opportunity to develop a new interface object presents the designer with the challenge of effectively communicating what the tool is good for and how it is used. This paper outlines the blueprint process we followed in creating the visual representations of these interface objects, describes the communicative strengths and weaknesses of a number of alternative designs, and discusses the importance of the report of new interface objects equally the means of providing the user with new interface affordances.

Introduction

The goal of this newspaper is to accost some of the conceptual issues that arise in the blueprint of a new kind of interface object for a specific domain — data mining for the humanities. In that context, we describe ane component of our enquiry: the design of a class of visual representation that would provide humanities scholars with some insight into the information mining process, while at the same time making the activity of data mining attractive and easy to acquit out.

Screenshot of the data mining worksheet for step three.

Figure 1.

An early sketch of the information mining surroundings (in this case for the NORA projection) shows someone using a droplet trained for identifying the erotic in a gear up of poems from the Emily Dickinson collection in the Institute for Advanced Engineering science in the Humanities (IATH) at the Academy of Virginia. Note that the preliminary droplet design shown here (bottom correct) has no specifically chatty morphology.

Our strategy in this interface was to provide the user with a variety of empty "droplets" which would be filled with the results of the software grooming phase [Ruecker et al.]. Each droplet would contain or encapsulate an entire working state of the organisation, including the algorithmic consequences of a particular grooming practise, combined with some parameters for organizing and selecting the form of the display. The choice of the proper word to identify the droplets is in itself a subject of pattern. Other terms that have been suggested include "magnet," "crystal," "capsule," "lens," "charm," "filter," "arrangement country," "kernel," and the very Canadian "hockey puck." Whatsoever these objects are eventually called, for the fourth dimension being we are using the term "droplet," which suggests to us a densely compressed particular that can unpack in an organic way to influence the unabridged environment. Once a droplet has been trained for information mining, information technology tin be saved and applied to the entire drove, or to a dissimilar collection. A droplet is applied to a collection by dragging and dropping it onto a display representing each item, after which the brandish organizes itself in a series of "oil and water" effects.

Of vital significance to the success of this strategy is the design of the droplets. The droplets demand to be able to stand for the relevant information about the information mining process in a class that is readily interpretable by humanities scholars. The droplet serves in 1 sense like an icon — a person looking at it will hopefully remember what system country it contains. This iconic function should work at different scales, at least ane of which is quite small. The droplets therefore demand to exist hands visually differentiable one from another, at every scale. Finally, the droplets need to exist visually highly-seasoned. We draw here our initial attempts to design these interface objects, based on a set of metaphors to real-earth items that combine complex visual appearance with a compact grade.

Background

The online availability of a wide range of digital data has resulted in a corresponding increment in various kinds of tools for retrieving and manipulating the items in a collection [Hockey 2000]. Interface design researchers have worked on systems intended to assistance users admission digital images, work with electronic text files, and apply information mining algorithms to a variety of problems, both in the sciences and in the humanities.

In the area of digital images, [Bederson 2001] describes a zoomable browser, [Bumgardner et al. 2005] provides an experimental search tool that uses a colour wheel as its interface, and [Hascoët et al. 1998] discuss the employ of maps in accessing a digital library. Other examples include [Rodden et al. 2001], who studied the utilize of similarity clustering for browsing tasks, and [Ruecker et al. 2005], who adult a epitome prototype browser for pill identification.

For tools related to text files, [Pirolli et al. 1996] describe a system for visualizing documents which allows the user to form dynamic groups. [Pocket-sized 1996] developed a 3D prototype for text navigation, where the reader moved between columns of text from Shakespeare's plays. A variety of researchers accept worked in the area of data mining for text collections of various kinds. For example, [Feldman et al. 1997] discuss early on efforts in this surface area, and [Weiss et al. 2005] provide a recent update on methods.

Some researchers have pointed out that the potential for applying data mining tools to questions in the humanities lies largely in the capacity of such tools to contribute, not primarily to hypothesis testing, but instead to hypothesis formulation [Shneiderman 2001]; [Ramsay 2003]; [Unsworth 2004]. The standard approach in humanities research is non to solve a problem by testing one hypothesis against another, simply rather to enrich the object of study by repeated observation and reporting. Data mining tools and their accompanying visualizations, which facilitate pattern finding across a wide range of data, can definitely play a office in this process.

With respect to the pattern of interfaces for data mining, it is important to remember that each new online tool represents a new opportunity for action, or affordance [Gibson 1979]; [Vicente 2002]. For instance, in a more conventional approach to the interface for information mining, information technology would be possible to create a history palette that records previous states of the system. However, it is not necessarily straightforward to repurpose an item from that history to a new drove. Past encapsulating the history states as droplets, we make the repurposing simpler.

Another significant feature of the droplets is their function in interactivity. By providing the user with an item to elevate and drop to trigger a series of dynamic responses from the arrangement, the droplets help facilitate an instructional aspect: the user can meet the steps carried out by system, which represent to the steps associated with the droplet. While visually dynamic responses are not reliant on the presence of droplets as objects, their existence as part of the user interaction helps to suggest to the designer these various new forms of feedback, which are a kind of affordance.

Studying these new affordances presents a challenge, in that the researcher by definition does non ever accept an existing object with a like affordance — otherwise it would be a case of a redesign rather than a new tool [Ruecker 2003]. Though opinions vary, the current dominant perspective is that interface research requires a component of usability report [Nielsen 2000], but that usability study lonely is probably not plenty. Attending should also exist paid to other factors, such as aesthetics [Karvonen 2000], effect [Dillon 2001], and sustained use over fourth dimension [Plaisant 2004].

Methodology

We began by identifying the kinds of information the user might desire to know while working with the system. These included an overview of the process, suggestions about the kinds of tasks that could be performed using the system, reassurance at each point that the correct things were happening, and assistance in interpreting the results of each phase and moving successfully to the next stage. With the aerosol, nosotros hoped to exist able to communicate what had been done to create them, in club to suggest how they might be successfully deployed once they were created.

To construct the aerosol, we generated a candidate list of real-world items that take a sufficiently complex physical shape to serve equally possible metaphors for the complexities of the information mining procedure. We determined early in the process that it would exist difficult and probably non helpful to try to communicate for this demographic the actual algorithms involved, as for example by superimposing an equation on a geometric shape. Instead, we hoped to be able to visually express the following information:

  • Is this a trained droplet or an empty 1?
  • For trained droplets, has the user accepted the features recommended by the organization or has the list of features been modified?
  • What kinds of features were included?
  • How many features were included?
  • What options for brandish have been associated with the droplet?
  • What choices for organizing the display take been applied?

There are also other pieces of data that could be useful for understanding what has been happening. These items need to exist communicated somehow just could be difficult to acquaintance with the visual appearance of the aerosol. These include:

  • The name of the drove or collections used in grooming.
  • The size of the collection.
  • The size of the grooming ready.
  • The proper noun and goals of the person responsible for preparation the droplet.

Some strategies involving droplet morphology might include using the size of the droplet to signal the size of the training set or of the collection the set was drawn from. Internal and external lines can too be thickened or lightened every bit a mode of suggesting robustness of the preparation set. Finally, depending on the visual kind of droplet, it may be possible to nest one droplet inside some other, as a manner of indicating their apply in combination.

Information technology may also be possible to acquaintance this information with the droplets using strategies that practice not involve the droplet morphology per se, but instead rely on the combination of text and image. Combining these methods is seen by some theorists as an important approach to the design of technical communications [Horn 1998]. We will provide this connexion in the case of the image by refreshing an data console about the droplet details whenever the user selects a droplet. This panel volition also provide the opportunity to adjust some of the settings stored by the droplet.

Results

Working from our original map of over a dozen potential metaphors (Figure 2), we selected the following short list for further investigation. We wanted to have a variety of items that were distinct from each other just were too visually complex in a way that could communicate the stages in droplet preparation. We thought we should include examples that covered points on a terrain that included the organic and the mechanical, with reference to several disciplines. Finally, we tried to choose examples that could exist contained by a common perimeter. Our working list contained the following items:

  • Ferns — configurations of individual organic pieces that form larger items
  • Snowflakes — a unmarried solid unique configuration that relies on symmetry
  • Solar system — private items in relations suggested by a larger structure
  • Atoms — individual items continued in a more than elaborate geometric framework
  • Cells — circuitous interiors composed of pieces that associate past juxtaposition
  • Clockwork — complex interiors consisting of structures that interconnect
  • Lego™ — geometric shapes with complex surfaces that interconnect

For each of these metaphors, we developed sketches for four different states of the droplet: untrained, trained, trained with multiple display options chosen, and trained with multiple brandish and ii unlike arrangement options. Our goal in each instance was to brand the unlike states visually distinct at every level of magnification, and to make the number of brandish and system options obvious at the largest size.

Concept map with the word 'droplets' in the center.

Figure 2.

Our concept map of possible droplet metaphors shows a wide range of candidate real-world objects that combine visual complexity with a compact form.

Nosotros chose these various states considering they represent pregnant choices made past the user. Information technology would likewise exist possible to consider visually representing choices the user makes most what collection to work with in the first identify, which may exist one of the near significant choices the user makes. However, visually representing collections is definitely a challenge, and it may exist preferable to provide data about the collection in the grade of text labels.

Ferns

A fern is a fractal, which means it repeats its morphology at increasing scales (Effigy 3). We might prefer this strategy for two scales, where in the unfolding fern foliage, the individual leaflets correspond functions and the unabridged leaf represents the complete, organized droplet.

We can utilise the stem to represent the software training, and the leaflets to represent the other functions. This strategy has the benefit of looking minimal when no display or organisation functions are chosen, which may prompt the user to desire to choose more sophisticated configurations of options.

If nosotros as well assume that the two sides of the stem correspond two kinds of system, then having all the display items on one side of the stem would point only one kind of sorting, while dividing display items on both sides of the stalk would bespeak two kinds of sorting.

Five yellow droplets with various types of leaves inside.

Figure three.

The placement of leaflets along the stem of the fern leaf allows us to limited the user choices starting with an empty droplet (left), then sequentially adding training information, display choices, then organization in 1 fashion and in 2 ways.

Reading the sequence from left to right, nosotros testify kickoff an untrained or empty droplet. The side by side version shows a droplet that has been trained by the user. Taking ane of the demonstration projects as an instance, this second droplet might comprise the results of training the system to recognize poems by Emily Dickinson with an erotic charge, using a naïve Bayesian algorithm. The third version shows this same trained droplet with seven items chosen for display. In the instance of the Dickinson collection, these items might include the verse form'southward championship (ofttimes the starting time line), the date of first publication, the place of publication, the name of the publisher, the number of lines in the poem, the number of words in the poem, the number of key features constitute in the poem related to eroticism, and the numeric score assigned by the organisation for the poem in terms of its erotic accuse. The fourth version would correspond the same information most each poem, only organize the results in some way — possibly by the numeric rating assigned by the system. The fifth and last version would prove the items bundled in two ways — first by numeric rating, and chronologically within that.

The organic nature of the fern droplet may atomic number 82 to some difficulties for the user in that a growth process for a fern is non the same as selection among various options by a user defining a droplet. The utilize of this organic metaphor, however, does suggest another possibility — would it exist interesting to indicate how long it has been since someone used a droplet? Do the droplets visibly age when they aren't used? Does new use refresh the appearance of the droplet? Would people be encouraged to experiment with strange aerosol because they are obviously drying up or deteriorating?

Snowflakes

Ferns suggest quite a regular form of arrangement, which means there is little meaningful variation possible betwixt different droplets. Snowflakes also tend to symmetry, but each is unique. They combine a complex silhouette with a compact course (Effigy iv). Variations in the details comprising the silhouette could therefore be used to communicate a broad range of functions.

Withal, the strong visual linguistic communication of the snowflake may evidence to exist hard to repurpose as a meaningful channel of communication. The fact that each snowflake is supposed to be unique as well ways that at that place is no basic, restricted vocabulary of shapes to draw on in their construction.

Five blue snowflakes with varying decorations inside.

Figure iv.

Each snowflake is a unique visual object, which allows u.s. to differentiate i droplet from another, but introduces a difficulty in that there is no unproblematic method of re-using recognizable components.

Our draft solution in this case is to treat the visual complexity of the interior of the object every bit the mensurate of the state of the droplet. Dissimilar our other designs, which involve composites of countable objects, the snowflake aerosol indicate each condition by filling in spaces that are otherwise unarticulated.

Solar Organization

Objects in the solar organization create a blended object where the individual items are in relation to i some other just not in immediate contact (Figure 5). The central position of the sun besides serves to imply the centrality of the software training. A solar system without a sun is conspicuously incomplete.

Five sample solar system droplets with orbiting objects in different                       locations.

Figure 5.

The solar system, with its objects in orbit, provides a structure that can be progressively filled with planetary dots that represent choices of representation, while location on the orbits is used to indicate arrangement.

Another potential difficulty with several of the designs, including the solar system, is that they may suggest a caste of gild and regularity which may be somewhat at odds with the feel of the scholar using data mining techniques. Using a data mining system can actually involve an iterative and somewhat "messy" experimentation with various options.

Diminutive

Our starting indicate for the atomic droplets are the simple models that consist of electrons in elliptical orbits around a nucleus (Figure 6). The nucleus is filled in during the grooming phase, while the inclusion of electrons and their locations stand for choices about item representation and organization.

Five sample atoms with particles in various locations.

Figure vi.

Atomic models provide a vocabulary for expressing the components of the droplets, consisting of individual items connected to each other.

Cells

A cell has an interior that is populated with a number of singled-out individual items and structures (Figure seven). Cells therefore provide a compact metaphor based on the complexities of the interior of the droplet. We also have available for futurity exploration the single-celled organisms, such as the paramecium, which combine this interior complexity with an exterior with some chatty potential.

Cells also suggest an organic course, which may aid to counterbalance the highly technical contour of information mining in the humanities.

Five purple cells with varying arrangements of inner parts.

Figure 7.

A prison cell is neither an amass nor does information technology have a complex silhouette. Its communicative potential consists instead of a rich interior of organic shapes, including individual items and structures that divide, enclose, and support them.

Clockwork

A clockwork is a complex interior like a cell, without the proposition of the organic (Effigy eight). There is a high degree of interconnection of the parts inside a clock, implying that all the parts are necessary in social club for it to work. This level of constraint on what is necessary and what is optional might not be appropriate in the context of data mining, but the operational nature of the clock and the implied association with the mathematical operations underlying data mining may get in particularly appropriate.

The diversity of interior components as well provides a potentially rich visual vocabulary for representing the different aspects of the droplets. Finally, we accept used an external outline suggestive of clock gears, in club to allow a directly visual association to the mechanical, even for the untrained form of the droplet.

Five sample clock droplets with varying arrangements of gears.

Effigy 8.

Like a cell, a clockwork shows a rich internal mural that can be used to represent a diversity of functions. Clockworks are mechanical rather than organic, and therefore advise interconnection, rather than isolation of the functions.

Lego™

With Lego, there are a fix number of individual shapes that are aggregated. With this metaphor, we tin can use the external contour of the composite droplet (Figure 9). Nosotros tin distinguish by size between more and less of import functions, then the key preparation tin can be indicated by large Lego piece, while the display functions are secondary and the arrangement functions 3rd.

Lego also comes with the affordance of assembling the separate pieces into different configurations. The user could distinguish between similar droplets by taking advantage of dissimilar kinds of arrangement.

Five sample Lego™ pieces built in different designs.

Figure 9.

Lego™ suggests a method of combining split items to create a new whole. For our purposes, each individual piece of Lego would stand either for the result of software training or for a option of representation or organisation.

Conclusions and Time to come Research

Having identified a range of possibilities, our adjacent step will exist to present them to potential users in lodge to collect measures of performance and preference. By placing them in the interactive context of a prototype environment, we will exist able to examine how humanities scholars reply to the diverse affordances. The goals of this phase will be to determine whether participants are able to make the necessary intuitive leaps to understand the intended communicative aspects of each of the droplet designs. Once we've established a smaller subset of droplets, we volition proceed by expanding the visual positioning or skinning of each droplet type, in order to determine how humanities scholars respond to various semantic differentials such every bit glossy/rough, technological/natural, geometric/organic, and colour/grey scale. By determining how potential users of the data mining system perceive the design dimensions of the aerosol, nosotros volition be able to decide to what extent this strategy can show beneficial in removing barriers to them adopting the organization. Ane possibility may consist of the use of a hybrid grade of droplets, where different visual components are assembled in a kind of toolkit. Our eventual decisions with respect to the design of the droplets may also be usefully repurposed to inform the visual aspects of the design of the entire organisation.

Acknowledgements

The authors wish to thanks the many members of the NORA project inquiry squad for their contributions to this work. Their names can be found at http://world wide web.noraproject.org/team.php. Nosotros would also like to acknowledge the generous back up of the Andrew W. Mellon Foundation, the Social Sciences and Humanities Research Council of Canada, the Natural Sciences and Engineering Council of Canada, and the Canadian Foundation for Innovation.

Works Cited

Bederson 2001

Bederson, B.B. "PhotoMesa: a Zoomable Paradigm Browser Using Quantum Treemaps and Bubblemaps". Presented at ACM 2001. Proceedings of the 14th Annual ACM Symposium on User Interface Software Applied science (2001), pp. 71-80.

Dillon 2001

Dillon, Andrew. "Beyond Usability: Process, Consequence and Affect in Human-Figurer Interactions". Canadian Periodical of Library and Informatics 26: 4 (2001), pp. 57-69.

Feldman et al. 1997

Feldman, Ronen, and Haym Hirsh. "Finding Associations in Collections of Text". In Ryszard S. Michalski Ivan Bratko and Miroslav Kubat, Motorcar Learning and Data Mining: Methods and Applications. New York: Wiley, 1997. pp. 223-240.

Gibson 1979

Gibson, James J. The Ecological Approach to Visual Perception. Boston: Houghton-Mifflin, 1979.

Hascoët et al. 1998

Hascoët, Mountaz, and Xavier Soinard. "Using Maps as a User Interface to a Digital Library". Presented at SIGIR '98. Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Data Retrieval (1998), pp. 339-340. http://doi.acm.org/10.1145/290941.291028.

Hockey 2000

Hockey, Susan. Electronic Texts in the Humanities. Oxford: Oxford University Printing, 2000.

Horn 1998

Horn, Robert E. Visual Language: Global Advice for the 21st Century. Bainbridge Island, WA: MacroVU Inc, 1998.

Horton et al. 2006

Horton, Tom, Kristen Taylor, Bei Yu and Xin Xiang. "Quite Right, Dear and Interesting: Seeking the Sentimental due north Nineteenth Century American Fiction". Presented at Digital Humanities 2006. Proceedings of the Association for Literary and Linguistic Computing (2006), pp. 81-82.

Karvonen 2000

Karvonen, Kristiina. "The Beauty of Simplicity". Presented at CUU. Proceedings of the 2000 Briefing on Universal Usability (2000).

Kirschenbaum et al. 2006

Kirschenbaum, Matthew G., Catherine Plaisant, Martha Nell Smith, Loretta Auvil, James Rose, Bei Yu and Tanya Clement. " 'Undiscovered Public Knowledge'": Mining for Patterns of Erotic Language in Emily Dickinson's Correspondence with Susan Huntington (Gilbert) Dickinson". Presented at Digital Humanities 2006 (July five–9, 2006). Digital Humanities 2006 Briefing Abstracts, pp. 252-255.

Nielsen 2000  J. Nielsen. Designing web usability: The practice of simplicity. Indianapolis, IN: New Riders, 2000.

Pirolli et al. 1996

Pirolli, Peter, Patricia Schank, Marti Hearst and Christine Diehl. "Scatter/Gather Browsing Communicates the Topic Construction of a Very Big Text Collection". Presented at SIGCHI 2006. Proceedings of the SIGCHI briefing on Homo factors in Computing Systems: Mutual Ground (1996), pp. 213-220.

Plaisant 2004

Plaisant, Catherine. "The Challenge of Data Visualization Evaluation". IEEE Proceedings of AVI 2004 (2004).

Ramsay 2003

Ramsay, Stephen. "Toward an Algorithmic Criticism". Literary and Linguistic Computing 18: ii (2003), pp. 167-174.

Ramsay and Steger 2006

Ramsay, Stephen, and Sara Steger. "Distinguished Speakers: Keyword Extraction and Critical Assay with Virginia Woolf's The Waves". Presented at Digital Humanities 2006. Proceedings of the Association for Literary and Linguistic Computing Conference 2006 (2006), pp. 255-257.

Rodden et al. 2001

Rodden, Kerry, Wojciech Basalaj, David Sinclair and Kenneth Woods. "Does Organization by Similarity Assist Image Browsing". Presented at CHI 2001. Proceedings of the Man Factors in Computing Systems Conference (2001), pp. 190-197.

Ruecker 2003

Ruecker, Stan. Affordances of prospect for academic users of interpretively-tagged text collections. Thesis, University of Alberta, Edmonton, Alberta, Canada: 2003.

Ruecker et al.

Ruecker, Stan, Milena Radzikowksa and Stéfan Sinclair. "Communicating Process with Form: Designing the Visual Morphology of the Nora Data Mining Kernels". Presented at CaSTA 2006. Proceedings of the Articulation Computer science and Humanities Computing Conference (2006), pp. 57-68.

Ruecker et al. 2005  S. Ruecker, L. Thousand. Given, B. Sadler, and A. Ruskin. "Building Accessible Spider web Interfaces for Seniors: Similarity Clustering of Pill Images." Include 2005. London. Helen Hamlyn Institute. Royal College of Art. April 5-8, 2005, 2005.

Shneiderman 2001

Shneiderman, Ben. "Inventing Discovery Tools: Combining Information Visualization with Information Mining". Presented at DC 2001. Keynote for Discovery Science Conference 2001 (2001).

Modest 1996

Small, David. "Navigating Large Bodies of Text". IBM Systems Journal 35: three-four (1996).

Unsworth 2004

Unsworth, John. "Forms of Attention: Digital Humanities Beyond Representation". Presented at CaSTA 2004. Proceedings of the Tertiary Conference of the Canadian Symposium on Text Analysis (2004).

Vicente 2002

Vicente, Kim J. "Ecological Interface Design: Progress and Challenges". Human Factors 44: i (2002), pp. 62-78.

Weiss et al. 2005

Weiss, Sholom M., Nitin Indurkhya, Tong Zhang and Fred Damerau. Text Mining: Predictive Methods for Analyzing Unstructured Information. New York: Springer, 2005.

harveythades.blogspot.com

Source: http://www.digitalhumanities.org/dhq/vol/3/3/000067/000067.html

0 Response to "The Remaking of Reading: Data Mining and the Digital Humanities"

إرسال تعليق

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel