MExiCo: A Library for Managing Multimodal Data Collections
Weitere Details
•express microstructural entities and relations such as elements in transcripts, annotation documents, treebanks, etc.;
•express macrostructural entities and relations such as the experimental setup (consisting of roles and types of participants, variables, number, type, and duration of trials, etc.) and the resources that together form a corpus or data collection (such as audio and video files, speech transcripts and annotation documents as a whole, etc.)
•allow for multiple versions of data sets (e.g., for multiple annotation of the same phenomenon as a basis for agreement calculations)
Our result was that none of these models (although perfectly eligible for special subsets of our data) was able to handle the entirety of our data collection. In many cases we identified the main problem being a different understanding of central terms such as „corpus“ or „transcript“. Although „corpus“ is usually defined as „a finite set of concrete linguistic utterances that serves as an empirical bases for linguistic research“ (Bußmann 1996:106), along with subsequent annotations, this definition is too narrow for our field. Even with the addition of an abstract timeline for anchoring multiple events (as in, among others, Bird & Liberman 2001, or Evert et al. 2003) we require an even more complex axis system that also supports multiple timelines (for cases where data sets are bound to multiple timelines for which no synchronisation has been defined yet), and also spatial systems (necessary for modeling, e.g., gestures, head movements, actions in dialogue games where spatial actions are of interest, as in, for instance, object arrangement games).
On the basis of those theories we propose a generic data model capable of dealing with such heterogeneous data collection as present in our Collaborative Research Centre: MExiCo, which will be available to researchers in different ways: As a library to be used in console scripts, as a HTTP API that can be accessed as a web service, and, finally, as a backend of Phoibos, a web-based corpus management application (Menke & Mehler 2011, Menke & Cimiano 2012) where researchers can benefit from its functionality without being required to perform actual programming – although even this is not difficult: Being implemented in Ruby, MExiCo’s core functionality benefits from Ruby’s flexible syntax and is designed as a DSL (domain-specific language). This means that researchers can formulate queries, scripts and batch processes in an easy-to-understand language that attempts to be as close to human language as possible, with as few formal requirements of a programming language as possible.
Zitierstile
Menke P, Cimiano P. MExiCo: A Library for Managing Multimodal Data Collections. In: Vargas-Sierra C, ed. Corpus Resources for Descriptive and Applied Studies. Current Challenges and Future Directions: Selected Papers from the 5th International Conference on Corpus Linguistics (CILC2013). Procedia - Social and Behavioral Sciences. Vol 95. Elsevier BV; 2013: 105-110.
Menke, P., & Cimiano, P. (2013). MExiCo: A Library for Managing Multimodal Data Collections. In C. Vargas-Sierra (Ed.), Procedia - Social and Behavioral Sciences: Vol. 95. Corpus Resources for Descriptive and Applied Studies. Current Challenges and Future Directions: Selected Papers from the 5th International Conference on Corpus Linguistics (CILC2013) (pp. 105-110). Elsevier BV. doi:10.1016/j.sbspro.2013.10.628
Menke, P., and Cimiano, P. (2013). “MExiCo: A Library for Managing Multimodal Data Collections” in Corpus Resources for Descriptive and Applied Studies. Current Challenges and Future Directions: Selected Papers from the 5th International Conference on Corpus Linguistics (CILC2013), Vargas-Sierra, C. ed. Procedia - Social and Behavioral Sciences, vol. 95, (Elsevier BV), 105-110.
Menke, P., & Cimiano, P., 2013. MExiCo: A Library for Managing Multimodal Data Collections. In C. Vargas-Sierra, ed. Corpus Resources for Descriptive and Applied Studies. Current Challenges and Future Directions: Selected Papers from the 5th International Conference on Corpus Linguistics (CILC2013). Procedia - Social and Behavioral Sciences. no.95 Elsevier BV, pp. 105-110.
P. Menke and P. Cimiano, “MExiCo: A Library for Managing Multimodal Data Collections”, Corpus Resources for Descriptive and Applied Studies. Current Challenges and Future Directions: Selected Papers from the 5th International Conference on Corpus Linguistics (CILC2013), C. Vargas-Sierra, ed., Procedia - Social and Behavioral Sciences, vol. 95, Elsevier BV, 2013, pp.105-110.
Menke, P., Cimiano, P.: MExiCo: A Library for Managing Multimodal Data Collections. In: Vargas-Sierra, C. (ed.) Corpus Resources for Descriptive and Applied Studies. Current Challenges and Future Directions: Selected Papers from the 5th International Conference on Corpus Linguistics (CILC2013). Procedia - Social and Behavioral Sciences. 95, p. 105-110. Elsevier BV (2013).
Menke, Peter, and Cimiano, Philipp. “MExiCo: A Library for Managing Multimodal Data Collections”. Corpus Resources for Descriptive and Applied Studies. Current Challenges and Future Directions: Selected Papers from the 5th International Conference on Corpus Linguistics (CILC2013). Ed. Chelo Vargas-Sierra. Elsevier BV, 2013.Vol. 95. Procedia - Social and Behavioral Sciences. 105-110.
Download
RDF/XML-Format
JSON-LD-Format
Turtle-Format
N3-Format
•express microstructural entities and relations such as elements in transcripts, annotation documents, treebanks, etc.;
•express macrostructural entities and relations such as the experimental setup (consisting of roles and types of participants, variables, number, type, and duration of trials, etc.) and the resources that together form a corpus or data collection (such as audio and video files, speech transcripts and annotation documents as a whole, etc.)
•allow for multiple versions of data sets (e.g., for multiple annotation of the same phenomenon as a basis for agreement calculations)
Our result was that none of these models (although perfectly eligible for special subsets of our data) was able to handle the entirety of our data collection. In many cases we identified the main problem being a different understanding of central terms such as „corpus“ or „transcript“. Although „corpus“ is usually defined as „a finite set of concrete linguistic utterances that serves as an empirical bases for linguistic research“ (Bußmann 1996:106), along with subsequent annotations, this definition is too narrow for our field. Even with the addition of an abstract timeline for anchoring multiple events (as in, among others, Bird & Liberman 2001, or Evert et al. 2003) we require an even more complex axis system that also supports multiple timelines (for cases where data sets are bound to multiple timelines for which no synchronisation has been defined yet), and also spatial systems (necessary for modeling, e.g., gestures, head movements, actions in dialogue games where spatial actions are of interest, as in, for instance, object arrangement games).
On the basis of those theories we propose a generic data model capable of dealing with such heterogeneous data collection as present in our Collaborative Research Centre: MExiCo, which will be available to researchers in different ways: As a library to be used in console scripts, as a HTTP API that can be accessed as a web service, and, finally, as a backend of Phoibos, a web-based corpus management application (Menke & Mehler 2011, Menke & Cimiano 2012) where researchers can benefit from its functionality without being required to perform actual programming – although even this is not difficult: Being implemented in Ruby, MExiCo’s core functionality benefits from Ruby’s flexible syntax and is designed as a DSL (domain-specific language). This means that researchers can formulate queries, scripts and batch processes in an easy-to-understand language that attempts to be as close to human language as possible, with as few formal requirements of a programming language as possible.
Zitierstile
Menke P, Cimiano P. MExiCo: A Library for Managing Multimodal Data Collections. In: Vargas-Sierra C, ed. Corpus Resources for Descriptive and Applied Studies. Current Challenges and Future Directions: Selected Papers from the 5th International Conference on Corpus Linguistics (CILC2013). Procedia - Social and Behavioral Sciences. Vol 95. Elsevier BV; 2013: 105-110.
Menke, P., & Cimiano, P. (2013). MExiCo: A Library for Managing Multimodal Data Collections. In C. Vargas-Sierra (Ed.), Procedia - Social and Behavioral Sciences: Vol. 95. Corpus Resources for Descriptive and Applied Studies. Current Challenges and Future Directions: Selected Papers from the 5th International Conference on Corpus Linguistics (CILC2013) (pp. 105-110). Elsevier BV. doi:10.1016/j.sbspro.2013.10.628
Menke, P., and Cimiano, P. (2013). “MExiCo: A Library for Managing Multimodal Data Collections” in Corpus Resources for Descriptive and Applied Studies. Current Challenges and Future Directions: Selected Papers from the 5th International Conference on Corpus Linguistics (CILC2013), Vargas-Sierra, C. ed. Procedia - Social and Behavioral Sciences, vol. 95, (Elsevier BV), 105-110.
Menke, P., & Cimiano, P., 2013. MExiCo: A Library for Managing Multimodal Data Collections. In C. Vargas-Sierra, ed. Corpus Resources for Descriptive and Applied Studies. Current Challenges and Future Directions: Selected Papers from the 5th International Conference on Corpus Linguistics (CILC2013). Procedia - Social and Behavioral Sciences. no.95 Elsevier BV, pp. 105-110.
P. Menke and P. Cimiano, “MExiCo: A Library for Managing Multimodal Data Collections”, Corpus Resources for Descriptive and Applied Studies. Current Challenges and Future Directions: Selected Papers from the 5th International Conference on Corpus Linguistics (CILC2013), C. Vargas-Sierra, ed., Procedia - Social and Behavioral Sciences, vol. 95, Elsevier BV, 2013, pp.105-110.
Menke, P., Cimiano, P.: MExiCo: A Library for Managing Multimodal Data Collections. In: Vargas-Sierra, C. (ed.) Corpus Resources for Descriptive and Applied Studies. Current Challenges and Future Directions: Selected Papers from the 5th International Conference on Corpus Linguistics (CILC2013). Procedia - Social and Behavioral Sciences. 95, p. 105-110. Elsevier BV (2013).
Menke, Peter, and Cimiano, Philipp. “MExiCo: A Library for Managing Multimodal Data Collections”. Corpus Resources for Descriptive and Applied Studies. Current Challenges and Future Directions: Selected Papers from the 5th International Conference on Corpus Linguistics (CILC2013). Ed. Chelo Vargas-Sierra. Elsevier BV, 2013.Vol. 95. Procedia - Social and Behavioral Sciences. 105-110.
Download
RDF/XML-Format
JSON-LD-Format
Turtle-Format
N3-Format