Until recently, statisticians were the de facto data doctors. Data were not widespread as much as it is now, it arrived clean, in orderly format, and only required the standard analysis and visualization. Now, this is no longer the case. Data availability, generation, format and usage are bringing new challenges and requiring a new super set of skills that traditional statisticians are unable to handle. The standard set of tools and formulas used by typical statisticians are unable to handle the shear volume of incoming data at its speed of generation. Moreover, analysis, visualization and knowledge discovery are not possible using the old fashioned set of tools and skills. Data scientists, the new and hot profession on job hunting lists are trying to tackle the data tsunami and turn the piles of raw data into actionable knowledge before it’s too late or just in time to beat competition or to help in money-saving decision making processes.
Data Science does include, among other skills, statistical and mathematical knowledge but does not stop there. There is more science to it as it tries to extract meaningful data from various sources and format, and generate a knowledge product. In addition to domain knowledge, computing, visualization, modeling, data processing and management, analytics and machine learning are part of the essential skills that are mandatory for those who want to approach this new profession. Some universities are already offering course or even full degrees in data science.
Data science applications are not limited to security and safety. They span all disciplines from biology and medicine to sports and entertainment. Analysis and predictions are as good as the data scientists who produce them, so a skilled and experienced professional with a sense of data dynamics may see more in relevant data sets and by choosing meaningful visualizations over pretty displays. Companies that live on data, ex. Google, use machine learning and crowd sourcing to improve translation and natural language processing by inferring lots of information from human input and interaction. Even by typing in CAPTCHA challenges you may contributing to text processing!
For a data scientist, data discovery, acquisition and cleaning are just the first few steps in a long and resource demanding journey. You have to devise your own computational scripts and rely on well-founded algorithms to scrape and clean data as well as tell erroneous data or outliers. Imagine a tabulation on human heights and weights collected on a national scale for school health – you may not have the luxury to have all data points in centimeters and kilograms. You have to automate the cleanup process to account for most expected scenarios and use statistics to either isolate outliers or treat them manually.
Python is the language of choice for data scientists not only due to its features and easiness but probably for other trendy reasons – for example, the Google effect and job requirements. Python has many data and numeric features baked within the language and the community is providing great contributions to enrich the experience. The IPython notebooks with online rendering makes it a great choice over others (try the Anaconda distribution). R is also a good choice as well as any other comparative language. You need some scripting and data wrangling tools and skills as well. Usually, the source and size of data may influence your preference for one tool over the other.
For data visualization, several academic and commercial packages are available. Python itself is capable of analyzing and representing your data but you should not ignore standard tools like MS Excel. With Power Query (Excel 2010 or newer) and Power View (Excel 2013), you can acquire data from various sources and present it in various ways. Of course, data science is not necessarily akin to Big Data (how big is big?) – adequate data for the case at hand, that you can handle, is a good rule of thumb.
To be a data scientist, you need to be a data doctor with a special feel for sound data and ill-formatted data, what is likely to affect what and how to slice and dice the data. Data presentation and visualization, with an interactive touch is equally important. Sometimes, data interpretation and knowledge derivation depends on how you look at data and what possible hints you may have as to what may be hidden in data terrain.
It is ironic how we choose to participate in this datafication effort (providing data and allowing data collection about our activity online and offline) and end up paying for data products resulting from raw data we offer at no cost. There are moral and ethical issues around such activities that may surface and haunt us in the near future. What are your thoughts?
* Illustration from Berkeley Science Review. Check page for symbol explanations.