Useful Tips and Tricks for Data Scientists
In this article, I will share some of the tricks and tools that I am using to interpret the data in a fast and precise way and get useful insights from it.
Join the DZone community and get the full member experience.
Join For FreeIn the world of data science, there are countless libraries and tools that can help speed up your work and make your analyses more efficient.
As a data analyst and researcher, I have developed tools for processing terabytes of data and performed anomaly research and analysis in the field of brokerage and trading. During my career, I have accumulated some useful knowledge on how to efficiently analyze large amounts of data as well as solve uncommon tasks that arise in the field.
In this article, I will share some of the tricks and tools that I am using to interpret the data in a fast and precise way and get useful insights from it. I hope you’ll find something useful for you to implement in your data research.
Useful SQL Functions for Data Analysis
I'm doing a lot of data engineering and dataset preparations in my daily research work, and I’ve gathered several useful tips that are not very popular — but can be very helpful.
Nowadays data analysts are mostly using Python frameworks for data manipulations. However, there may be better and more efficient options.
Sometimes, it is good to go “back to school” and use some SQL functions instead. Unfortunately, SQL is no longer studied that widely at the universities — now Pandas has become a default option, however, SQL has several advantages that can facilitate your work. Young data scientists should still get to know and use SQL because databases usually have more varied resources than Python notebooks.
Here are some less popular SQL functions that can be especially with data preparations:
CROSS JOIN
— Used to do operations between rows that might prove cumbersome in Python.SUM() OVER (PARTITION BY .. ORDER BY)
grouping expression which could be applied withoutGROUP BY
, can be used to divide a result set into partitions based on the values of one or more columns. These partitions can then be used to perform calculations and aggregate functionsROW_NUMBER
— Assigns a unique numerical identifier to each row, facilitating sorting, filtering, and other operations, useful for sequential analysisCOALESCE
— Used to handle the Null values. This function evaluates arguments in a particular order from the provided arguments list and always returns the first non-null value. If you’re doing multiple unions and want to have a common filled column, coalesce would help with that.GROUP_CONCAT
,STRING_CONCAT
— Merges the string representations of all arguments that are passed to it, and returns the result as a new string.
Logs as a Data Source
In my career, I’ve spent quite a lot of time studying log files in order to spot anomalies and got used to looking at these pieces of information “under a microscope”. Logs help understand in detail what happened at a certain moment of time when the issue occurred — but how to extract that information in the most efficient way?
Most of the time, logs are viewed as a useless amount of information in which you have to search for a particular moment in time or find a certain event (or error). Then you have to figure out what exactly went wrong and, based on your findings, fix the error.
But what if you considered a log file as a data scientist dataset? Looking at it from a different angle, you can realize that logs can actually be very helpful for statistical analysis. I’ve listed some advantages of the data handling functionalities that you can use and how they can help you with interpreting your data.
- Through statistical analysis of the logs, you can get a better understanding of what the app is doing exactly and how many times.
- You can implement anomaly detection analysis to find out the delay between the events.
- Through supervised learning, you may be able to identify whether the app is already experiencing problems. For that, the Classification type of supervised learning can come in handy, as it uses an algorithm to accurately assign test data into specific categories.
- Another application of supervised learning is predicting what the future performance of the app would be like.
- You might be able to perform modeling of the “what if” scenarios — however, this may be quite complicated.
- ACF functions can be helpful for finding delays and queues.
- If you are looking at unknown data, NLP functionality can help interpret it in a more user-friendly form.
Data Aggregation Without Losing Details
Big data assumes various aggregation techniques. What’s the downside of this approach?
Most of the time data is normalized in some form, and it has a flat or, in other words, relational representation of the data. When you perform aggregation, your data is transformed in the way that some of the fields and attributes are being skipped, simply because they are not relevant at the time.
However, when you try to build a hypothesis based on the aggregated data through ML or any other statistical analysis tool, you might need to bring the skipped details back. The problem is, that you’ve already lost them from your dataset and you need to start the cycle all over from (a) and include the field that you think might be important now.
This is how you can solve the problem and save a lot of time:
- Keep the original dataset available for enrichment.
- Perform the “Top-down” analysis, which means joining aggregated data to the original dataset in order to spot anomalies.
Here is the SQL query written for ClickHouse, PostgreSQL, and MySQL databases that you can use:
-- create
CREATE TABLE EMPLOYEE (
empId INTEGER PRIMARY KEY,
name TEXT NOT NULL,
dept TEXT NOT NULL,
city TEXT NULL,
new_city text NULL
);
-- insert
INSERT INTO EMPLOYEE VALUES (0001, 'Clark', 'Sales', 'Hamburg', NULL);
INSERT INTO EMPLOYEE VALUES (0002, 'Dave', 'Accounting','Paris', NULL);
INSERT INTO EMPLOYEE VALUES (0003, 'Ava', 'Sales','Tallinn', NULL);
INSERT INTO EMPLOYEE VALUES (0004, 'Ava', 'Sales','Paris', NULL);
INSERT INTO EMPLOYEE VALUES (0005, 'Eva', 'Sales',NULL, 'Madrid');
-- QUERY MYSQL
SELECT dept
, ROW_NUMBER() OVER () as n
, COUNT(DISTINCT city) uniq_cities
, COUNT(DISTINCT coalesce(city, new_city) ) uniq_cities_
, GROUP_CONCAT(CONCAT_WS(';',name, dept,city) SEPARATOR '\n')
FROM EMPLOYEE
GROUP by dept
--QUERY Clickhouse
select dept
, ROW_NUMBER() OVER () as n
, uniq(city) uniq_cities
, uniq(coalesce(city, new_city)) uniq_cities_
, arrayStringConcat(groupArray(concatWithSeparator(',', toString(tuple(*)))),';') as all_fields
FROM EMPLOYEE
GROUP by dept
--QUERY PostgreSQL
SELECT dept
, ROW_NUMBER() OVER () as n
, COUNT(DISTINCT city) uniq_cities
, COUNT(DISTINCT coalesce(city, new_city) ) uniq_cities_
, json_agg(ROW_TO_JSON(EMPLOYEE))
FROM EMPLOYEE
GROUP BY dept
Alternatively, if you are using Pandas, run this code:
import pandas as pd
data = [
(1, "Clark", "Sales", "Hamburg"),
(2, "Dave", "Accounting", "Riga"),
(3, "Ava", "Sales", "Tallinn"),
(4, "Ava", "Sales", "Paris"),
(5, "Eva", "Sales", "Paris"),
]
df = pd.DataFrame(data)
df.columns = ["empId", "name", "dept", "city"]
df.groupby(["dept"]).agg(uniq_cities=("city", "count")).join(
df.groupby(["dept"]).apply(lambda x: x.to_dict()).rename("all_fields")
)
This would allow you to unfold data back and check the details of your aggregates.
Effective Visualization for Data Discovery
When I’m thinking about how to improve the visibility of a certain process and describe it from a statistical analysis perspective, I always come to the conclusion that visualization is a form of representing some vectors in compressed form, which could reflect multiple dimensions.
All statistical books and articles say that the very first step you should do with the data — is to make a sample and try to make a scatter plot or some chart, but what if your dataset is very large and contains 100 or more columns? In this case, it is quite hard to show as many dimensions as possible while keeping them informative and relevant.
Use clear and easy for understanding representation, to avoid discussions of what means what, and rather have a discussion about actual data problems which are represented. Remember that if you’re trying to tackle a complicated problem, there will always be a combination of plots. Follow these important steps to make your data easier to understand in the visual form:
- Show what the dataset looks like so that there is an understanding of “what we are looking at”
- Demonstrate your data
- Write a comment
Over the years I have singled out several visualization options that I find very helpful for data analysis. These are the Python charts that helped me to save time and come to some preliminary conclusions.
- Scatter plot with coloring and sizing of different categories
- A heatmap or a facet grid of heatmaps
- ecdf and q2q graphs that can be used for the data distribution discovery and checking the ‘normality’ of pairs of factors
Opinions expressed by DZone contributors are their own.
Comments