Big Data from A to Z | Helen Anderson<\/title>\n<meta name=\"description\" content=\"Welcome to another awesome list. This time, Big Data and all the tools Data Scientists and Engineers use to build platforms and models.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.helenanderson.co.nz\/big-data-a-to-z\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Big Data from A to Z | Helen Anderson\" \/>\n<meta property=\"og:description\" content=\"Welcome to another awesome list. This time, Big Data and all the tools Data Scientists and Engineers use to build platforms and models.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.helenanderson.co.nz\/big-data-a-to-z\/\" \/>\n<meta property=\"og:site_name\" content=\"Helen Anderson\" \/>\n<meta property=\"article:published_time\" content=\"2019-05-25T07:21:16+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2020-08-30T08:14:28+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/helenanderson.co.nz\/wp-content\/uploads\/2019\/05\/big-data-hadoop.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1280\" \/>\n\t<meta property=\"og:image:height\" content=\"853\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@helenanders26\" \/>\n<meta name=\"twitter:site\" content=\"@helenanders26\" \/>\n<meta name=\"twitter:label1\" content=\"Est. reading time\">\n\t<meta name=\"twitter:data1\" content=\"7 minutes\">\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebSite\",\"@id\":\"https:\/\/13.237.200.153\/#website\",\"url\":\"https:\/\/13.237.200.153\/\",\"name\":\"Helen Anderson\",\"description\":\"Data Analyst | Technical Writer\",\"publisher\":{\"@id\":\"https:\/\/13.237.200.153\/#\/schema\/person\/4677a271385757403307fb29bd14d7bf\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":\"https:\/\/13.237.200.153\/?s={search_term_string}\",\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-NZ\"},{\"@type\":\"ImageObject\",\"@id\":\"https:\/\/www.helenanderson.co.nz\/big-data-a-to-z\/#primaryimage\",\"inLanguage\":\"en-NZ\",\"url\":\"https:\/\/helenanderson.co.nz\/wp-content\/uploads\/2019\/05\/big-data-hadoop.jpg\",\"width\":1280,\"height\":853,\"caption\":\"big-data-hadoop\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.helenanderson.co.nz\/big-data-a-to-z\/#webpage\",\"url\":\"https:\/\/www.helenanderson.co.nz\/big-data-a-to-z\/\",\"name\":\"Big Data from A to Z | Helen Anderson\",\"isPartOf\":{\"@id\":\"https:\/\/13.237.200.153\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.helenanderson.co.nz\/big-data-a-to-z\/#primaryimage\"},\"datePublished\":\"2019-05-25T07:21:16+00:00\",\"dateModified\":\"2020-08-30T08:14:28+00:00\",\"description\":\"Welcome to another awesome list. This time, Big Data and all the tools Data Scientists and Engineers use to build platforms and models.\",\"inLanguage\":\"en-NZ\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.helenanderson.co.nz\/big-data-a-to-z\/\"]}]},{\"@type\":\"Article\",\"@id\":\"https:\/\/www.helenanderson.co.nz\/big-data-a-to-z\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.helenanderson.co.nz\/big-data-a-to-z\/#webpage\"},\"author\":{\"@id\":\"https:\/\/13.237.200.153\/#\/schema\/person\/4677a271385757403307fb29bd14d7bf\"},\"headline\":\"Big Data from A to Z\",\"datePublished\":\"2019-05-25T07:21:16+00:00\",\"dateModified\":\"2020-08-30T08:14:28+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.helenanderson.co.nz\/big-data-a-to-z\/#webpage\"},\"publisher\":{\"@id\":\"https:\/\/13.237.200.153\/#\/schema\/person\/4677a271385757403307fb29bd14d7bf\"},\"image\":{\"@id\":\"https:\/\/www.helenanderson.co.nz\/big-data-a-to-z\/#primaryimage\"},\"keywords\":\"athena,aws,data science,ec2,lambda,python,s3,security\",\"articleSection\":\"AWS,Data Technology\",\"inLanguage\":\"en-NZ\"},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\/\/13.237.200.153\/#\/schema\/person\/4677a271385757403307fb29bd14d7bf\",\"name\":\"Helen Anderson\",\"image\":{\"@type\":\"ImageObject\",\"@id\":\"https:\/\/13.237.200.153\/#personlogo\",\"inLanguage\":\"en-NZ\",\"url\":\"https:\/\/helenanderson.co.nz\/wp-content\/uploads\/2019\/11\/helen-anderson-profile-selects-FA-1000.jpg\",\"width\":1000,\"height\":1000,\"caption\":\"Helen Anderson\"},\"logo\":{\"@id\":\"https:\/\/13.237.200.153\/#personlogo\"},\"description\":\"Hi, I'm Helen. I'm a data analyst, technical writer, and AWS Data Hero. I interpret the story behind the numbers, build data applications, and grow analyst and developer communities - currently at Kiwibank.\",\"sameAs\":[\"http:\/\/www.helenanderson.co.nz\/\",\"https:\/\/www.instagram.com\/helenanders26\/\",\"https:\/\/www.linkedin.com\/in\/helenanders26\/\",\"https:\/\/twitter.com\/helenanders26\",\"https:\/\/www.youtube.com\/channel\/UCttVhJizwkhgmMlDBMUE0wQ\"]}]}<\/script>\n","_links":{"self":[{"href":"https:\/\/helenanderson.co.nz\/wp-json\/wp\/v2\/posts\/1381"}],"collection":[{"href":"https:\/\/helenanderson.co.nz\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/helenanderson.co.nz\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/helenanderson.co.nz\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/helenanderson.co.nz\/wp-json\/wp\/v2\/comments?post=1381"}],"version-history":[{"count":17,"href":"https:\/\/helenanderson.co.nz\/wp-json\/wp\/v2\/posts\/1381\/revisions"}],"predecessor-version":[{"id":3493,"href":"https:\/\/helenanderson.co.nz\/wp-json\/wp\/v2\/posts\/1381\/revisions\/3493"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/helenanderson.co.nz\/wp-json\/wp\/v2\/media\/2530"}],"wp:attachment":[{"href":"https:\/\/helenanderson.co.nz\/wp-json\/wp\/v2\/media?parent=1381"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/helenanderson.co.nz\/wp-json\/wp\/v2\/categories?post=1381"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/helenanderson.co.nz\/wp-json\/wp\/v2\/tags?post=1381"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}

{"id":1381,"date":"2019-05-25T19:21:16","date_gmt":"2019-05-25T07:21:16","guid":{"rendered":"http:\/\/www.helenanderson.co.nz\/?p=1381"},"modified":"2020-08-30T20:14:28","modified_gmt":"2020-08-30T08:14:28","slug":"big-data-a-to-z","status":"publish","type":"post","link":"https:\/\/helenanderson.co.nz\/big-data-a-to-z\/","title":{"rendered":"Big Data from A to Z"},"content":{"rendered":"\n

Welcome to an awesome list of all the tools Data Scientists and Data Engineers use to build platforms and models.<\/p>\n\n\n\n

\n\n\n\n

Athena<\/a>
Batch Processing<\/a>
Compute<\/a>
Docker<\/a>
Ethical Guidelines<\/a>
Fuzzy Logic<\/a>
GPU<\/a>
Hadoop<\/a>
Image Recognition<\/a>
Jupyter Notebook<\/a>
Kaggle<\/a>
Linear Regression<\/a>
Map Reduce<\/a>
Natural Language Processing<\/a>
Overfitting<\/a>
Pattern Recognition<\/a>
Quantitative v Qualitative<\/a>
Real Time<\/a>
Spark]<\/a>
Testing<\/a>
Unstructured Data<\/a>
Volume and Velocity<\/a>
Web Scraping<\/a>
XML<\/a>
Numpy<\/a>
ZooKeeper<\/a><\/strong><\/p><\/blockquote>\n\n\n\n
\n\n\n\n
<\/a>
Athena<\/h3>\n\n\n\n
AWS Athena<\/a> is a service used to query files in S3 buckets<\/a> directly on a pay-for-what-you-use basis. This makes it easy to get going querying data in various formats without having to use an ETL tool to load it into a database.<\/p>\n\n\n\n
The service can be used on its own, integrated with AWS Glue<\/a> as a Data Catalogue or with AWS Lambda<\/a> as part of a bigger architecture.<\/p>\n\n\n\n
\n\n\n\n
<\/a>
Batch Data Processing<\/h3>\n\n\n\n
Data Science projects rely on the Data Scientist being able to process terabytes or even petabytes of data. Tools like Apache Flink<\/a> can get the job done using data streams<\/a> or batch processing.<\/p>\n\n\n\n
\n\n\n\n
<\/a>
Compute<\/h3>\n\n\n\n
To allow Data Scientists to process large data sets, the infrastructure needs to be there to support them. This can be put in place by using autoscaling to make sure there is enough capacity to process the volume of data.<\/p>\n\n\n\n
To make this even easier to manage AWS<\/a> has introduced Predictive Auto Scaling<\/a> that uses Machine Learning to scale up compute resources to support Machine Learning.<\/p>\n\n\n\n
So meta.<\/p>\n\n\n\n
\n\n\n\n
<\/a>
Docker<\/h3>\n\n\n\n
Sharing the results of Data Science experiments isn’t always easy. Operating systems and R libraries aren’t always compatible depending on who you are sharing with. Security<\/a> is also an issue when sharing datasets and final dashboards between users.<\/p>\n\n\n\n
That’s when Docker<\/a> comes in. Data Engineers can provision Docker Images that freeze the Operating System and libraries so sandboxes or final products can be shared securely.<\/p>\n\n\n\n
\n\n\n\n
<\/a>
Ethical Guidelines<\/h3>\n\n\n\n
Use of customers personal information in analysis needs to be taken seriously and guidelines need to be in place to keep it secure. This is more than just complying with legal requirements<\/a>. Models should not have any kind of bias and participants should always know where their data is being used.<\/p>\n\n\n\n
\n\n\n\n
<\/a>
Fuzzy Logic<\/h3>\n\n\n\n
Fuzzy Logic is used to calculate the distance between two strings. This is similar to using wildcards in SQL<\/a> and Regular Expressions<\/a> in many other languages.<\/p>\n\n\n\n
In the Data Science world, we can use the Fuzzy Wuzzy Python library across large data sets.<\/p>\n\n\n\n
\n\n\n\n
<\/a>
GPU<\/h3>\n\n\n\n
Graphics Processing Units (GPUs) are designed to process images, as they are made up of multiple cores. Because they can process huge batches of data<\/a> and perform the same task over and over they are also used in Data Science.<\/p>\n\n\n\n
\n\n\n\n
<\/a>
Hadoop<\/h3>\n\n\n\n
The Open-Source Hadoop<\/a> project is a collection of utilities that decouples Storage and Compute<\/a> so these can be scaled up and down as needed.<\/p>\n\n\n\n
Hadoop Distributed Files System (HDFS) breaks the files into logical chunks for storage, Spark, MapReduce or another tool can then take over to do the processing (more on that later in the post).<\/p>\n\n\n\n
Fun fact:<\/strong> Hadoop is the name of the creator’s sons toy elephant<\/a>.<\/p><\/blockquote>\n\n\n\n
\n\n\n\n
<\/a>
Image Recognition<\/h3>\n\n\n\n
Tensorflow<\/a> is a Machine Learning framework used to train models using Neural Networks to perform image recognition.<\/p>\n\n\n\n
Neural Networks break up inputs into vectors which they use to then interpret, cluster and classify.<\/p>\n\n\n\n
\n\n\n\n
<\/a>
Jupyter Notebook<\/h3>\n\n\n\n
Jupyter Notebooks<\/a> run code, perform statistical analysis and present data visualisations<\/a> all in one place. It supports 40 languages and is named Jupyter as a nod to Galileo’s notebooks recording the discovery of the moons of Jupiter.<\/p>\n\n\n\n
\n\n\n\n
<\/a>
Kaggle<\/h3>\n\n\n\n
If you are looking to get some practice in or need a dataset for a project Kaggle<\/a> is the place to start. Once you’ve practised on a few of the test data sets you can then compete in competitions to solve problems. The community and discussions are friendly and you can use your tool of choice.<\/p>\n\n\n\n
\n\n\n\n
<\/a>
Linear Regression<\/h3>\n\n\n\n
Regression is one of the statistical techniques used in Data Science to predict how one variable influences another. Linear regression can be used to analyse the relationship between long queues at the supermarket and customer satisfaction or temperature and ice cream sales.<\/p>\n\n\n\n
If you think there is a relationship between two things you can use regression to confirm it.<\/p>\n\n\n\n
\n\n\n\n
<\/a>
MapReduce<\/h3>\n\n\n\n
MapReduce<\/a> is the compute part of the Hadoop ecosystem. Once we have stored the data using HDFS, we can then use MapReduce to do the processing. MapReduce processes the data in logical chunks then processes them in parallel before aggregating the chunks again.<\/p>\n\n\n\n
\n\n\n\n
<\/a>
Natural Language Processing<\/h3>\n\n\n\n
Natural Language Processing (NLP) is the arm of Artifical Intelligence that is concerned with how computers can derive meaning from human language. If you’ve ever used Suri, Cortana, or Grammarly you’ve encountered NLP.<\/p>\n\n\n\n
\n\n\n\n
<\/a>
Overfitting<\/h3>\n\n\n\n
Both overfitting and underfitting lead to poor predictions.<\/p>\n\n\n\n
Overfitting<\/strong> – happens when a model is too complex and has too much noise. The model ‘memorises’ and makes generalisations on all the training data and can’t ‘fit’ this to another data set.<\/p>\n\n\n\n
Underfitting<\/strong> – happens when a model is too simple and there aren’t enough parameters to capture trends.<\/p>\n\n\n\n
\n\n\n\n<\/a>
Pattern Recognition<\/h3>\n\n\n\n
Pattern Recognition is used to detect similarities or irregularities in data sets. Practical applications can be seen in fingerprint identification, analysis of seismic activity and speech recognition.<\/p>\n\n\n\n
\n\n\n\n
<\/a>
Quantitative v Qualitative<\/h3>\n\n\n\n
If moving into Data Science from an Engineering background you may need to brush up on your statistics. Learn more about the skills needed to transition into the role in this fascinating interview with Julia Silge<\/a> of Stack Overflow.<\/p>\n\n\n\n
\n\n\n\n
<\/a>
Real Time<\/h3>\n\n\n\n
Apache Kafka<\/a> is a pub\/sub system that allows streaming of data from logs, web activity and monitoring systems.<\/p>\n\n\n\n
Kafka is used for two classes of applications:<\/p>\n\n\n\n
Building real-time streaming data pipelines that reliably get data between systems or applications<\/p>\n\n\n\n
Building real-time streaming applications that transform or react to the streams of data<\/p>\n\n\n\n
\n\n\n\n
<\/a>
Spark<\/h3>\n\n\n\n
Apache Spark,<\/a> like MapReduce is a tool for data processing.<\/p>\n\n\n\n
Spark<\/strong> – can process in-memory so is much faster. Useful if data is needed to be processed iteratively or in real time.<\/p>\n\n\n\n
MapReduce<\/strong> – must read from and write to a disk but can work with far larger data sets than Spark. If results aren’t required right away this may be a good choice.<\/p>\n\n\n\n
\n\n\n\n<\/a>
Testing<\/h3>\n\n\n\n
Artificial Intelligence (AI) has practical uses in Marketing with real-time product recommendations, Sales with VR systems helping shoppers make decisions and Customer Support with Natural Language Processing.<\/p>\n\n\n\n
An emerging use case comes is Software Testing. AI can be used in prioritising the order of tests, automating and optimising cases and freeing up QAs from tedious tasks.<\/p>\n\n\n\n
\n\n\n\n
<\/a>
Unstructured Data<\/h3>\n\n\n\n
Structured Data can be stored in a Relational Database is columns, rows and tables.<\/p>\n\n\n\n
When it comes to Unstructured Data which includes images, videos, and text the storage needs change. Data Lakes can hold both types of data at low cost.<\/p>\n\n\n\n
Data stored here is retrieved and read when required and organised based on need, making it popular with data scientists who would rather keep the quirks and \u2018noise\u2019 in, rather than having it cleaned and aggregated.<\/p>\n\n\n\n
\n\n\n\n
<\/a>
Volume and Velocity<\/h3>\n\n\n\n
In 2001 Big Data was defined by the three Vs:<\/p>\n\n\n\n
Volume<\/p>
Velocity<\/p>
Variety<\/p><\/blockquote>\n\n\n\n
Fast forward to today and there are additional Vs used in industry publications:<\/p>\n\n\n\n
Value<\/p>
Veracity<\/p>
Variability<\/p>
Visualisation<\/p><\/blockquote>\n\n\n\n
There is debate over whether these are relevant, or truly describe what Big Data and Data Science is but if you are researching the industry these will inevitably come up.<\/p>\n\n\n\n
\n\n\n\n
<\/a>
Web Scraping<\/h3>\n\n\n\n
Use cases for Web Scraping in Data Science projects include:<\/p>\n\n\n\n
Pulling data from social media sites or forums for sentiment analysis<\/li><\/ul>\n\n\n\n
Fetching prices and products for comparison<\/li><\/ul>\n\n\n\n
Analysing site content to rank and compare content<\/li><\/ul>\n\n\n\n
To get started using Python install Scrapy<\/a> to extract structured data from websites.<\/p>\n\n\n\n
\n\n\n\n
<\/a>
XML<\/h3>\n\n\n\n
XML and JSON formats are common in the Big Data world as ways to store and transport data. To use these with Python check out ElementTree<\/a> for parsing XML and json<\/a> for JSON.<\/p>\n\n\n\n
\n\n\n\n
<\/a>
NumPy<\/h3>\n\n\n\n
I cheated a little bit but in any case …<\/p>\n\n\n\n
NumPy<\/a> is used in Python to integrate with databases, perform scientific calculations and manipulating arrays.<\/p>\n\n\n\n
\n\n\n\n
<\/a>
ZooKeeper<\/h3>\n\n\n\n
Apache ZooKeeper<\/a> takes care of keeping clusters running and available. It maintains the network by passing messages back and forth and guarantees:<\/p>\n\n\n\n
Sequential Consistency<\/strong> – Updates from a client will be applied in the order that they were sent.<\/p>\n\n\n\n
Atomicity<\/strong> – Updates either succeed or fail. No partial results.<\/p>\n\n\n\n
Single System Image<\/strong> – A client will see the same view of the service regardless of the server that it connects to.<\/p>\n\n\n\n
Reliability<\/strong> – Once an update has been applied, it will persist from that time forward until a client overwrites the update.<\/p>\n\n\n\n
Timeliness<\/strong> – The clients’ view of the system is guaranteed to be up-to-date within a certain time-bound.<\/p>\n\n\n\n
\n\n\n\nPhoto by Magda Ehlers<\/a> from Pexels<\/p>\n","protected":false},"excerpt":{"rendered":"
Welcome to another awesome list. This time, Big Data and all the tools Data Scientists and Data Engineers use to build platforms and models.<\/p>\n","protected":false},"author":1,"featured_media":2530,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"Layout":""},"categories":[11,147],"tags":[185,173,203,176,202,204,186,174],"yoast_head":"\nBig Data from A to Z | Helen Anderson<\/title>\n<meta name=\"description\" content=\"Welcome to another awesome list. This time, Big Data and all the tools Data Scientists and Engineers use to build platforms and models.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.helenanderson.co.nz\/big-data-a-to-z\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Big Data from A to Z | Helen Anderson\" \/>\n<meta property=\"og:description\" content=\"Welcome to another awesome list. This time, Big Data and all the tools Data Scientists and Engineers use to build platforms and models.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.helenanderson.co.nz\/big-data-a-to-z\/\" \/>\n<meta property=\"og:site_name\" content=\"Helen Anderson\" \/>\n<meta property=\"article:published_time\" content=\"2019-05-25T07:21:16+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2020-08-30T08:14:28+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/helenanderson.co.nz\/wp-content\/uploads\/2019\/05\/big-data-hadoop.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1280\" \/>\n\t<meta property=\"og:image:height\" content=\"853\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@helenanders26\" \/>\n<meta name=\"twitter:site\" content=\"@helenanders26\" \/>\n<meta name=\"twitter:label1\" content=\"Est. reading time\">\n\t<meta name=\"twitter:data1\" content=\"7 minutes\">\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebSite\",\"@id\":\"https:\/\/13.237.200.153\/#website\",\"url\":\"https:\/\/13.237.200.153\/\",\"name\":\"Helen Anderson\",\"description\":\"Data Analyst | Technical Writer\",\"publisher\":{\"@id\":\"https:\/\/13.237.200.153\/#\/schema\/person\/4677a271385757403307fb29bd14d7bf\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":\"https:\/\/13.237.200.153\/?s={search_term_string}\",\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-NZ\"},{\"@type\":\"ImageObject\",\"@id\":\"https:\/\/www.helenanderson.co.nz\/big-data-a-to-z\/#primaryimage\",\"inLanguage\":\"en-NZ\",\"url\":\"https:\/\/helenanderson.co.nz\/wp-content\/uploads\/2019\/05\/big-data-hadoop.jpg\",\"width\":1280,\"height\":853,\"caption\":\"big-data-hadoop\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.helenanderson.co.nz\/big-data-a-to-z\/#webpage\",\"url\":\"https:\/\/www.helenanderson.co.nz\/big-data-a-to-z\/\",\"name\":\"Big Data from A to Z | Helen Anderson\",\"isPartOf\":{\"@id\":\"https:\/\/13.237.200.153\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.helenanderson.co.nz\/big-data-a-to-z\/#primaryimage\"},\"datePublished\":\"2019-05-25T07:21:16+00:00\",\"dateModified\":\"2020-08-30T08:14:28+00:00\",\"description\":\"Welcome to another awesome list. This time, Big Data and all the tools Data Scientists and Engineers use to build platforms and models.\",\"inLanguage\":\"en-NZ\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.helenanderson.co.nz\/big-data-a-to-z\/\"]}]},{\"@type\":\"Article\",\"@id\":\"https:\/\/www.helenanderson.co.nz\/big-data-a-to-z\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.helenanderson.co.nz\/big-data-a-to-z\/#webpage\"},\"author\":{\"@id\":\"https:\/\/13.237.200.153\/#\/schema\/person\/4677a271385757403307fb29bd14d7bf\"},\"headline\":\"Big Data from A to Z\",\"datePublished\":\"2019-05-25T07:21:16+00:00\",\"dateModified\":\"2020-08-30T08:14:28+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.helenanderson.co.nz\/big-data-a-to-z\/#webpage\"},\"publisher\":{\"@id\":\"https:\/\/13.237.200.153\/#\/schema\/person\/4677a271385757403307fb29bd14d7bf\"},\"image\":{\"@id\":\"https:\/\/www.helenanderson.co.nz\/big-data-a-to-z\/#primaryimage\"},\"keywords\":\"athena,aws,data science,ec2,lambda,python,s3,security\",\"articleSection\":\"AWS,Data Technology\",\"inLanguage\":\"en-NZ\"},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\/\/13.237.200.153\/#\/schema\/person\/4677a271385757403307fb29bd14d7bf\",\"name\":\"Helen Anderson\",\"image\":{\"@type\":\"ImageObject\",\"@id\":\"https:\/\/13.237.200.153\/#personlogo\",\"inLanguage\":\"en-NZ\",\"url\":\"https:\/\/helenanderson.co.nz\/wp-content\/uploads\/2019\/11\/helen-anderson-profile-selects-FA-1000.jpg\",\"width\":1000,\"height\":1000,\"caption\":\"Helen Anderson\"},\"logo\":{\"@id\":\"https:\/\/13.237.200.153\/#personlogo\"},\"description\":\"Hi, I'm Helen. I'm a data analyst, technical writer, and AWS Data Hero. I interpret the story behind the numbers, build data applications, and grow analyst and developer communities - currently at Kiwibank.\",\"sameAs\":[\"http:\/\/www.helenanderson.co.nz\/\",\"https:\/\/www.instagram.com\/helenanders26\/\",\"https:\/\/www.linkedin.com\/in\/helenanders26\/\",\"https:\/\/twitter.com\/helenanders26\",\"https:\/\/www.youtube.com\/channel\/UCttVhJizwkhgmMlDBMUE0wQ\"]}]}<\/script>\n","_links":{"self":[{"href":"https:\/\/helenanderson.co.nz\/wp-json\/wp\/v2\/posts\/1381"}],"collection":[{"href":"https:\/\/helenanderson.co.nz\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/helenanderson.co.nz\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/helenanderson.co.nz\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/helenanderson.co.nz\/wp-json\/wp\/v2\/comments?post=1381"}],"version-history":[{"count":17,"href":"https:\/\/helenanderson.co.nz\/wp-json\/wp\/v2\/posts\/1381\/revisions"}],"predecessor-version":[{"id":3493,"href":"https:\/\/helenanderson.co.nz\/wp-json\/wp\/v2\/posts\/1381\/revisions\/3493"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/helenanderson.co.nz\/wp-json\/wp\/v2\/media\/2530"}],"wp:attachment":[{"href":"https:\/\/helenanderson.co.nz\/wp-json\/wp\/v2\/media?parent=1381"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/helenanderson.co.nz\/wp-json\/wp\/v2\/categories?post=1381"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/helenanderson.co.nz\/wp-json\/wp\/v2\/tags?post=1381"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}

<\/a>
Batch Data Processing<\/h3>\n\n\n\n
Data Science projects rely on the Data Scientist being able to process terabytes or even petabytes of data. Tools like Apache Flink<\/a> can get the job done using data streams<\/a> or batch processing.<\/p>\n\n\n\n
\n\n\n\n

<\/a>
GPU<\/h3>\n\n\n\n
Graphics Processing Units (GPUs) are designed to process images, as they are made up of multiple cores. Because they can process huge batches of data<\/a> and perform the same task over and over they are also used in Data Science.<\/p>\n\n\n\n
\n\n\n\n

<\/a>
Image Recognition<\/h3>\n\n\n\n
Tensorflow<\/a> is a Machine Learning framework used to train models using Neural Networks to perform image recognition.<\/p>\n\n\n\n
Neural Networks break up inputs into vectors which they use to then interpret, cluster and classify.<\/p>\n\n\n\n
\n\n\n\n

<\/a>
Natural Language Processing<\/h3>\n\n\n\n
Natural Language Processing (NLP) is the arm of Artifical Intelligence that is concerned with how computers can derive meaning from human language. If you’ve ever used Suri, Cortana, or Grammarly you’ve encountered NLP.<\/p>\n\n\n\n
\n\n\n\n

<\/a>
Pattern Recognition<\/h3>\n\n\n\n
Pattern Recognition is used to detect similarities or irregularities in data sets. Practical applications can be seen in fingerprint identification, analysis of seismic activity and speech recognition.<\/p>\n\n\n\n
\n\n\n\n

<\/a>
XML<\/h3>\n\n\n\n
XML and JSON formats are common in the Big Data world as ways to store and transport data. To use these with Python check out ElementTree<\/a> for parsing XML and json<\/a> for JSON.<\/p>\n\n\n\n
\n\n\n\n

<\/a>
NumPy<\/h3>\n\n\n\n
I cheated a little bit but in any case …<\/p>\n\n\n\n
NumPy<\/a> is used in Python to integrate with databases, perform scientific calculations and manipulating arrays.<\/p>\n\n\n\n
\n\n\n\n