how to build a data lake on-premise

We want you to keep the storage that you have and never pay for storage that you already bought. You cannot do that and just manage it with like one or two guys, and forget storage, right? Lets say you have a cluster of a certain analytics cluster, you can add either higher capacity nodes, [00:06:00] when you add higher capacity nodes, you know whats going to happen is when one of those nodes fail, its going to cause a huge amount of rebalancing in your cluster and especially direct attached storage, right?Im talking about direct attached storage, where you have a hyper-converge infrastructure, where you have nodes. And by the way, he corrected himself instead of B-O-C, it was B-O-X. So box.Naveen: [crosstalk 00:35:52] Thats essentially what I answered. Learn on the go with our new app. And if want to learn more about this, there are many customers that are doing this today, and if you want to learn more about, weve got a very technical document written by Joshua. First, multidimensional performance, no matter what, the application that I throw at it and whatever [00:15:30] the data is, the data could be a different sizes, it could be either sequentially access or randomly access, it could be batch or real-time jobs, it could be a large number of small files or a small number of large files, whatever the file sizes, whatever the characteristics of the app is, I need to deliver a high throughput low-latency and consistent performance and thats key.The second, it needs to be an intelligent architecture built up on todays technologies, todays storage demands flash, [00:16:00] right? And I know DataOps is a very buzzy word right now. Today, we are going to talk about how to build a Cloud Data Lake or a cloud like Data Lake on premises. [00:10:00] So lets talk about the infrastructure underlying these cloud, like data lakes on premises and see how we build those and give you some actual examples of how to build those, right? You had nodes, these hyper-converged nodes, and youve given a certain number of nodes for a particular application, whether its Hadoop or Spark or whatever [00:08:00] application that may be and you had these nodes that you just [inaudible 00:08:06] to scale like hundreds of nodes to 200 nodes, 300 nodes.And in 2015 to 2020, we moved into this cloud data warehouse world, where you were in a cloud, theres separation of computing storage, so the whole storage became a sort of in a cloud S3 layer, and you had cloud data warehouses, which would [00:08:30] separate compute, so bring compute to a query, and if youre in a cloud it would bring unlimited compute with cloud to a particular query for a few minutes, and then spin it down when you dont need it. We want to take that cloud-like approach where the storage and the compute [00:29:00] are desegregated. Docker, Spark, and Iceberg: The Fastest Way to Try Iceberg! And in 2015 to 2020, we moved into this cloud data warehouse world, where you were in a cloud, theres separation of computing storage, so the whole storage became a sort of in a cloud S3 layer, and you had cloud data warehouses, which would [00:08:30] separate compute, so bring compute to a query, and if youre in a cloud it would bring unlimited compute with cloud to a particular query for a few minutes, and then spin it down when you dont need it.

And it needs to be reliable and available always, even if youre doing upgrades, you want to add capacity, you dont want to take the storage down, it needs to be always available, no matter what youre doing, upgrades patches and the data needs to be protected against ransomware attacks, against any kind of failure scenarios. It could be some kind of deep learning software, you have to keep performance tuning and users are always complaining about query speeds and not being there or some something not functioning, so you have to keep performance tuning.All of these cost complexity, and you guys are well aware of that. So lets see how our data architectures have evolved to [00:07:30] take into account this change paradigm shift that weve seen over the last 10 years, all of us started back in 2015 and before we had these Hadoop clusters and data warehouses and your data were co-locating. I dont think itll get added in, but just for reference, theres another question that was Is FlashBlade different from a standalone BOC of SSDs? So this is what most data teams want and we know that, but what are the infrastructure challenges that are sort of preventing us from getting there? So lets start with an architecture or [marketecture 00:10:19] diagram, which shows the various layers, should on top, you have your applications, right? A lot of this content was developed by Joshua Robinson, whos a chief [00:27:00] technology at Pure, and hes written a very detailed blog describing it. Okay. A lot of this content was developed by Joshua Robinson, whos a chief [00:27:00] technology at Pure, and hes written a very detailed blog describing it. You dont want to pay for storage that you did not use, [inaudible 00:16:45] let me pay only for the gigs that I use today and not for let me not plan for capacity for five years and then buy everything today and provide all my money, people are very operationally focused, so only pay for [00:17:00] what, what you use.And it needs to be reliable and available always, even if youre doing upgrades, you want to add capacity, you dont want to take the storage down, it needs to be always available, no matter what youre doing, upgrades patches and the data needs to be protected against ransomware attacks, against any kind of failure scenarios. Together, Dremio and Pure FlashBlade create a modern data lake and/or warehouse with the flexibility of cloud-native query engines and storage. We got one more question here. First, we have unpredictable performance, youve got data pipelines that service various teams with various requirements and their jobs [00:04:00] might be slow, their queries might be slowing them down, anybody that has a query thats stuck is going to just give up and not use the system, right? You can put the bunch of SSDs together and make it work for like a few terabytes of data. You had nodes, these hyper-converged nodes, and youve given a certain number of nodes for a particular application, whether its Hadoop or Spark or whatever [00:08:00] application that may be and you had these nodes that you just [inaudible 00:08:06] to scale like hundreds of nodes to 200 nodes, 300 nodes. Okay, so weve got a couple of questions. So this is what most data teams want and we know that, but what are the infrastructure challenges that are sort of preventing us from getting there? Okay. For those of you who are still here, it looks like theres still a 20 so people here, just in the chat here. So MinIO is similar to FlashBlade, except FlashBlade is [inaudible 00:29:53] software. Its going to bring you consistent desegregation, green compute and storage, so you can bring a lot of compute to us, a problem for few minutes, and then take it away for another problem, right?Gives you that elasticity, and with very, very high reliability, literally the storage is never going to be down. What [00:02:30] developers want, what organizations want is to automate their data pipelines and make them self-service. 10 Databricks Capabilities every Data Person Needs to Know, Query data from Cross Region Cross Account AWS Glue Data catalog, How To Obtain Kafka Consumer Lags in Pyspark Structured Streaming (Part 1). We all know this is not a way we want to be, we all want to shift to something thats [00:02:00] more sane, secure, reliable that we can take from experimentation to production pretty rapidly, and so everybody knows this, youve seen many slides like this, and so what is the state that we want to be today? Or if I have a lot of queries, but on very little data, then I just have to buy this storage, right? He has held various positions formerly at Databricks, Splunk, and Cisco Systems. Oh, I didnt realize were out of time already. Here Ill do it for you, Ill paste that question into your Slack channel and Ill post the link, just give me a second here. Nearly 60 live keynotes and breakout sessions. Okay. This is what you have in mind, so lets look at [00:11:00] storage and how we bring this paradigm to storage. So what you want to do is you want to run some nodes and just run the operating system and the basic functions on the local SSDs on the local drives and you want to keep all your data on a centralized object store file in object store that we call UFFO and that way you create an open data layer that can be used by any application, so youre not locking yourself into silos, to me, thats the difference between flash blade and RAID array of SSDs. But as you start scaling, youre going to see all these problems with complexity and performance. Lets say I have very little queries, but Im getting more data, I need to add storage, Id have to add couple of extra nodes there, right. It is difficult to plan capacity ahead of time when you plan for something and then you add a node or removal node from a cluster, suddenly your data starts rebalancing and you have to move data from one location together, you need to install a patch, its just complex, [00:05:30] and the complexity scales with the data, so you start with a few terabytes of data or less than that and you start scaling to more users, you start scaling to more data, you start scaling to more clusters and nodes, and what happens is, complexity goes through the roof along with your scale.

And finally, each piece of analytics software you have in your pipeline, whether its Spark or Splunk [00:07:00] or Elastic or whatever, it may be Dremio. Its easy to maintain, its beyond anything, just beyond the fast aspect, its just easy, if its built on flash, its stable, its easy to manage, power requirements are lower, its built, its just simple to manage no tuning required, you want to have no tuning required. 3 Questions You Should Ask, 30 Days of Code: HackerRank Solutions for Python (Day 5). [00:05:00] Thats going to be a problem for you to be agile and create value quickly to your end user teams.And finally complexity, management complexity. Buy vs build: 11-point decision framework. Naveen: Hey guys. Theres actually not only storage going to do it, the storage [00:35:30] compute and networking built into every blade and flash blade so that you have that linear increase in performance as you scale and youll see the details in that video. We got one more question here. Theres like MLOps thats also super big buzz word in the industry right now. Okay, thanks Naveen. Thats the old architecture old way of doing things, [00:28:30] the hyper-converge architecture, where you have a fixed amount of compute and a fixed amount of storage, and if you need to add storage, the compute just comes along with it.Lets say I have very little queries, but Im getting more data, I need to add storage, Id have to add couple of extra nodes there, right. This is a fantastic shift, it really brought in the elasticity and agility to the cloud world.What were seeing in 2020 beyond, especially with innovators, just like Dremio is youre [00:09:00] seeing Cloud Data Lakes that are built on open data, where theres a separation between compute and data, where you have a open data layer on top of your storage that may be built on open metadata standards, open file formats like parquet and open table format, suggest to data lake and, and Iceberg and other data formats, and then youve got this open data layer on top of your storage [00:09:30] and then that open data layer is accessed by various applications via Dremio, Spark or [inaudible 00:09:37] or whatever the application may be. Okay, thanks Naveen. Thats a long-winded answer for that question.Dave: Okay. Love podcasts or audiobooks? The first one is [00:28:00] what is the difference between a FlashBlade and a raid array of SSDs?Naveen: Fantastic, and so this is a great question. Like Dave said, Im a senior solutions manager for analytics and AI at Pure Storage. azure datalake You dont want to pay for storage that you did not use, [inaudible 00:16:45] let me pay only for the gigs that I use today and not for let me not plan for capacity for five years and then buy everything today and provide all my money, people are very operationally focused, so only pay for [00:17:00] what, what you use. Today, unstructured machine generated data is just growing exponentially, everybody knows that, IOT data, geospatial data are all generated by devices, video generated by cameras, log data. [00:19:30] The other layer, on top of this storage, we spoke about building that open data platform and we also need another software to manage storage for your containers, as you spin up and spin down containers, you want those to be automatic and the storage needs to be allocated when its spun up and also when theres a failure scenario, [00:20:00] when theres backup, when theres needs for backup, theres needs to migrate data, we need to create a dev test environment, theres a need to encrypt that data, when one container fails in and Kubernetes takes action to create another container, to do confer failure scenarios or scaling scenarios, all of those storage needs do need to be addressed, and you need [00:20:30] Kubernetes data services platform to address all those requirements and Pure Storage, quite a company called Portworx, which is an industrys leading companies, data services, platforms available for that.It can be used for building automating, protecting your cloud native applications, would module to just core storage, backup, disaster recovery, application, data migration, security, and infrastructure automation, all of that is taken care [00:21:00] with this a hundred percent software solution called Kubernetes data services platform. Yeah. Different applications use Cloud-like applications [00:18:00] use an S3 Itll be a [inaudible 00:18:03] protocol, right? And finally complexity, management complexity. While its clear in 10 years as forward thinking, companies say that most of the code generated would be AI and ML code. The storage that I would build to meet these requirements, like, itd be an object store, itll be capable of many things and lets see what are those capabilities that we should bring here. Thanks Dave.Dave: Okay guys, take care.Naveen: [00:32:00] Dave, you still there.Dave: Im still Here, and I think theres still a bunch of people on, so they havent cut us off yet. Different applications use Cloud-like applications [00:18:00] use an S3 Itll be a [inaudible 00:18:03] protocol, right? The second shift that were seeing is, of course, with the cloud, people are moving towards object story, object storage with a war on structured data. I dont know what BOC stands for?

Great. Lets get started with the agenda. Thank you so [00:31:30] much for the session. Note:For readers who are NOT ready but plan to install on-premise data lake, please skip the middle part of the article and only read the Introduction and the Conclusion at the bottom; For readers who are interested in AWS data lake, please refer to the article of migrating on-premise data to AWS data lake. Object storage, but you may have legacy software that may be using NFS, or even current software using NFS are SMB protocol, so you want whatever the protocol that your application is using to access that data, that protocol should be available.And also it should be native to that platform so the performance is good, no matter what protocol the application is using [00:18:30] to access data. [inaudible 00:32:35] Slack. Youre not creating pipelines for the sake of creating data pipelines, and you may encounter new tools, you want to use [00:03:00] the latest and greatest tools, newer tools, and you want to allocate the right amount of resources to the right project at the right time, right? MinIO is just like the quick and cheap, dirty version of that and its basically yes, but you could use our Portworx software to completely manage all your storage. And the second is people want people who are doing more predictive use cases, more real-time use cases where youre just not getting insights and just putting it on a dashboard, but you have a piece of code which gets an insight and then takes an action, and for example, in the case of creating, you get insights and then you take an action, you have to create a stock [00:14:30] or you have to respond to a security threat, or the software takes an action, so for you to be able to take immediate action on the data, you need to have real time data, and if the response is going to be automated, then you need better have real-time data and needs to be predictive as we go to move towards machine learning. Modern infrastructure is going to be built on containers, these applications are going to run on containers, and so hopefully you have something thats like a container as a service or a cluster as a service, platform as a service layer that has containers and virtual machines, right?This is what you have in mind, so lets look at [00:11:00] storage and how we bring this paradigm to storage. [00:19:30] The other layer, on top of this storage, we spoke about building that open data platform and we also need another software to manage storage for your containers, as you spin up and spin down containers, you want those to be automatic and the storage needs to be allocated when its spun up and also when theres a failure scenario, [00:20:00] when theres backup, when theres needs for backup, theres needs to migrate data, we need to create a dev test environment, theres a need to encrypt that data, when one container fails in and Kubernetes takes action to create another container, to do confer failure scenarios or scaling scenarios, all of those storage needs do need to be addressed, and you need [00:20:30] Kubernetes data services platform to address all those requirements and Pure Storage, quite a company called Portworx, which is an industrys leading companies, data services, platforms available for that. We want to take that cloud-like approach where the storage and the compute [00:29:00] are desegregated. You were able to do this with a combination of a Hive source of [00:23:00] data and Dremio and Flash registry. First, we have unpredictable performance, youve got data pipelines that service various teams with various requirements and their jobs [00:04:00] might be slow, their queries might be slowing them down, anybody that has a query thats stuck is going to just give up and not use the system, right?So, your users and your businesses leaders and your customers are impatient and they want predictable performance, but its hard to tune every system and figure out where the bottlenecks are and whether its a latency bottleneck or a throughput bottleneck, or is it just a process thats stuck, its hard to [00:04:30] find out where the performance is bad. And finally multi-protocol support, you dont want to bank all your dollars on one particular protocol. So box. Again, delivers consistent performance, consistent security. So this is something thats essential to create that architecture that you need for todays modern data services. Below the Kubernetes layer, youre going to have a layer that says Thats for data management services for Kubernetes. So the data management services for Kubernetes its going to as a container is spun up, spun down, the data management services layer is going to provide the storage to do the Kubernetes layer, and then youre going to have a layer, [00:11:30] which is your modern data lake layer, which is based on open data formats, and this software layer, or this layer is going to be built on top of Block or ObjectStore, or it could be more legacy systems, its going to be built on a [inaudible 00:11:49] . Okay. All right. It can be used for building automating, protecting your cloud native applications, would module to just core storage, backup, disaster recovery, application, data migration, security, and infrastructure automation, all of that is taken care [00:21:00] with this a hundred percent software solution called Kubernetes data services platform. Thank you guys. Second thing, theres a lack of agility, requirements change on you at any given time, when you start building something and the requirements change, the tools change, so if your infrastructure is rigid, if your data is rigid and you have a certain set of resources allocated to you, Oh, youve got this 10 nodes and youve got two terabytes of data. Thats all you have and you have to work within rigid infrastructure. Fantastic, and so this is a great question. Solutions Manager for analytics and AI at Pure Storage. So lets start with an architecture or [marketecture 00:10:19] diagram, which shows the various layers, should on top, you have your applications, right? Theres actually not only storage going to do it, the storage [00:35:30] compute and networking built into every blade and flash blade so that you have that linear increase in performance as you scale and youll see the details in that video. I can try to answer if people are there, I can still try and answer that question. Its like, you take a bunch of SSDs put it together in a box and it becomes a FlashBlade, right? [00:15:00] So, lets look at like, if we had some magic pixie dust, what kind of storage would we build to meet these requirements? Right? And finally multi-protocol support, you dont want to bank all your dollars on one particular protocol. Good morning. Hey guys. So [00:33:00] all right, let me copy the link. Okay. Good afternoon. Thats a long-winded answer for that question. You cannot do that and just manage it with like one or two guys, and forget storage, right? Again, if you have any questions, you can use the button in the upper right hand corner to share your audio or video, and youll automatically be put into a queue, and if for some reason youre having trouble with that, you can just ask your question in the chat. And also it should be native to that platform so the performance is good, no matter what protocol the application is using [00:18:30] to access data. Theres just so much engineering [00:36:00] that goes into it and then theres really smart HDs that worked on putting these things together to make it scale, to [inaudible 00:36:10]. Folks, if you have any other questions, lets go ahead and get them in. MinIO is just like the quick and cheap, dirty version of that and its basically yes, but you could use our Portworx software to completely manage all your storage. So I think we should [inaudible 00:36:14] and I did post the link there, you might just go check out slack just to see if anybody else is there, but at this point, I think we should just wrap it up. Youre not creating pipelines for the sake of creating data pipelines, and you may encounter new tools, you want to use [00:03:00] the latest and greatest tools, newer tools, and you want to allocate the right amount of resources to the right project at the right time, right? Years experience in AI/Machine Learning research, and leading engineering team in various areas software development, DevOps, data science and MLOps. Actually, if you go to YouTube and just search for FlashBlade theres something about Field Day with Brian Gold, where he actually walks through all of the design that hes gone through. The second shift that were seeing is, of course, with the cloud, people are moving towards object story, object storage with a war on structured data. [00:26:00] And finally its simplifies operations as you scale, like I said, Pure [inaudible 00:26:05] is managed from the cloud, it can be consumed as a service, its completely storage as a service, you only pay for what you use and you never have to be down for any upgrade patching, and even if you need to do a controller upgrade, thats all covered with Pures Evergreen guarantee. With that well switch over to Q and A.Dave: Great. But as you start scaling, youre going to see all these problems with complexity and performance. And by the way, he corrected himself instead of B-O-C, it was B-O-X. Apache Spark + Hadoop + Sqoop to take in data from RDBMS (MySQL). Or if I have a lot of queries, but on very little data, then I just have to buy this storage, right? [inaudible 00:10:24] Spark, Dremio, [inaudible 00:10:26] , or whatever youre using, which you [00:10:30] could be doing data science environments, you could be streaming real time analytics, or you could be doing scale-out SQL analytics. Im sure youve seen several slides like this throughout this conference and everybody starts with one of these slides and everybody knows that todays environment is in silos, you have data warehouses, you have a team working on streaming analytics, theres a backup copy [00:01:30] of some data somewhere, Data Lake, theres a team working on AI and ML, and many times you have to create copies of your data into all of these different environments and these different environments have different teams managing them, it has different levels of service, it has different reliability standards and has different security, right? If we didnt get your questions, I think weve got them all, but if we didnt, then you can hit up Naveen in Slack, [00:31:00] but before you go, we would appreciate it if you would please fill out the super short Slido session survey, which youll find in the chat and the next session is coming up, I think we have a panel actually, or a keynote, a fireside chat, I believe.

Sitemap 8

how to build a data lake on-premise